x

Signed-off-by: SumanthRH <[email protected]>
NovaSky-AI · Feb 6, 2025 · 33b24fb · 33b24fb
1 parent 7a4ee28
commit 33b24fb
Show file tree

Hide file tree

Showing 3 changed files with 31 additions and 7 deletions.
diff --git a/skythought/skythought_evals/README.md b/skythought/skythought_evals/README.md
@@ -33,17 +33,25 @@ We further recommend streaming all outputs to a log file for reference:
 ```shell
 python -m skythought_evals.eval --model Qwen/QwQ-32B-Preview --evals=aime,math500,gpqa_diamond --tp=8 --result-dir ./ 2>&1 | tee mylogs.txt
 ```
-
 
 Example result: `{"AIME": <aime_accuracy>, "MATH500": <math500_accuracy>, "GPQADiamond": <gpqa_diamond_accuracy>}` 
 
+### Scaling evaluation with Ray
+
+You can scale evaluations across multiple model replicas (and across multiple nodes) with `inference_and_check` using [ray](https://docs.ray.io):
+
+```shell
+python -m skythought_evals.inference_and_check --task math500 --model Qwen/Qwen2-7B-Instruct --max_tokens 4096 --split test --result-dir ./ --temperatures 0.7 --use-ray 
+```
+
+By default, we make use of the configuration in [ray_configs/ray_config.yaml](./ray_configs/ray_config.yaml). You can also customize this with `--ray-config /path/to/ray_config.yaml`. 
+
 #### Best-of-N Evaluation
 
 While we are actively working on a better CLI interface, you can use `-m skythought_evals.inference_and_check` for Best-of-N evaluation. 
 
 ```bash
-python -m skythought_evals.inference_and_check --task math500 --model Qwen/Qwen2-7B-Instruct --tp 4 --max_tokens 4096 --split test --result-dir ./ --inference --temperatures 0.7 --n 64
-python -m skythought_evals.inference_and_check --task math500 --model Qwen/Qwen2-7B-Instruct --tp 4 --max_tokens 4096 --split test --result-dir ./ --check --temperatures 0.7 --n 8
+python -m skythought_evals.inference_and_check --task math500 --model Qwen/Qwen2-7B-Instruct --tp 4 --max_tokens 4096 --split test --result-dir ./ --temperatures 0.7 --n 64
 ```
 
 ### Distill and Reject Sampling
@@ -63,3 +71,13 @@ python -m skythought_evals.inference_and_check --task numina --model Qwen/QwQ-32
 
 python -m skythought_evals.inference_and_check --task numina --model Qwen/QwQ-32B-Preview --tp 8 --max_tokens 16384 --split train --end 20000--source olympiads --filter-difficulty --result-dir $SKYT_HOME/data --math-difficulty-lower-bound 9 --math-difficulty-upper-bound 9
 ```
+
+### Reproducibility Issues
+
+
+We've noticed that it can be hard to reproduce results in reasoning benchmarks. Beyond the lack of agreed sampling parameters and metrics in the field at the moment, there can be significant differences in results across different evaluation codebases, and even for the same codebase with a different set of dependencies. In bfloat16/ half-precision, numerical error accumulation will change outputs ever so slightly, which can dramatically alter final performance. There are three factors we've noticed that affect results:
+
+- Long context generations: Errors can accumulate so that the output changes at 1k+ tokens, which compound as you keep generating. Since we typically set max tokens to be 16k or 32k tokens, the final solution will change significantly
+- vLLM settings:  With vLLM, we’ve also noticed that at half-precision, different batch sizes can affect downstream evaluation results by a few percentage points. Further, different tensor parallelism settings can also change results in half-precision.
+
+ We recommend to run all evaluation benchmarks at full precision, i.e float32 to avoid this.  By default, we run evaluation in `float32`, which can be customized with the `--dtype` flag. In full-precision, evaluation results should be robust to chanes in batch size, tensor parallel size, etc. 
diff --git a/skythought/skythought_evals/eval.py b/skythought/skythought_evals/eval.py
@@ -40,7 +40,10 @@ def parse_arguments():
         "--result-dir", type=str, default=".", help="Directory to save result files."
     )
     parser.add_argument(
-        "--output-file", type=str, default="", help="[OBSOLETE] Output file to save results to."
+        "--output-file",
+        type=str,
+        default="",
+        help="[OBSOLETE] Output file to save results to.",
     )
     return parser.parse_args()
 
@@ -71,7 +74,9 @@ def write_logs_to_file(logs, output_file):
 def main():
     args = parse_arguments()
     if args.output_file:
-        warnings.warn("`output-file` CLI argument is obsolete and will be ignored.")
+        warnings.warn(
+            "`output-file` CLI argument is obsolete and will be ignored.", stacklevel=1
+        )
     # Extract the arguments
     model_path = args.model
     evals = args.evals.split(",")

diff --git a/skythought/skythought_evals/inference_and_check.py b/skythought/skythought_evals/inference_and_check.py
@@ -251,7 +251,7 @@ def perform_inference_and_check(
         print(f"Final acc: {total_correct}/{total_finish}")
 
         acc = round(total_correct / total_finish, 4) if total_finish > 0 else 0
-        temperature_to_acc[temp] = acc
+        temperature_to_acc[f"{temp=}"] = acc
         print(json.dumps({"acc": acc}))
 
     pass_at_k_metrics = None
@@ -607,7 +607,8 @@ def main():
     # load ray config
     if args.use_ray:
         warnings.warn(
-            "`tp` CLI argument is not compatible with `use-ray` and will be ignored. Please configure tensor parallel size in the `ray_config` YAML",stacklevel=1
+            "`tp` CLI argument is not compatible with `use-ray` and will be ignored. Please configure tensor parallel size in the `ray_config` YAML",
+            stacklevel=1,
         )
         if not args.ray_config:
             # load default