Skip to content

Commit

Permalink
x
Browse files Browse the repository at this point in the history
Signed-off-by: SumanthRH <[email protected]>
  • Loading branch information
SumanthRH committed Feb 7, 2025
1 parent 55692a2 commit d796ceb
Showing 1 changed file with 2 additions and 2 deletions.
4 changes: 2 additions & 2 deletions skythought/skythought_evals/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -94,7 +94,7 @@ python -m skythought_evals.inference_and_check --task numina --model Qwen/QwQ-32
We've noticed that it can be hard to reproduce results in reasoning benchmarks. Beyond the lack of agreed sampling parameters and metrics in the field at the moment, there can be significant differences in results across different evaluation codebases, and even for the same codebase with a different set of dependencies. In half-precision (bfloat16 or float16), numerical error accumulation will change outputs ever so slightly, which can dramatically alter final performance. There are three factors we've noticed that affect results:

- Long context generations: Errors can accumulate so that the output changes at 1k+ tokens, which compound as you keep generating. Since we typically set max tokens to be 16k or 32k tokens, the final solution will change significantly
- vLLM settings: With vLLM, we’ve also noticed that at half-precision, different batch sizes can affect downstream evaluation results by a few percentage points. Further, different tensor parallelism settings can also change results in half-precision.\
- vLLM version: Different versions of vLLM will use different CUDA-Toolkit/ Flash attention versions. Even for the same settings, these differences in the underlying kernels used can change results.
- vLLM settings: With vLLM, we’ve also noticed that at half-precision, different batch sizes can affect downstream evaluation results by a few percentage points. Further, different tensor parallelism settings can also change results in half-precision.
- vLLM version: Different versions of vLLM will use different CUDA-Toolkit or Flash attention versions. Even for the same settings, these differences in the underlying kernels used can change results.

We recommend to run all evaluation benchmarks at full precision, i.e float32 to avoid this. By default, we run evaluation in `float32`, which can be customized with the `--dtype` flag. In full-precision, evaluation results should be robust to changes in batch size, tensor parallel size, version differences, etc.

0 comments on commit d796ceb

Please sign in to comment.