Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure to reproduce AIME score on Sky-T1-32B-Preview #38

Open
RyanMarten opened this issue Jan 21, 2025 · 4 comments
Open

Failure to reproduce AIME score on Sky-T1-32B-Preview #38

RyanMarten opened this issue Jan 21, 2025 · 4 comments
Assignees

Comments

@RyanMarten
Copy link

python eval.py --model NovaSky-AI/Sky-T1-32B-Preview --evals=AIME --tp=8 --output_file=results.txt --temperatures 0.7

{"acc": 0.3667}

The reported score was 43.3, which is significantly different.

@RyanMarten
Copy link
Author

Looks like @fanqiwan similarly got a lower AIME score:

#33 (comment)

@2proveit
Copy link

python eval.py --model NovaSky-AI/Sky-T1-32B-Preview --evals=AIME --tp=8 --output_file=results.txt --temperatures 0.7

{"acc": 0.3667}

The reported score was 43.3, which is significantly different.

my score is similar to yours, 0.333

@rucnyz
Copy link

rucnyz commented Jan 23, 2025

python eval.py --model NovaSky-AI/Sky-T1-32B-Preview --evals=AIME --tp=8 --output_file=results.txt --temperatures 0.7

{"acc": 0.3667}
The reported score was 43.3, which is significantly different.

my score is similar to yours, 0.333

my score is even worse, 0.30

@tyler-griggs
Copy link
Collaborator

Hi all, thanks for calling out these issues! We have also discovered some issues in our evaluation framework and are currently working to organize and refactor it at #47.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants