Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add code for evaluating pass @ k to inference_and_check #64 #65

Closed
wants to merge 1 commit into from

Conversation

erictang000
Copy link
Collaborator

Fixes minor bug in perform_check and adds code for checking pass @ k metric for n > 1 samples.

For example if we run the following with a saved file DeepSeek-R1-Distill-Qwen-7B_aime_train_None_False_0_-1.json with n=128 examples per question

python inference_and_check.py --model deepseek-ai/DeepSeek-R1-Distill-Qwen-7B --task aime  --split train --max_tokens 32768   --inference --n 128  --temperatures 0.6 --tp 1 --check

We will get the following output now:

Temperature: [0.6]
Loaded 30 existing results.
Found 3840 responses requiring reject sampling...
Processing Reject Sampling: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3840/3840 [00:02<00:00, 1432.24it/s]
Final reject-sampling accuracy: 2052/3840
Actual accuracy: 0.534375
Final pass @ k:
k: 128, pass @ k: 90.0
k: 64, pass @ k: 84.999
k: 32, pass @ k: 82.379
k: 16, pass @ k: 80.524
k: 8, pass @ k: 78.576
k: 4, pass @ k: 74.26
k: 2, pass @ k: 65.496
k: 1, pass @ k: 53.438
Temperature 0.6 acc: 27/30 (0.9)

Comment on lines +292 to +317
# pass at k per temperature
scores = list(correct[temp].values())
num_scores = len(scores[0])
N = num_scores
k = num_scores

actual_accuracy = sum([sum(sample) for sample in scores]) / (
len(scores) * num_scores
)
print(f"Actual accuracy: {actual_accuracy}")
final_bon_scores = {}

while k > 0:
new_scores = []
for sample in scores:
# calculate pass @ k
num_correct = np.sum(sample)
pass_k = 1 - (math.comb(N - num_correct, k) / math.comb(N, k))
new_scores.append(pass_k)
final_bon_scores[k] = round(np.mean(new_scores) * 100, 3)
k = k // 2

print("Final pass @ k:")
for k, s in final_bon_scores.items():
print(f"k: {k}, pass @ k: {s}")
temp_correct = sum([any(x) for x in scores])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just wait on this?

For now it's very trivial to get custom metrics like pass @ k once you save all the scores for all N responses in a file.

It would actually be better to have the user specify a list of metrics they want and we would be able to get it and save the final metrics somewhere.

If you prefer we can just have a utility script for now that will compute pass @ k given the saved results json.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well this function is essentially the utility script no?

Copy link
Collaborator Author

@erictang000 erictang000 Feb 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i agree we should add + save additional metrics and have a nicer way to organize them code wise than this though yeah

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but i was thinking we could punt on that until we need to report some other metrics (r1 does cons @ k?)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well this function is essentially the utility script no?

Yeah... why have it here than just the utility script itself? If we already know we're gonna change this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure i can separate the code out from the perform_check function and have it call a get_metrics function so it can either be called from a stand alone script or run as part of perform_check which also has to update correctness for samples.

Does that work?

@SumanthRH
Copy link
Collaborator

Closing in favour of #67

@SumanthRH SumanthRH closed this Feb 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants