-
Notifications
You must be signed in to change notification settings - Fork 271
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add code for evaluating pass @ k to inference_and_check #64 #65
Conversation
# pass at k per temperature | ||
scores = list(correct[temp].values()) | ||
num_scores = len(scores[0]) | ||
N = num_scores | ||
k = num_scores | ||
|
||
actual_accuracy = sum([sum(sample) for sample in scores]) / ( | ||
len(scores) * num_scores | ||
) | ||
print(f"Actual accuracy: {actual_accuracy}") | ||
final_bon_scores = {} | ||
|
||
while k > 0: | ||
new_scores = [] | ||
for sample in scores: | ||
# calculate pass @ k | ||
num_correct = np.sum(sample) | ||
pass_k = 1 - (math.comb(N - num_correct, k) / math.comb(N, k)) | ||
new_scores.append(pass_k) | ||
final_bon_scores[k] = round(np.mean(new_scores) * 100, 3) | ||
k = k // 2 | ||
|
||
print("Final pass @ k:") | ||
for k, s in final_bon_scores.items(): | ||
print(f"k: {k}, pass @ k: {s}") | ||
temp_correct = sum([any(x) for x in scores]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just wait on this?
For now it's very trivial to get custom metrics like pass @ k once you save all the scores for all N responses in a file.
It would actually be better to have the user specify a list of metrics they want and we would be able to get it and save the final metrics somewhere.
If you prefer we can just have a utility script for now that will compute pass @ k given the saved results json.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well this function is essentially the utility script no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i agree we should add + save additional metrics and have a nicer way to organize them code wise than this though yeah
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but i was thinking we could punt on that until we need to report some other metrics (r1 does cons @ k?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well this function is essentially the utility script no?
Yeah... why have it here than just the utility script itself? If we already know we're gonna change this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure i can separate the code out from the perform_check
function and have it call a get_metrics
function so it can either be called from a stand alone script or run as part of perform_check
which also has to update correctness for samples.
Does that work?
Closing in favour of #67 |
Fixes minor bug in
perform_check
and adds code for checking pass @ k metric forn > 1
samples.For example if we run the following with a saved file
DeepSeek-R1-Distill-Qwen-7B_aime_train_None_False_0_-1.json
withn=128
examples per questionWe will get the following output now: