Add code for evaluating pass @ k to inference_and_check #64 #65

erictang000 · 2025-02-04T22:38:48Z

Fixes minor bug in perform_check and adds code for checking pass @ k metric for n > 1 samples.

For example if we run the following with a saved file DeepSeek-R1-Distill-Qwen-7B_aime_train_None_False_0_-1.json with n=128 examples per question

python inference_and_check.py --model deepseek-ai/DeepSeek-R1-Distill-Qwen-7B --task aime  --split train --max_tokens 32768   --inference --n 128  --temperatures 0.6 --tp 1 --check

We will get the following output now:

Temperature: [0.6]
Loaded 30 existing results.
Found 3840 responses requiring reject sampling...
Processing Reject Sampling: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3840/3840 [00:02<00:00, 1432.24it/s]
Final reject-sampling accuracy: 2052/3840
Actual accuracy: 0.534375
Final pass @ k:
k: 128, pass @ k: 90.0
k: 64, pass @ k: 84.999
k: 32, pass @ k: 82.379
k: 16, pass @ k: 80.524
k: 8, pass @ k: 78.576
k: 4, pass @ k: 74.26
k: 2, pass @ k: 65.496
k: 1, pass @ k: 53.438
Temperature 0.6 acc: 27/30 (0.9)

SumanthRH · 2025-02-04T22:43:57Z

skythought/skythought_evals/inference_and_check.py

+        # pass at k per temperature
+        scores = list(correct[temp].values())
+        num_scores = len(scores[0])
+        N = num_scores
+        k = num_scores
+
+        actual_accuracy = sum([sum(sample) for sample in scores]) / (
+            len(scores) * num_scores
+        )
+        print(f"Actual accuracy: {actual_accuracy}")
+        final_bon_scores = {}
+
+        while k > 0:
+            new_scores = []
+            for sample in scores:
+                # calculate pass @ k
+                num_correct = np.sum(sample)
+                pass_k = 1 - (math.comb(N - num_correct, k) / math.comb(N, k))
+                new_scores.append(pass_k)
+            final_bon_scores[k] = round(np.mean(new_scores) * 100, 3)
+            k = k // 2
+
+        print("Final pass @ k:")
+        for k, s in final_bon_scores.items():
+            print(f"k: {k}, pass @ k: {s}")
+        temp_correct = sum([any(x) for x in scores])


Can we just wait on this?

For now it's very trivial to get custom metrics like pass @ k once you save all the scores for all N responses in a file.

It would actually be better to have the user specify a list of metrics they want and we would be able to get it and save the final metrics somewhere.

If you prefer we can just have a utility script for now that will compute pass @ k given the saved results json.

well this function is essentially the utility script no?

i agree we should add + save additional metrics and have a nicer way to organize them code wise than this though yeah

but i was thinking we could punt on that until we need to report some other metrics (r1 does cons @ k?)

well this function is essentially the utility script no?

Yeah... why have it here than just the utility script itself? If we already know we're gonna change this?

sure i can separate the code out from the perform_check function and have it call a get_metrics function so it can either be called from a stand alone script or run as part of perform_check which also has to update correctness for samples.

Does that work?

SumanthRH · 2025-02-07T20:48:38Z

Closing in favour of #67

move code from bon_check to be based off main for now

23d28f4

SumanthRH reviewed Feb 4, 2025

View reviewed changes

SumanthRH closed this Feb 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add code for evaluating pass @ k to inference_and_check #64 #65

Add code for evaluating pass @ k to inference_and_check #64 #65

erictang000 commented Feb 4, 2025

SumanthRH Feb 4, 2025

erictang000 Feb 4, 2025

erictang000 Feb 4, 2025 •

edited

Loading

erictang000 Feb 4, 2025

SumanthRH Feb 4, 2025

erictang000 Feb 4, 2025

SumanthRH commented Feb 7, 2025

Add code for evaluating pass @ k to inference_and_check #64 #65

Add code for evaluating pass @ k to inference_and_check #64 #65

Conversation

erictang000 commented Feb 4, 2025

SumanthRH Feb 4, 2025

Choose a reason for hiding this comment

erictang000 Feb 4, 2025

Choose a reason for hiding this comment

erictang000 Feb 4, 2025 • edited Loading

Choose a reason for hiding this comment

erictang000 Feb 4, 2025

Choose a reason for hiding this comment

SumanthRH Feb 4, 2025

Choose a reason for hiding this comment

erictang000 Feb 4, 2025

Choose a reason for hiding this comment

SumanthRH commented Feb 7, 2025

erictang000 Feb 4, 2025 •

edited

Loading