Pass@k #519

clefourrier · 2025-01-27T08:44:23Z

No description provided.

clefourrier · 2025-01-27T08:56:40Z

src/lighteval/metrics/metrics_sample.py

+        strip_strings: bool = False,
+        sample_scoring_function: Union[Callable[[str, str], float], str] = None,
+    ):
+        """Computing pass at k


I made it as exhaustive/customizable as the other metrics (full exact match for the individual predictions by default, options to normalize strings in case you use it for math evals for ex) but I can remove some options if you feel that's too much complexity

clefourrier · 2025-01-27T08:56:53Z

src/lighteval/metrics/metrics_sample.py

+            self.score_sample = self.default_sample_scoring
+
+    def compute(self, golds: list[str], predictions: list[str], **kwargs) -> dict[str, float]:
+        """Computes the metric over a list of golds and predictions for one single item with possibly many samples.


Core logic here

clefourrier · 2025-01-27T08:57:09Z

src/lighteval/metrics/metrics_sample.py

+        return 1 if gold == pred else 0
+
+    def pass_at_k(self, all_scores: list[int]) -> float:
+        """Algo from https://arxiv.org/pdf/2107.03374"""


Pass at K here, literally the one from codex

HuggingFaceDocBuilderDev · 2025-01-27T09:04:29Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

hynky1999 · 2025-01-27T20:36:59Z

Wouldn't be the best to make it dynamic ?
E.g you just wrap an existing metric with it, so that it's more flexible ?
Then I can create it like:
metric=pass_at_k(my_existing_metric, k=10) for example

clefourrier · 2025-01-28T08:07:55Z

You would be able to create a custom metric like so, with a custom sample level metric:

    your_custom_pass_at = SampleLevelMetric(
        metric_name="pass@",
        sample_level_fn=PassAtK(k=10, sample_scoring_function=my_existing_metric).compute,
        category=MetricCategory.GENERATIVE_SAMPLING,
        use_case=MetricUseCase.REASONING,
        corpus_level_fn=np.mean,
        higher_is_better=True,
    )

unless you need stg else?

hynky1999 · 2025-01-28T11:08:52Z

Ahhh I missed that arg, good by me, thought it was exclusively for string to string comparison

clefourrier added 3 commits January 27, 2025 09:41

init

0a49ac0

correct typing

93d108e

added defaults

5bea2f6

clefourrier commented Jan 27, 2025

View reviewed changes

small fix

2bd6eb6

clefourrier requested a review from NathanHB January 27, 2025 10:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pass@k #519

Pass@k #519

clefourrier commented Jan 27, 2025

clefourrier Jan 27, 2025

clefourrier Jan 27, 2025

clefourrier Jan 27, 2025

HuggingFaceDocBuilderDev commented Jan 27, 2025

hynky1999 commented Jan 27, 2025 •

edited

Loading

clefourrier commented Jan 28, 2025

hynky1999 commented Jan 28, 2025

Pass@k #519

Are you sure you want to change the base?

Pass@k #519

Conversation

clefourrier commented Jan 27, 2025

clefourrier Jan 27, 2025

Choose a reason for hiding this comment

clefourrier Jan 27, 2025

Choose a reason for hiding this comment

clefourrier Jan 27, 2025

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Jan 27, 2025

hynky1999 commented Jan 27, 2025 • edited Loading

clefourrier commented Jan 28, 2025

hynky1999 commented Jan 28, 2025

hynky1999 commented Jan 27, 2025 •

edited

Loading