Refactor Eval Suite #23

caoshiyi · 2025-01-17T02:16:03Z

Currently, we use a single task_handler.py to implement task handlers for each task. It will become longer and longer when we add more tasks. I am proposing the following refactors:

Further modularize it (each task has a TASK_NAME.py that includes the task handler).
Use configuration files (e.g., JSON, YAML) to define task-specific parameters such as dataset names, split types, and special prompts. This avoids hardcoding details in the code. An example:

tasks:
  - name: MATH500
    dataset: qq8933/MATH500
    split: test

Include tests for different tasks.

Can start the refactor after #19 , #20 and #21 are merged.

This is now a draft proposal, will add more later. Feel free to discuss here.
@erictang000 @SumanthRH @richardliaw @kouroshHakha

The text was updated successfully, but these errors were encountered:

SumanthRH · 2025-01-17T02:20:45Z

Makes sense! One example that does this well is lm harness, where many of prompt formatting / string post-processing rules are in yamls. For example, GSM8K: https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/gsm8k/gsm8k.yaml

DachengLi1 · 2025-01-17T19:03:25Z

Yes, I was about to raise an issue: currently when users add new models, they need to hard-code the path to match the model name in model_utils.py. @SumanthRH

Corrects README for new CLI. Fixes APPS task handler and corrects Math task handler's key to "math"

Another refactor PR on top of #23 now focused on model-specific configurations and data generation. - Model-specific system prompts, user templates etc are best left to be in the a YAML file. - TaskHandler should be model agnostic, since we want to have a consistent evaluation logic for all tasks - Data curation scripts for different Sky-T1 models should live outside the `skythought_evals` package. These are mostly scripts focused on a particular data curation task like filtering, rewriting etc. My proposal is to place common scripts in `scripts/ `. A guide for obtaining the final training data + training commands for different Sky-T1 models should be placed in `recipes/` . For now, all data curation scripts are in the `scripts` folder . - Adds a new `system-prompt-template` CLI flag. User can leverage available templates like those for sky-T1, Qwen, etc for a different model during evaluation.

caoshiyi added the enhancement New feature or request label Jan 17, 2025

caoshiyi self-assigned this Jan 17, 2025

caoshiyi mentioned this issue Jan 18, 2025

[Evals] Add eval support for GSM8K, ARC-C and MMLU-PRO #21

Merged

SumanthRH mentioned this issue Jan 24, 2025

[Evals] Refactor tasks, add linting and CI #47

Merged

1 task

SumanthRH closed this as completed in #47 Feb 1, 2025

SumanthRH reopened this Feb 1, 2025

This was referenced Feb 1, 2025

Minor fixes after #23 #58

Merged

Refactor model-specific configs and move data curation scripts #60

Merged

SumanthRH added a commit that referenced this issue Feb 4, 2025

Minor fixes after #23 (#58)

a399909

Corrects README for new CLI. Fixes APPS task handler and corrects Math task handler's key to "math"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor Eval Suite #23

Refactor Eval Suite #23

caoshiyi commented Jan 17, 2025 •

edited

Loading

SumanthRH commented Jan 17, 2025

DachengLi1 commented Jan 17, 2025

Refactor Eval Suite #23

Refactor Eval Suite #23

Comments

caoshiyi commented Jan 17, 2025 • edited Loading

SumanthRH commented Jan 17, 2025

DachengLi1 commented Jan 17, 2025

caoshiyi commented Jan 17, 2025 •

edited

Loading