Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor Eval Suite #23

Open
caoshiyi opened this issue Jan 17, 2025 · 2 comments · Fixed by #47
Open

Refactor Eval Suite #23

caoshiyi opened this issue Jan 17, 2025 · 2 comments · Fixed by #47
Assignees
Labels
enhancement New feature or request

Comments

@caoshiyi
Copy link
Member

caoshiyi commented Jan 17, 2025

Currently, we use a single task_handler.py to implement task handlers for each task. It will become longer and longer when we add more tasks. I am proposing the following refactors:

  1. Further modularize it (each task has a TASK_NAME.py that includes the task handler).
  2. Use configuration files (e.g., JSON, YAML) to define task-specific parameters such as dataset names, split types, and special prompts. This avoids hardcoding details in the code. An example:
tasks:
  - name: MATH500
    dataset: qq8933/MATH500
    split: test
  1. Include tests for different tasks.

Can start the refactor after #19 , #20 and #21 are merged.

This is now a draft proposal, will add more later. Feel free to discuss here.
@erictang000 @SumanthRH @richardliaw @kouroshHakha

@caoshiyi caoshiyi added the enhancement New feature or request label Jan 17, 2025
@caoshiyi caoshiyi self-assigned this Jan 17, 2025
@SumanthRH
Copy link
Collaborator

Makes sense! One example that does this well is lm harness, where many of prompt formatting / string post-processing rules are in yamls. For example, GSM8K: https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/gsm8k/gsm8k.yaml

@DachengLi1
Copy link
Collaborator

Yes, I was about to raise an issue: currently when users add new models, they need to hard-code the path to match the model name in model_utils.py. @SumanthRH

@SumanthRH SumanthRH reopened this Feb 1, 2025
SumanthRH added a commit that referenced this issue Feb 4, 2025
Corrects README for new CLI. Fixes APPS task handler and corrects Math task handler's key to "math"
SumanthRH added a commit that referenced this issue Feb 5, 2025
Another refactor PR on top of #23 now focused on model-specific configurations and data generation.

- Model-specific system prompts, user templates etc are best left to be in the a YAML file. 
- TaskHandler should be model agnostic, since we want to have a consistent evaluation logic for all tasks
- Data curation scripts for different Sky-T1 models should live outside the `skythought_evals` package. These are mostly scripts focused on a particular data curation task like filtering, rewriting etc. My proposal is to place common scripts in `scripts/ `. A guide for obtaining the final training data + training commands for different Sky-T1 models should be placed in `recipes/` . For now, all data curation scripts are in the `scripts` folder . 
- Adds a new `system-prompt-template` CLI flag. User can leverage available templates like those for sky-T1, Qwen, etc for a different model during evaluation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants