Skip to content

Commit

Permalink
Minor fixes after #23 (#58)
Browse files Browse the repository at this point in the history
Corrects README for new CLI. Fixes APPS task handler and corrects Math task handler's key to "math"
  • Loading branch information
SumanthRH authored Feb 4, 2025
1 parent a85f0f4 commit a399909
Show file tree
Hide file tree
Showing 4 changed files with 26 additions and 11 deletions.
23 changes: 20 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,14 +35,31 @@

We open source the code and scripts we used for data curation, training, and evaluation for Sky-T1-32B-Preview, you can find more details in each directory.
- ``/data``: The 17k training data used to train Sky-T1-32B-Preview. We also add the science and riddle portion from the [STILL-2 model](https://arxiv.org/pdf/2412.09413).
- ``skythought/tools``: Training data curation and evaluation for Sky-T1. To generate our training data, we use the QwQ-32B-Preview model. We curate the data mixture to cover diverse domains that require reasoning, and a reject sampling procedure to improve the data quality.
- ``skythought/skythought_evals``: Our data generation and evaluation library. To generate the training data for Sky-T1, we use the QwQ-32B-Preview model. We curate the data mixture to cover diverse domains that require reasoning, and a reject sampling procedure to improve the data quality.
- ``skythought/train``: Training scripts for Sky-T1. We use [Llama-Factory](https://github.com/hiyouga/LLaMA-Factory) to perform training. The model was trained for 3 epochs with a learning rate of 1e-5 and a batch size of 96. Our model training was completed in 19 hours on 8 H100 GPUs using DeepSpeed Zero-3 offloading, costing approximately $450 as per Lambda Cloud pricing.


# Evaluation
Following, we show our evaluation results for the Sky-T1-32B-Preview model across math, coding, and science benchmarks.

## Usage

First, clone the repository and install the package

```shell
git clone https://github.com/NovaSky-AI/SkyThought.git
cd SkyThought
# installs shown for conda
conda create -n eval python==3.10
conda activate eval
pip install -e .
```

For running evaluation, please refer to [skythought_evals/README.md](skythought/skythought_evals/README.md).


### Evaluation results
Following, we show our evaluation results for the Sky-T1-32B-Preview model across math, coding, and science benchmarks.

| Metric | Sky-T1-32B-Preview | Qwen-2.5-32B-Instruct | QwQ | o1-preview |
|-----------------------|---------------------|--------|-------|------------|
| Math500 | 86.4 | 81.4 | 92.2 | 81.4 |
Expand All @@ -51,7 +68,7 @@ Following, we show our evaluation results for the Sky-T1-32B-Preview model acros
| LiveCodeBench-Medium | 56.8 | 40.8 | 56.3 | 54.9 |
| LiveCodeBench-Hard | 17.9 | 9.8 | 17.1 | 16.3 |
| GPQA-Diamond | 56.8 | 45.5 | 52.5 | 75.2 |
| OlympiadBench (Math, EN) | 59.79 | 46.74 | 62.17 | - |
| OlympiadBench (Math, EN) | 59.79 | 46.74 | 62.17 | 59.2 |

#### Results on non-reasoning benchmarks

Expand Down
8 changes: 2 additions & 6 deletions skythought/skythought_evals/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,8 @@
This document describes the steps to training data curation and evaluation scripts for Sky-T1.

## Requirements
First create the environment as follows.
```shell
conda create -n eval python==3.10
conda activate eval
pip install -r requirements.txt
```

Make sure you have installed the `skythought-evals` package as outlined in the [README.md](../README.md).

For running OpenAI model, export the OpenAI key.
```shell
Expand Down
2 changes: 1 addition & 1 deletion skythought/skythought_evals/tasks/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@
"numina": NUMINATaskHandler,
"apps": APPSTaskHandler,
"taco": TACOTaskHandler,
"math500": MathTaskHandler,
"math": MathTaskHandler,
"aime": AIMETaskHandler,
"gpqa_diamond": GPQADiamondTaskHandler,
"mmlu": MMLUTaskHandler,
Expand Down
4 changes: 3 additions & 1 deletion skythought/skythought_evals/tasks/apps/apps_handler.py
Original file line number Diff line number Diff line change
Expand Up @@ -101,7 +101,7 @@ def make_conversations(self, data, system_prompt, model=None):
def load_and_filter_dataset(
self, start, end, split=None, subset=None, difficulty=None, args=None
):
train_data = self.load_dataset(subset=subset, split=split).to_pandas()
train_data = self.load_dataset(subset=subset, split=split)
if difficulty or "difficulty" in self.task_config.preprocess_config:
difficulty = (
self.task_config.preprocess_config["difficulty"]
Expand All @@ -110,6 +110,8 @@ def load_and_filter_dataset(
)
train_data = train_data.filter(lambda x: x["difficulty"] == difficulty)

train_data = train_data.to_pandas()

return train_data.iloc[start:end] if end > 0 else train_data.iloc[start:]

def process_remaining_data(self, train_data, results):
Expand Down

0 comments on commit a399909

Please sign in to comment.