Skip to content

Commit

Permalink
Initial commit of GPT-NeoX
Browse files Browse the repository at this point in the history
  • Loading branch information
Viatcheslav Gurev committed Aug 29, 2023
1 parent 303d7be commit 1ab9747
Show file tree
Hide file tree
Showing 42 changed files with 4,398 additions and 14 deletions.
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,10 @@ __pycache__/

# C extensions
*.so
save

# zarr file output
*.zarr

# Distribution / packaging
.Python
Expand Down
416 changes: 416 additions & 0 deletions LICENSE

Large diffs are not rendered by default.

120 changes: 117 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,121 @@
[![GitHub issues](https://img.shields.io/github/issues/EleutherAI/gpt-neox)](https://github.com/EleutherAI/gpt-neox/issues)
[<img src="https://raw.githubusercontent.com/wandb/assets/main/wandb-github-badge-28.svg" alt="Weights & Biases monitoring" height=20>](https://wandb.ai/eleutherai/neox)
# Times-NeoX

This repository is a fork from [GPT-NeoX](https://github.com/EleutherAI/gpt-neox) to implement a model for univariate time series forecasting. The model is based on [lag-GPT](https://github.com/kashif/pytorch-transformer-ts/tree/main/lag-gpt) and [GluonTS](https://ts.gluon.ai/) with [GPT-NeoX](https://github.com/EleutherAI/gpt-neox) engine. To create Times-NeoX model, we replaced the embedding and softmax layers of GPT model with a projection layer and density model head, respectively.

![Alt text](images/TimesNeoX.svg?raw=true "GPT vs Times")

## Training and inference
We adapted ***train.py*** and ***generate.py*** from GPT-NeoX training and inference scripts, the new scripts have a suffix "-times": train-times.py and generate-times.py. GPT-NeoX original scripts were renamed into trainGPT.py and generateGPT.py. Please refer to GPT-NeoX documentation how to launch the scripts.

## Config files
Please see an example of config file in config/times_config folder.

## Options
The model uses most of the options of GPT-NeoX (please see documentation for GPT-NeoX below) with an addition of "times_args" arguments:

```json
"times_args": {
"context_length": 1024,
"prediction_length": 17,
"scaling": "std",
"shuffle_buffer_length": 1000,
"padding_value": 0,
"data_seed": 10,

"inference": {
"num_test_batches": 2,
"file_name": "output.zarr",
"chunk_size": 128
},

"datasets":{

"train": [
"airpassengers", "australian_electricity_demand", "car_parts_without_missing",
],
"validation": [
"cif_2016", "covid_deaths", "electricity", "electricity_weekly", "exchange_rate",
],
"test":[
"airpassengers", "australian_electricity_demand",
"cif_2016", "covid_deaths", "electricity", "electricity_weekly", "exchange_rate",
],

"augmentation": {
"enabled": false,
"prob": 0.5,
"transforms": {
"freq_mask": {
"weight": 0.0,
"options": {
"rate": 0.01
}
},
"freq_mix": {
"weight": 0.0,
"options": {
"rate": 0.01
}
},
"permutation": {
"weight": 0.0,
"options": {
"max_segments": 7,
"seg_mode": "random"
}
},
"rotation": {
"weight": 0.0
},
"magnitude_warp": {
"weight": 0.0,
"options": {
"sigma": 0.7,
"knot": 4
}
},
"time_warp": {
"weight": 0.0,
"options": {
"sigma": 0.7,
"knot": 4
}
},
"window_slice": {
"weight": 0.0,
"options": {
"reduce_ratio": 0.7,
}
},
"window_warp": {
"weight": 1.0,
"options": {
"window_ratio": 0.2,
"scales": [0.5, 2.0],
}
}
}
},
}
}
```

### Model input
The dataloaders sample intervals from dataset time series of **context_length** + **prediction_length** lengths and pseudo-shuffle samples with **shuffle_buffer_length**. **data_seed** sets a seed for dataloaders. Not-observed values in datasets are replaced with **padding_value**. During training of the autoregressive model, values in context and prediction windows are treated the same. However, input time series values are normalized using **scaling** scaler trained on context window.
During inference stage, the minimum length of the input is of **context_length**.

### Datasets
**datasets** option sets the list of training, validation, and testing datasets from GluonTS library and the list of possible augmentations (see below).

### Augmentation
Augmentation of time-series is set in the **augmentation** option. **prob** defines the probability to apply augmentation. Each transform probability is weighted with **weight** option.

### Inference

Use ***generate-times.py*** to predict time series for GluonTS datasets from the **test** list. Specify number of batches for inference in **num_test_batches**. Results are saved into a file with [zarr](https://zarr.readthedocs.io/en/stable/) format. Each data-parallel partition writes into a separate group of the zarr file. Groups have *ground_truth*, *past_target*, and *output* arrays. *past_target* is a context window. *ground_truth* is a ground truth for the future window, *output* is model output for the future window. Please see ***print_zarr.py*** in [tools](/tools/) folder to create graphs of series from zarr file and save them to PDF.


# GPT-NeoX
# README from GPT-NeoX

This repository records [EleutherAI](https://www.eleuther.ai)'s library for training large-scale language models on GPUs. Our current framework is based on NVIDIA's [Megatron Language Model](https://github.com/NVIDIA/Megatron-LM) and has been augmented with techniques from [DeepSpeed](https://www.deepspeed.ai) as well as some novel optimizations. We aim to make this repo a centralized and accessible place to gather techniques for training large-scale autoregressive language models, and accelerate research into large-scale training.

Expand Down
217 changes: 217 additions & 0 deletions configs/times_configs/49M.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,217 @@
{
#"save": "save",
#"load": "save",
#"checkpoint_factor": 1000,
#"extra_save_iters": [10, 20, 30],
#"keep_last_n_checkpoints": 3,
#"checkpoint-scale": "linear",

"gradient_accumulation_steps": 1,

"checkpoint": {
"tag_validation":"Warn",
"load_universal":false,
"use_node_local_storage":false,
"parallel_write": {
"pipeline_stage": false
},
},

# For TFLOPS calculation
"seq_length": 1040,


"num_gpus": 2,
# parallelism settings
"pipe_parallel_size": 2,
"model_parallel_size": 1,

"times_args": {
"context_length": 1024,
"prediction_length": 10,
"scaling": "std",
"shuffle_buffer_length": 1000,
"padding_value": 0,
"data_seed": 10,

"inference": {
"num_test_batches": 1,
"file_name": "output.zarr",
"chunk_size": 128
},

"datasets":{
"train": [
"airpassengers", "australian_electricity_demand", "car_parts_without_missing",
"cif_2016", "covid_deaths", "electricity", "electricity_weekly", "exchange_rate",
"fred_md", "hospital", "kaggle_web_traffic_weekly", "kdd_cup_2018_without_missing",
"london_smart_meters_without_missing", "nn5_daily_with_missing", "nn5_weekly", "pedestrian_counts",
"rideshare_without_missing", "saugeenday", "solar-energy", "solar_10_minutes", "solar_weekly", "taxi_30min",
"temperature_rain_without_missing", "tourism_monthly", "uber_tlc_daily", "uber_tlc_hourly", "vehicle_trips_without_missing",
"weather", "wiki-rolling_nips", "m4_daily", "m4_hourly", "m4_monthly", "m4_quarterly", "m4_yearly", "wind_farms_without_missing"
],
"validation": [
"airpassengers", "australian_electricity_demand", "car_parts_without_missing",
"cif_2016", "covid_deaths", "electricity", "electricity_weekly", "exchange_rate",
"fred_md", "hospital", "kaggle_web_traffic_weekly", "kdd_cup_2018_without_missing",
"london_smart_meters_without_missing", "nn5_daily_with_missing", "nn5_weekly", "pedestrian_counts",
"rideshare_without_missing", "saugeenday", "solar-energy", "solar_10_minutes", "solar_weekly", "taxi_30min",
"temperature_rain_without_missing", "tourism_monthly", "uber_tlc_daily", "uber_tlc_hourly", "vehicle_trips_without_missing",
"weather", "wiki-rolling_nips", "m4_daily", "m4_hourly", "m4_monthly", "m4_quarterly", "m4_yearly", "wind_farms_without_missing"
],
"test":[
"airpassengers", "australian_electricity_demand",
],

"augmentation": {
"enabled": true,
"prob": 0.3,
"transforms": {
"freq_mask": {
"weight": 1.0,
"options": {
"rate": 0.01
}
},
"freq_mix": {
"weight": 1.0,
"options": {
"rate": 0.01
}
},
"permutation": {
"weight": 1.0,
"options": {
"max_segments": 7,
"seg_mode": "random"
}
},
"rotation": {
"weight": 1.0
},
"magnitude_warp": {
"weight": 1.0,
"options": {
"sigma": 0.7,
"knot": 4
}
},
"time_warp": {
"weight": 1.0,
"options": {
"sigma": 0.7,
"knot": 4
}
},
"window_slice": {
"weight": 1.0,
"options": {
"reduce_ratio": 0.7,
}
},
"window_warp": {
"weight": 1.0,
"options": {
"window_ratio": 0.2,
"scales": [0.5, 2.0],
}
}
}
},


}
},

# model settings
"num_layers": 10,
"hidden_size": 640,
"num_attention_heads": 10,
"max_position_embeddings": 2048,
"pos_emb": "rotary",
"rotary_pct": 0.25,
"gpt_j_residual": true,
"output_layer_parallelism": "column",

# these should provide some speedup but takes a while to build, set to true if desired
"scaled_upper_triang_masked_softmax_fusion": false,
"bias_gelu_fusion": false,

# init methods
"init_method": "small_init",
"output_layer_init_method": "wang_init",

# optimizer settings
"optimizer": {
"type": "Adam",
"params": {
"lr": 0.0008,
"betas": [0.9, 0.95],
"eps": 1.0e-8,
}
},
"min_lr": 0.00008,

# for all zero_optimization options, see https://www.deepspeed.ai/docs/config-json/#zero-optimizations-for-fp16-training
"zero_optimization": {
"stage": 1,
"allgather_partitions": True,
"allgather_bucket_size": 500000000,
"overlap_comm": True,
"reduce_scatter": True,
"reduce_bucket_size": 500000000,
"contiguous_gradients": True,
},

"csv_monitor": {
"enabled": true,
"output_path": "logs",
"job_name": "debug_run",
},

# batch / data settings
"train_micro_batch_size_per_gpu": 32,
"gas": 1,
"data_impl": "mmap",
"num_workers": 1,

# activation checkpointing
"checkpoint_activations": true,
"checkpoint_num_layers": 1,
"partition_activations": true,
"synchronize_each_layer": true,

# regularization
"gradient_clipping": 1.0,
"weight_decay": 0.1,
"hidden_dropout": 0,
"attention_dropout": 0,

"precision": "fp32",

# precision settings
"fp16": {
"fp16": false,
"enabled": false,
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 12,
"hysteresis": 2,
"min_loss_scale": 1,
},

# misc. training settings
"train_iters": 143000,
"lr_decay_iters": 143000,
"distributed_backend": "nccl",
"lr_decay_style": "cosine",
"warmup": 0.01,
#"eval_interval": 100000,
"eval_interval": 30,
"eval_iters": 10,

# logging
"log_interval": 10,
"steps_per_print": 10,
"wall_clock_breakdown": true,
}
2 changes: 1 addition & 1 deletion deepy.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ def main():
if wandb_token is not None:
deepspeed.launcher.runner.EXPORT_ENVS.append("WANDB_API_KEY")
os.environ["WANDB_API_KEY"] = wandb_token

deepspeed.launcher.runner.main(deepspeed_main_args)


Expand Down
7 changes: 7 additions & 0 deletions generate-times.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
from megatron.laggpt.inference import inference, initialize
from torch.distributed import barrier

neox_args, model, times_envelope, data_iterator = initialize()
inference(neox_args, model, times_envelope, data_iterator)

barrier()
File renamed without changes.
1 change: 1 addition & 0 deletions images/TimesNeoX.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
20 changes: 20 additions & 0 deletions inference_script.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
#!/bin/bash

# Runs the "345M" parameter model

# asynio flags
export LDFLAGS="$LDFLAGS -L/usr/lib64/"
export CFLAGS="$CFLAGS -I/usr/include/"
# c++ libs
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/data/vgurev/.conda/envs/GPT/x86_64-conda-linux-gnu/lib/
export PATH=/data/vgurev/.conda/envs/GPT/bin/:$PATH

#use mpirun, not pytorch luncher
export MPI=TRUE

GPUS_PER_NODE=2
NNODES=1
export WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))

python ./deepy.py generate-times.py 49M.yml

Loading

0 comments on commit 1ab9747

Please sign in to comment.