Initial commit of GPT-NeoX

IBM · Aug 29, 2023 · 1ab9747 · 1ab9747
1 parent 303d7be
commit 1ab9747
Show file tree

Hide file tree

Showing 42 changed files with 4,398 additions and 14 deletions.
diff --git a/.gitignore b/.gitignore
@@ -5,6 +5,10 @@ __pycache__/
 
 # C extensions
 *.so
+save
+
+# zarr file output
+*.zarr
 
 # Distribution / packaging
 .Python

diff --git a/LICENSE b/LICENSE
diff --git a/README.md b/README.md
@@ -1,7 +1,121 @@
-[![GitHub issues](https://img.shields.io/github/issues/EleutherAI/gpt-neox)](https://github.com/EleutherAI/gpt-neox/issues)
-[<img src="https://raw.githubusercontent.com/wandb/assets/main/wandb-github-badge-28.svg" alt="Weights & Biases monitoring" height=20>](https://wandb.ai/eleutherai/neox)
+# Times-NeoX
+
+This repository is a fork from [GPT-NeoX](https://github.com/EleutherAI/gpt-neox) to implement a model for univariate time series forecasting. The model is based on [lag-GPT](https://github.com/kashif/pytorch-transformer-ts/tree/main/lag-gpt) and [GluonTS](https://ts.gluon.ai/) with [GPT-NeoX](https://github.com/EleutherAI/gpt-neox) engine. To create Times-NeoX model, we replaced the embedding and softmax layers of GPT model with a projection layer and density model head, respectively.  
+
+![Alt text](images/TimesNeoX.svg?raw=true "GPT vs Times")
+
+## Training and inference
+We adapted ***train.py*** and ***generate.py*** from GPT-NeoX training and inference scripts, the new scripts have a suffix "-times": train-times.py and generate-times.py. GPT-NeoX original scripts were renamed into trainGPT.py and generateGPT.py. Please refer to GPT-NeoX documentation how to launch the scripts.
+
+## Config files
+Please see an example of config file in config/times_config folder. 
+
+## Options
+The model uses most of the options of GPT-NeoX (please see documentation for GPT-NeoX below) with an addition of "times_args" arguments: 
+
+```json
+"times_args": {
+    "context_length": 1024,
+    "prediction_length": 17,
+    "scaling": "std",
+    "shuffle_buffer_length": 1000,
+    "padding_value": 0,
+    "data_seed": 10,
+
+    "inference": {
+      "num_test_batches": 2,
+      "file_name": "output.zarr",
+      "chunk_size": 128
+    },
+
+    "datasets":{
+
+      "train": [
+        "airpassengers", "australian_electricity_demand", "car_parts_without_missing",
+      ],
+      "validation": [
+        "cif_2016", "covid_deaths", "electricity", "electricity_weekly", "exchange_rate",
+      ],
+      "test":[
+        "airpassengers", "australian_electricity_demand",
+        "cif_2016", "covid_deaths", "electricity", "electricity_weekly", "exchange_rate",
+      ],
+
+    "augmentation": {
+      "enabled": false,
+      "prob": 0.5,
+      "transforms": {
+          "freq_mask": {
+              "weight": 0.0,
+              "options": {
+                  "rate": 0.01
+              }
+          },
+          "freq_mix": {
+              "weight": 0.0,
+              "options": {
+                  "rate": 0.01
+              }
+          },
+          "permutation": {
+              "weight": 0.0,
+              "options": {
+                  "max_segments": 7,
+                  "seg_mode": "random"
+              }
+          },
+          "rotation": {
+              "weight": 0.0
+          },
+          "magnitude_warp": {
+              "weight": 0.0,
+              "options": {
+                  "sigma": 0.7,
+                  "knot": 4
+              }
+          },
+          "time_warp": {
+              "weight": 0.0,
+              "options": {
+                  "sigma": 0.7,
+                  "knot": 4
+              }
+          },
+          "window_slice": {
+              "weight": 0.0,
+              "options": {
+                  "reduce_ratio": 0.7,
+              }
+          },
+          "window_warp": {
+              "weight": 1.0,
+              "options": {
+                  "window_ratio": 0.2,
+                  "scales": [0.5, 2.0],
+              }
+          }
+      }
+    },
+  }
+}
+```
+
+### Model input
+The dataloaders sample intervals from dataset time series of **context_length** + **prediction_length** lengths and pseudo-shuffle samples with **shuffle_buffer_length**. **data_seed** sets a seed for dataloaders. Not-observed values in datasets are replaced with **padding_value**. During training of the autoregressive model, values in context and prediction windows are treated the same. However, input time series values are normalized using **scaling** scaler trained on context window.
+During inference stage, the minimum length of the input is of **context_length**.
+
+### Datasets
+**datasets** option sets the list of training, validation, and testing datasets from GluonTS library and the list of possible augmentations (see below).
+
+### Augmentation
+Augmentation of time-series is set in the **augmentation** option. **prob** defines the probability to apply augmentation. Each transform probability is weighted with **weight** option. 
+
+### Inference
+
+Use ***generate-times.py*** to predict time series for GluonTS datasets from the **test** list. Specify number of batches for inference in **num_test_batches**. Results are saved into a file with [zarr](https://zarr.readthedocs.io/en/stable/) format. Each data-parallel partition writes into a separate group of the zarr file. Groups have *ground_truth*, *past_target*, and *output* arrays. *past_target* is a context window. *ground_truth* is a ground truth for the future window, *output* is model output for the future window. Please see ***print_zarr.py*** in [tools](/tools/) folder to create graphs of series from zarr file and save them to PDF.    
+
 
-# GPT-NeoX
+# README from GPT-NeoX
 
 This repository records [EleutherAI](https://www.eleuther.ai)'s library for training large-scale language models on GPUs. Our current framework is based on NVIDIA's [Megatron Language Model](https://github.com/NVIDIA/Megatron-LM) and has been augmented with techniques from [DeepSpeed](https://www.deepspeed.ai) as well as some novel optimizations. We aim to make this repo a centralized and accessible place to gather techniques for training large-scale autoregressive language models, and accelerate research into large-scale training.
 

diff --git a/configs/times_configs/49M.yml b/configs/times_configs/49M.yml
@@ -0,0 +1,217 @@
+{
+  #"save": "save",
+  #"load": "save",
+  #"checkpoint_factor": 1000,
+  #"extra_save_iters": [10, 20, 30],
+  #"keep_last_n_checkpoints": 3,
+  #"checkpoint-scale": "linear",
+
+  "gradient_accumulation_steps": 1,
+
+  "checkpoint": {
+    "tag_validation":"Warn",
+    "load_universal":false,
+    "use_node_local_storage":false,
+    "parallel_write": {
+        "pipeline_stage": false
+    },
+  },
+
+  # For TFLOPS calculation
+  "seq_length": 1040,
+
+
+  "num_gpus": 2,
+  # parallelism settings
+  "pipe_parallel_size": 2,
+  "model_parallel_size": 1,
+
+  "times_args": {
+    "context_length": 1024,
+    "prediction_length": 10,
+    "scaling": "std",
+    "shuffle_buffer_length": 1000,
+    "padding_value": 0,
+    "data_seed": 10,
+
+    "inference": {
+      "num_test_batches": 1,
+      "file_name": "output.zarr",
+      "chunk_size": 128
+    },
+
+    "datasets":{
+      "train": [
+        "airpassengers", "australian_electricity_demand", "car_parts_without_missing",
+        "cif_2016", "covid_deaths", "electricity", "electricity_weekly", "exchange_rate",
+        "fred_md", "hospital", "kaggle_web_traffic_weekly", "kdd_cup_2018_without_missing", 
+        "london_smart_meters_without_missing", "nn5_daily_with_missing", "nn5_weekly", "pedestrian_counts",
+        "rideshare_without_missing", "saugeenday", "solar-energy", "solar_10_minutes", "solar_weekly", "taxi_30min",
+        "temperature_rain_without_missing", "tourism_monthly", "uber_tlc_daily", "uber_tlc_hourly", "vehicle_trips_without_missing",
+        "weather", "wiki-rolling_nips", "m4_daily", "m4_hourly", "m4_monthly", "m4_quarterly", "m4_yearly", "wind_farms_without_missing"
+      ],
+      "validation": [
+        "airpassengers", "australian_electricity_demand", "car_parts_without_missing",
+        "cif_2016", "covid_deaths", "electricity", "electricity_weekly", "exchange_rate",
+        "fred_md", "hospital", "kaggle_web_traffic_weekly", "kdd_cup_2018_without_missing", 
+        "london_smart_meters_without_missing", "nn5_daily_with_missing", "nn5_weekly", "pedestrian_counts",
+        "rideshare_without_missing", "saugeenday", "solar-energy", "solar_10_minutes", "solar_weekly", "taxi_30min",
+        "temperature_rain_without_missing", "tourism_monthly", "uber_tlc_daily", "uber_tlc_hourly", "vehicle_trips_without_missing",
+        "weather", "wiki-rolling_nips", "m4_daily", "m4_hourly", "m4_monthly", "m4_quarterly", "m4_yearly", "wind_farms_without_missing"
+      ],
+      "test":[
+        "airpassengers", "australian_electricity_demand",
+      ],
+
+      "augmentation": {
+        "enabled": true,
+        "prob": 0.3,
+        "transforms": {
+            "freq_mask": {
+                "weight": 1.0,
+                "options": {
+                    "rate": 0.01
+                }
+            },
+            "freq_mix": {
+                "weight": 1.0,
+                "options": {
+                    "rate": 0.01
+                }
+            },
+            "permutation": {
+                "weight": 1.0,
+                "options": {
+                    "max_segments": 7,
+                    "seg_mode": "random"
+                }
+            },
+            "rotation": {
+                "weight": 1.0
+            },
+            "magnitude_warp": {
+                "weight": 1.0,
+                "options": {
+                    "sigma": 0.7,
+                    "knot": 4
+                }
+            },
+            "time_warp": {
+                "weight": 1.0,
+                "options": {
+                    "sigma": 0.7,
+                    "knot": 4
+                }
+            },
+            "window_slice": {
+                "weight": 1.0,
+                "options": {
+                    "reduce_ratio": 0.7,
+                }
+            },
+            "window_warp": {
+                "weight": 1.0,
+                "options": {
+                    "window_ratio": 0.2,
+                    "scales": [0.5, 2.0],
+                }
+            }
+        }
+      },
+
+
+    }
+  },  
+
+  # model settings
+  "num_layers": 10,
+  "hidden_size": 640,
+  "num_attention_heads": 10,
+  "max_position_embeddings": 2048,
+  "pos_emb": "rotary",
+  "rotary_pct": 0.25,
+  "gpt_j_residual": true,
+  "output_layer_parallelism": "column",
+
+  # these should provide some speedup but takes a while to build, set to true if desired
+  "scaled_upper_triang_masked_softmax_fusion": false,
+  "bias_gelu_fusion": false,
+
+  # init methods
+  "init_method": "small_init",
+  "output_layer_init_method": "wang_init",
+
+  # optimizer settings
+  "optimizer": {
+    "type": "Adam",
+    "params": {
+      "lr": 0.0008,
+      "betas": [0.9, 0.95],
+      "eps": 1.0e-8,
+    }
+  },
+  "min_lr": 0.00008,
+
+  # for all zero_optimization options, see https://www.deepspeed.ai/docs/config-json/#zero-optimizations-for-fp16-training
+  "zero_optimization": {
+    "stage": 1,
+    "allgather_partitions": True,
+    "allgather_bucket_size": 500000000,
+    "overlap_comm": True,
+    "reduce_scatter": True,
+    "reduce_bucket_size": 500000000,
+    "contiguous_gradients": True,
+  },
+
+  "csv_monitor": {
+    "enabled": true,
+    "output_path": "logs",
+    "job_name": "debug_run",
+  },
+
+  # batch / data settings
+  "train_micro_batch_size_per_gpu": 32,
+  "gas": 1,
+  "data_impl": "mmap",
+  "num_workers": 1,
+
+  # activation checkpointing
+  "checkpoint_activations": true,
+  "checkpoint_num_layers": 1,
+  "partition_activations": true,
+  "synchronize_each_layer": true,
+
+  # regularization
+  "gradient_clipping": 1.0,
+  "weight_decay": 0.1,
+  "hidden_dropout": 0,
+  "attention_dropout": 0,
+
+  "precision": "fp32", 
+
+  # precision settings
+  "fp16": {
+    "fp16": false,
+    "enabled": false,
+    "loss_scale": 0,
+    "loss_scale_window": 1000,
+    "initial_scale_power": 12,
+    "hysteresis": 2,
+    "min_loss_scale": 1,
+  },
+
+  # misc. training settings
+  "train_iters": 143000,
+  "lr_decay_iters": 143000,
+  "distributed_backend": "nccl",
+  "lr_decay_style": "cosine",
+  "warmup": 0.01,
+  #"eval_interval": 100000,
+  "eval_interval": 30,
+  "eval_iters": 10,
+
+  # logging
+  "log_interval": 10,
+  "steps_per_print": 10,
+  "wall_clock_breakdown": true,
+}
diff --git a/deepy.py b/deepy.py
@@ -33,7 +33,7 @@ def main():
     if wandb_token is not None:
         deepspeed.launcher.runner.EXPORT_ENVS.append("WANDB_API_KEY")
         os.environ["WANDB_API_KEY"] = wandb_token
-
+    
     deepspeed.launcher.runner.main(deepspeed_main_args)
 
 

diff --git a/generate-times.py b/generate-times.py
@@ -0,0 +1,7 @@
+from megatron.laggpt.inference import inference, initialize
+from torch.distributed import barrier
+
+neox_args, model, times_envelope, data_iterator = initialize()
+inference(neox_args, model, times_envelope, data_iterator)
+
+barrier()
diff --git a/generate.py → generateGPT.py b/generate.py → generateGPT.py
diff --git a/images/TimesNeoX.svg b/images/TimesNeoX.svg
diff --git a/inference_script.sh b/inference_script.sh
@@ -0,0 +1,20 @@
+#!/bin/bash
+
+# Runs the "345M" parameter model
+
+# asynio flags
+export LDFLAGS="$LDFLAGS -L/usr/lib64/"
+export CFLAGS="$CFLAGS -I/usr/include/"
+# c++ libs
+export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/data/vgurev/.conda/envs/GPT/x86_64-conda-linux-gnu/lib/
+export PATH=/data/vgurev/.conda/envs/GPT/bin/:$PATH
+
+#use mpirun, not pytorch luncher
+export MPI=TRUE
+
+GPUS_PER_NODE=2
+NNODES=1
+export WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))
+
+python ./deepy.py generate-times.py 49M.yml
+
-Original file line number
+Diff line change
@@ Expand Up / @@ -5,6 +5,10 @@ __pycache__/ @@
     # C extensions
     *.so
+    save
+    # zarr file output
+    *.zarr
     # Distribution / packaging
     .Python
@@ Expand Down @@