GitHub - Eugene29/Megatron-DeepSpeed_ViT

Welcome to ALCF ViT repo

Clone & Init Submodule:

  git clone --recursive https://github.com/Eugene29/Megatron-DeepSpeed_ViT.git ## Clone module + submodule
  cd Megatron-DeepSpeed_ViT
  git submodule update --init --recursive ## Init & Update submodule

Environment

Only base environment is needed for polaris cluster while for aurora, we employ sam's ezpz library. A suitable virtual environment (in flare file-system) is activated automatically on aurora.

Notes:

Main script for entry is mult_mds_aurora.sh or mult_mds_polaris.sh. You'll need to modify SCRIPT_DIR, which essentially is your working directory. You can take a deeper look at other fixed ENV and MDS-related variables inside mult_launch.sh.

POSSIBLE ENV VARIABLES

USP_ulysses=1, SP=          ## Turn on USP's Ulysses. Separately set degree by SP=_
USP_ring=1, SP=             ## Turn on USP's Ulysses. Separately set degree by SP=_
USP_hybrid=(2,4)            ## TBD
SIZE=int                    ## Number of GPU (ONLY WORKS ON 1-NODE)
drop_last_batch_with_GBS=1  ## fixes the data order as long as GBS is matching.
DATA={TOY, CIFAR}           ## Use Toy dataset
factor=int                  ## Ratio of image_dim/patch_dim. Controls the Sequence Length.
PROFILE={0,1}               ## Enable pytorch profiler. Trace is saved in your LOG_DIR.
GBS=int                     ## global batch size
MBS=int                     ## micro batch size
POS_ENCODING={0,1}          ## Use positioanl encoding instead of positional embedding
WANDB_MODE=disabled         ## Disable WANDB
GLOBAL_MEAN_POOLING=1       ## Use Global mean pooling instead of clf token 
NUM_ITERS=int               ## Num train iteration
FA={0,1}                    ## Enable Flash Attention
ZERO={0,1,2,3}              ## Stages of DeepSpeed Zero. 0 by default
ACT_CKPT={0,1}              ## Enable activation checkpointing
VIT3D={0,1}                 ## Switch to 3DVIT. Must use Toy dataset (for now).
VIT=string                  ## Size of VIT. Refer to mds_launch.sh for possible models
TPSP={0,1}                  ## Upgrade from TP to TP-SP
LOG_RESULTS={0,1}           ## log results (tflops, mem fpt, samples/sec) in a json file
MICS_SHARD_SIZE             ## Size of your MICS partition group
fp16                        ## enable fp16
bf16                        ## use datatype bf16
LOG_COMMS                   ## log/profile communications through deepspeed
PROF_FLOPS                  ## profile flop counts with detail through deepspeed

################################ Notes ################################
1. Pass either GBS or MBS
2. Pass either fp16 or bf16
2. ZERO123 has different loss than ZERO=0, also observed in LLM. Whether convergence is impacted
needs to be tested.

Name		Name	Last commit message	Last commit date
Latest commit History 2,359 Commits
.github/workflows		.github/workflows
dataset		dataset
docs		docs
examples		examples
examples_deepspeed		examples_deepspeed
images		images
long-context-attention @ 27dae51		long-context-attention @ 27dae51
megatron		megatron
tasks		tasks
tests		tests
tools		tools
.coveragerc		.coveragerc
.gitignore		.gitignore
.gitlab-ci.yml		.gitlab-ci.yml
.gitmodules		.gitmodules
CODEOWNERS		CODEOWNERS
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
SECURITY.md		SECURITY.md
all2all_workload.sh		all2all_workload.sh
finetune_llama.py		finetune_llama.py
mds_launch.sh		mds_launch.sh
mds_qsub.sh		mds_qsub.sh
mult_mds_aurora.sh		mult_mds_aurora.sh
mult_mds_polaris.sh		mult_mds_polaris.sh
pretrain_bert.py		pretrain_bert.py
pretrain_gpt.py		pretrain_gpt.py
pretrain_gpt_core.py		pretrain_gpt_core.py
pretrain_ict.py		pretrain_ict.py
pretrain_retro.py		pretrain_retro.py
pretrain_swin_ezpz.py		pretrain_swin_ezpz.py
pretrain_t5.py		pretrain_t5.py
pretrain_vision_classify.py		pretrain_vision_classify.py
pretrain_vision_classify_ezpz.py		pretrain_vision_classify_ezpz.py
pretrain_vision_dino.py		pretrain_vision_dino.py
pretrain_vision_inpaint.py		pretrain_vision_inpaint.py
setup.py		setup.py
zero3_bench.sh		zero3_bench.sh
zero3_bench_qsub.sh		zero3_bench_qsub.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Welcome to ALCF ViT repo

Clone & Init Submodule:

Environment

Notes:

POSSIBLE ENV VARIABLES

About

Releases

Packages

Contributors 2

Languages

License

Eugene29/Megatron-DeepSpeed_ViT

Folders and files

Latest commit

History

Repository files navigation

Welcome to ALCF ViT repo

Clone & Init Submodule:

Environment

Notes:

POSSIBLE ENV VARIABLES

About

Resources

License

Security policy

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages