Releases: NVIDIA/NeMo
NVIDIA Neural Modules 2.2.0rc2
Prerelease: NVIDIA Neural Modules 2.2.0rc2 (2025-02-17)
NVIDIA Neural Modules 2.2.0rc1
Prerelease: NVIDIA Neural Modules 2.2.0rc1 (2025-02-04)
NVIDIA Neural Modules 2.2.0rc0
Prerelease: NVIDIA Neural Modules 2.2.0rc0 (2025-02-02)
NVIDIA Neural Modules 2.1.0
Highlights
- Training
- Fault Tolerance
- Straggler Detection
- Auto Relaunch
- Fault Tolerance
- LLM & MM
- MM models
- Llava-next
- Llama 3.2
- Sequence Model Parallel for NeVa
- Enable Energon
- SigLIP (NeMo 1.0 only)
- LLM 2.0 migration
- Starcoder2
- Gemma 2
- T5
- Baichuan
- BERT
- Mamba
- ChatGLM
- DoRA support
- MM models
- Export
- Nemo 2.0 base model export path for NIM
- PTQ in Nemo 2.0
- ASR
- Timestamps with TDT decoder
- Timestamps option with .transcribe()
Detailed Changelogs:
ASR
Changelog
- [Fix] Fixed sampler override and audio_key in prepare_audio_data by @anteju :: PR: #10980
- Akoumparouli/mixtral recipe fix r2.0.0 by @akoumpa :: PR: #10994
- TDT compute timestamps option and Extra Whitespace handling for SPE by @monica-sekoyan :: PR: #10875
- ci: Switch to CPU only runner by @ko3n1g :: PR: #11035
- Fix timestamps tests by @monica-sekoyan :: PR: #11053
- ci: Pin release freeze by @ko3n1g :: PR: #11143
- Fix RNN-T loss memory usage by @artbataev :: PR: #11144
- Added deprecation notice by @Ssofja :: PR: #11133
- Fixes for Canary adapters tutorial by @pzelasko :: PR: #11184
- add ipython import guard by @nithinraok :: PR: #11191
- Self Supervised Pre-Training tutorial Fix by @monica-sekoyan :: PR: #11206
- update the return type by @nithinraok :: PR: #11210
- Timestamps to transcribe by @nithinraok :: PR: #10950
- [Doc fixes] update file names, installation instructions, bad links by @erastorgueva-nv :: PR: #11045
- Beam search algorithm implementation for TDT models by @lilithgrigoryan :: PR: #10903
- Update import 'pytorch_lightning' -> 'lightning.pytorch' by @maanug-nv :: PR: #11252
- Remove pytorch-lightning by @maanug-nv :: PR: #11306
- update hypothesis when passed through cfg by @nithinraok :: PR: #11366
- Revert "update hypothesis when passed through cfg" by @pablo-garay :: PR: #11373
- Fix transcribe speech by @nithinraok :: PR: #11379
- Lhotse support for transcribe_speech_parallel by @nune-tadevosyan :: PR: #11249
- Sortformer Diarizer 4spk v1 model PR Part 1: models, modules and dataloaders by @tango4j :: PR: #11282
- Removing unnecessary lines by @nune-tadevosyan :: PR: #11408
- Support for initializing lhotse shar dataloader via field: list[path] mapping by @pzelasko :: PR: #11460
- New extended prompt format for Canary, short utterances inference fix, and training micro-optimizations by @pzelasko :: PR: #11058
- Fixing Multi_Task_Adapters.ipynb by replacing canary2 with canary_custom by @weiqingw4ng :: PR: #11636
TTS
Changelog
- [Doc fixes] update file names, installation instructions, bad links by @erastorgueva-nv :: PR: #11045
- Add T5TTS by @blisc :: PR: #11193
- Update import 'pytorch_lightning' -> 'lightning.pytorch' by @maanug-nv :: PR: #11252
- Remove pytorch-lightning by @maanug-nv :: PR: #11306
- Add nvidia/low-frame-rate-speech-codec-22khz model on docs by @Edresson :: PR: #11457
NLP / NMT
Changelog
- Move collectiob.nlp imports inline for t5 by @marcromeyn :: PR: #10877
- Use a context-manager when opening files by @akoumpa :: PR: #10895
- Packed sequence bug fixes by @cuichenx :: PR: #10898
- ckpt convert bug fixes by @dimapihtar :: PR: #10878
- remove deprecated ci tests by @dimapihtar :: PR: #10922
- Update T5 tokenizer (adding additional tokens to tokenizer config) by @huvunvidia :: PR: #10972
- Add support and recipes for HF models via AutoModelForCausalLM by @akoumpa :: PR: #10962
- gpt3 175b cli by @malay-nagda :: PR: #10985
- Fix for crash with LoRA + tp_overlap_comm=false + sequence_parallel=true by @vysarge :: PR: #10920
- Update
BaseMegatronSampler
for compatibility with PTL's_BatchProgress
by @ashors1 :: PR: #11016 - add deprecation note by @dimapihtar :: PR: #11024
- Update ModelOpt Width Pruning example defaults by @kevalmorabia97 :: PR: #10902
- switch to NeMo 2.0 recipes by @dimapihtar :: PR: #10948
- NeMo 1.0: upcycle dense to moe by @akoumpa :: PR: #11002
- Gemma2 in Nemo2 with Recipes by @suiyoubi :: PR: #11037
- Add Packed Seq option to GPT based models by @suiyoubi :: PR: #11100
- Fix MCoreGPTModel import in llm.gpt.model.base by @hemildesai :: PR: #11109
- TP+MoE peft fix by @akoumpa :: PR: #11114
- GPT recipes to use full te spec by @JimmyZhang12 :: PR: #11119
- Virtual pipeline parallel support for LoRA in NLPAdapterModelMixin by @vysarge :: PR: #11128
- update nemo args for mcore flash decode arg change by @HuiyingLi :: PR: #11138
- Call
ckpt_to_weights_subdir
fromMegatronCheckpointIO
by @ashors1 :: PR: #10897 - [Doc fixes] update file names, installation instructions, bad links by @erastorgueva-nv :: PR: #11045
- fix(export): GPT models w/ bias=False convert properly by @terrykong :: PR: #11255
- Use MegatronDataSampler in HfDatasetDataModule by @akoumpa :: PR: #11274
- Add T5TTS by @blisc :: PR: #11193
- ci: Exclude CPU machines from scan by @ko3n1g :: PR: #11300
- Revert "fix(export): GPT models w/ bias=False convert properly" by @terrykong :: PR: #11301
- remove redundant docs by @sharathts :: PR: #11302
- Update import 'pytorch_lightning' -> 'lightning.pytorch' by @maanug-nv :: PR: #11252
- Add
attention_bias
argument in transformer block and transformer layer modules, addressing change in MCore by @yaoyu-33 :: PR: #11289 - Remove pytorch-lightning by @maanug-nv :: PR: #11306
- Update T5 attention-mask shapes to be compatible with all attention-backend in new TE versions by @huvunvidia :: PR: #11059
- Add support for restoring from 2.0 checkpoint in 1.0 by @hemildesai :: PR: #11347
- Fix Gemma2 Attention Args by @suiyoubi :: PR: #11365
- mlm conversion & tiktokenizer support by @dimapihtar :: PR: #11349
- [Nemo1] Generate sharded optimizer state dicts only if needed for saving by @ananthsub :: PR: #11451
- add hindi tn/itn coverage by @mgrafu :: PR: #11382
- chore(beep boop 🤖): Bump
MCORE_TAG=67a50f2...
(2024-11-28) by @ko3n1g :: PR: #11427 - Handle exception when importing RetroGPTChunkDatasets by @guyueh1 :: PR: #11415
- Update restore from config for gpt type continual training in NeMo1 by @yaoyu-33 :: PR: #11471
- ci: Re-enable
L2_Megatron_LM_To_NeMo_Conversion
by @ko3n1g :: PR: #11484 - Apply packed sequence params change for fused rope compatibility by @ananthsub :: PR: #11506
- Huvu/tiktoken tokenizer update by @huvunvidia :: PR: #11494
Text Normalization / Inverse Text Normalization
Changelog
- Adding support for LightningDataModule inside Fabric-API by @marcromeyn :: PR: #10879
- Add registry to register all needed classes with artifacts in nemo.lightning.io by @hemildesai :: PR: #10861
- Update import 'pytorch_lightning' -> 'lightning.pytorch' by @maanug-nv :: PR: #11252
- Remove pytorch-lightning by @maanug-nv :: PR: #11306
- add hindi tn/itn coverage by @mgrafu :: PR: #11382
Export
Changelog
- Update engine build step for TRT-LLM 0.13.0 by @janekl :: PR: #10880
- Nemo 2.0 ckpt support in TRT-LLM export by @oyilmaz-nvidia :: PR: #10891
- Fix TRTLLM parallel_embedding by @meatybobby :: PR: #10975
- Export & deploy updates (part I) by @janekl :: PR: #10941
- Add doc-strings to import & export + improve logging by @marcromeyn :: PR: #11078
- NeMo-UX: fix nemo-ux export path by @akoumpa :: PR: #11081
- Fix TRTLLM nemo2 activation parsing by @meatybobby :: PR: #11062
- Support exporting Nemotron-340B for TensorRT-LLM by @jinyangyuan-nvidia :: PR: #11015
- vLLM Hugging Face exporter by @oyilmaz-nvidia :: PR: #11124
- Fix export of configuration parameters to Weights and Biases by @soluwalana :: PR: #10995
- Change activation parsing in TRTLLM by @meatybobby :: PR: #11173
- Remove builder_opt param from trtllm-build for TensorRT-LLM >= 0.14.0 by @janekl :: PR: #11259
- fix(export): GPT models w/ bias=False convert properly by @terrykong :: PR: #11255
- fix(export): update API for disabling device reassignment in TRTLLM for Aligner by @terrykong :: PR: #10863
- Add openai-gelu in gated activation for TRTLLM export by @meatybobby :: PR: #11293
- Revert "fix(export): GPT models w/ bias=False convert properly" by @terrykong :: PR: #11301
- Adding alinger export by @shanmugamr1992 :: PR: #11269
- Export & deploy updates (part II) by @janekl :: PR: #11344
- Introducing TensorRT lazy export and caching option with trt_compile() by @borisfom :: PR: #11266
- fix: export converts properly if no model_prefix by @terrykong :: PR: #11477
Bugfixes
Changelog
- Change default ckpt name by @maanug-nv :: PR: #11277
- Fix patching of NeMo tokenizers for correct Lambada evaluation by @janekl :: PR: #11326
Uncategorized:
Changelog
- ci: Use Slack group by @ko3n1g :: PR: #10866
- Bump
Dockerfile.ci
(2024-10-14) by @ko3n1g :: PR: #10871 - Fix peft resume by @cuichenx :: PR: #10887
- call post_init after altering config values by @akoumpa :: PR: #10885
- Late import prettytable by @maanug-nv :: PR: #10912
- Bump
Dockerfile.ci
(2024-10-17) by @ko3n1g :: PR: #10919 - Warning for missing FP8 checkpoint support for vLLM deployment by @janekl :: PR: #10906
- Fix artifact saving by @hemildesai :: PR: #10914
- Lora improvement by @cuichenx :: PR: #10918
- Huvu/t5 nemo2.0 peft by @huvunvidia :: PR: #10916
- perf recipes and Mcore DistOpt params by @malay-nagda :: PR: #10883
- ci: Fix cherry pick team by @ko3n1g :: PR: #10945
- Fix requirements for MacOS by @artbataev :: PR: #10930
- Fix nemo 2.0 recipes by @BoxiangW :: PR: #10915
- Akoumparouli/nemo ux fix dir or string artifact by @akoumpa :: PR: #10936
- Fix typo in docstring by @ashors1 :: PR: #10955
- [Nemo CICD] Remove deprecated tests by @pablo-garay :: PR: #10960
- Restore NeMo 2.0 T5 pretraining CICD test by @huvunvidia :: PR: #10952
- Convert perf plugin env vars to strings by @hemildesai :: PR: #10947
- disable ...
NVIDIA Neural Modules 2.1.0rc2
Prerelease: NVIDIA Neural Modules 2.1.0rc2 (2024-12-21)
NVIDIA Neural Modules 2.1.0rc1
Prerelease: NVIDIA Neural Modules 2.1.0rc1 (2024-12-20)
NVIDIA Neural Modules 2.1.0rc0
[🤠]: Howdy folks, let's release NeMo `r2.1.0` ! (#11556) Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: pablo-garay <[email protected]>
NVIDIA Neural Modules 2.0.0
Highlights
Large language models & Multi modal
- Training
- Long context recipe
- PyTorch Native FSDP 1
- Models
- Llama 3
- Mixtral
- Nemotron
- NeMo 1.0
Export
- TensorRT-LLM v0.12 integration
- LoRA support for vLLM
- FP8 checkpoint
ASR
- Parakeet large (ASR with PnC model)
- Added Uzbek offline and Gregorian streaming models
- Optimization feature for efficient bucketing to improve bs consumption on GPUs
Detailed Changelogs
ASR
Changelog
- add parakeet-tdt_ctc-110m model by @nithinraok :: PR: #10461
- fix asr finetune by @stevehuang52 :: PR: #10508
- replace unbiased with correction by @nithinraok :: PR: #10555
- Update Multi_Task_Adapters.ipynb by @pzelasko :: PR: #10600
- Fix asr warnings by @nithinraok :: PR: #10469
- Fix typo in ASR RNNT BPE model by @pzelasko :: PR: #10742
- TestEncDecMultiTaskModel for canary parallel by @karpnv :: PR: #10740
- fix chunked infer by @stevehuang52 :: PR: #10581
- training code for hybrid-autoregressive inference model by @hainan-xv :: PR: #10841
- remove stacking operation from batched functions by @lilithgrigoryan :: PR: #10524
- Add lhotse fixes for rnnt model training and WER hanging issue with f… by @nithinraok :: PR: #10821
- Fix ASR tests by @artbataev :: PR: #10794
- [Fix] Fixed sampler override and audio_key in prepare_audio_data by @anteju :: PR: #10980
- [WIP] Add docs for NEST SSL by @stevehuang52 :: PR: #10804
- Akoumparouli/mixtral recipe fix r2.0.0 by @akoumpa :: PR: #10994
- TDT compute timestamps option and Extra Whitespace handling for SPE by @monica-sekoyan :: PR: #10875
- ci: Switch to CPU only runner by @ko3n1g :: PR: #11035
- Fix timestamps tests by @monica-sekoyan :: PR: #11053
- ci: Pin release freeze by @ko3n1g :: PR: #11143
- Fix RNN-T loss memory usage by @artbataev :: PR: #11144
- Added deprecation notice by @Ssofja :: PR: #11133
- Fixes for Canary adapters tutorial by @pzelasko :: PR: #11184
- add ipython import guard by @nithinraok :: PR: #11191
- Self Supervised Pre-Training tutorial Fix by @monica-sekoyan :: PR: #11206
- update the return type by @nithinraok :: PR: #11210
- Timestamps to transcribe by @nithinraok :: PR: #10950
- [Doc fixes] update file names, installation instructions, bad links by @erastorgueva-nv :: PR: #11045
- Beam search algorithm implementation for TDT models by @lilithgrigoryan :: PR: #10903
TTS
Changelog
- Fix asr warnings by @nithinraok :: PR: #10469
- Make nemo text processing optional in TTS by @blisc :: PR: #10584
- [Doc fixes] update file names, installation instructions, bad links by @erastorgueva-nv :: PR: #11045
NLP / NMT
Changelog
-
MCORE interface for TP-only FP8 AMAX reduction by @erhoo82 :: PR: #10437
-
Remove Apex dependency if not using MixedFusedLayerNorm by @cuichenx :: PR: #10468
-
Add missing import guards for causal_conv1d and mamba_ssm dependencies by @janekl :: PR: #10429
-
Update doc for fp8 trt-llm export by @Laplasjan107 :: PR: #10444
-
Remove running validating after finetuning by @huvunvidia :: PR: #10560
-
Extending modelopt spec for TEDotProductAttention by @janekl :: PR: #10523
-
Fix mb_calculator import in lora tutorial by @BoxiangW :: PR: #10624
-
.nemo conversion bug fix by @dimapihtar :: PR: #10598
-
Require setuptools>=70 and update deprecated api by @thomasdhc :: PR: #10659
-
Akoumparouli/fix get tokenizer list by @akoumpa :: PR: #10596
-
[McoreDistOptim] fix the naming to match apex.dist by @gdengk :: PR: #10707
-
[fix] Ensures disabling exp_manager with exp_manager=null does not error by @terrykong :: PR: #10651
-
[feat] Update get_model_parallel_src_rank to support tp-pp-dp ordering by @terrykong :: PR: #10652
-
feat: Migrate GPTSession refit path in Nemo export to ModelRunner for Aligner by @terrykong :: PR: #10654
-
[MCoreDistOptim] Add assertions for McoreDistOptim and fix fp8 arg specs by @gdengk :: PR: #10748
-
Fix for crashes with tensorboard_logger=false and VP + LoRA by @vysarge :: PR: #10792
-
Adding init_model_parallel to FabricMegatronStrategy by @marcromeyn :: PR: #10733
-
Moving steps to MegatronParallel to improve UX for Fabric by @marcromeyn :: PR: #10732
-
Adding setup_megatron_optimizer to FabricMegatronStrategy by @marcromeyn :: PR: #10833
-
Make FabricMegatronMixedPrecision match MegatronMixedPrecision by @marcromeyn :: PR: #10835
-
Fix VPP bug in MegatronStep by @marcromeyn :: PR: #10847
-
Expose drop_last in MegatronDataSampler by @farhadrgh :: PR: #10837
-
Move collectiob.nlp imports inline for t5 by @marcromeyn :: PR: #10877
-
Use a context-manager when opening files by @akoumpa :: PR: #10895
-
ckpt convert bug fixes by @dimapihtar :: PR: #10878
-
remove deprecated ci tests by @dimapihtar :: PR: #10922
-
Update T5 tokenizer (adding additional tokens to tokenizer config) by @huvunvidia :: PR: #10972
-
Add support and recipes for HF models via AutoModelForCausalLM by @akoumpa :: PR: #10962
- gpt3 175b cli by @malay-nagda :: PR: #10985
- Fix for crash with LoRA + tp_overlap_comm=false + sequence_parallel=true by @vysarge :: PR: #10920
- Update
BaseMegatronSampler
for compatibility with PTL'''s_BatchProgress
by @ashors1 :: PR: #11016 - add deprecation note by @dimapihtar :: PR: #11024
- Update ModelOpt Width Pruning example defaults by @kevalmorabia97 :: PR: #10902
- switch to NeMo 2.0 recipes by @dimapihtar :: PR: #10948
- NeMo 1.0: upcycle dense to moe by @akoumpa :: PR: #11002
- Update mcore parallelism initialization in nemo2 by @yaoyu-33 :: PR: #10643
- Gemma2 in Nemo2 with Recipes by @suiyoubi :: PR: #11037
- Add Packed Seq option to GPT based models by @suiyoubi :: PR: #11100
- Fix MCoreGPTModel import in llm.gpt.model.base by @hemildesai :: PR: #11109
- TP+MoE peft fix by @akoumpa :: PR: #11114
- GPT recipes to use full te spec by @JimmyZhang12 :: PR: #11119
- Virtual pipeline parallel support for LoRA in NLPAdapterModelMixin by @vysarge :: PR: #11128
- update nemo args for mcore flash decode arg change by @HuiyingLi :: PR: #11138
- Call
ckpt_to_weights_subdir
fromMegatronCheckpointIO
by @ashors1 :: PR: #10897 - fix typo by @dimapihtar :: PR: #11234
- [Doc fixes] update file names, installation instructions, bad links by @erastorgueva-nv :: PR: #11045
- fix(export): GPT models w/ bias=False convert properly by @terrykong :: PR: #11255
NVIDIA Neural Modules 2.0.0rc1
Highlights
Large language models
- PEFT: QLoRA support, LoRA/QLora for Mixture-of-Experts (MoE) dense layer
- State Space Models & Hybrid Architecture support (Mamba2 and NV-Mamba2-hybrid)
- Support Nemotron, Minitron, Gemma2, Qwen, RAG
- Custom Tokenizer training in NeMo
- Update the Auto-Configurator for EP, CP and FSDP
Multimodal
- NeVA: Add SOTA LLM backbone support (Mixtral/LLaMA3) and suite of model parallelism support (PP/EP)
- Support Language Instructed Temporal-Localization Assistant (LITA) on top of video NeVA
ASR
- SpeechLM and SALM
- Adapters for Canary Customization
- Pytorch allocator in PyTorch 2.2 improves training speed up to 30% for all ASR models
- Cuda Graphs for Transducer Inference
- Replaced webdataset with Lhotse - gives up to 2x speedup
- Transcription Improvements - Speedup and QoL Changes
- ASR Prompt Formatter for multimodal Canary
Export & Deploy
- In framework PyTriton deployment with backends: - PyTorch - vLLM - TRT-LLM update to 0.10
- TRT-LLM C++ runtime
Detailed Changelogs
ASR
Changelog
- Support dataloader as input to
audio
for transcription by @titu1994 :: PR: #9201 - Clean up dev docs collection section by @yaoyu-33 :: PR: #9205
- Fix Online_Offline_Microphone_VAD_Demo.ipynb by @stevehuang52 :: PR: #9251
- Remove .nemo instead of renaming by @mikolajblaz :: PR: #9281
- Fix GreedyBatchedCTCInfer regression from GreedyCTCInfer. by @galv :: PR: #9347
- Revert "Fix GreedyBatchedCTCInfer regression from GreedyCTCInfer." by @titu1994 :: PR: #9351
- Prompt formatter API and canary transcribe tensor input support by @pzelasko :: PR: #9206
- Fix prompt formatter's defaults=None case in multi-task model by @pzelasko :: PR: #9366
- move AED chunked infer script by @stevehuang52 :: PR: #9367
- Use model-cast-to-bfloat16 rather than AMP-to-bfloat16 for inference. by @galv :: PR: #9198
- ci: Fix `L2_Segmentation_Tool_Parallel_ctc_segmentation_test_L2_Eng_C… by @ko3n1g :: PR: #9399
- Fix logging message for ASR by @titu1994 :: PR: #9469
- Add support to change Multi task model prompt by @titu1994 :: PR: #9542
- Enable encoder adapters for Canary and MultiTaskAED models by @titu1994 :: PR: #9409
- Audio model collection by @anteju :: PR: #9263
- TitaNet Batch Verify Speaker by @monica-sekoyan :: PR: #9337
- Fix the arguments of forward_for_export function in msdd_models by @tango4j :: PR: #9624
- chore: Pin branch in notebooks by @ko3n1g :: PR: #9697
- refactor: notebook branch release by @ko3n1g :: PR: #9711
- Canary Adapters tutorial (#9670) by @nithinraok :: PR: #9777
- typos and branch name update to r2.0.0rc1 by @nithinraok :: PR: #9846
- Fix RNNT alignments test by @artbataev :: PR: #9770
- By default trust remote code from HF Datasets by @nithinraok :: PR: #9886
- Temporarily disable cuda graph based RNN-T greedy inference for r2.0.0rc1 by @galv :: PR: #9904
- Enable CUDA graphs by default, but require CUDA 12.6 for full graphs by @artbataev :: PR: #9919
- update branch name for script by @nithinraok :: PR: #9936
- updte branch by @nithinraok :: PR: #9942
TTS
Changelog
LLM/Multimodal
Changelog
- Update nemo.export module for quantized models by @janekl :: PR: #9218
- Add save option to the TRT-LLM export test script by @oyilmaz-nvidia :: PR: #9221
- Checkpoint resuming compatible for 2403 container by @suiyoubi :: PR: #9199
- Clean up dev docs collection section by @yaoyu-33 :: PR: #9205
- use get with fallback when reading checkpoint_callback_params by @akoumpa :: PR: #9223
- Revert rope fusion defaults by @cuichenx :: PR: #9237
- fix import by @akoumpa :: PR: #9240
- Add TRT-LLM params like max_num_tokens and opt_num_tokens by @oyilmaz-nvidia :: PR: #9210
- sum-reduce grad_norm in DP+CP domain by @erhoo82 :: PR: #9262
- Alit/bert convert fix by @JRD971000 :: PR: #9285
- conv1d stable version by @JRD971000 :: PR: #9330
- Fix trainer builder when exp_manager is not in config by @yaoyu-33 :: PR: #9293
- Fix Peft Weights Loading in NeVA by @yaoyu-33 :: PR: #9341
- Skip sequence_parallel allreduce when using Mcore DistOpt by @akoumpa :: PR: #9344
- Fix FSDP gradient calculation with orig params by @janEbert :: PR: #9335
- TRT-LLM Export Code Cleanup by @oyilmaz-nvidia :: PR: #9270
- support null/None truncation field by @arendu :: PR: #9355
- NeVa token fusion by @paul-gibbons :: PR: #9245
- bugfix if using mcore distOpt with sft by @akoumpa :: PR: #9356
- Re-org export code by @oyilmaz-nvidia :: PR: #9353
- QLoRA by @cuichenx :: PR: #9340
- PeFT fix for distOpt by @akoumpa :: PR: #9392
- [NeMo-UX] Integrating mcore's DistributedDataParallel into MegatronStrategy by @marcromeyn :: PR: #9387
- cherry pick of #9266 by @dimapihtar :: PR: #9411
- Enable specifying alpha for PTQ INT8 SmoothQuant method by @janekl :: PR: #9423
- add support for new mcore ds features by @dimapihtar :: PR: #9388
- LoRA for MoE Layer by @cuichenx :: PR: #9396
- Mistral-7B: apply user's precision to output checkpoint by @akoumpa :: PR: #9222
- Add option to merge distributed optimizer buckets by @timmoon10 :: PR: #9414
- TRT-LLM 0.10 Update by @oyilmaz-nvidia :: PR: #9402
- In-framework deployment by @oyilmaz-nvidia :: PR: #9438
- Bugfix missing variables and argument changes to MegatronPretrainingRandomSampler by @jstjohn :: PR: #9458
- Hyena Operator by @guyjacob :: PR: #9264
- Refactor Quantizer for reusing in QAT by @kevalmorabia97 :: PR: #9276
- move load state dict after initialize parallel state in nlp_model by @ryxli :: PR: #9382
- Enable user to optionally upgrade Megatron by @jstjohn :: PR: #9478
- Fix unwrap model by @cuichenx :: PR: #9480
- fix operator precedence by @akoumpa :: PR: #9403
- [NeMo-UX] Adding context- & expert-parallelism to MegatronStrategy by @marcromeyn :: PR: #9525
- update mcoreddp call by @akoumpa :: PR: #9345
- mcore distOpt restore fix by @akoumpa :: PR: #9421
- vLLM Export Support by @apanteleev :: PR: #9381
- PL: Delete precision if using plugin. TODO switch to MegatronTrainerB… by @akoumpa :: PR: #9535
- extend get_gpt_layer_modelopt_spec to support MoE by @akoumpa :: PR: #9532
- fix mock data generation for legacy dataset by @dimapihtar :: PR: #9530
- add reset learning rate functionality by @dimapihtar :: PR: #9372
- Use closed-formula to round by multiple by @akoumpa :: PR: #9307
- GPU unit tests: Mark flaky tests to be fixed by @pablo-garay :: PR: #9559
- Consolidate gpt continue training script into pretraining script by @yaoyu-33 :: PR: #9413
- Enable encoder adapters for Canary and MultiTaskAED models by @titu1994 :: PR: #9409
- PTQ refinements by @janekl :: PR: #9574
- Add ModelOpt QAT example for Llama2 SFT model by @kevalmorabia97 :: PR: #9326
- Multimodal projection layer adapter fix for PP>1 by @paul-gibbons :: PR: #9445
- Add offline quantization script for QLoRA deployment by @cuichenx :: PR: #9455
- Make QLoRA more model-agnostic by @cuichenx :: PR: #9488
- Set n_gpu to None in nemo export by @oyilmaz-nvidia :: PR: #9593
- [NeMo-UX] Fix Megatron-optimizer by @marcromeyn :: PR: #9599
- Chat template support for megatron_gpt_eval.py by @akoumpa :: PR: #9354
- [NeMo-UX] Add PEFT by @cuichenx :: PR: #9490
- Alit/mamba tmp by @JRD971000 :: PR: #9612
- Enable MCore checkpointing optimizations by @mikolajblaz :: PR: #9505
- Change mixtral moe key name for trt-llm by @oyilmaz-nvidia :: PR: #9620
- fix ckpt load bug by @dimapihtar :: PR: #9621
- Alit/mamba by @JRD971000 :: PR: #9575
- Unwrap ckpt_io for model opt (async save) by @mikolajblaz :: PR: #9622
- MCore T5 support for NeMo - Training by @huvunvidia :: PR: #9432
- [Nemo-UX] Expose transformer_layer_spec inside GPTConfig by @marcromeyn :: PR: #9592
- Update NeMo Clip to Use MCore Modules by @yaoyu-33 :: PR: #9594
- Mistral + Mixtral Support for NeVa by @paul-gibbons :: PR: #9459
- Adding support for mcore generate by @shanmugamr1992 :: PR: #9566
- Improve error messaging during trt-llm export by @oyilmaz-nvidia :: PR: #9638
- [Cherrypick] support lora when kv_channel != hidden_size / num_heads by @cuichenx :: PR: #9644
- Parametrize FPS group by @mikolajblaz :: PR: #9648
- Cherry-pick megatron export fix from main by @borisfom :: PR: #9643
- add documentation for reset_lr feature by @dimapihta
- chore: Pin branch in notebooks by @ko3n1g :: PR: #9697
- Cherry pick: LITA Integration by @Slyne :: PR: #9684
- SDXL improvements (and support for Draft+) by @rohitrango :: PR: #9654
- Gemma 2 by @cuichenx :: PR: #9672
- Allows non-strict load with distributed checkpoints by @mikolajblaz :: PR: #9613
- refactor: notebook branch release by @ko3n1g :: PR: #9711
- [NeMo-UX] Make TE and Apex dependencies optional by @ashors1 :: PR: #9550
- Alit/r2.0.0 by @JRD971000 :: PR: #9718
- Manually cherry-pick from PR 9679 (PR to main - Support SFT/Eval/PEFT for mcore T5) by @huvunvidia :: PR: #9737
- In framework export by @oyilmaz-nvidia :: PR: #9658
- T5 changes based on mcore changes by @pablo-garay :: PR: #9829
- [NeMo-UX] Use single instance of loss reductions in GPTModel by @hemildesai :: PR: #9801
- deprecate NeMo NLP tutorial by @dimapihtar :: PR: #9864
- Disable nvFuser setup with PyTorch 23.11 and later by @athitten :: PR: #9837
- make torch_dist ckpt strategy as default by @dimapihtar :: PR: #9852
- add rampup bs documentation by @dimapihtar :: PR: #9884
- copy of #9576 by @dimapihtar :: PR: #9986
- Support Nvidia Torch and Arch versions by @thomasdhc :: PR: #9897
-...
NVIDIA Neural Modules 2.0.0rc0
Highlights
LLM and MM
Models
-
Megatron Core RETRO
- Pre-training
- Zero-shot Evaluation
-
Pretraining, conversion, evaluation, SFT, and PEFT for:
- Mixtral 8X22B
- Llama 3
- SpaceGemma
-
Embedding Models Fine Tuning
- Mistral
- BERT
-
BERT models
- Context Parallel
- Distributed checkpoint
-
Video capabilities with NeVa
Performance
-
Distributed Checkpointing
- Torch native backend
- Parallel read/write
- Async write
-
Multimodal LLM (LLAVA/NeVA)
- Pipeline Parallelism support
- Sequence packing support
Export
- Integration of Export & Deploy Modules into NeMo Framework container
- Upgrade to TRT-LLM 0.9
Speech (ASR & TTS)
Models
- AED Multi Task Models (Canary) - Multi-Task Multi-Lingual Speech Recognition / Speech Translation model
- Multimodal Domain - Speech LLM supporting SALM Model
- Parakeet-tdt_ctc-1.1b Model - RTFx of > 1500 (can transcribe 1500 seconds of audio in 1 second)
- Audio Codec 16kHz Small - NeMo Neural Audio Codec for discretizing speech for use in LLMs
- mel_codec_22khz_medium
- mel_codec_44khz_medium
Perf Improvements
- Transcribe() upgrade - Enables one line transcribe with files, tensors, data loaders
- Frame looping algorithm for RNNT faster decoding - Improves Real Time Factor (RTF) by 2-3x
- Cuda Graphs + Label-Looping algorithm for RNN-T and TDT Decoding - Transducer Greedy decoding at over 1500x RTFx, on par with CTC Non-Autoregressive models
- Semi Sorted Batching support - External User contribution that speeds up training by 15-30%.
Customization
- Context biasing for CTC word stamping - Improve accuracy for custom vocabulary and pronunciation
- Longform Inference
- Longform inference support for AED models
- Transcription of multi-channel audio for AED models
Misc
- Upgraded webdataset - Speech and LLM / Multimodal unified container
Detailed Changelogs
ASR
Changelog
- Enable using hybrid asr models in CTC Segmentation tool by @erastorgueva-nv :: PR: #8828
- TDT confidence fix by @GNroy :: PR: #8982
- Fix union type annotations for autodoc+mock-import rendering by @pzelasko :: PR: #8956
- NeMo dev doc restructure by @yaoyu-33 :: PR: #8896
- Improved random seed configuration for Lhotse dataloaders with docs by @pzelasko :: PR: #9001
- Fix #8948, allow preprocessor to be stream captured to a cuda graph when doing per_feature normalization by @galv :: PR: #8964
- [ASR] Support for transcription of multi-channel audio for AED models by @anteju :: PR: #9007
- Add ASR latest news by @titu1994 :: PR: #9073
- Fix docs errors and most warnings by @erastorgueva-nv :: PR: #9006
- PyTorch CUDA allocator optimization for dynamic batch shape dataloading in ASR by @pzelasko :: PR: #9061
- RNN-T and TDT inference: use CUDA graphs by default by @artbataev :: PR: #8972
- Fix #8891 by supported GPU-side batched CTC Greedy Decoding by @galv :: PR: #9100
- Update branch for notebooks and ci in release by @ericharper :: PR: #9189
- Enable CUDA graphs by default only for transcription by @artbataev :: PR: #9196
- rename paths2audiofiles to audio by @nithinraok :: PR: #9209
- Fix ASR_Context_Biasing.ipynb contains FileNotFoundError by @andrusenkoau :: PR: #9233
- Cherrypick: Support dataloader as input to
audio
for transcription (#9201) by @titu1994 :: PR: #9235 - Update Online_Offline_Microphone_VAD_Demo.ipynb by @stevehuang52 :: PR: #9252
- Dgalvez/fix greedy batch strategy name r2.0.0rc0 by @galv :: PR: #9243
- Accept None as an argument to decoder_lengths in GreedyBatchedCTCInfer::forward by @galv :: PR: #9246
- Fix loading github raw images on notebook by @nithinraok :: PR: #9282
- typos by @nithinraok :: PR: #9314
- Re-enable cuda graphs in training modes. by @galv :: PR: #9338
- add large model stable training fix and contrastive loss update for variable seq by @nithinraok :: PR: #9259
- Fix conv1d package in r2.0.0rc0 by @pablo-garay :: PR: #9369
- Fix GreedyBatchedCTCInfer regression from GreedyCTCInfer. (#9347) by @titu1994 :: PR: #9350
- Make a backward compatibility for old MSDD configs in label models by @tango4j :: PR: #9377
- Force diarizer to use CUDA if cuda is available and if device=None. by @tango4j :: PR: #9380
TTS
Changelog
LLM and MM
Changelog
- Rachitg/dpa by @rachitgarg91 :: PR: #8911
- Remove precision args in trainer due to PTL update by @yaoyu-33 :: PR: #8908
- Huvu/mcore retro by @huvunvidia :: PR: #8861
- fsdp tp > 1 bug fix by @dimapihtar :: PR: #8947
- Fix memory leak at loss func by @minitu :: PR: #8868
- change the condition for get qkv tensor from linear_qkv output in mcoremixin by @HuiyingLi :: PR: #8965
- Add safety checks for 'data' key in MegatronGPTModel cfg by @HuiyingLi :: PR: #8991
- [NeMo-UX] Adding MegatronParallel by @cuichenx :: PR: #8987
- Skip top_p computations when set to 1.0 by @odelalleau :: PR: #8905
- Gemma bug by @cuichenx :: PR: #8962
- [NeMo-UX] Adding megatron strategy by @marcromeyn :: PR: #8995
- Quantized checkpoint support in export and deploy modules by @janekl :: PR: #8859
- add geglu to mlp swap by @JRD971000 :: PR: #8999
- add timeout for new_group by @acphile :: PR: #8998
- Zero-shot evaluation pipeline for mcore RETRO by @huvunvidia :: PR: #8941
- Added fusion for squared relu by @sanandaraj5597 :: PR: #8963
- Developer Documents for mcore RETRO by @huvunvidia :: PR: #9026
- [NeMo-UX] Adding GPTModel & MockDataModule by @marcromeyn :: PR: #9011
- Adding unit test for mcore RETRO model by @huvunvidia :: PR: #9022
- docs and simplification of cmd args by @arendu :: PR: #8979
- [NeMo-UX] Add checkpoint-io to MegatronStrategy by @marcromeyn :: PR: #9057
- Enable Sequence Packing and Pipeline Parallel in NeVA by @yaoyu-33 :: PR: #8957
- Mingyuanm/add back fp8 support to sd by @Victor49152 :: PR: #9070
- unfused lora by @arendu :: PR: #9004
- Handle case where num_query_groups is set to null for LoRA config setup by @vysarge :: PR: #9075
- Alit/griffin by @JRD971000 :: PR: #9021
- Implement DistributedCheckpointIO by @mikolajblaz :: PR: #9016
- Video Neva Pretraining + Inference Implementation by @paul-gibbons :: PR: #9095
- HF to .nemo for Mixtral-8x22B-instruct by @akoumpa :: PR: #9060
- mcore ds updates by @dimapihtar :: PR: #8951
- Alit/griffin perf by @JRD971000 :: PR: #9107
- Add assert for max_steps to be positive in MegatronGPTSFTModel by @athitten :: PR: #9110
- Extend sequence length padding for GPT SFT to account for context parallel by @vysarge :: PR: #8869
- Update gpt dataset config parameter for mock by @thomasdhc :: PR: #9118
- Add Mcore DistributedDataParallel and distributed optimizer into Nemo by @gdengk :: PR: #9034
- Revert "Add assert for max_steps to be positive in MegatronGPTSFTMode… by @pablo-garay :: PR: #9128
- scripts to convert HF lora to nemo by @arendu :: PR: #9102
- Prevent duplicated checkpoints by @mikolajblaz :: PR: #9015
- add TN/ITN link in speech tools list by @erastorgueva-nv :: PR: #9142
- Cleanup deprecated files and temporary changes by @cuichenx :: PR: #9088
- Use DP+CP groups as the FSDP sharding domain by @erhoo82 :: PR: #9145
- CUDA memory profile by @erhoo82 :: PR: #9096
- Fix missing func for T5 model by @gdengk :: PR: #9141
- Add knob for load_directly_on_device by @mikolajblaz :: PR: #9125
- Revert rope fusion defaults by @cuichenx :: PR: #9238
- Update nemo.export module for quantized models by @janekl :: PR: #9250
- Fix circular import for MM dataprep notebook by @cuichenx :: PR: #9287
- neva media_type + text generation default fix by @paul-gibbons :: PR: #9257
- fix lora and ptuning and isort/black by @oyilmaz-nvidia :: PR: #9290
- add check if num layers is divisible by pp size by @dimapihtar :: PR: #9208
- Fix P-tuning for Llama based models by @apanteleev :: PR: #9297
- add deprecation warnings by @pablo-garay :: PR: #9266
- move pooler under post_process by @dimapihtar :: PR: #9328
- add deprecation note for nmt by @dimapihtar :: PR: #9342
- Fix incorrect checkpoint removal logic (#9192) by @mikolajblaz :: PR: #9204
- fix fp16 precision issue by @dimapihtar :: PR: #9376
- Fix module.training for Neva in FusedAttn backward which causes nan by @yaoyu-33 :: PR: #8877
Export
Changelog
- Updates for TRT-LLM 0.9 by @oyilmaz-nvidia :: PR: #8873
- Mingyuanm/sdxl export by @Victor49152 :: PR: #8926
- Avoid unpacking NeMo checkpoints before exporting to TRT-LLM by @apanteleev :: PR: #8866
- Update gemma for trt-llm 0.9 by @oyilmaz-nvidia :: PR: #8974
- TRT-LLM export P-tuning related fixes by @apanteleev :: PR: #8863
General Improvements
Changelog
- Update package info by @ericharper :: PR: #8793
- [Nemo CICD] Update mcore 4.13.24 by @pablo-garay :: PR: #8917
- Akoumparouli/low mem mixtral ckpt converter by @akoumpa :: PR: #8895
- Adding RETRO tests to Action Tests (cicd-main.yml) by @huvunvidia :: PR: #8942
- Akoumparouli/fix sd train 2 by @akoumpa :: PR: #8883
- Update te install for jenkins by @ericharper :: PR: #8954
- [Nemo CICD] Add last job depending on others for blocking check by @pablo-garay :: PR: #8959
- Minor quantization...