Fail to convert Llama3 Nemo 2.0 checkpoint to HF #11256

EthanLI24 · 2024-11-12T05:42:28Z

Describe the bug

I use Nemo 2.0 to train my model and get Nemo 2.0 checkpoint like this with .distcp files:
model_name
├── context
│ ├── model_config.yaml
│ ├── io.json
│ └── tokenizer
├── weights
│ ├── distributed checkpointing directories/files in torch_dist format
│ ├── metadata.json
│ └── common.pt

but filed to use NeMo/scripts/checkpoint_converters/convert_llama_nemo_to_hf.py to export it to a HF file

Expected behavior

get HF files by Nemo2.0 checkpoint

EthanLI24 · 2024-11-12T09:01:20Z

Is there a way we can now save non .distcp files or huggingface .bin files directly after training?
If not, how to convert nemo2.0 checkpoint to the community common format?

hemildesai · 2024-11-14T01:32:18Z

Hi, the scripts/checkpoint_converters/convert_llama_nemo_to_hf.py script is only for NeMo 1.0 checkpoints. To export 2.0 checkpoints, you can use a custom script like

from pathlib import Path

from nemo.collections.llm import export_ckpt

if __name__ == "__main__":
    export_ckpt(
        path=Path("/workspace/input_ckpt"),
        target="hf",
        output_path=Path("/workspace/output_ckpt.hf"),
    )

EthanLI24 · 2024-11-14T06:30:36Z

Hi, the scripts/checkpoint_converters/convert_llama_nemo_to_hf.py script is only for NeMo 1.0 checkpoints. To export 2.0 checkpoints, you can use a custom script like

from pathlib import Path

from nemo.collections.llm import export_ckpt

if name == "main":
export_ckpt(
path=Path("/workspace/input_ckpt"),
target="hf",
output_path=Path("/workspace/output_ckpt.hf"),
)

Here are my fold arch:

nemo2_llama3
├── context
│ ├── model.yaml
│ ├── io.json
│ └── nemo_tokenizer
├── weights
│ ├── __0_0.distcp
│ ├── __0_1.distcp
│ ├── metadata.json
│ └── common.pt

and I follow your instruction to use

if __name__ == "__main__":
    export_ckpt(
        path=Path("./nemo2_llama3"),
        target="hf",
        output_path=Path("./nemo2exporthf"),
    )

but i get failed again, could you help me with this?

github-actions · 2024-12-15T02:09:04Z

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions · 2024-12-23T01:59:41Z

This issue was closed because it has been inactive for 7 days since being marked as stale.

Zhihan1996 · 2025-02-10T21:49:22Z

Hello @EthanLI24 , could you let me know if you solved this issue? I have the same problem with a Mistral model. My model is in the same format as yours, and I use the same scripts. The following is my error message. It seems like a bug in Nemo's official code, as the trainer automatically saves the model.

Hey @hemildesai, could you please help with it?

[rank0]: Traceback (most recent call last):
[rank0]:   File "/pscratch/sd/z/zhihanz/GenomeOcean/export_ckpt.py", line 6, in <module>
[rank0]:     export_ckpt(
[rank0]:   File "/opt/NeMo/nemo/collections/llm/api.py", line 663, in export_ckpt
[rank0]:     output = io.export_ckpt(path, target, output_path, overwrite, load_connector)
[rank0]:   File "/opt/NeMo/nemo/lightning/io/api.py", line 229, in export_ckpt
[rank0]:     return exporter(overwrite=overwrite, output_path=_output_path)
[rank0]:   File "/opt/NeMo/nemo/lightning/io/connector.py", line 99, in __call__
[rank0]:     to_return = self.apply(_output_path)
[rank0]:   File "/opt/NeMo/nemo/collections/llm/gpt/model/mistral.py", line 202, in apply
[rank0]:     target = self.convert_state(source, target)
[rank0]:   File "/opt/NeMo/nemo/collections/llm/gpt/model/mistral.py", line 220, in convert_state
[rank0]:     return io.apply_transforms(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/opt/NeMo/nemo/lightning/io/state.py", line 180, in apply_transforms
[rank0]:     assert target_orig_dtypes == extract_dtypes(_target.named_parameters()), (
[rank0]: AssertionError: dtype mismatch between source and target state dicts. Left side is {}, Right side is {'model.embed_tokens.weight': torch.float32, 'model.layers.0.self_attn.q_proj.weight': torch.float32, 'model.layers.0.self_attn.k_proj.weight': torch.float32, 'model.layers.0.self_attn.v_proj.weight': torch.float32, 'model.layers.0.self_attn.o_proj.weight': torch.float32, 'model.layers.0.mlp.gate_proj.weight': torch.float32, 'model.layers

akoumpa · 2025-02-11T22:09:16Z

Hi @Zhihan1996 , thanks for reporting this. Will have someone push a fix soon. Thank you for your patience.

Zhihan1996 · 2025-02-12T04:29:29Z

Thank you @akoumpa. I look forward to it. Please kindly let me know when the fix is pushed.

aflah02 · 2025-02-14T09:37:48Z

Hi
I'm also hitting this issue. Any fixes for the same that we can patch before the official release?

akoumpa · 2025-02-14T10:04:20Z

Hi,

I'm not able to reproduce, so I would your help.

Here's the steps and commands/code I used:

download the mistral checkpoint
akoumparouli@dev:hf_ckpt/$ huggingface-cli download --local-dir mistralai/Mistral-7B-v0.3 --cache-dir mistralai/Mistral-7B-v0.3
import to NeMo

from nemo.collections import llm

if __name__ == "__main__":

    config = llm.MistralConfig7B()
    model = llm.MistralModel(config)
    ckpt_path = model.import_ckpt('hf_ckpt/mistralai/Mistral-7B-v0.3')

export to HF

from pathlib import Path

from nemo.collections.llm import export_ckpt

if __name__ == "__main__":
    export_ckpt(
        path=Path("/root/.cache/nemo/models/Mistral-7B-v0.3"),
        target="hf",
        output_path=Path("/workspace/output_ckpt.hf"),
    )

So I wonder in your case how the checkpoint is generated?

I realized just now, that mistral while it uses the correct dtype during export/import (e.g., here), it does not have a default dtype in the config. So unless specified by the user, there's a chance it might not use bfloat16 (if initialized from a config).

I think there's two work-arounds here:

cast _target to bfloat16 before reaching the assert in the export statement
comment the assert in the export statement.

Edit: updating the default configs here
Edit: I'm using ToT.

EthanLI24 · 2025-02-20T09:22:43Z

@Zhihan1996 Hi Zhihan,
That's beacuse there have some mismatch between your nemo and hf target model, and it will do type exterct in the mixin.py.
There have two solutions:

Manually convert your nemo or hf to same format like bf16 or fp32
Modify your target hf model config.json to same with your nemo config of the tensor type

or maybe you can try modify your io.json too. And I think it always fail to reload the format type in HFExporter class but I don't know why,,,(I try to modify the code here but it does not work)

Zhihan1996 · 2025-02-20T17:38:15Z

Thank you @EthanLI24 and @akoumpa, it is indeed a type mismatch problem. I am able to solve it by commenting the line that asserts the same type (assert target_orig_dtypes == extract_dtypes(_target.named_parameters())) in the export statement (NeMo/nemo/lightning/io/state.py).

My checkpoint was automatically saved by the trainer after distributed training with bf16. And the bf16 argument is true in my model.yaml. So I suspect there are some tiny issues with the mistral module when handling the dtype.

aflah02 · 2025-02-20T17:44:57Z

Hi @EthanLI24 @Zhihan1996
Did you notice issues around quality of exported model? In my case the model exports but keeps outputting the same token while if I run inference via NeMo it seems to work well

Zhihan1996 · 2025-02-20T17:46:43Z

My exported Mistral model works well. It has almost the same clm loss on validation data as training.

aflah02 · 2025-02-20T17:48:43Z

Thanks @Zhihan1996
Just to confirm, you used from nemo.collections.llm import export_ckpt for export right?

Zhihan1996 · 2025-02-20T19:30:11Z

Yes

EthanLI24 added the bug Something isn't working label Nov 12, 2024

github-actions bot added the stale label Dec 15, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 23, 2024

akoumpa removed the stale label Feb 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fail to convert Llama3 Nemo 2.0 checkpoint to HF #11256

Fail to convert Llama3 Nemo 2.0 checkpoint to HF #11256

EthanLI24 commented Nov 12, 2024

EthanLI24 commented Nov 12, 2024

hemildesai commented Nov 14, 2024

EthanLI24 commented Nov 14, 2024 •

edited

Loading

github-actions bot commented Dec 15, 2024

github-actions bot commented Dec 23, 2024

Zhihan1996 commented Feb 10, 2025 •

edited

Loading

akoumpa commented Feb 11, 2025

Zhihan1996 commented Feb 12, 2025

aflah02 commented Feb 14, 2025

akoumpa commented Feb 14, 2025 •

edited

Loading

EthanLI24 commented Feb 20, 2025

Zhihan1996 commented Feb 20, 2025

aflah02 commented Feb 20, 2025 •

edited

Loading

Zhihan1996 commented Feb 20, 2025

aflah02 commented Feb 20, 2025

Zhihan1996 commented Feb 20, 2025

Fail to convert Llama3 Nemo 2.0 checkpoint to HF #11256

Fail to convert Llama3 Nemo 2.0 checkpoint to HF #11256

Comments

EthanLI24 commented Nov 12, 2024

EthanLI24 commented Nov 12, 2024

hemildesai commented Nov 14, 2024

EthanLI24 commented Nov 14, 2024 • edited Loading

github-actions bot commented Dec 15, 2024

github-actions bot commented Dec 23, 2024

Zhihan1996 commented Feb 10, 2025 • edited Loading

akoumpa commented Feb 11, 2025

Zhihan1996 commented Feb 12, 2025

aflah02 commented Feb 14, 2025

akoumpa commented Feb 14, 2025 • edited Loading

EthanLI24 commented Feb 20, 2025

Zhihan1996 commented Feb 20, 2025

aflah02 commented Feb 20, 2025 • edited Loading

Zhihan1996 commented Feb 20, 2025

aflah02 commented Feb 20, 2025

Zhihan1996 commented Feb 20, 2025

EthanLI24 commented Nov 14, 2024 •

edited

Loading

Zhihan1996 commented Feb 10, 2025 •

edited

Loading

akoumpa commented Feb 14, 2025 •

edited

Loading

aflah02 commented Feb 20, 2025 •

edited

Loading