Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fail to convert Llama3 Nemo 2.0 checkpoint to HF #11256

Closed
EthanLI24 opened this issue Nov 12, 2024 · 16 comments
Closed

Fail to convert Llama3 Nemo 2.0 checkpoint to HF #11256

EthanLI24 opened this issue Nov 12, 2024 · 16 comments
Labels
bug Something isn't working

Comments

@EthanLI24
Copy link

Describe the bug

I use Nemo 2.0 to train my model and get Nemo 2.0 checkpoint like this with .distcp files:
model_name
├── context
│ ├── model_config.yaml
│ ├── io.json
│ └── tokenizer
├── weights
│ ├── distributed checkpointing directories/files in torch_dist format
│ ├── metadata.json
│ └── common.pt

but filed to use NeMo/scripts/checkpoint_converters/convert_llama_nemo_to_hf.py to export it to a HF file

Expected behavior

get HF files by Nemo2.0 checkpoint

@EthanLI24 EthanLI24 added the bug Something isn't working label Nov 12, 2024
@EthanLI24
Copy link
Author

Is there a way we can now save non .distcp files or huggingface .bin files directly after training?
If not, how to convert nemo2.0 checkpoint to the community common format?

@hemildesai
Copy link
Collaborator

Hi, the scripts/checkpoint_converters/convert_llama_nemo_to_hf.py script is only for NeMo 1.0 checkpoints. To export 2.0 checkpoints, you can use a custom script like

from pathlib import Path

from nemo.collections.llm import export_ckpt

if __name__ == "__main__":
    export_ckpt(
        path=Path("/workspace/input_ckpt"),
        target="hf",
        output_path=Path("/workspace/output_ckpt.hf"),
    )

@EthanLI24
Copy link
Author

EthanLI24 commented Nov 14, 2024

Hi, the scripts/checkpoint_converters/convert_llama_nemo_to_hf.py script is only for NeMo 1.0 checkpoints. To export 2.0 checkpoints, you can use a custom script like

from pathlib import Path

from nemo.collections.llm import export_ckpt

if name == "main":
export_ckpt(
path=Path("/workspace/input_ckpt"),
target="hf",
output_path=Path("/workspace/output_ckpt.hf"),
)

Here are my fold arch:

nemo2_llama3
├── context
│ ├── model.yaml
│ ├── io.json
│ └── nemo_tokenizer
├── weights
│ ├── __0_0.distcp
│ ├── __0_1.distcp
│ ├── metadata.json
│ └── common.pt

and I follow your instruction to use

if __name__ == "__main__":
    export_ckpt(
        path=Path("./nemo2_llama3"),
        target="hf",
        output_path=Path("./nemo2exporthf"),
    )

but i get failed again, could you help me with this?

Copy link
Contributor

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the stale label Dec 15, 2024
Copy link
Contributor

This issue was closed because it has been inactive for 7 days since being marked as stale.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 23, 2024
@Zhihan1996
Copy link

Zhihan1996 commented Feb 10, 2025

Hello @EthanLI24 , could you let me know if you solved this issue? I have the same problem with a Mistral model. My model is in the same format as yours, and I use the same scripts. The following is my error message. It seems like a bug in Nemo's official code, as the trainer automatically saves the model.

Hey @hemildesai, could you please help with it?

[rank0]: Traceback (most recent call last):
[rank0]:   File "/pscratch/sd/z/zhihanz/GenomeOcean/export_ckpt.py", line 6, in <module>
[rank0]:     export_ckpt(
[rank0]:   File "/opt/NeMo/nemo/collections/llm/api.py", line 663, in export_ckpt
[rank0]:     output = io.export_ckpt(path, target, output_path, overwrite, load_connector)
[rank0]:   File "/opt/NeMo/nemo/lightning/io/api.py", line 229, in export_ckpt
[rank0]:     return exporter(overwrite=overwrite, output_path=_output_path)
[rank0]:   File "/opt/NeMo/nemo/lightning/io/connector.py", line 99, in __call__
[rank0]:     to_return = self.apply(_output_path)
[rank0]:   File "/opt/NeMo/nemo/collections/llm/gpt/model/mistral.py", line 202, in apply
[rank0]:     target = self.convert_state(source, target)
[rank0]:   File "/opt/NeMo/nemo/collections/llm/gpt/model/mistral.py", line 220, in convert_state
[rank0]:     return io.apply_transforms(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/opt/NeMo/nemo/lightning/io/state.py", line 180, in apply_transforms
[rank0]:     assert target_orig_dtypes == extract_dtypes(_target.named_parameters()), (
[rank0]: AssertionError: dtype mismatch between source and target state dicts. Left side is {}, Right side is {'model.embed_tokens.weight': torch.float32, 'model.layers.0.self_attn.q_proj.weight': torch.float32, 'model.layers.0.self_attn.k_proj.weight': torch.float32, 'model.layers.0.self_attn.v_proj.weight': torch.float32, 'model.layers.0.self_attn.o_proj.weight': torch.float32, 'model.layers.0.mlp.gate_proj.weight': torch.float32, 'model.layers

@akoumpa
Copy link
Member

akoumpa commented Feb 11, 2025

Hi @Zhihan1996 , thanks for reporting this. Will have someone push a fix soon. Thank you for your patience.

@Zhihan1996
Copy link

Thank you @akoumpa. I look forward to it. Please kindly let me know when the fix is pushed.

@aflah02
Copy link

aflah02 commented Feb 14, 2025

Hi
I'm also hitting this issue. Any fixes for the same that we can patch before the official release?

@akoumpa
Copy link
Member

akoumpa commented Feb 14, 2025

Hi,

I'm not able to reproduce, so I would your help.

Here's the steps and commands/code I used:

  1. download the mistral checkpoint
    akoumparouli@dev:hf_ckpt/$ huggingface-cli download --local-dir mistralai/Mistral-7B-v0.3 --cache-dir mistralai/Mistral-7B-v0.3

  2. import to NeMo

from nemo.collections import llm

if __name__ == "__main__":

    config = llm.MistralConfig7B()
    model = llm.MistralModel(config)
    ckpt_path = model.import_ckpt('hf_ckpt/mistralai/Mistral-7B-v0.3')
  1. export to HF
from pathlib import Path

from nemo.collections.llm import export_ckpt

if __name__ == "__main__":
    export_ckpt(
        path=Path("/root/.cache/nemo/models/Mistral-7B-v0.3"),
        target="hf",
        output_path=Path("/workspace/output_ckpt.hf"),
    )

So I wonder in your case how the checkpoint is generated?

I realized just now, that mistral while it uses the correct dtype during export/import (e.g., here), it does not have a default dtype in the config. So unless specified by the user, there's a chance it might not use bfloat16 (if initialized from a config).

I think there's two work-arounds here:

  1. cast _target to bfloat16 before reaching the assert in the export statement
  2. comment the assert in the export statement.

Edit: updating the default configs here
Edit: I'm using ToT.

@EthanLI24
Copy link
Author

@Zhihan1996 Hi Zhihan,
That's beacuse there have some mismatch between your nemo and hf target model, and it will do type exterct in the mixin.py.
There have two solutions:

  1. Manually convert your nemo or hf to same format like bf16 or fp32
  2. Modify your target hf model config.json to same with your nemo config of the tensor type

or maybe you can try modify your io.json too. And I think it always fail to reload the format type in HFExporter class but I don't know why,,,(I try to modify the code here but it does not work)

@Zhihan1996
Copy link

Thank you @EthanLI24 and @akoumpa, it is indeed a type mismatch problem. I am able to solve it by commenting the line that asserts the same type (assert target_orig_dtypes == extract_dtypes(_target.named_parameters())) in the export statement (NeMo/nemo/lightning/io/state.py).

My checkpoint was automatically saved by the trainer after distributed training with bf16. And the bf16 argument is true in my model.yaml. So I suspect there are some tiny issues with the mistral module when handling the dtype.

@aflah02
Copy link

aflah02 commented Feb 20, 2025

Hi @EthanLI24 @Zhihan1996
Did you notice issues around quality of exported model? In my case the model exports but keeps outputting the same token while if I run inference via NeMo it seems to work well

@Zhihan1996
Copy link

My exported Mistral model works well. It has almost the same clm loss on validation data as training.

@aflah02
Copy link

aflah02 commented Feb 20, 2025

Thanks @Zhihan1996
Just to confirm, you used from nemo.collections.llm import export_ckpt for export right?

@Zhihan1996
Copy link

Yes

@akoumpa akoumpa removed the stale label Feb 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants