-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fail to convert Llama3 Nemo 2.0 checkpoint to HF #11256
Comments
Is there a way we can now save non .distcp files or huggingface .bin files directly after training? |
Hi, the scripts/checkpoint_converters/convert_llama_nemo_to_hf.py script is only for NeMo 1.0 checkpoints. To export 2.0 checkpoints, you can use a custom script like from pathlib import Path
from nemo.collections.llm import export_ckpt
if __name__ == "__main__":
export_ckpt(
path=Path("/workspace/input_ckpt"),
target="hf",
output_path=Path("/workspace/output_ckpt.hf"),
) |
Here are my fold arch: nemo2_llama3
├── context
│ ├── model.yaml
│ ├── io.json
│ └── nemo_tokenizer
├── weights
│ ├── __0_0.distcp
│ ├── __0_1.distcp
│ ├── metadata.json
│ └── common.pt and I follow your instruction to use if __name__ == "__main__":
export_ckpt(
path=Path("./nemo2_llama3"),
target="hf",
output_path=Path("./nemo2exporthf"),
) but i get failed again, could you help me with this? |
This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days. |
This issue was closed because it has been inactive for 7 days since being marked as stale. |
Hello @EthanLI24 , could you let me know if you solved this issue? I have the same problem with a Mistral model. My model is in the same format as yours, and I use the same scripts. The following is my error message. It seems like a bug in Nemo's official code, as the trainer automatically saves the model. Hey @hemildesai, could you please help with it?
|
Hi @Zhihan1996 , thanks for reporting this. Will have someone push a fix soon. Thank you for your patience. |
Thank you @akoumpa. I look forward to it. Please kindly let me know when the fix is pushed. |
Hi |
Hi, I'm not able to reproduce, so I would your help. Here's the steps and commands/code I used:
So I wonder in your case how the checkpoint is generated? I realized just now, that mistral while it uses the correct dtype during export/import (e.g., here), it does not have a default dtype in the config. So unless specified by the user, there's a chance it might not use bfloat16 (if initialized from a config). I think there's two work-arounds here:
Edit: updating the default configs here |
@Zhihan1996 Hi Zhihan,
or maybe you can try modify your io.json too. And I think it always fail to reload the format type in HFExporter class but I don't know why,,,(I try to modify the code here but it does not work) |
Thank you @EthanLI24 and @akoumpa, it is indeed a type mismatch problem. I am able to solve it by commenting the line that asserts the same type ( My checkpoint was automatically saved by the trainer after distributed training with |
Hi @EthanLI24 @Zhihan1996 |
My exported Mistral model works well. It has almost the same clm loss on validation data as training. |
Thanks @Zhihan1996 |
Yes |
Describe the bug
I use Nemo 2.0 to train my model and get Nemo 2.0 checkpoint like this with .distcp files:
model_name
├── context
│ ├── model_config.yaml
│ ├── io.json
│ └── tokenizer
├── weights
│ ├── distributed checkpointing directories/files in torch_dist format
│ ├── metadata.json
│ └── common.pt
but filed to use NeMo/scripts/checkpoint_converters/convert_llama_nemo_to_hf.py to export it to a HF file
Expected behavior
get HF files by Nemo2.0 checkpoint
The text was updated successfully, but these errors were encountered: