Unable to save model after training with tensor parallel #36436

bursteratom · 2025-02-26T20:16:07Z

System Info

Currently, attempting to save model after training with tensor parallel gives the RuntimeError: Attempted to access the data pointer on an invalid python storage, this is due to the state dict not properly gathered from the sharded tensors beforehand.

Fix here: #36434

Who can help?

No response

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Train the model with tensor parallelism by parsing tp_size >=2 into the trainer, make sure to specify output_dir for the model saving directory.

Expected behavior

Model is saved upon completion of training.

The text was updated successfully, but these errors were encountered:

bursteratom added the bug label Feb 26, 2025

bursteratom mentioned this issue Feb 26, 2025

Fix model saving bug post training with tensor parallel #36434

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to save model after training with tensor parallel #36436

Unable to save model after training with tensor parallel #36436

bursteratom commented Feb 26, 2025

Unable to save model after training with tensor parallel #36436

Unable to save model after training with tensor parallel #36436

Comments

bursteratom commented Feb 26, 2025

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior