You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, attempting to save model after training with tensor parallel gives the RuntimeError: Attempted to access the data pointer on an invalid python storage, this is due to the state dict not properly gathered from the sharded tensors beforehand.
System Info
Currently, attempting to save model after training with tensor parallel gives the
RuntimeError: Attempted to access the data pointer on an invalid python storage
, this is due to the state dict not properly gathered from the sharded tensors beforehand.Fix here: #36434
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Train the model with tensor parallelism by parsing
tp_size >=2
into the trainer, make sure to specifyoutput_dir
for the model saving directory.Expected behavior
Model is saved upon completion of training.
The text was updated successfully, but these errors were encountered: