-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA OOM? #4
Comments
I change the zero3.json as follow:
} It can work with me |
But the speed is too slow。。。。 On 8 H100s |
Hi there. yes, the speed is slow. i‘ve trained 2b model on four card 1000 steps takes 10h. I've train the 2b model successfully and testing its result (and if you want test the result on sii's cluster, you will face some problems with lmms-eval.) |
use_cache=True
is incompatible with gradient checkpointing. Settinguse_cache=False
...use_cache=True
is incompatible with gradient checkpointing. Settinguse_cache=False
...use_cache=True
is incompatible with gradient checkpointing. Settinguse_cache=False
...use_cache=True
is incompatible with gradient checkpointing. Settinguse_cache=False
...use_cache=True
is incompatible with gradient checkpointing. Settinguse_cache=False
...use_cache=True
is incompatible with gradient checkpointing. Settinguse_cache=False
...use_cache=True
is incompatible with gradient checkpointing. Settinguse_cache=False
...use_cache=True
is incompatible with gradient checkpointing. Settinguse_cache=False
...Invalidate trace cache @ step 0 and module 1458: cache has only 0 modules
[rank7]: Traceback (most recent call last):
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/mengfanqing/open-r1-multimodal/src/open_r1/grpo.py", line 183, in
[rank7]: main(script_args, training_args, model_args)
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/mengfanqing/open-r1-multimodal/src/open_r1/grpo.py", line 172, in main
[rank7]: trainer.train()
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/anaconda3/envs/multimodelr1/lib/python3.10/site-packages/transformers/trainer.py", line 2184, in train
[rank7]: return inner_training_loop(
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/anaconda3/envs/multimodelr1/lib/python3.10/site-packages/transformers/trainer.py", line 2490, in _inner_training_loop
[rank7]: tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/anaconda3/envs/multimodelr1/lib/python3.10/site-packages/transformers/trainer.py", line 3640, in training_step
[rank7]: self.accelerator.backward(loss, **kwargs)
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/anaconda3/envs/multimodelr1/lib/python3.10/site-packages/accelerate/accelerator.py", line 2238, in backward
[rank7]: self.deepspeed_engine_wrapped.backward(loss, **kwargs)
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/anaconda3/envs/multimodelr1/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 261, in backward
[rank7]: self.engine.backward(loss, **kwargs)
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/anaconda3/envs/multimodelr1/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank7]: ret_val = func(*args, **kwargs)
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/anaconda3/envs/multimodelr1/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2020, in backward
[rank7]: self.optimizer.backward(loss, retain_graph=retain_graph)
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/anaconda3/envs/multimodelr1/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank7]: ret_val = func(*args, **kwargs)
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/anaconda3/envs/multimodelr1/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 2259, in backward
[rank7]: self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/anaconda3/envs/multimodelr1/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
[rank7]: scaled_loss.backward(retain_graph=retain_graph)
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/anaconda3/envs/multimodelr1/lib/python3.10/site-packages/torch/_tensor.py", line 626, in backward
[rank7]: torch.autograd.backward(
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/anaconda3/envs/multimodelr1/lib/python3.10/site-packages/torch/autograd/init.py", line 347, in backward
[rank7]: _engine_run_backward(
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/anaconda3/envs/multimodelr1/lib/python3.10/site-packages/torch/autograd/graph.py", line 823, in _engine_run_backward
[rank7]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/anaconda3/envs/multimodelr1/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank7]: ret_val = func(*args, **kwargs)
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/anaconda3/envs/multimodelr1/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 1173, in reduce_partition_and_remove_grads
[rank7]: self.reduce_ready_partitions_and_remove_grads(param)
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/anaconda3/envs/multimodelr1/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 1528, in reduce_ready_partitions_and_remove_grads
[rank7]: self.reduce_independent_p_g_buckets_and_remove_grads(param)
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/anaconda3/envs/multimodelr1/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 1265, in reduce_independent_p_g_buckets_and_remove_grads
[rank7]: self.__reduce_and_partition_ipg_grads()
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/anaconda3/envs/multimodelr1/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank7]: ret_val = func(*args, **kwargs)
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/anaconda3/envs/multimodelr1/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank7]: return func(*args, **kwargs)
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/anaconda3/envs/multimodelr1/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 1315, in __reduce_and_partition_ipg_grads
[rank7]: grad_partitions = self.__avg_scatter_grads(self.params_in_ipg_bucket)
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/anaconda3/envs/multimodelr1/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank7]: ret_val = func(*args, **kwargs)
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/anaconda3/envs/multimodelr1/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 1384, in __avg_scatter_grads
[rank7]: grad_partitions_for_rank = reduce_scatter_coalesced(full_grads_for_rank, self.dp_process_group)
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/anaconda3/envs/multimodelr1/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank7]: ret_val = func(*args, **kwargs)
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/anaconda3/envs/multimodelr1/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank7]: return func(*args, **kwargs)
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/anaconda3/envs/multimodelr1/lib/python3.10/site-packages/deepspeed/runtime/comm/coalesced_collectives.py", line 128, in reduce_scatter_coalesced
[rank7]: _torch_reduce_scatter_fn(tensor_partition_flat_buffer,
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/anaconda3/envs/multimodelr1/lib/python3.10/site-packages/deepspeed/runtime/comm/coalesced_collectives.py", line 23, in _torch_reduce_scatter_fn
[rank7]: return instrument_w_nvtx(dist.reduce_scatter_fn)(output_tensor, input_tensor, group=group, async_op=False)
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/anaconda3/envs/multimodelr1/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank7]: ret_val = func(*args, **kwargs)
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/anaconda3/envs/multimodelr1/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 257, in reduce_scatter_fn
[rank7]: return reduce_scatter_tensor(output_tensor,
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/anaconda3/envs/multimodelr1/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper
[rank7]: return func(*args, **kwargs)
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/anaconda3/envs/multimodelr1/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 289, in reduce_scatter_tensor
[rank7]: return cdb.reduce_scatter_tensor(output_tensor=output_tensor,
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/anaconda3/envs/multimodelr1/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 745, in _fn
[rank7]: return fn(*args, **kwargs)
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/anaconda3/envs/multimodelr1/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 264, in reduce_scatter_tensor
[rank7]: return self.reduce_scatter_function(output_tensor,
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/anaconda3/envs/multimodelr1/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
[rank7]: return func(*args, **kwargs)
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/anaconda3/envs/multimodelr1/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4235, in reduce_scatter_tensor
[rank7]: work = group._reduce_scatter_base(output, input, opts)
[rank7]: torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/NCCLUtils.hpp:268, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.21.5
[rank7]: ncclUnhandledCudaError: Call to CUDA function failed.
[rank7]: Last error:
[rank7]: Failed to CUDA calloc async 4864 bytes
I use the script in repo to train Qwen-VL2-7B, And I also use 8 H100s. But i get this error, it seems as cuda:7 OOM. How Can i fix it?
The text was updated successfully, but these errors were encountered: