CUDA OOM? #4

FanqingM · 2025-02-02T13:26:36Z

use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False...
use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False...
use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False...
use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False...
use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False...
use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False...
use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False...
use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False...
Invalidate trace cache @ step 0 and module 1458: cache has only 0 modules
[rank7]: Traceback (most recent call last):
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/mengfanqing/open-r1-multimodal/src/open_r1/grpo.py", line 183, in
[rank7]: main(script_args, training_args, model_args)
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/mengfanqing/open-r1-multimodal/src/open_r1/grpo.py", line 172, in main
[rank7]: trainer.train()
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/anaconda3/envs/multimodelr1/lib/python3.10/site-packages/transformers/trainer.py", line 2184, in train
[rank7]: return inner_training_loop(
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/anaconda3/envs/multimodelr1/lib/python3.10/site-packages/transformers/trainer.py", line 2490, in _inner_training_loop
[rank7]: tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/anaconda3/envs/multimodelr1/lib/python3.10/site-packages/transformers/trainer.py", line 3640, in training_step
[rank7]: self.accelerator.backward(loss, **kwargs)
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/anaconda3/envs/multimodelr1/lib/python3.10/site-packages/accelerate/accelerator.py", line 2238, in backward
[rank7]: self.deepspeed_engine_wrapped.backward(loss, **kwargs)
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/anaconda3/envs/multimodelr1/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 261, in backward
[rank7]: self.engine.backward(loss, **kwargs)
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/anaconda3/envs/multimodelr1/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank7]: ret_val = func(*args, **kwargs)
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/anaconda3/envs/multimodelr1/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2020, in backward
[rank7]: self.optimizer.backward(loss, retain_graph=retain_graph)
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/anaconda3/envs/multimodelr1/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank7]: ret_val = func(*args, **kwargs)
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/anaconda3/envs/multimodelr1/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 2259, in backward
[rank7]: self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/anaconda3/envs/multimodelr1/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
[rank7]: scaled_loss.backward(retain_graph=retain_graph)
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/anaconda3/envs/multimodelr1/lib/python3.10/site-packages/torch/_tensor.py", line 626, in backward
[rank7]: torch.autograd.backward(
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/anaconda3/envs/multimodelr1/lib/python3.10/site-packages/torch/autograd/init.py", line 347, in backward
[rank7]: _engine_run_backward(
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/anaconda3/envs/multimodelr1/lib/python3.10/site-packages/torch/autograd/graph.py", line 823, in _engine_run_backward
[rank7]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/anaconda3/envs/multimodelr1/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank7]: ret_val = func(*args, **kwargs)
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/anaconda3/envs/multimodelr1/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 1173, in reduce_partition_and_remove_grads
[rank7]: self.reduce_ready_partitions_and_remove_grads(param)
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/anaconda3/envs/multimodelr1/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 1528, in reduce_ready_partitions_and_remove_grads
[rank7]: self.reduce_independent_p_g_buckets_and_remove_grads(param)
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/anaconda3/envs/multimodelr1/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 1265, in reduce_independent_p_g_buckets_and_remove_grads
[rank7]: self.__reduce_and_partition_ipg_grads()
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/anaconda3/envs/multimodelr1/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank7]: ret_val = func(*args, **kwargs)
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/anaconda3/envs/multimodelr1/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank7]: return func(*args, **kwargs)
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/anaconda3/envs/multimodelr1/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 1315, in __reduce_and_partition_ipg_grads
[rank7]: grad_partitions = self.__avg_scatter_grads(self.params_in_ipg_bucket)
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/anaconda3/envs/multimodelr1/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank7]: ret_val = func(*args, **kwargs)
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/anaconda3/envs/multimodelr1/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 1384, in __avg_scatter_grads
[rank7]: grad_partitions_for_rank = reduce_scatter_coalesced(full_grads_for_rank, self.dp_process_group)
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/anaconda3/envs/multimodelr1/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank7]: ret_val = func(*args, **kwargs)
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/anaconda3/envs/multimodelr1/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank7]: return func(*args, **kwargs)
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/anaconda3/envs/multimodelr1/lib/python3.10/site-packages/deepspeed/runtime/comm/coalesced_collectives.py", line 128, in reduce_scatter_coalesced
[rank7]: _torch_reduce_scatter_fn(tensor_partition_flat_buffer,
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/anaconda3/envs/multimodelr1/lib/python3.10/site-packages/deepspeed/runtime/comm/coalesced_collectives.py", line 23, in _torch_reduce_scatter_fn
[rank7]: return instrument_w_nvtx(dist.reduce_scatter_fn)(output_tensor, input_tensor, group=group, async_op=False)
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/anaconda3/envs/multimodelr1/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank7]: ret_val = func(*args, **kwargs)
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/anaconda3/envs/multimodelr1/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 257, in reduce_scatter_fn
[rank7]: return reduce_scatter_tensor(output_tensor,
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/anaconda3/envs/multimodelr1/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper
[rank7]: return func(*args, **kwargs)
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/anaconda3/envs/multimodelr1/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 289, in reduce_scatter_tensor
[rank7]: return cdb.reduce_scatter_tensor(output_tensor=output_tensor,
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/anaconda3/envs/multimodelr1/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 745, in _fn
[rank7]: return fn(*args, **kwargs)
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/anaconda3/envs/multimodelr1/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 264, in reduce_scatter_tensor
[rank7]: return self.reduce_scatter_function(output_tensor,
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/anaconda3/envs/multimodelr1/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
[rank7]: return func(*args, **kwargs)
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/anaconda3/envs/multimodelr1/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4235, in reduce_scatter_tensor
[rank7]: work = group._reduce_scatter_base(output, input, opts)
[rank7]: torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/NCCLUtils.hpp:268, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.21.5
[rank7]: ncclUnhandledCudaError: Call to CUDA function failed.
[rank7]: Last error:
[rank7]: Failed to CUDA calloc async 4864 bytes

I use the script in repo to train Qwen-VL2-7B, And I also use 8 H100s. But i get this error, it seems as cuda:7 OOM. How Can i fix it?

The text was updated successfully, but these errors were encountered:

FanqingM · 2025-02-02T13:47:55Z

I change the zero3.json as follow:
{
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 32,
"hysteresis": 2,
"min_loss_scale": 1
},
"bf16": {
"enabled": "auto"
},

"zero_optimization": {
    "stage": 3,
    "overlap_comm": true,
    "contiguous_gradients": true,
    "sub_group_size": 5e8,
    "reduce_bucket_size": 5e8,
    "stage3_prefetch_bucket_size": 5e8,
    "stage3_param_persistence_threshold": 1e6,
    "stage3_max_live_parameters": 1e9,
    "stage3_max_reuse_distance": 1e9,
    "stage3_gather_16bit_weights_on_model_save": true
},

"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"steps_per_print": 100,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false

}

It can work with me

FanqingM · 2025-02-02T13:51:16Z

But the speed is too slow。。。。
{'loss': 0.0001, 'grad_norm': 57.799599938214996, 'learning_rate': 9.97920997920998e-07, 'completion_length': 100.03125, 'rewards/accuracy_reward': 0.203125, 'rewards/format_reward': 0.171875, 'reward': 0.375, 'reward_std': 0.4175198972225189, 'kl': 0.0015106201171875, 'epoch': 0.0}
0%|▏ | 2/962 [08:08<64:19:48, 241.24s/it]

On 8 H100s

baibizhe · 2025-02-02T15:53:28Z

Hi there. yes, the speed is slow. i‘ve trained 2b model on four card 1000 steps takes 10h. I've train the 2b model successfully and testing its result (and if you want test the result on sii's cluster, you will face some problems with lmms-eval.)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA OOM? #4

CUDA OOM? #4

FanqingM commented Feb 2, 2025

FanqingM commented Feb 2, 2025

FanqingM commented Feb 2, 2025

baibizhe commented Feb 2, 2025 •

edited

Loading

CUDA OOM? #4

CUDA OOM? #4

Comments

FanqingM commented Feb 2, 2025

FanqingM commented Feb 2, 2025

FanqingM commented Feb 2, 2025

baibizhe commented Feb 2, 2025 • edited Loading

baibizhe commented Feb 2, 2025 •

edited

Loading