Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA OOM? #4

Open
FanqingM opened this issue Feb 2, 2025 · 3 comments
Open

CUDA OOM? #4

FanqingM opened this issue Feb 2, 2025 · 3 comments

Comments

@FanqingM
Copy link

FanqingM commented Feb 2, 2025

use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False...
use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False...
use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False...
use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False...
use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False...
use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False...
use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False...
use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False...
Invalidate trace cache @ step 0 and module 1458: cache has only 0 modules
[rank7]: Traceback (most recent call last):
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/mengfanqing/open-r1-multimodal/src/open_r1/grpo.py", line 183, in
[rank7]: main(script_args, training_args, model_args)
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/mengfanqing/open-r1-multimodal/src/open_r1/grpo.py", line 172, in main
[rank7]: trainer.train()
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/anaconda3/envs/multimodelr1/lib/python3.10/site-packages/transformers/trainer.py", line 2184, in train
[rank7]: return inner_training_loop(
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/anaconda3/envs/multimodelr1/lib/python3.10/site-packages/transformers/trainer.py", line 2490, in _inner_training_loop
[rank7]: tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/anaconda3/envs/multimodelr1/lib/python3.10/site-packages/transformers/trainer.py", line 3640, in training_step
[rank7]: self.accelerator.backward(loss, **kwargs)
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/anaconda3/envs/multimodelr1/lib/python3.10/site-packages/accelerate/accelerator.py", line 2238, in backward
[rank7]: self.deepspeed_engine_wrapped.backward(loss, **kwargs)
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/anaconda3/envs/multimodelr1/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 261, in backward
[rank7]: self.engine.backward(loss, **kwargs)
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/anaconda3/envs/multimodelr1/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank7]: ret_val = func(*args, **kwargs)
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/anaconda3/envs/multimodelr1/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2020, in backward
[rank7]: self.optimizer.backward(loss, retain_graph=retain_graph)
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/anaconda3/envs/multimodelr1/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank7]: ret_val = func(*args, **kwargs)
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/anaconda3/envs/multimodelr1/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 2259, in backward
[rank7]: self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/anaconda3/envs/multimodelr1/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
[rank7]: scaled_loss.backward(retain_graph=retain_graph)
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/anaconda3/envs/multimodelr1/lib/python3.10/site-packages/torch/_tensor.py", line 626, in backward
[rank7]: torch.autograd.backward(
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/anaconda3/envs/multimodelr1/lib/python3.10/site-packages/torch/autograd/init.py", line 347, in backward
[rank7]: _engine_run_backward(
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/anaconda3/envs/multimodelr1/lib/python3.10/site-packages/torch/autograd/graph.py", line 823, in _engine_run_backward
[rank7]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/anaconda3/envs/multimodelr1/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank7]: ret_val = func(*args, **kwargs)
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/anaconda3/envs/multimodelr1/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 1173, in reduce_partition_and_remove_grads
[rank7]: self.reduce_ready_partitions_and_remove_grads(param)
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/anaconda3/envs/multimodelr1/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 1528, in reduce_ready_partitions_and_remove_grads
[rank7]: self.reduce_independent_p_g_buckets_and_remove_grads(param)
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/anaconda3/envs/multimodelr1/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 1265, in reduce_independent_p_g_buckets_and_remove_grads
[rank7]: self.__reduce_and_partition_ipg_grads()
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/anaconda3/envs/multimodelr1/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank7]: ret_val = func(*args, **kwargs)
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/anaconda3/envs/multimodelr1/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank7]: return func(*args, **kwargs)
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/anaconda3/envs/multimodelr1/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 1315, in __reduce_and_partition_ipg_grads
[rank7]: grad_partitions = self.__avg_scatter_grads(self.params_in_ipg_bucket)
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/anaconda3/envs/multimodelr1/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank7]: ret_val = func(*args, **kwargs)
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/anaconda3/envs/multimodelr1/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 1384, in __avg_scatter_grads
[rank7]: grad_partitions_for_rank = reduce_scatter_coalesced(full_grads_for_rank, self.dp_process_group)
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/anaconda3/envs/multimodelr1/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank7]: ret_val = func(*args, **kwargs)
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/anaconda3/envs/multimodelr1/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank7]: return func(*args, **kwargs)
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/anaconda3/envs/multimodelr1/lib/python3.10/site-packages/deepspeed/runtime/comm/coalesced_collectives.py", line 128, in reduce_scatter_coalesced
[rank7]: _torch_reduce_scatter_fn(tensor_partition_flat_buffer,
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/anaconda3/envs/multimodelr1/lib/python3.10/site-packages/deepspeed/runtime/comm/coalesced_collectives.py", line 23, in _torch_reduce_scatter_fn
[rank7]: return instrument_w_nvtx(dist.reduce_scatter_fn)(output_tensor, input_tensor, group=group, async_op=False)
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/anaconda3/envs/multimodelr1/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank7]: ret_val = func(*args, **kwargs)
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/anaconda3/envs/multimodelr1/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 257, in reduce_scatter_fn
[rank7]: return reduce_scatter_tensor(output_tensor,
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/anaconda3/envs/multimodelr1/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper
[rank7]: return func(*args, **kwargs)
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/anaconda3/envs/multimodelr1/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 289, in reduce_scatter_tensor
[rank7]: return cdb.reduce_scatter_tensor(output_tensor=output_tensor,
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/anaconda3/envs/multimodelr1/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 745, in _fn
[rank7]: return fn(*args, **kwargs)
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/anaconda3/envs/multimodelr1/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 264, in reduce_scatter_tensor
[rank7]: return self.reduce_scatter_function(output_tensor,
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/anaconda3/envs/multimodelr1/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
[rank7]: return func(*args, **kwargs)
[rank7]: File "/inspire/hdd/ws-c6f77a66-a5f5-45dc-a4ce-1e856fe7a7b4/project/shaowenqi-shaowenqi/anaconda3/envs/multimodelr1/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4235, in reduce_scatter_tensor
[rank7]: work = group._reduce_scatter_base(output, input, opts)
[rank7]: torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/NCCLUtils.hpp:268, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.21.5
[rank7]: ncclUnhandledCudaError: Call to CUDA function failed.
[rank7]: Last error:
[rank7]: Failed to CUDA calloc async 4864 bytes

I use the script in repo to train Qwen-VL2-7B, And I also use 8 H100s. But i get this error, it seems as cuda:7 OOM. How Can i fix it?

@FanqingM
Copy link
Author

FanqingM commented Feb 2, 2025

I change the zero3.json as follow:
{
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 32,
"hysteresis": 2,
"min_loss_scale": 1
},
"bf16": {
"enabled": "auto"
},

"zero_optimization": {
    "stage": 3,
    "overlap_comm": true,
    "contiguous_gradients": true,
    "sub_group_size": 5e8,
    "reduce_bucket_size": 5e8,
    "stage3_prefetch_bucket_size": 5e8,
    "stage3_param_persistence_threshold": 1e6,
    "stage3_max_live_parameters": 1e9,
    "stage3_max_reuse_distance": 1e9,
    "stage3_gather_16bit_weights_on_model_save": true
},

"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"steps_per_print": 100,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false

}

It can work with me

@FanqingM
Copy link
Author

FanqingM commented Feb 2, 2025

But the speed is too slow。。。。
{'loss': 0.0001, 'grad_norm': 57.799599938214996, 'learning_rate': 9.97920997920998e-07, 'completion_length': 100.03125, 'rewards/accuracy_reward': 0.203125, 'rewards/format_reward': 0.171875, 'reward': 0.375, 'reward_std': 0.4175198972225189, 'kl': 0.0015106201171875, 'epoch': 0.0}
0%|▏ | 2/962 [08:08<64:19:48, 241.24s/it]

On 8 H100s

@baibizhe
Copy link

baibizhe commented Feb 2, 2025

Hi there. yes, the speed is slow. i‘ve trained 2b model on four card 1000 steps takes 10h. I've train the 2b model successfully and testing its result (and if you want test the result on sii's cluster, you will face some problems with lmms-eval.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants