Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

torch.cuda.OutOfMemoryError: CUDA out of memory #228

Closed
whk6688 opened this issue Dec 25, 2024 · 3 comments
Closed

torch.cuda.OutOfMemoryError: CUDA out of memory #228

whk6688 opened this issue Dec 25, 2024 · 3 comments

Comments

@whk6688
Copy link

whk6688 commented Dec 25, 2024

很奇怪参数已经设置到最小, 还是显存溢出。 有50G的显存,理论上不应该。 训练集也只有2条
python train.py
--model_name_or_path /home/wanghaikuan/code/qwen2.5-coder-3b-ins
--data_path processed/sft.jsonl.npy
--model_max_length 1024
--output_dir out_dir
--num_train_epochs 3
--per_device_train_batch_size 1
--gradient_accumulation_steps 1
--per_device_eval_batch_size 1
--evaluation_strategy "no"
--save_strategy "steps"
--save_steps 100
--save_total_limit 100
--learning_rate 5e-5
--weight_decay 0.0
--warmup_steps 100
--lr_scheduler_type "cosine"
--logging_strategy "steps"
--logging_steps 1
--report_to "tensorboard"
--bf16 True
--tf32 True
--truncate_source False


INFO:root:Namespace(model_name_or_path='/home/whk/code/qwen2.5-coder-3b-ins', data_path='processed/sft.jsonl.npy', output_dir='out_dir', overwrite_output_dir=False, do_train=False, do_eval=False, do_predict=False, eval_strategy=<IntervalStrategy.NO: 'no'>, prediction_loss_only=False, per_device_train_batch_size=1, per_device_eval_batch_size=1, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, eval_accumulation_steps=None, eval_delay=0, learning_rate=5e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=3.0, max_steps=-1, lr_scheduler_type=<SchedulerType.COSINE: 'cosine'>, lr_scheduler_kwargs={}, warmup_ratio=0.0, warmup_steps=100, log_level='passive', log_level_replica='warning', log_on_each_node=True, logging_dir='out_dir/runs/Dec25_21-15-52_104-171-203-126', logging_strategy=<IntervalStrategy.STEPS: 'steps'>, logging_first_step=False, logging_steps=1.0, logging_nan_inf_filter=True, save_strategy=<IntervalStrategy.STEPS: 'steps'>, save_steps=100, save_total_limit=100, save_safetensors=True, save_on_each_node=False, save_only_model=False, restore_callback_states_from_checkpoint=False, no_cuda=False, use_cpu=False, use_mps_device=False, seed=42, data_seed=None, jit_mode_eval=False, use_ipex=False, bf16=True, fp16=False, fp16_opt_level='O1', half_precision_backend='auto', bf16_full_eval=False, fp16_full_eval=False, tf32=True, local_rank=0, ddp_backend=None, tpu_num_cores=None, tpu_metrics_debug=False, debug=[], dataloader_drop_last=False, eval_steps=None, dataloader_num_workers=0, dataloader_prefetch_factor=None, past_index=-1, run_name='out_dir', disable_tqdm=False, remove_unused_columns=True, label_names=None, load_best_model_at_end=False, metric_for_best_model=None, greater_is_better=None, ignore_data_skip=False, fsdp=[], fsdp_min_num_params=0, fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, fsdp_transformer_layer_cls_to_wrap=None, accelerator_config=AcceleratorConfig(split_batches=False, dispatch_batches=None, even_batches=True, use_seedable_sampler=True, non_blocking=False, gradient_accumulation_kwargs=None), deepspeed=None, label_smoothing_factor=0.0, optim=<OptimizerNames.ADAMW_TORCH: 'adamw_torch'>, optim_args=None, adafactor=False, group_by_length=False, length_column_name='length', report_to=['tensorboard'], ddp_find_unused_parameters=None, ddp_bucket_cap_mb=None, ddp_broadcast_buffers=None, dataloader_pin_memory=True, dataloader_persistent_workers=False, skip_memory_metrics=True, use_legacy_prediction_loop=False, push_to_hub=False, resume_from_checkpoint=None, hub_model_id=None, hub_strategy=<HubStrategy.EVERY_SAVE: 'every_save'>, hub_token=None, hub_private_repo=False, hub_always_push=False, gradient_checkpointing=False, gradient_checkpointing_kwargs=None, include_inputs_for_metrics=False, eval_do_concat_batches=True, fp16_backend='auto', evaluation_strategy='no', push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=None, mp_parameters='', auto_find_batch_size=False, full_determinism=False, torchdynamo=None, ray_scope='last', ddp_timeout=1800, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, dispatch_batches=None, split_batches=None, include_tokens_per_second=False, include_num_input_tokens_seen=False, neftune_noise_alpha=None, optim_target_modules=None, batch_eval_metrics=False, cache_dir=None, model_max_length=1024, truncate_source=False, distributed_state=Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda
, _n_gpu=1, __cached__setup_devices=device(type='cuda', index=0), deepspeed_plugin=None)
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00, 1.55it/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
WARNING:root:Loading data...
INFO:root:Completely Loading tokenized sentences...
Samples: 2 -> 2
[2024-12-25 21:15:56,407] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
0%| | 0/6 [00:00<?, ?it/s]Traceback (most recent call last):
File "/home/whk/code/coder/Qwen2.5-Coder/finetuning/sft/train.py", line 162, in
train()
File "/home/whk/code/coder/Qwen2.5-Coder/finetuning/sft/train.py", line 157, in train
trainer.train()
File "/home/whk/anaconda3/envs/python310/lib/python3.10/site-packages/transformers/trainer.py", line 1885, in train
return inner_training_loop(
File "/home/whk/anaconda3/envs/python310/lib/python3.10/site-packages/transformers/trainer.py", line 2279, in _inner_training_loop
self.optimizer.step()
File "/home/whk/anaconda3/envs/python310/lib/python3.10/site-packages/accelerate/optimizer.py", line 170, in step
self.optimizer.step(closure)
File "/home/whk/anaconda3/envs/python310/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 75, in wrapper
return wrapped(*args, **kwargs)
File "/home/whk/anaconda3/envs/python310/lib/python3.10/site-packages/torch/optim/optimizer.py", line 391, in wrapper
out = func(*args, **kwargs)
File "/home/whk/anaconda3/envs/python310/lib/python3.10/site-packages/torch/optim/optimizer.py", line 76, in _use_grad
ret = func(self, *args, **kwargs)
File "/home/whk/anaconda3/envs/python310/lib/python3.10/site-packages/torch/optim/adamw.py", line 177, in step
has_complex = self._init_group(
File "/home/whk/anaconda3/envs/python310/lib/python3.10/site-packages/torch/optim/adamw.py", line 128, in _init_group
state["exp_avg_sq"] = torch.zeros_like(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 86.00 MiB. GPU
0%| | 0/6 [00:01<?, ?it/s]

@whk6688
Copy link
Author

whk6688 commented Dec 25, 2024

使用torch 2.5的时候 报错为:
File "/home/wang/anaconda3/envs/python311/lib/python3.11/site-packages/torch/optim/adamw.py", line 606, in _multi_tensor_adamw
exp_avg_sq_sqrt = torch._foreach_sqrt(device_exp_avg_sqs)。 跟上面差不多

@whk6688
Copy link
Author

whk6688 commented Dec 26, 2024

这么设置对吗? target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"] 。 设置后,已经能正常保存模型了

@cyente
Copy link
Collaborator

cyente commented Feb 11, 2025

seems that the promblem have been solved :)

@cyente cyente closed this as completed Feb 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants