We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
很奇怪参数已经设置到最小, 还是显存溢出。 有50G的显存,理论上不应该。 训练集也只有2条 python train.py --model_name_or_path /home/wanghaikuan/code/qwen2.5-coder-3b-ins --data_path processed/sft.jsonl.npy --model_max_length 1024 --output_dir out_dir --num_train_epochs 3 --per_device_train_batch_size 1 --gradient_accumulation_steps 1 --per_device_eval_batch_size 1 --evaluation_strategy "no" --save_strategy "steps" --save_steps 100 --save_total_limit 100 --learning_rate 5e-5 --weight_decay 0.0 --warmup_steps 100 --lr_scheduler_type "cosine" --logging_strategy "steps" --logging_steps 1 --report_to "tensorboard" --bf16 True --tf32 True --truncate_source False
INFO:root:Namespace(model_name_or_path='/home/whk/code/qwen2.5-coder-3b-ins', data_path='processed/sft.jsonl.npy', output_dir='out_dir', overwrite_output_dir=False, do_train=False, do_eval=False, do_predict=False, eval_strategy=<IntervalStrategy.NO: 'no'>, prediction_loss_only=False, per_device_train_batch_size=1, per_device_eval_batch_size=1, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, eval_accumulation_steps=None, eval_delay=0, learning_rate=5e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=3.0, max_steps=-1, lr_scheduler_type=<SchedulerType.COSINE: 'cosine'>, lr_scheduler_kwargs={}, warmup_ratio=0.0, warmup_steps=100, log_level='passive', log_level_replica='warning', log_on_each_node=True, logging_dir='out_dir/runs/Dec25_21-15-52_104-171-203-126', logging_strategy=<IntervalStrategy.STEPS: 'steps'>, logging_first_step=False, logging_steps=1.0, logging_nan_inf_filter=True, save_strategy=<IntervalStrategy.STEPS: 'steps'>, save_steps=100, save_total_limit=100, save_safetensors=True, save_on_each_node=False, save_only_model=False, restore_callback_states_from_checkpoint=False, no_cuda=False, use_cpu=False, use_mps_device=False, seed=42, data_seed=None, jit_mode_eval=False, use_ipex=False, bf16=True, fp16=False, fp16_opt_level='O1', half_precision_backend='auto', bf16_full_eval=False, fp16_full_eval=False, tf32=True, local_rank=0, ddp_backend=None, tpu_num_cores=None, tpu_metrics_debug=False, debug=[], dataloader_drop_last=False, eval_steps=None, dataloader_num_workers=0, dataloader_prefetch_factor=None, past_index=-1, run_name='out_dir', disable_tqdm=False, remove_unused_columns=True, label_names=None, load_best_model_at_end=False, metric_for_best_model=None, greater_is_better=None, ignore_data_skip=False, fsdp=[], fsdp_min_num_params=0, fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, fsdp_transformer_layer_cls_to_wrap=None, accelerator_config=AcceleratorConfig(split_batches=False, dispatch_batches=None, even_batches=True, use_seedable_sampler=True, non_blocking=False, gradient_accumulation_kwargs=None), deepspeed=None, label_smoothing_factor=0.0, optim=<OptimizerNames.ADAMW_TORCH: 'adamw_torch'>, optim_args=None, adafactor=False, group_by_length=False, length_column_name='length', report_to=['tensorboard'], ddp_find_unused_parameters=None, ddp_bucket_cap_mb=None, ddp_broadcast_buffers=None, dataloader_pin_memory=True, dataloader_persistent_workers=False, skip_memory_metrics=True, use_legacy_prediction_loop=False, push_to_hub=False, resume_from_checkpoint=None, hub_model_id=None, hub_strategy=<HubStrategy.EVERY_SAVE: 'every_save'>, hub_token=None, hub_private_repo=False, hub_always_push=False, gradient_checkpointing=False, gradient_checkpointing_kwargs=None, include_inputs_for_metrics=False, eval_do_concat_batches=True, fp16_backend='auto', evaluation_strategy='no', push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=None, mp_parameters='', auto_find_batch_size=False, full_determinism=False, torchdynamo=None, ray_scope='last', ddp_timeout=1800, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, dispatch_batches=None, split_batches=None, include_tokens_per_second=False, include_num_input_tokens_seen=False, neftune_noise_alpha=None, optim_target_modules=None, batch_eval_metrics=False, cache_dir=None, model_max_length=1024, truncate_source=False, distributed_state=Distributed environment: NO Num processes: 1 Process index: 0 Local process index: 0 Device: cuda , _n_gpu=1, __cached__setup_devices=device(type='cuda', index=0), deepspeed_plugin=None) Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00, 1.55it/s] Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. WARNING:root:Loading data... INFO:root:Completely Loading tokenized sentences... Samples: 2 -> 2 [2024-12-25 21:15:56,407] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) 0%| | 0/6 [00:00<?, ?it/s]Traceback (most recent call last): File "/home/whk/code/coder/Qwen2.5-Coder/finetuning/sft/train.py", line 162, in train() File "/home/whk/code/coder/Qwen2.5-Coder/finetuning/sft/train.py", line 157, in train trainer.train() File "/home/whk/anaconda3/envs/python310/lib/python3.10/site-packages/transformers/trainer.py", line 1885, in train return inner_training_loop( File "/home/whk/anaconda3/envs/python310/lib/python3.10/site-packages/transformers/trainer.py", line 2279, in _inner_training_loop self.optimizer.step() File "/home/whk/anaconda3/envs/python310/lib/python3.10/site-packages/accelerate/optimizer.py", line 170, in step self.optimizer.step(closure) File "/home/whk/anaconda3/envs/python310/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 75, in wrapper return wrapped(*args, **kwargs) File "/home/whk/anaconda3/envs/python310/lib/python3.10/site-packages/torch/optim/optimizer.py", line 391, in wrapper out = func(*args, **kwargs) File "/home/whk/anaconda3/envs/python310/lib/python3.10/site-packages/torch/optim/optimizer.py", line 76, in _use_grad ret = func(self, *args, **kwargs) File "/home/whk/anaconda3/envs/python310/lib/python3.10/site-packages/torch/optim/adamw.py", line 177, in step has_complex = self._init_group( File "/home/whk/anaconda3/envs/python310/lib/python3.10/site-packages/torch/optim/adamw.py", line 128, in _init_group state["exp_avg_sq"] = torch.zeros_like( torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 86.00 MiB. GPU 0%| | 0/6 [00:01<?, ?it/s]
The text was updated successfully, but these errors were encountered:
使用torch 2.5的时候 报错为: File "/home/wang/anaconda3/envs/python311/lib/python3.11/site-packages/torch/optim/adamw.py", line 606, in _multi_tensor_adamw exp_avg_sq_sqrt = torch._foreach_sqrt(device_exp_avg_sqs)。 跟上面差不多
Sorry, something went wrong.
这么设置对吗? target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"] 。 设置后,已经能正常保存模型了
seems that the promblem have been solved :)
No branches or pull requests
很奇怪参数已经设置到最小, 还是显存溢出。 有50G的显存,理论上不应该。 训练集也只有2条
python train.py
--model_name_or_path /home/wanghaikuan/code/qwen2.5-coder-3b-ins
--data_path processed/sft.jsonl.npy
--model_max_length 1024
--output_dir out_dir
--num_train_epochs 3
--per_device_train_batch_size 1
--gradient_accumulation_steps 1
--per_device_eval_batch_size 1
--evaluation_strategy "no"
--save_strategy "steps"
--save_steps 100
--save_total_limit 100
--learning_rate 5e-5
--weight_decay 0.0
--warmup_steps 100
--lr_scheduler_type "cosine"
--logging_strategy "steps"
--logging_steps 1
--report_to "tensorboard"
--bf16 True
--tf32 True
--truncate_source False
INFO:root:Namespace(model_name_or_path='/home/whk/code/qwen2.5-coder-3b-ins', data_path='processed/sft.jsonl.npy', output_dir='out_dir', overwrite_output_dir=False, do_train=False, do_eval=False, do_predict=False, eval_strategy=<IntervalStrategy.NO: 'no'>, prediction_loss_only=False, per_device_train_batch_size=1, per_device_eval_batch_size=1, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, eval_accumulation_steps=None, eval_delay=0, learning_rate=5e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=3.0, max_steps=-1, lr_scheduler_type=<SchedulerType.COSINE: 'cosine'>, lr_scheduler_kwargs={}, warmup_ratio=0.0, warmup_steps=100, log_level='passive', log_level_replica='warning', log_on_each_node=True, logging_dir='out_dir/runs/Dec25_21-15-52_104-171-203-126', logging_strategy=<IntervalStrategy.STEPS: 'steps'>, logging_first_step=False, logging_steps=1.0, logging_nan_inf_filter=True, save_strategy=<IntervalStrategy.STEPS: 'steps'>, save_steps=100, save_total_limit=100, save_safetensors=True, save_on_each_node=False, save_only_model=False, restore_callback_states_from_checkpoint=False, no_cuda=False, use_cpu=False, use_mps_device=False, seed=42, data_seed=None, jit_mode_eval=False, use_ipex=False, bf16=True, fp16=False, fp16_opt_level='O1', half_precision_backend='auto', bf16_full_eval=False, fp16_full_eval=False, tf32=True, local_rank=0, ddp_backend=None, tpu_num_cores=None, tpu_metrics_debug=False, debug=[], dataloader_drop_last=False, eval_steps=None, dataloader_num_workers=0, dataloader_prefetch_factor=None, past_index=-1, run_name='out_dir', disable_tqdm=False, remove_unused_columns=True, label_names=None, load_best_model_at_end=False, metric_for_best_model=None, greater_is_better=None, ignore_data_skip=False, fsdp=[], fsdp_min_num_params=0, fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, fsdp_transformer_layer_cls_to_wrap=None, accelerator_config=AcceleratorConfig(split_batches=False, dispatch_batches=None, even_batches=True, use_seedable_sampler=True, non_blocking=False, gradient_accumulation_kwargs=None), deepspeed=None, label_smoothing_factor=0.0, optim=<OptimizerNames.ADAMW_TORCH: 'adamw_torch'>, optim_args=None, adafactor=False, group_by_length=False, length_column_name='length', report_to=['tensorboard'], ddp_find_unused_parameters=None, ddp_bucket_cap_mb=None, ddp_broadcast_buffers=None, dataloader_pin_memory=True, dataloader_persistent_workers=False, skip_memory_metrics=True, use_legacy_prediction_loop=False, push_to_hub=False, resume_from_checkpoint=None, hub_model_id=None, hub_strategy=<HubStrategy.EVERY_SAVE: 'every_save'>, hub_token=None, hub_private_repo=False, hub_always_push=False, gradient_checkpointing=False, gradient_checkpointing_kwargs=None, include_inputs_for_metrics=False, eval_do_concat_batches=True, fp16_backend='auto', evaluation_strategy='no', push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=None, mp_parameters='', auto_find_batch_size=False, full_determinism=False, torchdynamo=None, ray_scope='last', ddp_timeout=1800, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, dispatch_batches=None, split_batches=None, include_tokens_per_second=False, include_num_input_tokens_seen=False, neftune_noise_alpha=None, optim_target_modules=None, batch_eval_metrics=False, cache_dir=None, model_max_length=1024, truncate_source=False, distributed_state=Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda
, _n_gpu=1, __cached__setup_devices=device(type='cuda', index=0), deepspeed_plugin=None)
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00, 1.55it/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
WARNING:root:Loading data...
INFO:root:Completely Loading tokenized sentences...
Samples: 2 -> 2
[2024-12-25 21:15:56,407] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
0%| | 0/6 [00:00<?, ?it/s]Traceback (most recent call last):
File "/home/whk/code/coder/Qwen2.5-Coder/finetuning/sft/train.py", line 162, in
train()
File "/home/whk/code/coder/Qwen2.5-Coder/finetuning/sft/train.py", line 157, in train
trainer.train()
File "/home/whk/anaconda3/envs/python310/lib/python3.10/site-packages/transformers/trainer.py", line 1885, in train
return inner_training_loop(
File "/home/whk/anaconda3/envs/python310/lib/python3.10/site-packages/transformers/trainer.py", line 2279, in _inner_training_loop
self.optimizer.step()
File "/home/whk/anaconda3/envs/python310/lib/python3.10/site-packages/accelerate/optimizer.py", line 170, in step
self.optimizer.step(closure)
File "/home/whk/anaconda3/envs/python310/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 75, in wrapper
return wrapped(*args, **kwargs)
File "/home/whk/anaconda3/envs/python310/lib/python3.10/site-packages/torch/optim/optimizer.py", line 391, in wrapper
out = func(*args, **kwargs)
File "/home/whk/anaconda3/envs/python310/lib/python3.10/site-packages/torch/optim/optimizer.py", line 76, in _use_grad
ret = func(self, *args, **kwargs)
File "/home/whk/anaconda3/envs/python310/lib/python3.10/site-packages/torch/optim/adamw.py", line 177, in step
has_complex = self._init_group(
File "/home/whk/anaconda3/envs/python310/lib/python3.10/site-packages/torch/optim/adamw.py", line 128, in _init_group
state["exp_avg_sq"] = torch.zeros_like(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 86.00 MiB. GPU
0%| | 0/6 [00:01<?, ?it/s]
The text was updated successfully, but these errors were encountered: