-
Notifications
You must be signed in to change notification settings - Fork 422
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
哪位大佬帮忙看看,微调出现维度不一致问题; RuntimeError: The size of tensor a (25165824) must match the size of tensor b (12288) at non-singleton dimension 0 #125
Comments
感觉是显卡没配置好,你试一下这段代码你能跑通吗: from bitsandbytes.nn import LinearNF4
model = LinearNF4(10, 20).cuda()
import torch
x = torch.randn(2, 10).cuda()
out = model(x) |
在脚本里面正常运行;
===================================BUG REPORT=================================== python -m bitsandbytes and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issuesbin /opt/conda/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda116.so
|
我这边是可以跑起来的,你看一下你的代码和VisualGLM-6B的main分支一致吗?是不是没有更新到最新版或者你本地改了什么?以及bitsandbytes是不是0.39.0版本。 |
tokenizer的问题可以参考这里:#111 (comment) |
主要是后面的问题: |
这个应该是deepspeed配置的问题,有一个类似的issue:#43 查了一下可能的解决方案:
|
修改了 finetune_visualglm.py #176行 错误就从RuntimeError: Error building extension 'fused_adam',变成了 维度问题RuntimeError: The size of tensor a (25165824) must match the size of tensor b (12288) at non-singleton dimension 0; 这里的args.device 指的是什么,为啥会这样呢? |
问题解决了,可以训练啦!!主要是cusolverDn.h: No such file or directory 找不到导致; |
请问这个问题解决了吗,就是维度不一致的问题 |
请问这个问题解决了吗,维度不一致的问题 |
这个没有解决,还是改为原来的代码#176行 args.device = 'cpu'这里设置成cpu,然后就是cuda环境问题,RuntimeError: Error building extension 'fused_adam';通过上面配置环境变量解决了; |
因为 所以维度不一致是显卡配置的问题, |
我的CUDA版本是12.0 也是这个问题 |
在哪添加呢 |
|
我输入 vi ~/.bashrc命令,在底下添加了环境变量export PATH=/usr/local/cuda/bin:$PATH依然出现这个问题nsion.py", line 2112, in _run_ninja_build |
问题已全部解决,微调成功 |
推理微调后的模型权重文件时出现: |
已解决,把里面报错信息里提到的模型文件下载到本地,再去运行的文件里指定它在本地的路径就可以了 |
微调llama3 我也遇到了相同的问题,不过解决了 1、使用以下版本,主要是peft使用0.4.0 2、lora config |
(base) root@6633711ec9b0:/home/data/VisualGLM-6B# bash finetune/finetune_visualglm_qlora.sh
NCCL_DEBUG=info NCCL_IB_DISABLE=0 NCCL_NET_GDR_LEVEL=2 deepspeed --master_port 16666 --include localhost:0 --hostfile hostfile_single finetune_visualglm.py --experiment-name finetune-visualglm-6b --model-parallel-size 1 --mode finetune --train-iters 300 --resume-dataloader --max_source_length 64 --max_target_length 256 --lora_rank 10 --layer_range 0 14 --pre_seq_len 4 --train-data ./fewshot-data/dataset.json --valid-data ./fewshot-data/dataset.json --distributed-backend nccl --lr-decay-style cosine --warmup .02 --checkpoint-activations --save-interval 300 --eval-interval 10000 --save ./checkpoints --split 1 --eval-iters 10 --eval-batch-size 8 --zero-stage 1 --lr 0.0001 --batch-size 8 --gradient-accumulation-steps 4 --skip-init --fp16 --use_qlora
Setting ds_accelerator to cuda (auto detect)
[2023-06-12 06:33:33,961] [WARNING] [runner.py:196:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-06-12 06:33:34,032] [INFO] [runner.py:555:main] cmd = /opt/conda/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=16666 --enable_each_rank_log=None finetune_visualglm.py --experiment-name finetune-visualglm-6b --model-parallel-size 1 --mode finetune --train-iters 300 --resume-dataloader --max_source_length 64 --max_target_length 256 --lora_rank 10 --layer_range 0 14 --pre_seq_len 4 --train-data ./fewshot-data/dataset.json --valid-data ./fewshot-data/dataset.json --distributed-backend nccl --lr-decay-style cosine --warmup .02 --checkpoint-activations --save-interval 300 --eval-interval 10000 --save ./checkpoints --split 1 --eval-iters 10 --eval-batch-size 8 --zero-stage 1 --lr 0.0001 --batch-size 8 --gradient-accumulation-steps 4 --skip-init --fp16 --use_qlora
Setting ds_accelerator to cuda (auto detect)
[2023-06-12 06:33:35,958] [INFO] [launch.py:138:main] 0 NCCL_DEBUG=info
[2023-06-12 06:33:35,958] [INFO] [launch.py:138:main] 0 NCCL_NET_GDR_LEVEL=2
[2023-06-12 06:33:35,958] [INFO] [launch.py:138:main] 0 NCCL_IB_DISABLE=0
[2023-06-12 06:33:35,958] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_DEV_PACKAGE=libnccl-dev=2.12.10-1+cuda11.6
[2023-06-12 06:33:35,959] [INFO] [launch.py:138:main] 0 NCCL_VERSION=2.12.10-1
[2023-06-12 06:33:35,959] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_PACKAGE_VERSION=2.12.10-1
[2023-06-12 06:33:35,959] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_PACKAGE=libnccl2=2.12.10-1+cuda11.6
[2023-06-12 06:33:35,959] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_DEV_PACKAGE_NAME=libnccl-dev
[2023-06-12 06:33:35,959] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_PACKAGE_NAME=libnccl2
[2023-06-12 06:33:35,959] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_DEV_PACKAGE_VERSION=2.12.10-1
[2023-06-12 06:33:35,959] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0]}
[2023-06-12 06:33:35,959] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=1, node_rank=0
[2023-06-12 06:33:35,959] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2023-06-12 06:33:35,959] [INFO] [launch.py:163:main] dist_world_size=1
[2023-06-12 06:33:35,959] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0
Setting ds_accelerator to cuda (auto detect)
===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run
python -m bitsandbytes
and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
bin /opt/conda/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda116.so
/opt/conda/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: Found duplicate ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] files: {PosixPath('/opt/conda/lib/libcudart.so'), PosixPath('/opt/conda/lib/libcudart.so.11.0')}.. We'll flip a coin and try one of these, in order to fail forward.
Either way, this might cause trouble in the future:
If you get
CUDA error: invalid device function
errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env.warn(msg)
CUDA SETUP: CUDA runtime path found: /opt/conda/lib/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.0
CUDA SETUP: Detected CUDA version 116
CUDA SETUP: Loading binary /opt/conda/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda116.so...
[2023-06-12 06:33:39,415] [INFO] using world size: 1 and model-parallel size: 1
[2023-06-12 06:33:39,415] [INFO] > padded vocab (size: 100) with 28 dummy tokens (new size: 128)
16666
[2023-06-12 06:33:39,417] [INFO] [RANK 0] > initializing model parallel with size 1
[2023-06-12 06:33:39,418] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-06-12 06:33:39,418] [INFO] [comm.py:594:init_distributed] cdb=None
[2023-06-12 06:33:39,418] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter cpu_offload is deprecated use offload_optimizer instead
[2023-06-12 06:33:39,418] [INFO] [checkpointing.py:764:_configure_using_config_file] {'partition_activations': False, 'contiguous_memory_optimization': False, 'cpu_checkpointing': False, 'number_checkpoints': None, 'synchronize_checkpoint_boundary': False, 'profile': False}
[2023-06-12 06:33:39,419] [INFO] [checkpointing.py:226:model_parallel_cuda_manual_seed] > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
[2023-06-12 06:33:39,419] [INFO] [RANK 0] building FineTuneVisualGLMModel model ...
/opt/conda/lib/python3.10/site-packages/torch/nn/init.py:405: UserWarning: Initializing zero-element tensors is a no-op
warnings.warn("Initializing zero-element tensors is a no-op")
replacing layer 0 attention with lora
replacing layer 14 attention with lora
replacing chatglm linear layer with 4bit
[2023-06-12 06:34:26,500] [INFO] [RANK 0] > number of parameters on model parallel rank 0: 7811237376
[2023-06-12 06:34:30,185] [INFO] [RANK 0] global rank 0 is loading checkpoint /root/.sat_models/visualglm-6b/1/mp_rank_00_model_states.pt
Traceback (most recent call last):
File "/home/data/VisualGLM-6B/finetune_visualglm.py", line 180, in
model, args = FineTuneVisualGLMModel.from_pretrained(model_type, args)
File "/opt/conda/lib/python3.10/site-packages/sat/model/base_model.py", line 216, in from_pretrained
load_checkpoint(model, args, load_path=model_path, prefix=prefix)
File "/opt/conda/lib/python3.10/site-packages/sat/training/model_io.py", line 208, in load_checkpoint
missing_keys, unexpected_keys = module.load_state_dict(sd['module'], strict=False)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1657, in load_state_dict
load(self, state_dict)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1645, in load
load(child, child_state_dict, child_prefix)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1645, in load
load(child, child_state_dict, child_prefix)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1645, in load
load(child, child_state_dict, child_prefix)
[Previous line repeated 2 more times]
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1639, in load
module._load_from_state_dict(
File "/home/data/VisualGLM-6B/lora_mixin.py", line 109, in _load_from_state_dict
self.original._load_from_state_dict(state_dict, prefix, local_metadata, strict, missing_keys, unexpected_keys, error_msgs)
File "/home/data/VisualGLM-6B/lora_mixin.py", line 47, in load_from_state_dict
self.weight.data.copy(state_dict[prefix+'weight'])
RuntimeError: The size of tensor a (25165824) must match the size of tensor b (12288) at non-singleton dimension 0
[2023-06-12 06:34:36,019] [INFO] [launch.py:314:sigkill_handler] Killing subprocess 8346
[2023-06-12 06:34:36,019] [ERROR] [launch.py:320:sigkill_handler] ['/opt/conda/bin/python', '-u', 'finetune_visualglm.py', '--local_rank=0', '--experiment-name', 'finetune-visualglm-6b', '--model-parallel-size', '1', '--mode', 'finetune', '--train-iters', '300', '--resume-dataloader', '--max_source_length', '64', '--max_target_length', '256', '--lora_rank', '10', '--layer_range', '0', '14', '--pre_seq_len', '4', '--train-data', './fewshot-data/dataset.json', '--valid-data', './fewshot-data/dataset.json', '--distributed-backend', 'nccl', '--lr-decay-style', 'cosine', '--warmup', '.02', '--checkpoint-activations', '--save-interval', '300', '--eval-interval', '10000', '--save', './checkpoints', '--split', '1', '--eval-iters', '10', '--eval-batch-size', '8', '--zero-stage', '1', '--lr', '0.0001', '--batch-size', '8', '--gradient-accumulation-steps', '4', '--skip-init', '--fp16', '--use_qlora'] exits with return code = 1
The text was updated successfully, but these errors were encountered: