Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

哪位大佬帮忙看看,微调出现维度不一致问题; RuntimeError: The size of tensor a (25165824) must match the size of tensor b (12288) at non-singleton dimension 0 #125

Open
xxuyyuan opened this issue Jun 12, 2023 · 21 comments

Comments

@xxuyyuan
Copy link

(base) root@6633711ec9b0:/home/data/VisualGLM-6B# bash finetune/finetune_visualglm_qlora.sh
NCCL_DEBUG=info NCCL_IB_DISABLE=0 NCCL_NET_GDR_LEVEL=2 deepspeed --master_port 16666 --include localhost:0 --hostfile hostfile_single finetune_visualglm.py --experiment-name finetune-visualglm-6b --model-parallel-size 1 --mode finetune --train-iters 300 --resume-dataloader --max_source_length 64 --max_target_length 256 --lora_rank 10 --layer_range 0 14 --pre_seq_len 4 --train-data ./fewshot-data/dataset.json --valid-data ./fewshot-data/dataset.json --distributed-backend nccl --lr-decay-style cosine --warmup .02 --checkpoint-activations --save-interval 300 --eval-interval 10000 --save ./checkpoints --split 1 --eval-iters 10 --eval-batch-size 8 --zero-stage 1 --lr 0.0001 --batch-size 8 --gradient-accumulation-steps 4 --skip-init --fp16 --use_qlora
Setting ds_accelerator to cuda (auto detect)
[2023-06-12 06:33:33,961] [WARNING] [runner.py:196:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-06-12 06:33:34,032] [INFO] [runner.py:555:main] cmd = /opt/conda/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=16666 --enable_each_rank_log=None finetune_visualglm.py --experiment-name finetune-visualglm-6b --model-parallel-size 1 --mode finetune --train-iters 300 --resume-dataloader --max_source_length 64 --max_target_length 256 --lora_rank 10 --layer_range 0 14 --pre_seq_len 4 --train-data ./fewshot-data/dataset.json --valid-data ./fewshot-data/dataset.json --distributed-backend nccl --lr-decay-style cosine --warmup .02 --checkpoint-activations --save-interval 300 --eval-interval 10000 --save ./checkpoints --split 1 --eval-iters 10 --eval-batch-size 8 --zero-stage 1 --lr 0.0001 --batch-size 8 --gradient-accumulation-steps 4 --skip-init --fp16 --use_qlora
Setting ds_accelerator to cuda (auto detect)
[2023-06-12 06:33:35,958] [INFO] [launch.py:138:main] 0 NCCL_DEBUG=info
[2023-06-12 06:33:35,958] [INFO] [launch.py:138:main] 0 NCCL_NET_GDR_LEVEL=2
[2023-06-12 06:33:35,958] [INFO] [launch.py:138:main] 0 NCCL_IB_DISABLE=0
[2023-06-12 06:33:35,958] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_DEV_PACKAGE=libnccl-dev=2.12.10-1+cuda11.6
[2023-06-12 06:33:35,959] [INFO] [launch.py:138:main] 0 NCCL_VERSION=2.12.10-1
[2023-06-12 06:33:35,959] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_PACKAGE_VERSION=2.12.10-1
[2023-06-12 06:33:35,959] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_PACKAGE=libnccl2=2.12.10-1+cuda11.6
[2023-06-12 06:33:35,959] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_DEV_PACKAGE_NAME=libnccl-dev
[2023-06-12 06:33:35,959] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_PACKAGE_NAME=libnccl2
[2023-06-12 06:33:35,959] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_DEV_PACKAGE_VERSION=2.12.10-1
[2023-06-12 06:33:35,959] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0]}
[2023-06-12 06:33:35,959] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=1, node_rank=0
[2023-06-12 06:33:35,959] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2023-06-12 06:33:35,959] [INFO] [launch.py:163:main] dist_world_size=1
[2023-06-12 06:33:35,959] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0
Setting ds_accelerator to cuda (auto detect)

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

bin /opt/conda/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda116.so
/opt/conda/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: Found duplicate ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] files: {PosixPath('/opt/conda/lib/libcudart.so'), PosixPath('/opt/conda/lib/libcudart.so.11.0')}.. We'll flip a coin and try one of these, in order to fail forward.
Either way, this might cause trouble in the future:
If you get CUDA error: invalid device function errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env.
warn(msg)
CUDA SETUP: CUDA runtime path found: /opt/conda/lib/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.0
CUDA SETUP: Detected CUDA version 116
CUDA SETUP: Loading binary /opt/conda/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda116.so...
[2023-06-12 06:33:39,415] [INFO] using world size: 1 and model-parallel size: 1
[2023-06-12 06:33:39,415] [INFO] > padded vocab (size: 100) with 28 dummy tokens (new size: 128)
16666
[2023-06-12 06:33:39,417] [INFO] [RANK 0] > initializing model parallel with size 1
[2023-06-12 06:33:39,418] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-06-12 06:33:39,418] [INFO] [comm.py:594:init_distributed] cdb=None
[2023-06-12 06:33:39,418] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter cpu_offload is deprecated use offload_optimizer instead
[2023-06-12 06:33:39,418] [INFO] [checkpointing.py:764:_configure_using_config_file] {'partition_activations': False, 'contiguous_memory_optimization': False, 'cpu_checkpointing': False, 'number_checkpoints': None, 'synchronize_checkpoint_boundary': False, 'profile': False}
[2023-06-12 06:33:39,419] [INFO] [checkpointing.py:226:model_parallel_cuda_manual_seed] > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
[2023-06-12 06:33:39,419] [INFO] [RANK 0] building FineTuneVisualGLMModel model ...
/opt/conda/lib/python3.10/site-packages/torch/nn/init.py:405: UserWarning: Initializing zero-element tensors is a no-op
warnings.warn("Initializing zero-element tensors is a no-op")
replacing layer 0 attention with lora
replacing layer 14 attention with lora
replacing chatglm linear layer with 4bit
[2023-06-12 06:34:26,500] [INFO] [RANK 0] > number of parameters on model parallel rank 0: 7811237376
[2023-06-12 06:34:30,185] [INFO] [RANK 0] global rank 0 is loading checkpoint /root/.sat_models/visualglm-6b/1/mp_rank_00_model_states.pt
Traceback (most recent call last):
File "/home/data/VisualGLM-6B/finetune_visualglm.py", line 180, in
model, args = FineTuneVisualGLMModel.from_pretrained(model_type, args)
File "/opt/conda/lib/python3.10/site-packages/sat/model/base_model.py", line 216, in from_pretrained
load_checkpoint(model, args, load_path=model_path, prefix=prefix)
File "/opt/conda/lib/python3.10/site-packages/sat/training/model_io.py", line 208, in load_checkpoint
missing_keys, unexpected_keys = module.load_state_dict(sd['module'], strict=False)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1657, in load_state_dict
load(self, state_dict)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1645, in load
load(child, child_state_dict, child_prefix)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1645, in load
load(child, child_state_dict, child_prefix)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1645, in load
load(child, child_state_dict, child_prefix)
[Previous line repeated 2 more times]
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1639, in load
module._load_from_state_dict(
File "/home/data/VisualGLM-6B/lora_mixin.py", line 109, in _load_from_state_dict
self.original._load_from_state_dict(state_dict, prefix, local_metadata, strict, missing_keys, unexpected_keys, error_msgs)
File "/home/data/VisualGLM-6B/lora_mixin.py", line 47, in load_from_state_dict
self.weight.data.copy
(state_dict[prefix+'weight'])
RuntimeError: The size of tensor a (25165824) must match the size of tensor b (12288) at non-singleton dimension 0
[2023-06-12 06:34:36,019] [INFO] [launch.py:314:sigkill_handler] Killing subprocess 8346
[2023-06-12 06:34:36,019] [ERROR] [launch.py:320:sigkill_handler] ['/opt/conda/bin/python', '-u', 'finetune_visualglm.py', '--local_rank=0', '--experiment-name', 'finetune-visualglm-6b', '--model-parallel-size', '1', '--mode', 'finetune', '--train-iters', '300', '--resume-dataloader', '--max_source_length', '64', '--max_target_length', '256', '--lora_rank', '10', '--layer_range', '0', '14', '--pre_seq_len', '4', '--train-data', './fewshot-data/dataset.json', '--valid-data', './fewshot-data/dataset.json', '--distributed-backend', 'nccl', '--lr-decay-style', 'cosine', '--warmup', '.02', '--checkpoint-activations', '--save-interval', '300', '--eval-interval', '10000', '--save', './checkpoints', '--split', '1', '--eval-iters', '10', '--eval-batch-size', '8', '--zero-stage', '1', '--lr', '0.0001', '--batch-size', '8', '--gradient-accumulation-steps', '4', '--skip-init', '--fp16', '--use_qlora'] exits with return code = 1

@1049451037
Copy link
Member

1049451037 commented Jun 12, 2023

感觉是显卡没配置好,你试一下这段代码你能跑通吗:

from bitsandbytes.nn import LinearNF4
model = LinearNF4(10, 20).cuda()

import torch
x = torch.randn(2, 10).cuda()
out = model(x)

@xxuyyuan
Copy link
Author

感觉是显卡没配置好,你试一下这段代码你能跑通吗:

from bitsandbytes.nn import LinearNF4
model = LinearNF4(10, 20).cuda()

import torch
x = torch.randn(2, 10).cuda()
out = model(x)

在脚本里面正常运行;
(base) root@6633711ec9b0:/home/data/VisualGLM-6B# python3
Python 3.10.8 (main, Nov 4 2022, 13:48:29) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.

from bitsandbytes.nn import LinearNF4

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

bin /opt/conda/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda116.so
/opt/conda/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: Found duplicate ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] files: {PosixPath('/opt/conda/lib/libcudart.so.11.0'), PosixPath('/opt/conda/lib/libcudart.so')}.. We'll flip a coin and try one of these, in order to fail forward.
Either way, this might cause trouble in the future:
If you get CUDA error: invalid device function errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env.
warn(msg)
CUDA SETUP: CUDA runtime path found: /opt/conda/lib/libcudart.so.11.0
CUDA SETUP: Highest compute capability among GPUs detected: 8.0
CUDA SETUP: Detected CUDA version 116
CUDA SETUP: Loading binary /opt/conda/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda116.so...

model = LinearNF4(10, 20).cuda()
import torch
x = torch.randn(2, 10).cuda()
out = model(x)

@1049451037
Copy link
Member

1049451037 commented Jun 12, 2023

我这边是可以跑起来的,你看一下你的代码和VisualGLM-6B的main分支一致吗?是不是没有更新到最新版或者你本地改了什么?以及bitsandbytes是不是0.39.0版本。

@xxuyyuan
Copy link
Author

xxuyyuan commented Jun 12, 2023

我这边是可以跑起来的,你看一下你的代码和VisualGLM-6B的main分支一致吗?是不是没有更新到最新版或者你本地改了什么?以及bitsandbytes是不是0.39.0版本。

bitsandbytes版本是0.39.0,然后重新更新了代码,跑了一下;
第一次运行出现:
AttributeError: 'FakeTokenizer' object has no attribute 'encode'
详情:
[2023-06-12 08:15:11,713] [INFO] [RANK 0] building FineTuneVisualGLMModel model ...
/opt/conda/lib/python3.10/site-packages/torch/nn/init.py:405: UserWarning: Initializing zero-element tensors is a no-op
warnings.warn("Initializing zero-element tensors is a no-op")
replacing layer 0 attention with lora
replacing layer 14 attention with lora
replacing chatglm linear layer with 4bit
[2023-06-12 08:15:58,973] [INFO] [RANK 0] > number of parameters on model parallel rank 0: 7811237376
[2023-06-12 08:15:59,738] [INFO] [RANK 0] global rank 0 is loading checkpoint /root/.sat_models/visualglm-6b/1/mp_rank_00_model_states.pt
[2023-06-12 08:16:04,555] [INFO] [RANK 0] > successfully loaded /root/.sat_models/visualglm-6b/1/mp_rank_00_model_states.pt
[2023-06-12 08:16:07,585] [INFO] [RANK 0] Try to load tokenizer from Huggingface transformers...
Explicitly passing a revision is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision.
[2023-06-12 08:32:23,056] [INFO] [RANK 0] Cannot find THUDM/chatglm-6b from Huggingface or sat. Creating a fake tokenizer...
Traceback (most recent call last):
File "/home/data/VisualGLM-6B/finetune_visualglm.py", line 195, in
training_main(args, model_cls=model, forward_step_function=forward_step, create_dataset_function=create_dataset_function, collate_fn=data_collator)
File "/opt/conda/lib/python3.10/site-packages/sat/training/deepspeed_training.py", line 67, in training_main
train_data, val_data, test_data = make_loaders(args, hooks['create_dataset_function'], collate_fn=collate_fn)
File "/opt/conda/lib/python3.10/site-packages/sat/data_utils/configure_data.py", line 197, in make_loaders
train = make_dataset(**data_set_args, args=args, dataset_weights=args.train_data_weights, is_train_data=True)
File "/opt/conda/lib/python3.10/site-packages/sat/data_utils/configure_data.py", line 124, in make_dataset_full
d = create_dataset_function(p, args)
File "/home/data/VisualGLM-6B/finetune_visualglm.py", line 161, in create_dataset_function
dataset = FewShotDataset(path, image_processor, tokenizer, args)
File "/home/data/VisualGLM-6B/finetune_visualglm.py", line 119, in init
input0 = tokenizer.encode("", add_special_tokens=False)
AttributeError: 'FakeTokenizer' object has no attribute 'encode'
[2023-06-12 08:32:24,375] [INFO] [launch.py:314:sigkill_handler] Killing subprocess 11313
[2023-06-12 08:32:24,375] [ERROR] [launch.py:320:sigkill_handler] ['/opt/conda/bin/python', '-u', 'finetune_visualglm.py', '--local_rank=0', '--experiment-name', 'finetune-visualglm-6b', '--model-parallel-size', '1', '--mode', 'finetune', '--train-iters', '300', '--resume-dataloader', '--max_source_length', '64', '--max_target_length', '256', '--lora_rank', '10', '--layer_range', '0', '14', '--pre_seq_len', '4', '--train-data', './fewshot-data/dataset.json', '--valid-data', './fewshot-data/dataset.json', '--distributed-backend', 'nccl', '--lr-decay-style', 'cosine', '--warmup', '.02', '--checkpoint-activations', '--save-interval', '300', '--eval-interval', '10000', '--save', './checkpoints', '--split', '1', '--eval-iters', '10', '--eval-batch-size', '8', '--zero-stage', '1', '--lr', '0.0001', '--batch-size', '1', '--gradient-accumulation-steps', '4', '--skip-init', '--fp16', '--use_qlora'] exits with return code = 1

再次运行:
RuntimeError: Error building extension 'fused_adam'
详情:
6633711ec9b0:11642:11786 [0] NCCL INFO Connected all rings
6633711ec9b0:11642:11786 [0] NCCL INFO Connected all trees
6633711ec9b0:11642:11786 [0] NCCL INFO 32 coll channels, 32 p2p channels, 32 p2p channels per peer
6633711ec9b0:11642:11786 [0] NCCL INFO comm 0xb1efe50 rank 0 nranks 1 cudaDev 0 busId 54000 - Init COMPLETE
[2023-06-12 08:35:47,241] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
Using /root/.cache/torch_extensions/py310_cu116 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py310_cu116/fused_adam/build.ninja...
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/2] /opt/conda/bin/nvcc -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/opt/conda/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -I/opt/conda/lib/python3.10/site-packages/deepspeed/ops/csrc/adam -isystem /opt/conda/lib/python3.10/site-packages/torch/include -isystem /opt/conda/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/lib/python3.10/site-packages/torch/include/TH -isystem /opt/conda/lib/python3.10/site-packages/torch/include/THC -isystem /opt/conda/include -isystem /opt/conda/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -lineinfo --use_fast_math -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_80,code=compute_80 -DBF16_AVAILABLE -std=c++14 -c /opt/conda/lib/python3.10/site-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o
FAILED: multi_tensor_adam.cuda.o
/opt/conda/bin/nvcc -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/opt/conda/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -I/opt/conda/lib/python3.10/site-packages/deepspeed/ops/csrc/adam -isystem /opt/conda/lib/python3.10/site-packages/torch/include -isystem /opt/conda/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/lib/python3.10/site-packages/torch/include/TH -isystem /opt/conda/lib/python3.10/site-packages/torch/include/THC -isystem /opt/conda/include -isystem /opt/conda/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -lineinfo --use_fast_math -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_80,code=compute_80 -DBF16_AVAILABLE -std=c++14 -c /opt/conda/lib/python3.10/site-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o
In file included from /opt/conda/lib/python3.10/site-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu:13:0:
/opt/conda/lib/python3.10/site-packages/torch/include/ATen/cuda/CUDAContext.h:10:10: fatal error: cusolverDn.h: No such file or directory
#include <cusolverDn.h>
^~~~~~~~~~~~~~
compilation terminated.
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1900, in _run_ninja_build
subprocess.run(
File "/opt/conda/lib/python3.10/subprocess.py", line 526, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/data/VisualGLM-6B/finetune_visualglm.py", line 195, in
training_main(args, model_cls=model, forward_step_function=forward_step, create_dataset_function=create_dataset_function, collate_fn=data_collator)
File "/opt/conda/lib/python3.10/site-packages/sat/training/deepspeed_training.py", line 98, in training_main
model, optimizer = setup_model_untrainable_params_and_optimizer(args, model)
File "/opt/conda/lib/python3.10/site-packages/sat/training/deepspeed_training.py", line 161, in setup_model_untrainable_params_and_optimizer
model, optimizer, _, _ = deepspeed.initialize(
File "/opt/conda/lib/python3.10/site-packages/deepspeed/init.py", line 165, in initialize
engine = DeepSpeedEngine(args=args,
File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 309, in init
self._configure_optimizer(optimizer, model_parameters)
File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1174, in _configure_optimizer
basic_optimizer = self._configure_basic_optimizer(model_parameters)
File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1236, in _configure_basic_optimizer
optimizer = FusedAdam(
File "/opt/conda/lib/python3.10/site-packages/deepspeed/ops/adam/fused_adam.py", line 94, in init
fused_adam_cuda = FusedAdamBuilder().load()
File "/opt/conda/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 454, in load
return self.jit_load(verbose)
File "/opt/conda/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 497, in jit_load
op_module = load(name=self.name,
File "/opt/conda/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1284, in load
return _jit_compile(
File "/opt/conda/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1508, in _jit_compile
_write_ninja_file_and_build_library(
File "/opt/conda/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1623, in _write_ninja_file_and_build_library
_run_ninja_build(
File "/opt/conda/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1916, in _run_ninja_build
raise RuntimeError(message) from e
RuntimeError: Error building extension 'fused_adam'
6633711ec9b0:11642:11783 [0] NCCL INFO [Service thread] Connection closed by localRank 0
6633711ec9b0:11642:11642 [0] NCCL INFO comm 0xb1cab50 rank 0 nranks 1 cudaDev 0 busId 54000 - Abort COMPLETE
6633711ec9b0:11642:11787 [0] NCCL INFO [Service thread] Connection closed by localRank 0
6633711ec9b0:11642:11642 [0] NCCL INFO comm 0xb1efe50 rank 0 nranks 1 cudaDev 0 busId 54000 - Abort COMPLETE
[2023-06-12 08:35:49,912] [INFO] [launch.py:314:sigkill_handler] Killing subprocess 11642
[2023-06-12 08:35:49,912] [ERROR] [launch.py:320:sigkill_handler] ['/opt/conda/bin/python', '-u', 'finetune_visualglm.py', '--local_rank=0', '--experiment-name', 'finetune-visualglm-6b', '--model-parallel-size', '1', '--mode', 'finetune', '--train-iters', '300', '--resume-dataloader', '--max_source_length', '64', '--max_target_length', '256', '--lora_rank', '10', '--layer_range', '0', '14', '--pre_seq_len', '4', '--train-data', './fewshot-data/dataset.json', '--valid-data', './fewshot-data/dataset.json', '--distributed-backend', 'nccl', '--lr-decay-style', 'cosine', '--warmup', '.02', '--checkpoint-activations', '--save-interval', '300', '--eval-interval', '10000', '--save', './checkpoints', '--split', '1', '--eval-iters', '10', '--eval-batch-size', '8', '--zero-stage', '1', '--lr', '0.0001', '--batch-size', '1', '--gradient-accumulation-steps', '4', '--skip-init', '--fp16', '--use_qlora'] exits with return code = 1
(base) root@6633711ec9b0:/home/data/VisualGLM-6B# AttributeError: 'FakeTokenizer' object has no attribute 'encode'RuntimeError: Error building extension 'fused_adam'

cuda环境的问题嘛?来来回回陷入死循环;

@1049451037
Copy link
Member

tokenizer的问题可以参考这里:#111 (comment)

@xxuyyuan
Copy link
Author

tokenizer的问题可以参考这里:#111 (comment)

tokenzier重新运行是正常;
image

主要是后面的问题:
RuntimeError: Error building extension 'fused_adam',详情见上面;

@1049451037
Copy link
Member

这个应该是deepspeed配置的问题,有一个类似的issue:#43

查了一下可能的解决方案:

@xxuyyuan
Copy link
Author

我这边是可以跑起来的,你看一下你的代码和VisualGLM-6B的main分支一致吗?是不是没有更新到最新版或者你本地改了什么?以及bitsandbytes是不是0.39.0版本。

修改了 finetune_visualglm.py #176行
args.device = 'cpu' 修改成 args.device = 'cuda'

错误就从RuntimeError: Error building extension 'fused_adam',变成了 维度问题RuntimeError: The size of tensor a (25165824) must match the size of tensor b (12288) at non-singleton dimension 0;

这里的args.device 指的是什么,为啥会这样呢?

@xxuyyuan
Copy link
Author

/opt/conda/lib/python3.10/site-packages/torch/include/ATen/cuda/CUDAContext.h:10:10: fatal error: cusolverDn.h: No such file or directory
#include <cusolverDn.h>

问题解决了,可以训练啦!!主要是cusolverDn.h: No such file or directory 找不到导致;
添加环境变量,export PATH=/usr/local/cuda/bin:$PATH

@JumpingRain
Copy link

请问这个问题解决了吗,就是维度不一致的问题

@JumpingRain
Copy link

我这边是可以跑起来的,你看一下你的代码和VisualGLM-6B的main分支一致吗?是不是没有更新到最新版或者你本地改了什么?以及bitsandbytes是不是0.39.0版本。

修改了 finetune_visualglm.py #176行 args.device = 'cpu' 修改成 args.device = 'cuda'

错误就从RuntimeError: Error building extension 'fused_adam',变成了 维度问题RuntimeError: The size of tensor a (25165824) must match the size of tensor b (12288) at non-singleton dimension 0;

这里的args.device 指的是什么,为啥会这样呢?

请问这个问题解决了吗,维度不一致的问题

@xxuyyuan
Copy link
Author

我这边是可以跑起来的,你看一下你的代码和VisualGLM-6B的main分支一致吗?是不是没有更新到最新版或者你本地改了什么?以及bitsandbytes是不是0.39.0版本。

修改了 finetune_visualglm.py #176行 args.device = 'cpu' 修改成 args.device = 'cuda'
错误就从RuntimeError: Error building extension 'fused_adam',变成了 维度问题RuntimeError: The size of tensor a (25165824) must match the size of tensor b (12288) at non-singleton dimension 0;
这里的args.device 指的是什么,为啥会这样呢?

请问这个问题解决了吗,维度不一致的问题

这个没有解决,还是改为原来的代码#176行 args.device = 'cpu'这里设置成cpu,然后就是cuda环境问题,RuntimeError: Error building extension 'fused_adam';通过上面配置环境变量解决了;

@1049451037
Copy link
Member

1049451037 commented Jun 15, 2023

因为bitsandbytes实现模型量化的时候是通过重载.cuda()函数实现的,也就是说模型在放到显卡的时候会发生量化(改变tensor维度)。在微调的时候,加载的预训练权重是fp16的,所以需要设置args.device='cpu',把权重加载进来再调用.cuda()。因为这个是bitsandbytes的实现,我们也没办法控制,只能适配。

所以维度不一致是显卡配置的问题,.cuda()调用失败了。

@chenchen333-dev
Copy link

这个应该是deepspeed配置的问题,有一个类似的issue:#43

查了一下可能的解决方案:

我的CUDA版本是12.0 也是这个问题

@chenchen333-dev
Copy link

/opt/conda/lib/python3.10/site-packages/torch/include/ATen/cuda/CUDAContext.h:10:10: fatal error: cusolverDn.h: No such file or directory
#include <cusolverDn.h>

问题解决了,可以训练啦!!主要是cusolverDn.h: No such file or directory 找不到导致; 添加环境变量,export PATH=/usr/local/cuda/bin:$PATH

在哪添加呢

@chenchen333-dev
Copy link

pip uninstall deepspeed DS_BUILD_FUSED_ADAM=1 pip install deepspeed 以上不行的话再试试 git clone https://github.com/microsoft/DeepSpeed.git cd DeepSpeed DS_BUILD_FUSED_ADAM=1 pip3 install . 还是不行的话,提出你的错误
pip uninstall deepspeed
DS_BUILD_FUSED_ADAM=1 pip install deepspeed
进行了上述操作依然出现这个报错
File "/home/nbicc/data/anaconda3/envs/visualglm/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 2112, in _run_ninja_build
raise RuntimeError(message) from e
RuntimeError: Error building extension 'fused_adam'

@chenchen333-dev
Copy link

/opt/conda/lib/python3.10/site-packages/torch/include/ATen/cuda/CUDAContext.h:10:10: fatal error: cusolverDn.h: No such file or directory
#include <cusolverDn.h>

问题解决了,可以训练啦!!主要是cusolverDn.h: No such file or directory 找不到导致; 添加环境变量,export PATH=/usr/local/cuda/bin:$PATH

我输入 vi ~/.bashrc命令,在底下添加了环境变量export PATH=/usr/local/cuda/bin:$PATH依然出现这个问题nsion.py", line 2112, in _run_ninja_build
raise RuntimeError(message) from e
RuntimeError: Error building extension 'fused_adam'

@chenchen333-dev
Copy link

tokenizer的问题可以参考这里:#111 (comment)

tokenzier重新运行是正常; image

主要是后面的问题: RuntimeError: Error building extension 'fused_adam',详情见上面;

问题已全部解决,微调成功

@chenchen333-dev
Copy link

tokenizer的问题可以参考这里:#111 (comment)

tokenzier重新运行是正常; image
主要是后面的问题: RuntimeError: Error building extension 'fused_adam',详情见上面;

问题已全部解决,微调成功

推理微调后的模型权重文件时出现:
File "/home/nbicc/data/anaconda3/envs/lm/lib/python3.8/site-packages/transformers/utils/hub.py", line 469, in cached_file
raise EnvironmentError(
OSError: We couldn't connect to 'https://huggingface.co' to load this file, couldn't find it in the cached files and it looks like THUDM/chatglm-6b is not the path to a directory containing a file named config.json.
Checkout your internet connection or see how to run the library in offline mode at 'https://huggingface.co/docs/transformers/installation#offline-mode'.
有没有人遇到这个问题

@chenchen333-dev
Copy link

tokenizer的问题可以参考这里:#111 (comment)

tokenzier重新运行是正常; image
主要是后面的问题: RuntimeError: Error building extension 'fused_adam',详情见上面;

问题已全部解决,微调成功

推理微调后的模型权重文件时出现: File "/home/nbicc/data/anaconda3/envs/lm/lib/python3.8/site-packages/transformers/utils/hub.py", line 469, in cached_file raise EnvironmentError( OSError: We couldn't connect to 'https://huggingface.co' to load this file, couldn't find it in the cached files and it looks like THUDM/chatglm-6b is not the path to a directory containing a file named config.json. Checkout your internet connection or see how to run the library in offline mode at 'https://huggingface.co/docs/transformers/installation#offline-mode'. 有没有人遇到这个问题

已解决,把里面报错信息里提到的模型文件下载到本地,再去运行的文件里指定它在本地的路径就可以了

@huangheLee
Copy link

微调llama3 我也遇到了相同的问题,不过解决了

1、使用以下版本,主要是peft使用0.4.0
accelerate==0.33.0
transformers==4.44.0
peft==0.4.0
bitsandbytes==0.43.3
loguru==0.7.0
jsonschema==4.23.0
tensorboard==2.14.0

2、lora config
"lora_rank": 64,
"lora_alpha": 16,
"lora_dropout": 0.05,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants