Misc. bug: failed to find ggml_backend_init in libggml-cuda.so and libggml-cpu.so #11700

furioussoul · 2025-02-06T07:27:16Z

Name and Version

./llama-cli --version
load_backend: failed to find ggml_backend_init in ./libggml-cuda.so
load_backend: failed to find ggml_backend_init in ./libggml-cpu.so
version: 4603 (4a2b196)
built with gcc (GCC) 8.5.0 20210514 (TencentOS 8.5.0-18) for x86_64-redhat-linux

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

llama-cli

Command line

./llama-cli --n_gpu_layers 29 -m /path/to/qwen2.5-7b-it-Q4_K_M-LOT.gguf

Problem description & steps to reproduce

Run the following command:
./llama-cli --n_gpu_layers 29 -m /path/to/qwen2.5-7b-it-Q4_K_M-LOT.gguf

Then, check the log. Everything seems fine except for the error highlighted in bold below:

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: Tesla V100-SXM2-32GB, compute capability 7.0, VMM: yes
register_backend: registered backend CUDA (1 devices)
register_device: registered device CUDA0 (Tesla V100-SXM2-32GB)
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (Intel(R) Xeon(R) Platinum 8255C CPU @ 2.50GHz)
load_backend: failed to find ggml_backend_init in /path/to/llama.cpp/out/build/linux/bin/libggml-cuda.so
load_backend: failed to find ggml_backend_init in /path/to/llama.cpp/out/build/linux/bin/libggml-cpu.so
build: 3 (f1614702) with gcc (GCC) 8.5.0 20210514 (TencentOS 8.5.0-18) for x86_64-redhat-linux (debug)
llama_model_load_from_file_impl: using device CUDA0 (Tesla V100-SXM2-32GB) - 29565 MiB free
llama_model_loader: loaded meta data with 34 key-value pairs and 339 tensors from /mnt/szj/llama.cpp/models/qwen2.5-7b-it-Q4_K_M-LOT.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen2
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Qwen2.5 7B Instruct
llama_model_loader: - kv 3: general.finetune str = Instruct
llama_model_loader: - kv 4: general.basename str = Qwen2.5
llama_model_loader: - kv 5: general.size_label str = 7B
llama_model_loader: - kv 6: general.license str = apache-2.0
llama_model_loader: - kv 7: general.license.link str = https://huggingface.co/Qwen/Qwen2.5-7...
llama_model_loader: - kv 8: general.base_model.count u32 = 1
llama_model_loader: - kv 9: general.base_model.0.name str = Qwen2.5 7B
llama_model_loader: - kv 10: general.base_model.0.organization str = Qwen
llama_model_loader: - kv 11: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen2.5-7B
llama_model_loader: - kv 12: general.tags arr[str,2] = ["chat", "text-generation"]
llama_model_loader: - kv 13: general.languages arr[str,1] = ["en"]
llama_model_loader: - kv 14: qwen2.block_count u32 = 28
llama_model_loader: - kv 15: qwen2.context_length u32 = 32768
llama_model_loader: - kv 16: qwen2.embedding_length u32 = 3584
llama_model_loader: - kv 17: qwen2.feed_forward_length u32 = 18944
llama_model_loader: - kv 18: qwen2.attention.head_count u32 = 28
llama_model_loader: - kv 19: qwen2.attention.head_count_kv u32 = 4
llama_model_loader: - kv 20: qwen2.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 21: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 22: general.file_type u32 = 15
llama_model_loader: - kv 23: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 24: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 25: tokenizer.ggml.tokens arr[str,152064] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 26: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 27: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 151645
llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 151643
llama_model_loader: - kv 30: tokenizer.ggml.bos_token_id u32 = 151643
llama_model_loader: - kv 31: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 32: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
llama_model_loader: - kv 33: general.quantization_version u32 = 2
llama_model_loader: - type f32: 141 tensors
llama_model_loader: - type f16: 1 tensors
llama_model_loader: - type q4_K: 169 tensors
llama_model_loader: - type q6_K: 28 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q4_K - Medium
print_info: file size = 4.95 GiB (5.59 BPW)
load: special tokens cache size = 22
load: token to piece cache size = 0.9310 MB
print_info: arch = qwen2
print_info: vocab_only = 0
print_info: n_ctx_train = 32768
print_info: n_embd = 3584
print_info: n_layer = 28
print_info: n_head = 28
print_info: n_head_kv = 4
print_info: n_rot = 128
print_info: n_swa = 0
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 7
print_info: n_embd_k_gqa = 512
print_info: n_embd_v_gqa = 512
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-06
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: n_ff = 18944
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 2
print_info: rope scaling = linear
print_info: freq_base_train = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 32768
print_info: rope_finetuned = unknown
print_info: ssm_d_conv = 0
print_info: ssm_d_inner = 0
print_info: ssm_d_state = 0
print_info: ssm_dt_rank = 0
print_info: ssm_dt_b_c_rms = 0
print_info: model type = 7B
print_info: model params = 7.62 B
print_info: general.name = Qwen2.5 7B Instruct
print_info: vocab type = BPE
print_info: n_vocab = 152064
print_info: n_merges = 151387
print_info: BOS token = 151643 '<|endoftext|>'
print_info: EOS token = 151645 '<|im_end|>'
print_info: EOT token = 151643 '<|endoftext|>'
print_info: PAD token = 151643 '<|endoftext|>'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
print_info: FIM MID token = 151660 '<|fim_middle|>'
print_info: FIM PAD token = 151662 '<|fim_pad|>'
print_info: FIM REP token = 151663 '<|repo_name|>'
print_info: FIM SEP token = 151664 '<|file_sep|>'
print_info: EOG token = 151643 '<|endoftext|>'
print_info: EOG token = 151645 '<|im_end|>'
print_info: EOG token = 151662 '<|fim_pad|>'
print_info: EOG token = 151663 '<|repo_name|>'
print_info: EOG token = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: offloading 28 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 29/29 layers to GPU
load_tensors: CPU_Mapped model buffer size = 292.36 MiB
load_tensors: CUDA0 model buffer size = 4781.23 MiB
llama_init_from_model: n_seq_max = 1
llama_init_from_model: n_ctx = 8192
llama_init_from_model: n_ctx_per_seq = 8192
llama_init_from_model: n_batch = 2048
llama_init_from_model: n_ubatch = 512
llama_init_from_model: flash_attn = 0
llama_init_from_model: freq_base = 1000000.0
llama_init_from_model: freq_scale = 1
llama_init_from_model: n_ctx_per_seq (8192) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 8192, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 28, can_shift = 1
llama_kv_cache_init: CUDA0 KV buffer size = 448.00 MiB
llama_init_from_model: KV self size = 448.00 MiB, K (f16): 224.00 MiB, V (f16): 224.00 MiB
llama_init_from_model: CUDA_Host output buffer size = 0.58 MiB
llama_init_from_model: CUDA0 compute buffer size = 492.00 MiB
llama_init_from_model: CUDA_Host compute buffer size = 23.01 MiB
llama_init_from_model: graph nodes = 986
llama_init_from_model: graph splits = 2
common_init_from_params: setting dry_penalty_last_n to ctx_size = 8192
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
你好！“jntm”是一句网络流行语，源自于2017年的一

In the file ggml-backend-reg.cpp, the code snippet is as follows:

    ggml_backend_reg_t load_backend(const std::wstring & path, bool silent) {
        dl_handle_ptr handle { dl_load_library(path) };
        if (!handle) {
            if (!silent) {
                GGML_LOG_ERROR("%s: failed to load %s\n", __func__, utf16_to_utf8(path).c_str());
            }
            return nullptr;
        }

        auto score_fn = (ggml_backend_score_t) dl_get_sym(handle.get(), "ggml_backend_score");
        if (score_fn && score_fn() == 0) {
            if (!silent) {
                GGML_LOG_INFO("%s: backend %s is not supported on this system\n", __func__, utf16_to_utf8(path).c_str());
            }
            return nullptr;
        }

        auto backend_init_fn = (ggml_backend_init_t) dl_get_sym(handle.get(), "ggml_backend_init");
        if (!backend_init_fn) {
            if (!silent) {
                GGML_LOG_ERROR("%s: failed to find ggml_backend_init in %s\n", __func__, utf16_to_utf8(path).c_str());
            }
            return nullptr;
        }

        ggml_backend_reg_t reg = backend_init_fn();
        if (!reg || reg->api_version != GGML_BACKEND_API_VERSION) {
            if (!silent) {
                if (!reg) {
                    GGML_LOG_ERROR("%s: failed to initialize backend from %s: ggml_backend_init returned NULL\n", __func__, utf16_to_utf8(path).c_str());
                } else {
                    GGML_LOG_ERROR("%s: failed to initialize backend from %s: incompatible API version (backend: %d, current: %d)\n",
                        __func__, utf16_to_utf8(path).c_str(), reg->api_version, GGML_BACKEND_API_VERSION);
                }
            }
            return nullptr;
        }

        GGML_LOG_INFO("%s: loaded %s backend from %s\n", __func__, ggml_backend_reg_name(reg), utf16_to_utf8(path).c_str());

        register_backend(reg, std::move(handle));

        return reg;
    }

I believe that when score_fn is nullptr, backend_init_fn should also be nullptr.
This error should occur when score_fn == nullptr and backend_init_fn is also nullptr.
To address the issue, the condition should be changed to !score_fn || score_fn() == 0.

The text was updated successfully, but these errors were encountered:

furioussoul added the bug-unconfirmed label Feb 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Misc. bug: failed to find ggml_backend_init in libggml-cuda.so and libggml-cpu.so #11700

Misc. bug: failed to find ggml_backend_init in libggml-cuda.so and libggml-cpu.so #11700

furioussoul commented Feb 6, 2025 •

edited

Loading

Misc. bug: failed to find ggml_backend_init in libggml-cuda.so and libggml-cpu.so #11700

Misc. bug: failed to find ggml_backend_init in libggml-cuda.so and libggml-cpu.so #11700

Comments

furioussoul commented Feb 6, 2025 • edited Loading

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

furioussoul commented Feb 6, 2025 •

edited

Loading