Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Misc. bug: llama-server --ctx-size is divided by --parallel and cannot be increased? #11681

Open
alasdairforsythe opened this issue Feb 5, 2025 · 2 comments

Comments

@alasdairforsythe
Copy link

alasdairforsythe commented Feb 5, 2025

Name and Version

./llama-server --version
version: 4621 (6eecde3)
built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

llama-server

Command line

./llama-server \
    --model models/deepseek-v3-q2_k_xs-00001-of-00005.gguf --alias full \
    --host 0.0.0.0 \
    --port 55055 \
    --ctx-size 0 \
    --cache-type-k q5_0 \
    --slot-save-path "saved_slots.kvc" \
    --threads 94 \
    --threads-http 12 \
    --parallel 6 \
    --mirostat 2 \
    --mirostat-ent 5.7 \
    --mirostat-lr 0.14

Problem description & steps to reproduce

Context size is being divided by number of parallel inferences allowed to run on llama-server. This does not make sense. Attempting to rectify it by increasing --ctx-size to compensate for this results in an assertion fail. In order to not break previous behavior, I suggest adding a new command line argument --ctz-size-per-seq to override this behavior.

If llama-server is run with --ctx-size 0 --parallel 6 or --ctx-size 163840 --parallel 6, it divides the total ctx by 6.

print_info: file format = GGUF V3 (latest)
print_info: file type   = Q2_K - Medium
print_info: file size   = 206.05 GiB (2.64 BPW)
[...]
print_info: arch             = deepseek2
print_info: n_ctx_train      = 163840
[...]
llama_init_from_model: n_seq_max     = 6
llama_init_from_model: n_ctx         = 163840
llama_init_from_model: n_ctx_per_seq = 27306
llama_init_from_model: n_batch       = 2048
llama_init_from_model: n_ubatch      = 512
llama_init_from_model: flash_attn    = 0
llama_init_from_model: freq_base     = 10000.0
llama_init_from_model: freq_scale    = 0.025
llama_init_from_model: n_ctx_per_seq (27306) < n_ctx_train (163840) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 163840, offload = 1, type_k = 'q5_0', type_v = 'f16', n_layer = 61, can_shift = 0

Okay, so I increase the ctx to 327680 so it gives more to each parallel. That almost works, but then fails on GGML_ASSERT.

llama_init_from_model: n_seq_max     = 6
llama_init_from_model: n_ctx         = 327680
llama_init_from_model: n_ctx_per_seq = 54613
llama_init_from_model: n_batch       = 2048
llama_init_from_model: n_ubatch      = 512
llama_init_from_model: flash_attn    = 0
llama_init_from_model: freq_base     = 10000.0
llama_init_from_model: freq_scale    = 0.025
llama_init_from_model: n_ctx_per_seq (54613) < n_ctx_train (163840) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 327680, offload = 1, type_k = 'q5_0', type_v = 'f16', n_layer = 61, can_shift = 0
llama_kv_cache_init:        CPU KV buffer size = 275232.00 MiB
llama_init_from_model: KV self size  = 275232.00 MiB, K (q5_0): 150304.00 MiB, V (f16): 124928.00 MiB
llama_init_from_model:        CPU  output buffer size =     2.96 MiB
/home/main/llama.cpp/ggml/src/ggml.c:1578: GGML_ASSERT(view_src == NULL || data_size == 0 || data_size + view_offs <= ggml_nbytes(view_src)) failed

So it's impossible to use more than 27306 ctx size, even though I have 1.1TB of RAM.

First Bad Commit

No response

Relevant log output

./llama-server \
>     --model models/deepseek-v3-q2_k_xs-00001-of-00005.gguf --alias full \
>     --host 0.0.0.0 \
>     --port 55055 \
>     --ctx-size 327680 \
>     --cache-type-k q5_0 \
>     --slot-save-path "saved_slots.kvc" \
>     --threads 94 \
>     --threads-http 12 \
>     --parallel 6 \
>     --mirostat 2 \
>     --mirostat-ent 5.7 \
>     --mirostat-lr 0.14
build: 4621 (6eecde3c) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
system info: n_threads = 94, n_threads_batch = 94, total_threads = 96

system_info: n_threads = 94 (n_threads_batch = 94) / 96 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |

main: HTTP server is listening, hostname: 0.0.0.0, port: 55055, http threads: 12
main: loading model
srv    load_model: loading model 'models/deepseek-v3-q2_k_xs-00001-of-00005.gguf'
llama_model_loader: additional 4 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 46 key-value pairs and 1025 tensors from models/deepseek-v3-q2_k_xs-00001-of-00005.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = deepseek2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = DeepSeek V3 BF16
llama_model_loader: - kv   3:                         general.size_label str              = 256x20B
llama_model_loader: - kv   4:                      deepseek2.block_count u32              = 61
llama_model_loader: - kv   5:                   deepseek2.context_length u32              = 163840
llama_model_loader: - kv   6:                 deepseek2.embedding_length u32              = 7168
llama_model_loader: - kv   7:              deepseek2.feed_forward_length u32              = 18432
llama_model_loader: - kv   8:             deepseek2.attention.head_count u32              = 128
llama_model_loader: - kv   9:          deepseek2.attention.head_count_kv u32              = 128
llama_model_loader: - kv  10:                   deepseek2.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  11: deepseek2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  12:                deepseek2.expert_used_count u32              = 8
llama_model_loader: - kv  13:                          general.file_type u32              = 10
llama_model_loader: - kv  14:        deepseek2.leading_dense_block_count u32              = 3
llama_model_loader: - kv  15:                       deepseek2.vocab_size u32              = 129280
llama_model_loader: - kv  16:            deepseek2.attention.q_lora_rank u32              = 1536
llama_model_loader: - kv  17:           deepseek2.attention.kv_lora_rank u32              = 512
llama_model_loader: - kv  18:             deepseek2.attention.key_length u32              = 192
llama_model_loader: - kv  19:           deepseek2.attention.value_length u32              = 128
llama_model_loader: - kv  20:       deepseek2.expert_feed_forward_length u32              = 2048
llama_model_loader: - kv  21:                     deepseek2.expert_count u32              = 256
llama_model_loader: - kv  22:              deepseek2.expert_shared_count u32              = 1
llama_model_loader: - kv  23:             deepseek2.expert_weights_scale f32              = 2.500000
llama_model_loader: - kv  24:              deepseek2.expert_weights_norm bool             = true
llama_model_loader: - kv  25:               deepseek2.expert_gating_func u32              = 2
llama_model_loader: - kv  26:             deepseek2.rope.dimension_count u32              = 64
llama_model_loader: - kv  27:                deepseek2.rope.scaling.type str              = yarn
llama_model_loader: - kv  28:              deepseek2.rope.scaling.factor f32              = 40.000000
llama_model_loader: - kv  29: deepseek2.rope.scaling.original_context_length u32              = 4096
llama_model_loader: - kv  30: deepseek2.rope.scaling.yarn_log_multiplier f32              = 0.100000
llama_model_loader: - kv  31:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  32:                         tokenizer.ggml.pre str              = deepseek-v3
llama_model_loader: - kv  33:                      tokenizer.ggml.tokens arr[str,129280]  = ["<|begin▁of▁sentence|>", "<�...
llama_model_loader: - kv  34:                  tokenizer.ggml.token_type arr[i32,129280]  = [3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  35:                      tokenizer.ggml.merges arr[str,127741]  = ["Ġ t", "Ġ a", "i n", "Ġ Ġ", "h e...
llama_model_loader: - kv  36:                tokenizer.ggml.bos_token_id u32              = 0
llama_model_loader: - kv  37:                tokenizer.ggml.eos_token_id u32              = 1
llama_model_loader: - kv  38:            tokenizer.ggml.padding_token_id u32              = 1
llama_model_loader: - kv  39:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  40:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  41:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
llama_model_loader: - kv  42:               general.quantization_version u32              = 2
llama_model_loader: - kv  43:                                   split.no u16              = 0
llama_model_loader: - kv  44:                                split.count u16              = 5
llama_model_loader: - kv  45:                        split.tensors.count i32              = 1025
llama_model_loader: - type  f32:  361 tensors
llama_model_loader: - type q2_K:  662 tensors
llama_model_loader: - type q4_K:    1 tensors
llama_model_loader: - type q6_K:    1 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q2_K - Medium
print_info: file size   = 206.05 GiB (2.64 BPW)
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 818
load: token to piece cache size = 0.8223 MB
print_info: arch             = deepseek2
print_info: vocab_only       = 0
print_info: n_ctx_train      = 163840
print_info: n_embd           = 7168
print_info: n_layer          = 61
print_info: n_head           = 128
print_info: n_head_kv        = 128
print_info: n_rot            = 64
print_info: n_swa            = 0
print_info: n_embd_head_k    = 192
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 1
print_info: n_embd_k_gqa     = 24576
print_info: n_embd_v_gqa     = 16384
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: n_ff             = 18432
print_info: n_expert         = 256
print_info: n_expert_used    = 8
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = yarn
print_info: freq_base_train  = 10000.0
print_info: freq_scale_train = 0.025
print_info: n_ctx_orig_yarn  = 4096
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 671B
print_info: model params     = 671.03 B
print_info: general.name     = DeepSeek V3 BF16
print_info: n_layer_dense_lead   = 3
print_info: n_lora_q             = 1536
print_info: n_lora_kv            = 512
print_info: n_ff_exp             = 2048
print_info: n_expert_shared      = 1
print_info: expert_weights_scale = 2.5
print_info: expert_weights_norm  = 1
print_info: expert_gating_func   = sigmoid
print_info: rope_yarn_log_mul    = 0.1000
print_info: vocab type       = BPE
print_info: n_vocab          = 129280
print_info: n_merges         = 127741
print_info: BOS token        = 0 '<|begin▁of▁sentence|>'
print_info: EOS token        = 1 '<|end▁of▁sentence|>'
print_info: EOT token        = 1 '<|end▁of▁sentence|>'
print_info: PAD token        = 1 '<|end▁of▁sentence|>'
print_info: LF token         = 201 'Ċ'
print_info: FIM PRE token    = 128801 '<|fim▁begin|>'
print_info: FIM SUF token    = 128800 '<|fim▁hole|>'
print_info: FIM MID token    = 128802 '<|fim▁end|>'
print_info: EOG token        = 1 '<|end▁of▁sentence|>'
print_info: max token length = 256
load_tensors:   CPU_Mapped model buffer size = 42690.54 MiB
load_tensors:   CPU_Mapped model buffer size = 42108.14 MiB
load_tensors:   CPU_Mapped model buffer size = 42049.55 MiB
load_tensors:   CPU_Mapped model buffer size = 42101.11 MiB
load_tensors:   CPU_Mapped model buffer size = 42049.55 MiB
llama_init_from_model: n_seq_max     = 6
llama_init_from_model: n_ctx         = 327680
llama_init_from_model: n_ctx_per_seq = 54613
llama_init_from_model: n_batch       = 2048
llama_init_from_model: n_ubatch      = 512
llama_init_from_model: flash_attn    = 0
llama_init_from_model: freq_base     = 10000.0
llama_init_from_model: freq_scale    = 0.025
llama_init_from_model: n_ctx_per_seq (54613) < n_ctx_train (163840) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 327680, offload = 1, type_k = 'q5_0', type_v = 'f16', n_layer = 61, can_shift = 0
llama_kv_cache_init:        CPU KV buffer size = 275232.00 MiB
llama_init_from_model: KV self size  = 275232.00 MiB, K (q5_0): 150304.00 MiB, V (f16): 124928.00 MiB
llama_init_from_model:        CPU  output buffer size =     2.96 MiB
/home/main/llama.cpp/ggml/src/ggml.c:1578: GGML_ASSERT(view_src == NULL || data_size == 0 || data_size + view_offs <= ggml_nbytes(view_src)) failed
@ggerganov
Copy link
Owner

Could be an integer overflow somewhere. Could you try to trace it down?

@alasdairforsythe
Copy link
Author

It's line 1902 in server.cpp:

const int32_t n_ctx_slot = n_ctx / params_base.n_parallel;

I will add a new parameter to optionally override this value and push it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants