Misc. bug: llama-server --ctx-size
is divided by --parallel
and cannot be increased?
#11681
Labels
--ctx-size
is divided by --parallel
and cannot be increased?
#11681
Name and Version
./llama-server --version
version: 4621 (6eecde3)
built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
Operating systems
Linux
Which llama.cpp modules do you know to be affected?
llama-server
Command line
./llama-server \ --model models/deepseek-v3-q2_k_xs-00001-of-00005.gguf --alias full \ --host 0.0.0.0 \ --port 55055 \ --ctx-size 0 \ --cache-type-k q5_0 \ --slot-save-path "saved_slots.kvc" \ --threads 94 \ --threads-http 12 \ --parallel 6 \ --mirostat 2 \ --mirostat-ent 5.7 \ --mirostat-lr 0.14
Problem description & steps to reproduce
Context size is being divided by number of parallel inferences allowed to run on llama-server. This does not make sense. Attempting to rectify it by increasing --ctx-size to compensate for this results in an assertion fail. In order to not break previous behavior, I suggest adding a new command line argument --ctz-size-per-seq to override this behavior.
If llama-server is run with
--ctx-size 0 --parallel 6
or--ctx-size 163840 --parallel 6
, it divides the total ctx by 6.Okay, so I increase the ctx to 327680 so it gives more to each parallel. That almost works, but then fails on GGML_ASSERT.
So it's impossible to use more than 27306 ctx size, even though I have 1.1TB of RAM.
First Bad Commit
No response
Relevant log output
The text was updated successfully, but these errors were encountered: