-
Notifications
You must be signed in to change notification settings - Fork 10.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Eval bug: llama.cpp CPU bound while inferencing against DeepSeek-R1 GGUF #11635
Comments
What do top and nvidia-smi tell you about vram usage? If its in use, the gpu is almost certainly being used for whatever part of the model is loaded into vram. |
You were right with the the sanity test. I'm seeing a decrease in performance, but it's not appreciably faster with -ngl 8 or 16. Here's nvtop during inferencing:
29457 0 Compute 0% 44378MiB 96% 1119% 52072MiB build/bin/llama-cli --model DeepSeek-R1-GGUF/DeepSeek-R1-U And here's nvidia-smi: +-----------------------------------------------------------------------------------------+ With -ngl 0 in 90 seconds, this is how far I get: With -ngl 8 in 90 seconds: With -ngl 16 in 90 seconds: |
have you tried reducing the --threads from 64 to say 16? I have a 64 core amd cpu and I found optimal performance around this point. if you have enough system memory and ok with a longer startup time you can add --no-mmap which will load the model into system ram, rather than have it mapped to disk. |
Name and Version
$ ./build/bin/llama-cli --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA L40S, compute capability 8.9, VMM: yes
version: 4625 (5598f47)
built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
Operating systems
Linux
GGML backends
CUDA
Hardware
Intel(R) Xeon(R) w5-3425 + NVIDIA L40S
Models
unsloth/DeepSeek-R1-GGUF
Problem description & steps to reproduce
When attempting to use llama-cli to inference, it becomes CPU bound and is painfully slow (less than one token per second). nvtop shows that the GPU is 0% utilized (all CPU being used) despite 14 layers and 44GB offloaded to VRAM. I'm following the instructions outlined on Unsloth's blog and running the following command:
!build/bin/llama-cli \ --model DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf \ --cache-type-k q4_0 \ --threads 64 \ --prio 2 \ --temp 0.6 \ --ctx-size 8192 \ --seed 3407 \ --n-gpu-layers 16 \ -no-cnv \ --prompt "<|User|>Create a Flappy Bird game in Python.<|Assistant|>"
First Bad Commit
No response
Relevant log output
The text was updated successfully, but these errors were encountered: