-
-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Unable to use fp8 kv cache with chunked prefill on ampere #7714
Comments
Also seeing the same thing on a100 gpu's with the same model |
Can you share the full output and command? I'm able to successfully run
|
Did you test calling the model? When I did it, the model loads but fails whenever you try to call it. |
Yeah, same behavior here. I can load llama3.1 type models and vllm says uvicorn is running but the first completion request yields: AssertionError: fp8e4nv data type is not supported on CUDA arch < 89 I'm also not trying to use triton backend and I would be surprised if vllm used it automatically for this model too. |
@joe-schwartz-certara - thanks for reporting the issue. I don't believe the issue is due to NM models, but rather I believe the root cause is:
We should be detecting this incompatibility and rejecting it in setup. Will work on making this fix. For now, can you try running with |
Confirmed I am able to reproduce the issue using just the unquantized reference Meta Llama 3.1 8B with fp8 kv cache (
|
Yup, i absolutely agree with all of these thoughts. Can also confirm lowering max length "fixes" it but we have many users who want to experiment with the huge context window. Thanks for the debugging efforts vllm team!! |
Hold on, there's another flag you can set to enable both. |
I think if you do just explicitly set |
@joe-schwartz-certara can you LMK if this works for you? If not will dig a bit deeper in the code to remind myself |
I'm getting the error that the kv cache doesn't have enough space to fit on my 80 GB a100 now with --enable_chunked-prefill=False. I can try with 2 a100s just to make sure it fixes it; But usually i can just barely fit llama3.1 70b awq on a single a100. Is there a reason why disabling chunked prefill would make the kv cache size go up? I really am right at the limit of vram usage on a single a100; i calculated that the kv cache would be around 30-40 gb without fp8 quantization and weights for the model are around 37 gb. I can't fit the default kv cache dtype on a single a100 without going right to 0.99 gpu mem utilization. Here's the whole command that says it doesnt have space for kv cache on a single a100 80 gb gpu: python -u -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 8000 --model hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4 --tensor-parallel-size 1 --download-dir /data --max-model-len 128000 --quantization marlin --gpu-memory-utilization 0.99 --trust-remote-code --enforce-eager --kv-cache-dtype fp8 --enable_chunked-prefill=False |
Oh perhaps another possibility is disabling chunked prefill is allocating gpu vram to some other kind of cache process and thus choking the available vram for kv cache |
On 2 a100s, it works. So its definitely the issue somewhere in the new parameter but also I'm still wondering why it doesn't let me fit it on 80 gb vram. |
Chunked prefill reduces the maximum size of the activations (since we only ever run forward with chunk size). You can reduce the max prefill length when chunked prefill is disabled |
reminder re: #8512 |
Same issue, but lower max-model-len does not fix it, disable CUDA version: 12.7 |
30series GPUs (RTX30series, A100, etc) have this error. 40series GPUs will be OK. |
@dkkb @codexq123 With A100, same issue, but it won't fix with For example, the following code will raise this issue: llm = LLM(
model=model_name_or_path,
dtype='bfloat16',
quantization='fp8',
kv_cache_dtype='fp8',
enable_chunked_prefill=False,
max_model_len=65536,
enforce_eager=True,
enable_prefix_caching=True,
trust_remote_code=True,
) We can "solve" this issue via either decrease the value of vllm: 0.6.4.post1 |
@scruel Sorry for offtopic, but how do you use fp8 for ampere a100? It doesn't work and doesn't give acceleration? Or for the size of the model? |
Your current environment
The output of `python collect_env.py`
🐛 Describe the bug
I am running the nightly pip, pulled 9:00 PM EST 8/20/24. Running under docker image 0.5.4. I also tested the same commands under normal 0.5.4.
This is the model I am using
https://huggingface.co/neuralmagic/Meta-Llama-3-70B-Instruct-quantized.w8a16
It also fails with
https://huggingface.co/neuralmagic/Meta-Llama-3.1-70B-Instruct-FP8
However, all of these commands work.
It appears to be something to do with neuralmagic's quants and the fp8 cache. It appears to be setting the fp8 cache to a dtype that is not possible under ampere. However, forcibly setting the dtype to fp8_e5m2 or fp8_e4m3 do not work either likely due to the precomputed scales in the quants. Let me know if I should post this to neuralmagic but I figure they're heavily involved here anyway and there might be a vllm solution as well (I noticed you guys managed to get the neuralmagic quants running ampere in 0.5.3, thank you!).
The text was updated successfully, but these errors were encountered: