Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Unable to use fp8 kv cache with chunked prefill on ampere #7714

Open
w013nad opened this issue Aug 21, 2024 · 20 comments
Open

[Bug]: Unable to use fp8 kv cache with chunked prefill on ampere #7714

w013nad opened this issue Aug 21, 2024 · 20 comments
Labels
bug Something isn't working

Comments

@w013nad
Copy link

w013nad commented Aug 21, 2024

Your current environment

The output of `python collect_env.py`
PyTorch version: 2.4.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.6 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
Clang version: Could not collect
CMake version: version 3.30.2
Libc version: glibc-2.31

Python version: 3.10.14 (main, Apr  6 2024, 18:45:05) [GCC 9.4.0] (64-bit runtime)
Python platform: Linux-4.18.0-425.19.2.el8_7.x86_64-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA A100-SXM4-40GB
GPU 1: NVIDIA A100-SXM4-40GB
GPU 2: NVIDIA A100-SXM4-40GB
GPU 3: NVIDIA A100-SXM4-40GB
GPU 4: NVIDIA A100-SXM4-40GB
GPU 5: NVIDIA A100-SXM4-40GB
GPU 6: NVIDIA A100-SXM4-40GB
GPU 7: NVIDIA A100-SXM4-40GB

Nvidia driver version: 525.105.17
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   43 bits physical, 48 bits virtual
CPU(s):                          256
On-line CPU(s) list:             0-255
Thread(s) per core:              2
Core(s) per socket:              64
Socket(s):                       2
NUMA node(s):                    8
Vendor ID:                       AuthenticAMD
CPU family:                      23
Model:                           49
Model name:                      AMD EPYC 7742 64-Core Processor
Stepping:                        0
Frequency boost:                 enabled
CPU MHz:                         2250.000
CPU max MHz:                     2250.0000
CPU min MHz:                     1500.0000
BogoMIPS:                        4491.45
Virtualization:                  AMD-V
L1d cache:                       4 MiB
L1i cache:                       4 MiB
L2 cache:                        64 MiB
L3 cache:                        512 MiB
NUMA node0 CPU(s):               0-15,128-143
NUMA node1 CPU(s):               16-31,144-159
NUMA node2 CPU(s):               32-47,160-175
NUMA node3 CPU(s):               48-63,176-191
NUMA node4 CPU(s):               64-79,192-207
NUMA node5 CPU(s):               80-95,208-223
NUMA node6 CPU(s):               96-111,224-239
NUMA node7 CPU(s):               112-127,240-255
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Mmio stale data:   Not affected
Vulnerability Retbleed:          Mitigation; untrained return thunk; SMT enabled with STIBP protection
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Retpolines, IBPB conditional, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_recov succor smca sme sev sev_es

Versions of relevant libraries:
[pip3] flashinfer==0.1.2+cu121torch2.4
[pip3] numpy==1.26.4
[pip3] torch==2.4.0
[pip3] torchvision==0.19.0
[pip3] triton==3.0.0
[conda] Could not collectROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.5.4
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    NIC1    NIC2    NIC3    NIC4    NIC5    NIC6   NIC7                                                                                                                                                   NIC8     NIC9    CPU Affinity    NUMA Affinity
GPU0     X      NV12    NV12    NV12    NV12    NV12    NV12    NV12    PXB     PXB     SYS     SYS     SYS     SYS    SYS                                                                                                                                                    SYS      SYS     SYS     48-63,176-191   3
GPU1    NV12     X      NV12    NV12    NV12    NV12    NV12    NV12    PXB     PXB     SYS     SYS     SYS     SYS    SYS                                                                                                                                                    SYS      SYS     SYS     48-63,176-191   3
GPU2    NV12    NV12     X      NV12    NV12    NV12    NV12    NV12    SYS     SYS     PXB     PXB     SYS     SYS    SYS                                                                                                                                                    SYS      SYS     SYS     16-31,144-159   1
GPU3    NV12    NV12    NV12     X      NV12    NV12    NV12    NV12    SYS     SYS     PXB     PXB     SYS     SYS    SYS                                                                                                                                                    SYS      SYS     SYS     16-31,144-159   1
GPU4    NV12    NV12    NV12    NV12     X      NV12    NV12    NV12    SYS     SYS     SYS     SYS     PXB     PXB    SYS                                                                                                                                                    SYS      SYS     SYS     112-127,240-255 7
GPU5    NV12    NV12    NV12    NV12    NV12     X      NV12    NV12    SYS     SYS     SYS     SYS     PXB     PXB    SYS                                                                                                                                                    SYS      SYS     SYS     112-127,240-255 7
GPU6    NV12    NV12    NV12    NV12    NV12    NV12     X      NV12    SYS     SYS     SYS     SYS     SYS     SYS    PXB                                                                                                                                                    PXB      SYS     SYS     80-95,208-223   5
GPU7    NV12    NV12    NV12    NV12    NV12    NV12    NV12     X      SYS     SYS     SYS     SYS     SYS     SYS    PXB                                                                                                                                                    PXB      SYS     SYS     80-95,208-223   5
NIC0    PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS      X      PXB     SYS     SYS     SYS     SYS    SYS                                                                                                                                                    SYS      SYS     SYS
NIC1    PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS     PXB      X      SYS     SYS     SYS     SYS    SYS                                                                                                                                                    SYS      SYS     SYS
NIC2    SYS     SYS     PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS      X      PXB     SYS     SYS    SYS                                                                                                                                                    SYS      SYS     SYS
NIC3    SYS     SYS     PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS     PXB      X      SYS     SYS    SYS                                                                                                                                                    SYS      SYS     SYS
NIC4    SYS     SYS     SYS     SYS     PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS      X      PXB    SYS                                                                                                                                                    SYS      SYS     SYS
NIC5    SYS     SYS     SYS     SYS     PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS     PXB      X     SYS                                                                                                                                                    SYS      SYS     SYS
NIC6    SYS     SYS     SYS     SYS     SYS     SYS     PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS     X                                                                                                                                                     PXB      SYS     SYS
NIC7    SYS     SYS     SYS     SYS     SYS     SYS     PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS    PXB                                                                                                                                                     X       SYS     SYS
NIC8    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS    SYS                                                                                                                                                    SYS       X      PIX
NIC9    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS    SYS                                                                                                                                                    SYS      PIX      X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_2
  NIC3: mlx5_3
  NIC4: mlx5_4
  NIC5: mlx5_5
  NIC6: mlx5_6
  NIC7: mlx5_7
  NIC8: mlx5_8
  NIC9: mlx5_9

🐛 Describe the bug

I am running the nightly pip, pulled 9:00 PM EST 8/20/24. Running under docker image 0.5.4. I also tested the same commands under normal 0.5.4.

python3 -m vllm.entrypoints.openai.api_server --model /home/ndurkee/Meta-Llama-3.1-70B-Instruct-quantized.w8a16 -tp 8 --gpu-memory-utilization 0.995 --dtype auto --distributed-executor-backend mp --max-model-len 40000 --kv-cache-dtype fp8

This is the model I am using
https://huggingface.co/neuralmagic/Meta-Llama-3-70B-Instruct-quantized.w8a16

It also fails with
https://huggingface.co/neuralmagic/Meta-Llama-3.1-70B-Instruct-FP8

(VllmWorkerProcess pid=4055) ERROR 08-21 01:06:31 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method start_worker_execution_loop: at 110:17:
(VllmWorkerProcess pid=4055) ERROR 08-21 01:06:31 multiproc_worker_utils.py:226]                  cur_kv_head * stride_k_cache_h +
(VllmWorkerProcess pid=4055) ERROR 08-21 01:06:31 multiproc_worker_utils.py:226]                  (offs_d[:, None] // x) * stride_k_cache_d +
(VllmWorkerProcess pid=4055) ERROR 08-21 01:06:31 multiproc_worker_utils.py:226]                  ((start_n + offs_n[None, :]) % block_size) *
(VllmWorkerProcess pid=4055) ERROR 08-21 01:06:31 multiproc_worker_utils.py:226]                  stride_k_cache_bl +
(VllmWorkerProcess pid=4055) ERROR 08-21 01:06:31 multiproc_worker_utils.py:226]                  (offs_d[:, None] % x) * stride_k_cache_x)
(VllmWorkerProcess pid=4055) ERROR 08-21 01:06:31 multiproc_worker_utils.py:226]         # [N,D]
(VllmWorkerProcess pid=4055) ERROR 08-21 01:06:31 multiproc_worker_utils.py:226]         off_v = (
(VllmWorkerProcess pid=4055) ERROR 08-21 01:06:31 multiproc_worker_utils.py:226]             bn[:, None] * stride_v_cache_bs +
(VllmWorkerProcess pid=4055) ERROR 08-21 01:06:31 multiproc_worker_utils.py:226]             cur_kv_head * stride_v_cache_h +
(VllmWorkerProcess pid=4055) ERROR 08-21 01:06:31 multiproc_worker_utils.py:226]             offs_d[None, :] * stride_v_cache_d +
(VllmWorkerProcess pid=4055) ERROR 08-21 01:06:31 multiproc_worker_utils.py:226]             (start_n + offs_n[:, None]) % block_size * stride_v_cache_bl)
(VllmWorkerProcess pid=4055) ERROR 08-21 01:06:31 multiproc_worker_utils.py:226]         k_load = tl.load(K_cache + off_k,
(VllmWorkerProcess pid=4055) ERROR 08-21 01:06:31 multiproc_worker_utils.py:226]                  ^, Traceback (most recent call last):
(VllmWorkerProcess pid=4055) ERROR 08-21 01:06:31 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/triton/language/core.py", line 35, in wrapper
(VllmWorkerProcess pid=4055) ERROR 08-21 01:06:31 multiproc_worker_utils.py:226]     return fn(*args, **kwargs)
(VllmWorkerProcess pid=4055) ERROR 08-21 01:06:31 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/triton/language/core.py", line 1597, in load
(VllmWorkerProcess pid=4055) ERROR 08-21 01:06:31 multiproc_worker_utils.py:226]     return semantic.load(pointer, mask, other, boundary_check, padding_option, cache_modifier, eviction_policy,
(VllmWorkerProcess pid=4055) ERROR 08-21 01:06:31 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/triton/language/semantic.py", line 1037, in load
(VllmWorkerProcess pid=4055) ERROR 08-21 01:06:31 multiproc_worker_utils.py:226]     return _load_legacy(ptr, mask, other, boundary_check, padding, cache, eviction, is_volatile, builder)
(VllmWorkerProcess pid=4055) ERROR 08-21 01:06:31 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/triton/language/semantic.py", line 1005, in _load_legacy
(VllmWorkerProcess pid=4055) ERROR 08-21 01:06:31 multiproc_worker_utils.py:226]     other = cast(other, elt_ty, builder)
(VllmWorkerProcess pid=4055) ERROR 08-21 01:06:31 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/triton/language/semantic.py", line 759, in cast
(VllmWorkerProcess pid=4055) ERROR 08-21 01:06:31 multiproc_worker_utils.py:226]     assert builder.options.allow_fp8e4nv, "fp8e4nv data type is not supported on CUDA arch < 89"
(VllmWorkerProcess pid=4055) ERROR 08-21 01:06:31 multiproc_worker_utils.py:226] AssertionError: fp8e4nv data type is not supported on CUDA arch < 89

However, all of these commands work.

python3 -m vllm.entrypoints.openai.api_server --model /home/ndurkee/Meta-Llama-3.1-70B-Instruct-quantized.w8a16 -tp 8 --gpu-memory-utilization 0.995 --dtype auto --distributed-executor-backend mp --max-model-len 40000
python3 -m vllm.entrypoints.openai.api_server --model /home/ndurkee/Llama-3-8B-Instruct-GPTQ-4-Bit -tp 8 --gpu-memory-utilization 0.995 --dtype auto --distributed-executor-backend mp --max-model-len 4000 --kv-cache-dtype fp8
python3 -m vllm.entrypoints.openai.api_server --model /home/ndurkee/Llama-3-8B-Instruct -tp 8 --gpu-memory-utilization 0.995 --dtype auto --distributed-executor-backend mp --max-model-len 4000 --kv-cache-dtype fp8 --quantization fp8
python3 -m vllm.entrypoints.openai.api_server --model /home/ndurkee/Llama-3-8B-Instruct -tp 8 --gpu-memory-utilization 0.995 --dtype auto --distributed-executor-backend mp --max-model-len 4000 --kv-cache-dtype fp8

It appears to be something to do with neuralmagic's quants and the fp8 cache. It appears to be setting the fp8 cache to a dtype that is not possible under ampere. However, forcibly setting the dtype to fp8_e5m2 or fp8_e4m3 do not work either likely due to the precomputed scales in the quants. Let me know if I should post this to neuralmagic but I figure they're heavily involved here anyway and there might be a vllm solution as well (I noticed you guys managed to get the neuralmagic quants running ampere in 0.5.3, thank you!).

@w013nad w013nad added the bug Something isn't working label Aug 21, 2024
@w013nad w013nad changed the title [Bug]: Unable to use fp8 kv cache with GPTQ [Bug]: Unable to use fp8 kv cache with neuralmagic quants on ampere Aug 21, 2024
@robertgshaw2-redhat
Copy link
Collaborator

robertgshaw2-redhat commented Aug 21, 2024

@mgoin @dsikka

@joe-schwartz-certara
Copy link

Also seeing the same thing on a100 gpu's with the same model

@mgoin
Copy link
Member

mgoin commented Sep 6, 2024

Can you share the full output and command? I'm able to successfully run vllm serve neuralmagic/Meta-Llama-3-8B-Instruct-quantized.w8a16 --kv-cache-dtype fp8. From the section of output you shared it seems to be a triton error (which triton doesn't support fp8 types on A100), but I don't understand why triton would be used here, especially for a non-MoE model. I tested both TP=1 and TP=8

vllm serve neuralmagic/Meta-Llama-3-8B-Instruct-quantized.w8a16 --kv-cache-dtype fp8
INFO 09-06 01:11:53 api_server.py:459] vLLM API server version 0.6.0
INFO 09-06 01:11:53 api_server.py:460] args: Namespace(model_tag='neuralmagic/Meta-Llama-3-8B-Instruct-quantized.w8a16', config='', host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, model='neuralmagic/Meta-Llama-3-8B-Instruct-quantized.w8a16', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='fp8', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, override_neuron_config=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None, dispatch_function=<function serve at 0x7f45c5b33b50>)
config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.33k/1.33k [00:00<00:00, 14.9MB/s]
INFO 09-06 01:11:53 gptq_marlin.py:98] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
INFO 09-06 01:11:53 api_server.py:160] Multiprocessing frontend to use ipc:///tmp/643e8439-bf08-4d82-8529-5243ce03fe37 for RPC Path.
INFO 09-06 01:11:53 api_server.py:176] Started engine process with PID 2504832
INFO 09-06 01:11:56 gptq_marlin.py:98] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
INFO 09-06 01:11:56 config.py:650] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor
INFO 09-06 01:11:56 llm_engine.py:212] Initializing an LLM engine (v0.6.0) with config: model='neuralmagic/Meta-Llama-3-8B-Instruct-quantized.w8a16', speculative_config=None, tokenizer='neuralmagic/Meta-Llama-3-8B-Instruct-quantized.w8a16', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=gptq_marlin, enforce_eager=False, kv_cache_dtype=fp8, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=neuralmagic/Meta-Llama-3-8B-Instruct-quantized.w8a16, use_v2_block_manager=False, num_scheduler_steps=1, enable_prefix_caching=False, use_async_output_proc=True)
tokenizer_config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 51.0k/51.0k [00:00<00:00, 1.42MB/s]
tokenizer.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.09M/9.09M [00:00<00:00, 34.2MB/s]
special_tokens_map.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 296/296 [00:00<00:00, 4.83MB/s]
generation_config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 121/121 [00:00<00:00, 1.67MB/s]
INFO 09-06 01:11:58 selector.py:227] Cannot use FlashAttention-2 backend for FP8 KV cache.
WARNING 09-06 01:11:58 selector.py:229] Please use FlashInfer backend with FP8 KV Cache for better performance by setting environment variable  VLLM_ATTENTION_BACKEND=FLASHINFER
INFO 09-06 01:11:58 selector.py:116] Using XFormers backend.
/home/mgoin/venvs/vllm/lib/python3.10/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  @torch.library.impl_abstract("xformers_flash::flash_fwd")
/home/mgoin/venvs/vllm/lib/python3.10/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  @torch.library.impl_abstract("xformers_flash::flash_bwd")
INFO 09-06 01:11:59 model_runner.py:915] Starting to load model neuralmagic/Meta-Llama-3-8B-Instruct-quantized.w8a16...
INFO 09-06 01:11:59 selector.py:227] Cannot use FlashAttention-2 backend for FP8 KV cache.
WARNING 09-06 01:11:59 selector.py:229] Please use FlashInfer backend with FP8 KV Cache for better performance by setting environment variable  VLLM_ATTENTION_BACKEND=FLASHINFER
INFO 09-06 01:11:59 selector.py:116] Using XFormers backend.
INFO 09-06 01:11:59 weight_utils.py:236] Using model weights format ['*.safetensors']
model-00001-of-00002.safetensors: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.99G/4.99G [02:43<00:00, 30.4MB/s]
model-00002-of-00002.safetensors: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.10G/4.10G [02:49<00:00, 24.2MB/s]
model.safetensors.index.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 78.5k/78.5k [00:00<00:00, 285MB/s]
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:00<00:00,  1.18it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00,  1.13it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00,  1.13it/s]

INFO 09-06 01:14:52 model_runner.py:926] Loading model weights took 8.4646 GB
INFO 09-06 01:14:53 gpu_executor.py:122] # GPU blocks: 62089, # CPU blocks: 4096
INFO 09-06 01:14:54 model_runner.py:1217] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 09-06 01:14:54 model_runner.py:1221] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 09-06 01:15:03 model_runner.py:1335] Graph capturing finished in 8 secs.
envs.VLLM_LOGGING_INTERVAL_SEC 5.0
INFO 09-06 01:15:03 api_server.py:224] vLLM to use /tmp/tmp1igkaduf as PROMETHEUS_MULTIPROC_DIR
WARNING 09-06 01:15:03 serving_embedding.py:190] embedding_mode is False. Embedding API will not work.
INFO 09-06 01:15:03 launcher.py:20] Available routes are:
INFO 09-06 01:15:03 launcher.py:28] Route: /openapi.json, Methods: GET, HEAD
INFO 09-06 01:15:03 launcher.py:28] Route: /docs, Methods: GET, HEAD
INFO 09-06 01:15:03 launcher.py:28] Route: /docs/oauth2-redirect, Methods: GET, HEAD
INFO 09-06 01:15:03 launcher.py:28] Route: /redoc, Methods: GET, HEAD
INFO 09-06 01:15:03 launcher.py:28] Route: /health, Methods: GET
INFO 09-06 01:15:03 launcher.py:28] Route: /tokenize, Methods: POST
INFO 09-06 01:15:03 launcher.py:28] Route: /detokenize, Methods: POST
INFO 09-06 01:15:03 launcher.py:28] Route: /v1/models, Methods: GET
INFO 09-06 01:15:03 launcher.py:28] Route: /version, Methods: GET
INFO 09-06 01:15:03 launcher.py:28] Route: /v1/chat/completions, Methods: POST
INFO 09-06 01:15:03 launcher.py:28] Route: /v1/completions, Methods: POST
INFO 09-06 01:15:03 launcher.py:28] Route: /v1/embeddings, Methods: POST
INFO 09-06 01:15:03 launcher.py:33] Launching Uvicorn with --limit_concurrency 32765. To avoid this limit at the expense of performance run with --disable-frontend-multiprocessing
INFO:     Started server process [2504719]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

@w013nad
Copy link
Author

w013nad commented Sep 6, 2024

Did you test calling the model? When I did it, the model loads but fails whenever you try to call it.

@joe-schwartz-certara
Copy link

Yeah, same behavior here. I can load llama3.1 type models and vllm says uvicorn is running but the first completion request yields:

AssertionError: fp8e4nv data type is not supported on CUDA arch < 89

I'm also not trying to use triton backend and I would be surprised if vllm used it automatically for this model too.

@robertgshaw2-redhat
Copy link
Collaborator

@joe-schwartz-certara - thanks for reporting the issue. I don't believe the issue is due to NM models, but rather I believe the root cause is:

  • Llama3.1 has a long max sequence length
  • We default to use chunked_prefill is max length is >32k
  • With fp8 kv cache + chunked_prefill, we cannot use the flash_attention backend, so we fall back to our triton implementation
  • This fails on ampere for the reason listed above

We should be detecting this incompatibility and rejecting it in setup. Will work on making this fix.

For now, can you try running with --max-model-len 4096 and lmk if this works to confirm my hypothesis?

@mgoin
Copy link
Member

mgoin commented Sep 6, 2024

Confirmed I am able to reproduce the issue using just the unquantized reference Meta Llama 3.1 8B with fp8 kv cache (vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct --kv-cache-dtype fp8), after sending a single request. So this is likely just an issue with chunked prefill + fp8 kv cache on ampere. Adding --max-model-len 4096 "fixes" the issue.

vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct --kv-cache-dtype fp8
INFO 09-06 14:48:07 api_server.py:459] vLLM API server version 0.6.0
INFO 09-06 14:48:07 api_server.py:460] args: Namespace(model_tag='meta-llama/Meta-Llama-3.1-8B-Instruct', config='', host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, model='meta-llama/Meta-Llama-3.1-8B-Instruct', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='fp8', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, override_neuron_config=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None, dispatch_function=<function serve at 0x7effd47a7b50>)
INFO 09-06 14:48:07 api_server.py:160] Multiprocessing frontend to use ipc:///tmp/b1f12031-84d2-47eb-a0e1-1ac2e095788c for RPC Path.
INFO 09-06 14:48:07 api_server.py:176] Started engine process with PID 3048715
INFO 09-06 14:48:10 config.py:650] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor
WARNING 09-06 14:48:10 arg_utils.py:862] Chunked prefill is enabled by default for models with max_model_len > 32K. Currently, chunked prefill might not work with some features or models. If you encounter any issues, please disable chunked prefill by setting --enable-chunked-prefill=False.
INFO 09-06 14:48:10 config.py:1001] Chunked prefill is enabled with max_num_batched_tokens=512.
INFO 09-06 14:48:10 llm_engine.py:212] Initializing an LLM engine (v0.6.0) with config: model='meta-llama/Meta-Llama-3.1-8B-Instruct', speculative_config=None, tokenizer='meta-llama/Meta-Llama-3.1-8B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=fp8, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=meta-llama/Meta-Llama-3.1-8B-Instruct, use_v2_block_manager=False, num_scheduler_steps=1, enable_prefix_caching=False, use_async_output_proc=True)
INFO 09-06 14:48:10 selector.py:227] Cannot use FlashAttention-2 backend for FP8 KV cache.
WARNING 09-06 14:48:10 selector.py:229] Please use FlashInfer backend with FP8 KV Cache for better performance by setting environment variable  VLLM_ATTENTION_BACKEND=FLASHINFER
INFO 09-06 14:48:10 selector.py:116] Using XFormers backend.
/home/mgoin/venvs/vllm/lib/python3.10/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  @torch.library.impl_abstract("xformers_flash::flash_fwd")
/home/mgoin/venvs/vllm/lib/python3.10/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  @torch.library.impl_abstract("xformers_flash::flash_bwd")
INFO 09-06 14:48:11 model_runner.py:915] Starting to load model meta-llama/Meta-Llama-3.1-8B-Instruct...
INFO 09-06 14:48:11 selector.py:227] Cannot use FlashAttention-2 backend for FP8 KV cache.
WARNING 09-06 14:48:11 selector.py:229] Please use FlashInfer backend with FP8 KV Cache for better performance by setting environment variable  VLLM_ATTENTION_BACKEND=FLASHINFER
INFO 09-06 14:48:11 selector.py:116] Using XFormers backend.
INFO 09-06 14:48:12 weight_utils.py:236] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:01,  2.34it/s]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:02<00:02,  1.28s/it]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:04<00:01,  1.60s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:06<00:00,  1.75s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:06<00:00,  1.56s/it]

INFO 09-06 14:48:18 model_runner.py:926] Loading model weights took 14.9888 GB
INFO 09-06 14:48:19 gpu_executor.py:122] # GPU blocks: 56089, # CPU blocks: 4096
INFO 09-06 14:48:20 model_runner.py:1217] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 09-06 14:48:20 model_runner.py:1221] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 09-06 14:48:28 model_runner.py:1335] Graph capturing finished in 8 secs.
envs.VLLM_LOGGING_INTERVAL_SEC 5.0
INFO 09-06 14:48:29 api_server.py:224] vLLM to use /tmp/tmpfkai7e34 as PROMETHEUS_MULTIPROC_DIR
WARNING 09-06 14:48:29 serving_embedding.py:190] embedding_mode is False. Embedding API will not work.
INFO 09-06 14:48:29 launcher.py:20] Available routes are:
INFO 09-06 14:48:29 launcher.py:28] Route: /openapi.json, Methods: GET, HEAD
INFO 09-06 14:48:29 launcher.py:28] Route: /docs, Methods: GET, HEAD
INFO 09-06 14:48:29 launcher.py:28] Route: /docs/oauth2-redirect, Methods: GET, HEAD
INFO 09-06 14:48:29 launcher.py:28] Route: /redoc, Methods: GET, HEAD
INFO 09-06 14:48:29 launcher.py:28] Route: /health, Methods: GET
INFO 09-06 14:48:29 launcher.py:28] Route: /tokenize, Methods: POST
INFO 09-06 14:48:29 launcher.py:28] Route: /detokenize, Methods: POST
INFO 09-06 14:48:29 launcher.py:28] Route: /v1/models, Methods: GET
INFO 09-06 14:48:29 launcher.py:28] Route: /version, Methods: GET
INFO 09-06 14:48:29 launcher.py:28] Route: /v1/chat/completions, Methods: POST
INFO 09-06 14:48:29 launcher.py:28] Route: /v1/completions, Methods: POST
INFO 09-06 14:48:29 launcher.py:28] Route: /v1/embeddings, Methods: POST
INFO 09-06 14:48:29 launcher.py:33] Launching Uvicorn with --limit_concurrency 32765. To avoid this limit at the expense of performance run with --disable-frontend-multiprocessing
INFO:     Started server process [3048618]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
INFO 09-06 14:48:39 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 09-06 14:48:49 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 09-06 14:48:59 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 09-06 14:49:09 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO:     127.0.0.1:46242 - "GET /v1/models HTTP/1.1" 200 OK
INFO 09-06 14:49:16 logger.py:36] Received request chat-feb1ef32e2454b4c90c61b94ae0e1e3a: prompt: '<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 Jul 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWhat is the capital of the USA?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=131029, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [128000, 128006, 9125, 128007, 271, 38766, 1303, 33025, 2696, 25, 6790, 220, 2366, 18, 198, 15724, 2696, 25, 220, 1627, 10263, 220, 2366, 19, 271, 128009, 128006, 882, 128007, 271, 3923, 374, 279, 6864, 315, 279, 7427, 30, 128009, 128006, 78191, 128007, 271], lora_request: None, prompt_adapter_request: None.
INFO 09-06 14:49:16 async_llm_engine.py:206] Added request chat-feb1ef32e2454b4c90c61b94ae0e1e3a.
ERROR 09-06 14:49:17 async_llm_engine.py:63] Engine background task failed
ERROR 09-06 14:49:17 async_llm_engine.py:63] Traceback (most recent call last):
ERROR 09-06 14:49:17 async_llm_engine.py:63]   File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/triton/language/core.py", line 35, in wrapper
ERROR 09-06 14:49:17 async_llm_engine.py:63]     return fn(*args, **kwargs)
ERROR 09-06 14:49:17 async_llm_engine.py:63]   File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/triton/language/core.py", line 1597, in load
ERROR 09-06 14:49:17 async_llm_engine.py:63]     return semantic.load(pointer, mask, other, boundary_check, padding_option, cache_modifier, eviction_policy,
ERROR 09-06 14:49:17 async_llm_engine.py:63]   File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/triton/language/semantic.py", line 1037, in load
ERROR 09-06 14:49:17 async_llm_engine.py:63]     return _load_legacy(ptr, mask, other, boundary_check, padding, cache, eviction, is_volatile, builder)
ERROR 09-06 14:49:17 async_llm_engine.py:63]   File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/triton/language/semantic.py", line 1005, in _load_legacy
ERROR 09-06 14:49:17 async_llm_engine.py:63]     other = cast(other, elt_ty, builder)
ERROR 09-06 14:49:17 async_llm_engine.py:63]   File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/triton/language/semantic.py", line 759, in cast
ERROR 09-06 14:49:17 async_llm_engine.py:63]     assert builder.options.allow_fp8e4nv, "fp8e4nv data type is not supported on CUDA arch < 89"
ERROR 09-06 14:49:17 async_llm_engine.py:63] AssertionError: fp8e4nv data type is not supported on CUDA arch < 89
ERROR 09-06 14:49:17 async_llm_engine.py:63] 
ERROR 09-06 14:49:17 async_llm_engine.py:63] The above exception was the direct cause of the following exception:
ERROR 09-06 14:49:17 async_llm_engine.py:63] 
ERROR 09-06 14:49:17 async_llm_engine.py:63] Traceback (most recent call last):
ERROR 09-06 14:49:17 async_llm_engine.py:63]   File "/home/mgoin/code/vllm/vllm/engine/async_llm_engine.py", line 53, in _log_task_completion
ERROR 09-06 14:49:17 async_llm_engine.py:63]     return_value = task.result()
ERROR 09-06 14:49:17 async_llm_engine.py:63]   File "/home/mgoin/code/vllm/vllm/engine/async_llm_engine.py", line 939, in run_engine_loop
ERROR 09-06 14:49:17 async_llm_engine.py:63]     result = task.result()
ERROR 09-06 14:49:17 async_llm_engine.py:63]   File "/home/mgoin/code/vllm/vllm/engine/async_llm_engine.py", line 868, in engine_step
ERROR 09-06 14:49:17 async_llm_engine.py:63]     request_outputs = await self.engine.step_async(virtual_engine)
ERROR 09-06 14:49:17 async_llm_engine.py:63]   File "/home/mgoin/code/vllm/vllm/engine/async_llm_engine.py", line 345, in step_async
ERROR 09-06 14:49:17 async_llm_engine.py:63]     output = await self.model_executor.execute_model_async(
ERROR 09-06 14:49:17 async_llm_engine.py:63]   File "/home/mgoin/code/vllm/vllm/executor/gpu_executor.py", line 179, in execute_model_async
ERROR 09-06 14:49:17 async_llm_engine.py:63]     output = await make_async(self.driver_worker.execute_model
ERROR 09-06 14:49:17 async_llm_engine.py:63]   File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
ERROR 09-06 14:49:17 async_llm_engine.py:63]     result = self.fn(*self.args, **self.kwargs)
ERROR 09-06 14:49:17 async_llm_engine.py:63]   File "/home/mgoin/code/vllm/vllm/worker/worker_base.py", line 327, in execute_model
ERROR 09-06 14:49:17 async_llm_engine.py:63]     output = self.model_runner.execute_model(
ERROR 09-06 14:49:17 async_llm_engine.py:63]   File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 09-06 14:49:17 async_llm_engine.py:63]     return func(*args, **kwargs)
ERROR 09-06 14:49:17 async_llm_engine.py:63]   File "/home/mgoin/code/vllm/vllm/worker/model_runner.py", line 1450, in execute_model
ERROR 09-06 14:49:17 async_llm_engine.py:63]     hidden_or_intermediate_states = model_executable(
ERROR 09-06 14:49:17 async_llm_engine.py:63]   File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 09-06 14:49:17 async_llm_engine.py:63]     return self._call_impl(*args, **kwargs)
ERROR 09-06 14:49:17 async_llm_engine.py:63]   File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 09-06 14:49:17 async_llm_engine.py:63]     return forward_call(*args, **kwargs)
ERROR 09-06 14:49:17 async_llm_engine.py:63]   File "/home/mgoin/code/vllm/vllm/model_executor/models/llama.py", line 429, in forward
ERROR 09-06 14:49:17 async_llm_engine.py:63]     model_output = self.model(input_ids, positions, kv_caches,
ERROR 09-06 14:49:17 async_llm_engine.py:63]   File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 09-06 14:49:17 async_llm_engine.py:63]     return self._call_impl(*args, **kwargs)
ERROR 09-06 14:49:17 async_llm_engine.py:63]   File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 09-06 14:49:17 async_llm_engine.py:63]     return forward_call(*args, **kwargs)
ERROR 09-06 14:49:17 async_llm_engine.py:63]   File "/home/mgoin/code/vllm/vllm/model_executor/models/llama.py", line 329, in forward
ERROR 09-06 14:49:17 async_llm_engine.py:63]     hidden_states, residual = layer(
ERROR 09-06 14:49:17 async_llm_engine.py:63]   File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 09-06 14:49:17 async_llm_engine.py:63]     return self._call_impl(*args, **kwargs)
ERROR 09-06 14:49:17 async_llm_engine.py:63]   File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 09-06 14:49:17 async_llm_engine.py:63]     return forward_call(*args, **kwargs)
ERROR 09-06 14:49:17 async_llm_engine.py:63]   File "/home/mgoin/code/vllm/vllm/model_executor/models/llama.py", line 251, in forward
ERROR 09-06 14:49:17 async_llm_engine.py:63]     hidden_states = self.self_attn(
ERROR 09-06 14:49:17 async_llm_engine.py:63]   File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 09-06 14:49:17 async_llm_engine.py:63]     return self._call_impl(*args, **kwargs)
ERROR 09-06 14:49:17 async_llm_engine.py:63]   File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 09-06 14:49:17 async_llm_engine.py:63]     return forward_call(*args, **kwargs)
ERROR 09-06 14:49:17 async_llm_engine.py:63]   File "/home/mgoin/code/vllm/vllm/model_executor/models/llama.py", line 181, in forward
ERROR 09-06 14:49:17 async_llm_engine.py:63]     attn_output = self.attn(q, k, v, kv_cache, attn_metadata)
ERROR 09-06 14:49:17 async_llm_engine.py:63]   File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 09-06 14:49:17 async_llm_engine.py:63]     return self._call_impl(*args, **kwargs)
ERROR 09-06 14:49:17 async_llm_engine.py:63]   File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 09-06 14:49:17 async_llm_engine.py:63]     return forward_call(*args, **kwargs)
ERROR 09-06 14:49:17 async_llm_engine.py:63]   File "/home/mgoin/code/vllm/vllm/attention/layer.py", line 98, in forward
ERROR 09-06 14:49:17 async_llm_engine.py:63]     return self.impl.forward(query,
ERROR 09-06 14:49:17 async_llm_engine.py:63]   File "/home/mgoin/code/vllm/vllm/attention/backends/xformers.py", line 608, in forward
ERROR 09-06 14:49:17 async_llm_engine.py:63]     out = PagedAttention.forward_prefix(
ERROR 09-06 14:49:17 async_llm_engine.py:63]   File "/home/mgoin/code/vllm/vllm/attention/ops/paged_attn.py", line 211, in forward_prefix
ERROR 09-06 14:49:17 async_llm_engine.py:63]     context_attention_fwd(
ERROR 09-06 14:49:17 async_llm_engine.py:63]   File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 09-06 14:49:17 async_llm_engine.py:63]     return func(*args, **kwargs)
ERROR 09-06 14:49:17 async_llm_engine.py:63]   File "/home/mgoin/code/vllm/vllm/attention/ops/prefix_prefill.py", line 812, in context_attention_fwd
ERROR 09-06 14:49:17 async_llm_engine.py:63]     _fwd_kernel[grid](
ERROR 09-06 14:49:17 async_llm_engine.py:63]   File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/triton/runtime/jit.py", line 345, in <lambda>
ERROR 09-06 14:49:17 async_llm_engine.py:63]     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
ERROR 09-06 14:49:17 async_llm_engine.py:63]   File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/triton/runtime/jit.py", line 662, in run
ERROR 09-06 14:49:17 async_llm_engine.py:63]     kernel = self.compile(
ERROR 09-06 14:49:17 async_llm_engine.py:63]   File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/triton/compiler/compiler.py", line 276, in compile
ERROR 09-06 14:49:17 async_llm_engine.py:63]     module = src.make_ir(options, codegen_fns, context)
ERROR 09-06 14:49:17 async_llm_engine.py:63]   File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/triton/compiler/compiler.py", line 113, in make_ir
ERROR 09-06 14:49:17 async_llm_engine.py:63]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns)
ERROR 09-06 14:49:17 async_llm_engine.py:63] triton.compiler.errors.CompilationError: at 110:17:
ERROR 09-06 14:49:17 async_llm_engine.py:63]                  cur_kv_head * stride_k_cache_h +
ERROR 09-06 14:49:17 async_llm_engine.py:63]                  (offs_d[:, None] // x) * stride_k_cache_d +
ERROR 09-06 14:49:17 async_llm_engine.py:63]                  ((start_n + offs_n[None, :]) % block_size) *
ERROR 09-06 14:49:17 async_llm_engine.py:63]                  stride_k_cache_bl +
ERROR 09-06 14:49:17 async_llm_engine.py:63]                  (offs_d[:, None] % x) * stride_k_cache_x)
ERROR 09-06 14:49:17 async_llm_engine.py:63]         # [N,D]
ERROR 09-06 14:49:17 async_llm_engine.py:63]         off_v = (
ERROR 09-06 14:49:17 async_llm_engine.py:63]             bn[:, None] * stride_v_cache_bs +
ERROR 09-06 14:49:17 async_llm_engine.py:63]             cur_kv_head * stride_v_cache_h +
ERROR 09-06 14:49:17 async_llm_engine.py:63]             offs_d[None, :] * stride_v_cache_d +
ERROR 09-06 14:49:17 async_llm_engine.py:63]             (start_n + offs_n[:, None]) % block_size * stride_v_cache_bl)
ERROR 09-06 14:49:17 async_llm_engine.py:63]         k_load = tl.load(K_cache + off_k,
ERROR 09-06 14:49:17 async_llm_engine.py:63]                  ^
Exception in callback functools.partial(<function _log_task_completion at 0x7f622cd92560>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x7f6229029bd0>>)
handle: <Handle functools.partial(<function _log_task_completion at 0x7f622cd92560>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x7f6229029bd0>>)>
Traceback (most recent call last):
  File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/triton/language/core.py", line 35, in wrapper
    return fn(*args, **kwargs)
  File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/triton/language/core.py", line 1597, in load
    return semantic.load(pointer, mask, other, boundary_check, padding_option, cache_modifier, eviction_policy,
  File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/triton/language/semantic.py", line 1037, in load
    return _load_legacy(ptr, mask, other, boundary_check, padding, cache, eviction, is_volatile, builder)
  File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/triton/language/semantic.py", line 1005, in _load_legacy
    other = cast(other, elt_ty, builder)
  File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/triton/language/semantic.py", line 759, in cast
    assert builder.options.allow_fp8e4nv, "fp8e4nv data type is not supported on CUDA arch < 89"
AssertionError: fp8e4nv data type is not supported on CUDA arch < 89

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/mgoin/code/vllm/vllm/engine/async_llm_engine.py", line 53, in _log_task_completion
    return_value = task.result()
  File "/home/mgoin/code/vllm/vllm/engine/async_llm_engine.py", line 939, in run_engine_loop
    result = task.result()
  File "/home/mgoin/code/vllm/vllm/engine/async_llm_engine.py", line 868, in engine_step
    request_outputs = await self.engine.step_async(virtual_engine)
  File "/home/mgoin/code/vllm/vllm/engine/async_llm_engine.py", line 345, in step_async
    output = await self.model_executor.execute_model_async(
  File "/home/mgoin/code/vllm/vllm/executor/gpu_executor.py", line 179, in execute_model_async
    output = await make_async(self.driver_worker.execute_model
  File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/home/mgoin/code/vllm/vllm/worker/worker_base.py", line 327, in execute_model
    output = self.model_runner.execute_model(
  File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/home/mgoin/code/vllm/vllm/worker/model_runner.py", line 1450, in execute_model
    hidden_or_intermediate_states = model_executable(
  File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/mgoin/code/vllm/vllm/model_executor/models/llama.py", line 429, in forward
    model_output = self.model(input_ids, positions, kv_caches,
  File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/mgoin/code/vllm/vllm/model_executor/models/llama.py", line 329, in forward
    hidden_states, residual = layer(
  File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/mgoin/code/vllm/vllm/model_executor/models/llama.py", line 251, in forward
    hidden_states = self.self_attn(
  File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/mgoin/code/vllm/vllm/model_executor/models/llama.py", line 181, in forward
    attn_output = self.attn(q, k, v, kv_cache, attn_metadata)
  File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/mgoin/code/vllm/vllm/attention/layer.py", line 98, in forward
    return self.impl.forward(query,
  File "/home/mgoin/code/vllm/vllm/attention/backends/xformers.py", line 608, in forward
    out = PagedAttention.forward_prefix(
  File "/home/mgoin/code/vllm/vllm/attention/ops/paged_attn.py", line 211, in forward_prefix
    context_attention_fwd(
  File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/home/mgoin/code/vllm/vllm/attention/ops/prefix_prefill.py", line 812, in context_attention_fwd
    _fwd_kernel[grid](
  File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/triton/runtime/jit.py", line 345, in <lambda>
    return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
  File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/triton/runtime/jit.py", line 662, in run
    kernel = self.compile(
  File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/triton/compiler/compiler.py", line 276, in compile
    module = src.make_ir(options, codegen_fns, context)
  File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/triton/compiler/compiler.py", line 113, in make_ir
    return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns)
triton.compiler.errors.CompilationError: at 110:17:
                 cur_kv_head * stride_k_cache_h +
                 (offs_d[:, None] // x) * stride_k_cache_d +
                 ((start_n + offs_n[None, :]) % block_size) *
                 stride_k_cache_bl +
                 (offs_d[:, None] % x) * stride_k_cache_x)
        # [N,D]
        off_v = (
            bn[:, None] * stride_v_cache_bs +
            cur_kv_head * stride_v_cache_h +
            offs_d[None, :] * stride_v_cache_d +
            (start_n + offs_n[:, None]) % block_size * stride_v_cache_bl)
        k_load = tl.load(K_cache + off_k,
                 ^

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run
  File "/home/mgoin/code/vllm/vllm/engine/async_llm_engine.py", line 65, in _log_task_completion
    raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.
ERROR 09-06 14:49:17 client.py:266] Got Unhealthy response from RPC Server
ERROR 09-06 14:49:17 client.py:412] AsyncEngineDeadError('Background loop is stopped.')
ERROR 09-06 14:49:17 client.py:412] Traceback (most recent call last):
ERROR 09-06 14:49:17 client.py:412]   File "/home/mgoin/code/vllm/vllm/entrypoints/openai/rpc/client.py", line 409, in generate
ERROR 09-06 14:49:17 client.py:412]     await self.check_health(socket=socket)
ERROR 09-06 14:49:17 client.py:412]   File "/home/mgoin/code/vllm/vllm/entrypoints/openai/rpc/client.py", line 429, in check_health
ERROR 09-06 14:49:17 client.py:412]     await self._send_one_way_rpc_request(
ERROR 09-06 14:49:17 client.py:412]   File "/home/mgoin/code/vllm/vllm/entrypoints/openai/rpc/client.py", line 267, in _send_one_way_rpc_request
ERROR 09-06 14:49:17 client.py:412]     raise response
ERROR 09-06 14:49:17 client.py:412] vllm.engine.async_llm_engine.AsyncEngineDeadError: Background loop is stopped.
INFO:     127.0.0.1:46242 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 399, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__
    return await self.app(scope, receive, send)
  File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/starlette/applications.py", line 123, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 65, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/starlette/routing.py", line 756, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/starlette/routing.py", line 776, in app
    await route.handle(scope, receive, send)
  File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/starlette/routing.py", line 297, in handle
    await self.app(scope, receive, send)
  File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/starlette/routing.py", line 77, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/starlette/routing.py", line 72, in app
    response = await func(request)
  File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/fastapi/routing.py", line 278, in app
    raw_response = await run_endpoint_function(
  File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/fastapi/routing.py", line 191, in run_endpoint_function
    return await dependant.call(**values)
  File "/home/mgoin/code/vllm/vllm/entrypoints/openai/api_server.py", line 287, in create_chat_completion
    generator = await openai_serving_chat.create_chat_completion(
  File "/home/mgoin/code/vllm/vllm/entrypoints/openai/serving_chat.py", line 227, in create_chat_completion
    return await self.chat_completion_full_generator(
  File "/home/mgoin/code/vllm/vllm/entrypoints/openai/serving_chat.py", line 564, in chat_completion_full_generator
    async for res in result_generator:
  File "/home/mgoin/code/vllm/vllm/utils.py", line 432, in iterate_with_cancellation
    item = await awaits[0]
  File "/home/mgoin/code/vllm/vllm/entrypoints/openai/rpc/client.py", line 416, in generate
    raise request_output
triton.compiler.errors.CompilationError: at 110:17:
                 cur_kv_head * stride_k_cache_h +
                 (offs_d[:, None] // x) * stride_k_cache_d +
                 ((start_n + offs_n[None, :]) % block_size) *
                 stride_k_cache_bl +
                 (offs_d[:, None] % x) * stride_k_cache_x)
        # [N,D]
        off_v = (
            bn[:, None] * stride_v_cache_bs +
            cur_kv_head * stride_v_cache_h +
            offs_d[None, :] * stride_v_cache_d +
            (start_n + offs_n[:, None]) % block_size * stride_v_cache_bl)
        k_load = tl.load(K_cache + off_k,
                 ^
INFO 09-06 14:49:18 logger.py:36] Received request chat-4e3e01b46e4646d28915a7262833e99e: prompt: '<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 Jul 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWhat is the capital of the USA?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=131029, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [128000, 128006, 9125, 128007, 271, 38766, 1303, 33025, 2696, 25, 6790, 220, 2366, 18, 198, 15724, 2696, 25, 220, 1627, 10263, 220, 2366, 19, 271, 128009, 128006, 882, 128007, 271, 3923, 374, 279, 6864, 315, 279, 7427, 30, 128009, 128006, 78191, 128007, 271], lora_request: None, prompt_adapter_request: None.
CRITICAL 09-06 14:49:18 launcher.py:98] AsyncLLMEngine is already dead, terminating server process
INFO:     127.0.0.1:46246 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
INFO:     Shutting down
INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.
INFO:     Finished server process [3048618]
INFO 09-06 14:49:18 server.py:228] vLLM ZMQ RPC Server was interrupted.
Future exception was never retrieved
future: <Future finished exception=CompilationError('def _fwd_kernel(\n    Q,\n    K,\n    V,\n    K_cache,\n    V_cache,\n    B_Loc,\n    sm_scale,\n    k_scale,\n    v_scale,\n    B_Start_Loc,\n    B_Seqlen,\n    B_Ctxlen,\n    block_size,\n    x,\n    Out,\n    stride_b_loc_b,\n    stride_b_loc_s,\n    stride_qbs,\n    stride_qh,\n    stride_qd,\n    stride_kbs,\n    stride_kh,\n    stride_kd,\n    stride_vbs,\n    stride_vh,\n    stride_vd,\n    stride_obs,\n    stride_oh,\n    stride_od,\n    stride_k_cache_bs,\n    stride_k_cache_h,\n    stride_k_cache_d,\n    stride_k_cache_bl,\n    stride_k_cache_x,\n    stride_v_cache_bs,\n    stride_v_cache_h,\n    stride_v_cache_d,\n    stride_v_cache_bl,\n    num_queries_per_kv: int,\n    BLOCK_M: tl.constexpr,\n    BLOCK_DMODEL: tl.constexpr,  # head size\n    BLOCK_DMODEL_PADDED: tl.constexpr,  # head size padded to a power of 2\n    BLOCK_N: tl.constexpr,\n    SLIDING_WINDOW: tl.constexpr,\n):\n    cur_batch = tl.program_id(0)\n    cur_head = tl.program_id(1)\n    start_m = tl.program_id(2)\n\n    cur_kv_head = cur_head // num_queries_per_kv\n\n    cur_batch_ctx_len = tl.load(B_Ctxlen + cur_batch)\n    cur_batch_seq_len = tl.load(B_Seqlen + cur_batch)\n    cur_batch_in_all_start_index = tl.load(B_Start_Loc + cur_batch)\n    cur_batch_query_len = cur_batch_seq_len - cur_batch_ctx_len\n\n    # start position inside of the query\n    # generally, N goes over kv, while M goes over query_len\n    block_start_loc = BLOCK_M * start_m\n\n    # initialize offsets\n    # [N]; starts at 0\n    offs_n = tl.arange(0, BLOCK_N)\n    # [D]; starts at 0\n    offs_d = tl.arange(0, BLOCK_DMODEL_PADDED)\n    # [M]; starts at current position in query\n    offs_m = start_m * BLOCK_M + tl.arange(0, BLOCK_M)\n    # [M,D]\n    off_q = (\n        (cur_batch_in_all_start_index + offs_m[:, None]) * stride_qbs +\n        cur_head * stride_qh + offs_d[None, :] * stride_qd)\n\n    dim_mask = tl.where(\n        tl.arange(0, BLOCK_DMODEL_PADDED) < BLOCK_DMODEL, 1,\n        0).to(tl.int1)  # [D]\n\n    q = tl.load(Q + off_q,\n                mask=dim_mask[None, :] &\n                (offs_m[:, None] < cur_batch_query_len),\n                other=0.0)  # [M,D]\n\n    # initialize pointer to m and l\n    m_i = tl.zeros([BLOCK_M], dtype=tl.float32) - float("inf")  # [M]\n    l_i = tl.zeros([BLOCK_M], dtype=tl.float32)  # [M]\n    acc = tl.zeros([BLOCK_M, BLOCK_DMODEL_PADDED],\n                   dtype=tl.float32)  # [M,D]\n\n    # compute query against context (no causal mask here)\n    for start_n in range(0, cur_batch_ctx_len, BLOCK_N):\n        start_n = tl.multiple_of(start_n, BLOCK_N)\n        # -- compute qk ----\n        bn = tl.load(B_Loc + cur_batch * stride_b_loc_b +\n                     ((start_n + offs_n) // block_size) * stride_b_loc_s,\n                     mask=(start_n + offs_n) < cur_batch_ctx_len,\n                     other=0)  # [N]\n        # [D,N]\n        off_k = (bn[None, :] * stride_k_cache_bs +\n                 cur_kv_head * stride_k_cache_h +\n                 (offs_d[:, None] // x) * stride_k_cache_d +\n                 ((start_n + offs_n[None, :]) % block_size) *\n                 stride_k_cache_bl +\n                 (offs_d[:, None] % x) * stride_k_cache_x)\n        # [N,D]\n        off_v = (\n            bn[:, None] * stride_v_cache_bs +\n            cur_kv_head * stride_v_cache_h +\n            offs_d[None, :] * stride_v_cache_d +\n            (start_n + offs_n[:, None]) % block_size * stride_v_cache_bl)\n        k_load = tl.load(K_cache + off_k,\n                         mask=dim_mask[:, None] &\n                         ((start_n + offs_n[None, :]) < cur_batch_ctx_len),\n                         other=0.0)  # [D,N]\n\n        if k_load.dtype.is_fp8():\n            k = (k_load.to(tl.float32) * k_scale).to(q.dtype)\n        else:\n            k = k_load\n\n        qk = tl.zeros([BLOCK_M, BLOCK_N], dtype=tl.float32)  # [M,N]\n        qk += tl.dot(q, k)\n        qk = tl.where((start_n + offs_n[None, :]) < cur_batch_ctx_len, qk,\n                      float("-inf"))\n        qk *= sm_scale\n        if SLIDING_WINDOW > 0:\n            # (cur_batch_ctx_len + offs_m[:, None]) are the positions of\n            # Q entries in sequence\n            # (start_n + offs_n[None, :]) are the positions of\n            # KV entries in sequence\n            # So the condition makes sure each entry in Q only attends\n            # to KV entries not more than SLIDING_WINDOW away.\n            #\n            # We can\'t use -inf here, because the\n            # sliding window may lead to the entire row being masked.\n            # This then makes m_ij contain -inf, which causes NaNs in\n            # exp().\n            qk = tl.where((cur_batch_ctx_len + offs_m[:, None]) -\n                          (start_n + offs_n[None, :]) < SLIDING_WINDOW, qk,\n                          -10000)\n\n        # -- compute m_ij, p, l_ij\n        m_ij = tl.max(qk, 1)  # [M]\n        p = tl.exp(qk - m_ij[:, None])  # [M,N]\n        l_ij = tl.sum(p, 1)  # [M]\n        # -- update m_i and l_i\n        m_i_new = tl.maximum(m_i, m_ij)  # [M]\n        alpha = tl.exp(m_i - m_i_new)  # [M]\n        beta = tl.exp(m_ij - m_i_new)  # [M]\n        l_i_new = alpha * l_i + beta * l_ij  # [M]\n\n        # -- update output accumulator --\n        # scale p\n        p_scale = beta / l_i_new\n        p = p * p_scale[:, None]\n        # scale acc\n        acc_scale = l_i / l_i_new * alpha\n        acc = acc * acc_scale[:, None]\n        # update acc\n        v_load = tl.load(V_cache + off_v,\n                         mask=dim_mask[None, :] &\n                         ((start_n + offs_n[:, None]) < cur_batch_ctx_len),\n                         other=0.0)  # [N,D]\n        if v_load.dtype.is_fp8():\n            v = (v_load.to(tl.float32) * v_scale).to(q.dtype)\n        else:\n            v = v_load\n        p = p.to(v.dtype)\n\n        acc += tl.dot(p, v)\n        # # update m_i and l_i\n        l_i = l_i_new\n        m_i = m_i_new\n\n    off_k = (offs_n[None, :] * stride_kbs + cur_kv_head * stride_kh +\n             offs_d[:, None] * stride_kd)\n    off_v = (offs_n[:, None] * stride_vbs + cur_kv_head * stride_vh +\n             offs_d[None, :] * stride_vd)\n    k_ptrs = K + off_k\n    v_ptrs = V + off_v\n\n    # block_mask is 0 when we\'re already past the current query length\n    block_mask = tl.where(block_start_loc < cur_batch_query_len, 1, 0)\n\n    # compute query against itself (with causal mask)\n    for start_n in range(0, block_mask * (start_m + 1) * BLOCK_M, BLOCK_N):\n        start_n = tl.multiple_of(start_n, BLOCK_N)\n        # -- compute qk ----\n        k = tl.load(k_ptrs +\n                    (cur_batch_in_all_start_index + start_n) * stride_kbs,\n                    mask=dim_mask[:, None] &\n                    ((start_n + offs_n[None, :]) < cur_batch_query_len),\n                    other=0.0)\n\n        qk = tl.zeros([BLOCK_M, BLOCK_N], dtype=tl.float32)\n        qk += tl.dot(q, k)\n        qk *= sm_scale\n        # apply causal mask\n        qk = tl.where(offs_m[:, None] >= (start_n + offs_n[None, :]), qk,\n                      float("-inf"))\n        if SLIDING_WINDOW > 0:\n            qk = tl.where(\n                offs_m[:, None] -\n                (start_n + offs_n[None, :]) < SLIDING_WINDOW, qk, -10000)\n\n        # -- compute m_ij, p, l_ij\n        m_ij = tl.max(qk, 1)\n        p = tl.exp(qk - m_ij[:, None])\n        l_ij = tl.sum(p, 1)\n        # -- update m_i and l_i\n        m_i_new = tl.maximum(m_i, m_ij)\n        alpha = tl.exp(m_i - m_i_new)\n        beta = tl.exp(m_ij - m_i_new)\n        l_i_new = alpha * l_i + beta * l_ij\n        # -- update output accumulator --\n        # scale p\n        p_scale = beta / l_i_new\n        p = p * p_scale[:, None]\n        # scale acc\n        acc_scale = l_i / l_i_new * alpha\n        acc = acc * acc_scale[:, None]\n        # update acc\n        v = tl.load(v_ptrs +\n                    (cur_batch_in_all_start_index + start_n) * stride_vbs,\n                    mask=dim_mask[None, :] &\n                    ((start_n + offs_n[:, None]) < cur_batch_query_len),\n                    other=0.0)\n        p = p.to(v.dtype)\n\n        acc += tl.dot(p, v)\n        # update m_i and l_i\n        l_i = l_i_new\n        m_i = m_i_new\n    # initialize pointers to output\n    off_o = (\n        (cur_batch_in_all_start_index + offs_m[:, None]) * stride_obs +\n        cur_head * stride_oh + offs_d[None, :] * stride_od)\n    out_ptrs = Out + off_o\n    tl.store(out_ptrs,\n             acc,\n             mask=dim_mask[None, :] &\n             (offs_m[:, None] < cur_batch_query_len))\n    return\n', <ast.Call object at 0x7f613c4203a0>, None)>
Traceback (most recent call last):
  File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/triton/language/core.py", line 35, in wrapper
    return fn(*args, **kwargs)
  File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/triton/language/core.py", line 1597, in load
    return semantic.load(pointer, mask, other, boundary_check, padding_option, cache_modifier, eviction_policy,
  File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/triton/language/semantic.py", line 1037, in load
    return _load_legacy(ptr, mask, other, boundary_check, padding, cache, eviction, is_volatile, builder)
  File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/triton/language/semantic.py", line 1005, in _load_legacy
    other = cast(other, elt_ty, builder)
  File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/triton/language/semantic.py", line 759, in cast
    assert builder.options.allow_fp8e4nv, "fp8e4nv data type is not supported on CUDA arch < 89"
AssertionError: fp8e4nv data type is not supported on CUDA arch < 89

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/mgoin/code/vllm/vllm/entrypoints/openai/rpc/server.py", line 115, in generate
    async for request_output in results_generator:
  File "/home/mgoin/code/vllm/vllm/engine/async_llm_engine.py", line 1073, in generate
    async for output in await self.add_request(
  File "/home/mgoin/code/vllm/vllm/engine/async_llm_engine.py", line 111, in generator
    raise result
  File "/home/mgoin/code/vllm/vllm/engine/async_llm_engine.py", line 53, in _log_task_completion
    return_value = task.result()
  File "/home/mgoin/code/vllm/vllm/engine/async_llm_engine.py", line 939, in run_engine_loop
    result = task.result()
  File "/home/mgoin/code/vllm/vllm/engine/async_llm_engine.py", line 868, in engine_step
    request_outputs = await self.engine.step_async(virtual_engine)
  File "/home/mgoin/code/vllm/vllm/engine/async_llm_engine.py", line 345, in step_async
    output = await self.model_executor.execute_model_async(
  File "/home/mgoin/code/vllm/vllm/executor/gpu_executor.py", line 179, in execute_model_async
    output = await make_async(self.driver_worker.execute_model
  File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/home/mgoin/code/vllm/vllm/worker/worker_base.py", line 327, in execute_model
    output = self.model_runner.execute_model(
  File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/home/mgoin/code/vllm/vllm/worker/model_runner.py", line 1450, in execute_model
    hidden_or_intermediate_states = model_executable(
  File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/mgoin/code/vllm/vllm/model_executor/models/llama.py", line 429, in forward
    model_output = self.model(input_ids, positions, kv_caches,
  File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/mgoin/code/vllm/vllm/model_executor/models/llama.py", line 329, in forward
    hidden_states, residual = layer(
  File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/mgoin/code/vllm/vllm/model_executor/models/llama.py", line 251, in forward
    hidden_states = self.self_attn(
  File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/mgoin/code/vllm/vllm/model_executor/models/llama.py", line 181, in forward
    attn_output = self.attn(q, k, v, kv_cache, attn_metadata)
  File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/mgoin/code/vllm/vllm/attention/layer.py", line 98, in forward
    return self.impl.forward(query,
  File "/home/mgoin/code/vllm/vllm/attention/backends/xformers.py", line 608, in forward
    out = PagedAttention.forward_prefix(
  File "/home/mgoin/code/vllm/vllm/attention/ops/paged_attn.py", line 211, in forward_prefix
    context_attention_fwd(
  File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/home/mgoin/code/vllm/vllm/attention/ops/prefix_prefill.py", line 812, in context_attention_fwd
    _fwd_kernel[grid](
  File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/triton/runtime/jit.py", line 345, in <lambda>
    return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
  File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/triton/runtime/jit.py", line 662, in run
    kernel = self.compile(
  File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/triton/compiler/compiler.py", line 276, in compile
    module = src.make_ir(options, codegen_fns, context)
  File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/triton/compiler/compiler.py", line 113, in make_ir
    return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns)
triton.compiler.errors.CompilationError: at 110:17:
                 cur_kv_head * stride_k_cache_h +
                 (offs_d[:, None] // x) * stride_k_cache_d +
                 ((start_n + offs_n[None, :]) % block_size) *
                 stride_k_cache_bl +
                 (offs_d[:, None] % x) * stride_k_cache_x)
        # [N,D]
        off_v = (
            bn[:, None] * stride_v_cache_bs +
            cur_kv_head * stride_v_cache_h +
            offs_d[None, :] * stride_v_cache_d +
            (start_n + offs_n[:, None]) % block_size * stride_v_cache_bl)
        k_load = tl.load(K_cache + off_k,
                 ^

@mgoin mgoin changed the title [Bug]: Unable to use fp8 kv cache with neuralmagic quants on ampere [Bug]: Unable to use fp8 kv cache with chunked prefill on ampere Sep 6, 2024
@joe-schwartz-certara
Copy link

Yup, i absolutely agree with all of these thoughts. Can also confirm lowering max length "fixes" it but we have many users who want to experiment with the huge context window. Thanks for the debugging efforts vllm team!!

@robertgshaw2-redhat
Copy link
Collaborator

Hold on, there's another flag you can set to enable both.

@robertgshaw2-redhat
Copy link
Collaborator

I think if you do just explicitly set --enable_chunked-prefill=False it should work

@robertgshaw2-redhat
Copy link
Collaborator

@joe-schwartz-certara can you LMK if this works for you? If not will dig a bit deeper in the code to remind myself

@joe-schwartz-certara
Copy link

I'm getting the error that the kv cache doesn't have enough space to fit on my 80 GB a100 now with --enable_chunked-prefill=False. I can try with 2 a100s just to make sure it fixes it; But usually i can just barely fit llama3.1 70b awq on a single a100. Is there a reason why disabling chunked prefill would make the kv cache size go up? I really am right at the limit of vram usage on a single a100; i calculated that the kv cache would be around 30-40 gb without fp8 quantization and weights for the model are around 37 gb. I can't fit the default kv cache dtype on a single a100 without going right to 0.99 gpu mem utilization.

Here's the whole command that says it doesnt have space for kv cache on a single a100 80 gb gpu:

python -u -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 8000 --model hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4 --tensor-parallel-size 1 --download-dir /data --max-model-len 128000 --quantization marlin --gpu-memory-utilization 0.99 --trust-remote-code --enforce-eager --kv-cache-dtype fp8 --enable_chunked-prefill=False

@joe-schwartz-certara
Copy link

Oh perhaps another possibility is disabling chunked prefill is allocating gpu vram to some other kind of cache process and thus choking the available vram for kv cache

@joe-schwartz-certara
Copy link

On 2 a100s, it works. So its definitely the issue somewhere in the new parameter but also I'm still wondering why it doesn't let me fit it on 80 gb vram.

@robertgshaw2-redhat
Copy link
Collaborator

Chunked prefill reduces the maximum size of the activations (since we only ever run forward with chunk size). You can reduce the max prefill length when chunked prefill is disabled

@K-Mistele
Copy link
Contributor

reminder re: #8512

@dkkb
Copy link

dkkb commented Nov 16, 2024

Same issue, but lower max-model-len does not fix it, disable enable-chunked-prefill fix this.

CUDA version: 12.7
vllm: 0.6.4

@codexq123
Copy link

30series GPUs (RTX30series, A100, etc) have this error. 40series GPUs will be OK.

@scruel
Copy link

scruel commented Dec 4, 2024

@dkkb @codexq123 With A100, same issue, but it won't fix with enable-chunked-prefill=False.

For example, the following code will raise this issue:

llm = LLM(
    model=model_name_or_path,
    dtype='bfloat16',
    quantization='fp8',
    kv_cache_dtype='fp8',
    enable_chunked_prefill=False,
    max_model_len=65536,
    enforce_eager=True,
    enable_prefix_caching=True,
    trust_remote_code=True,
)

We can "solve" this issue via either decrease the value of max_model_len, or disable prefix caching will make infer work like a charm.

vllm: 0.6.4.post1

@HelenaSak
Copy link

@dkkb @codexq123 With A100, same issue, but it won't fix with enable-chunked-prefill=False.

For example, the following code will raise this issue:

llm = LLM(
    model=model_name_or_path,
    dtype='bfloat16',
    quantization='fp8',
    kv_cache_dtype='fp8',
    enable_chunked_prefill=False,
    max_model_len=65536,
    enforce_eager=True,
    enable_prefix_caching=True,
    trust_remote_code=True,
)

We can "solve" this issue via either decrease the value of max_model_len, or disable prefix caching will make infer work like a charm.

vllm: 0.6.4.post1

@scruel Sorry for offtopic, but how do you use fp8 for ampere a100? It doesn't work and doesn't give acceleration? Or for the size of the model?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

9 participants