Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: VLLM get stucks with Qwen VL 7B #11899

Open
1 task done
engleccma opened this issue Jan 9, 2025 · 4 comments
Open
1 task done

[Bug]: VLLM get stucks with Qwen VL 7B #11899

engleccma opened this issue Jan 9, 2025 · 4 comments
Labels
bug Something isn't working

Comments

@engleccma
Copy link

engleccma commented Jan 9, 2025

Your current environment

I'm using ["v0.6.5"] of VLLM.
When I try to launch Qwen VL 7B with 100% of the GPU (24GB VRAM) it's ok.
Then even if the model is only 4GB when I reduce to a little bit less the launch of VLLM is getting stuck by printing an endless: 'INFO: 127.0.0.6:XXX - "GET /metrics HTTP/1.1" 200 OK'
I'm confused because I know that I have enough space for the model.

      vllm serve Qwen/Qwen2-VL-7B-Instruct-AWQ --trust-remote-code --enable-chunked-prefill --max_model_len 4096 --quantization awq_marlin --gpu_memory_utilization=0.8 --max-num-batched-tokens 4097 --kv-cache-dtype fp8_e4m3

Model Input Dumps

No response

🐛 Describe the bug

INFO 01-09 06:23:59 api_server.py:651] vLLM API server version 0.6.5
INFO 01-09 06:23:59 api_server.py:652] args: Namespace(subparser='serve', model_tag='Qwen/Qwen2-VL-7B-Instruct-AWQ', config='', host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=[''], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='Qwen/Qwen2-VL-7B-Instruct-AWQ', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='fp8_e4m3', quantization_param_path=None, max_model_len=4096, guided_decoding_backend='xgrammar', logits_processor_pattern=None, distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.8, num_gpu_blocks_override=None, max_num_batched_tokens=4097, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization='awq_marlin', rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, mm_cache_preprocessor=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=True, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, dispatch_function=<function serve at 0x7c62441bcea0>)
INFO 01-09 06:23:59 api_server.py:199] Started engine process with PID 38
INFO 01-09 06:24:07 config.py:478] This model supports multiple tasks: {'reward', 'classify', 'score', 'embed', 'generate'}. Defaulting to 'generate'.
INFO 01-09 06:24:08 awq_marlin.py:109] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
INFO 01-09 06:24:08 config.py:925] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor
INFO 01-09 06:24:08 config.py:1364] Chunked prefill is enabled with max_num_batched_tokens=4097.
INFO 01-09 06:24:12 config.py:478] This model supports multiple tasks: {'classify', 'generate', 'reward', 'score', 'embed'}. Defaulting to 'generate'.
INFO 01-09 06:24:13 awq_marlin.py:109] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
INFO 01-09 06:24:13 config.py:925] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor
INFO 01-09 06:24:13 config.py:1364] Chunked prefill is enabled with max_num_batched_tokens=4097.
INFO 01-09 06:24:13 llm_engine.py:249] Initializing an LLM engine (v0.6.5) with config: model='Qwen/Qwen2-VL-7B-Instruct-AWQ', speculative_config=None, tokenizer='Qwen/Qwen2-VL-7B-Instruct-AWQ', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=awq_marlin, enforce_eager=False, kv_cache_dtype=fp8_e4m3, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=Qwen/Qwen2-VL-7B-Instruct-AWQ, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=True, use_async_output_proc=True, mm_cache_preprocessor=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"candidate_compile_sizes":[],"compile_sizes":[],"capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=True,
INFO 01-09 06:24:14 selector.py:227] Cannot use FlashAttention-2 backend for FP8 KV cache.
WARNING 01-09 06:24:14 selector.py:229] Please use FlashInfer backend with FP8 KV Cache for better performance by setting environment variable VLLM_ATTENTION_BACKEND=FLASHINFER
INFO 01-09 06:24:14 selector.py:129] Using XFormers backend.
INFO 01-09 06:24:15 model_runner.py:1092] Starting to load model Qwen/Qwen2-VL-7B-Instruct-AWQ...
WARNING 01-09 06:24:15 utils.py:624] Current vllm-flash-attn has a bug inside vision module, so we use xformers backend instead. You can run pip install flash-attn to use flash-attention backend.
INFO 01-09 06:24:15 weight_utils.py:243] Using model weights format ['
.safetensors']
Loading safetensors checkpoint shards: 0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 50% Completed | 1/2 [00:01<00:01, 1.85s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:02<00:00, 1.31s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:02<00:00, 1.39s/it]
INFO 01-09 06:24:19 model_runner.py:1097] Loading model weights took 6.4651 GB
INFO 01-09 06:24:22 worker.py:241] Memory profiling takes 2.64 seconds
INFO 01-09 06:24:22 worker.py:241] the current vLLM instance can use total_gpu_memory (21.95GiB) x gpu_memory_utilization (0.80) = 17.56GiB
INFO 01-09 06:24:22 worker.py:241] model weights take 6.47GiB; non_torch_memory takes 0.33GiB; PyTorch activation peak memory takes 0.74GiB; the rest of the memory reserved for KV Cache is 10.02GiB.
INFO 01-09 06:24:22 gpu_executor.py:76] # GPU blocks: 23460, # CPU blocks: 9362
INFO 01-09 06:24:22 gpu_executor.py:80] Maximum concurrency for 4096 tokens per request: 91.64x
INFO 01-09 06:24:26 model_runner.py:1413] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 01-09 06:24:26 model_runner.py:1417] If out-of-memory error occurs during cudagraph capture, consider decreasing gpu_memory_utilization or switching to eager mode. You can also reduce the max_num_seqs as needed to decrease memory usage.
INFO 01-09 06:24:42 model_runner.py:1527] Graph capturing finished in 17 secs, took 0.42 GiB
INFO 01-09 06:24:42 llm_engine.py:446] init engine (profile, create kv cache, warmup model) took 23.34 seconds
INFO 01-09 06:24:43 api_server.py:586] Using supplied chat template:
INFO 01-09 06:24:43 api_server.py:586] None
INFO 01-09 06:24:43 launcher.py:19] Available routes are:
INFO 01-09 06:24:43 launcher.py:27] Route: /openapi.json, Methods: GET, HEAD
INFO 01-09 06:24:43 launcher.py:27] Route: /docs, Methods: GET, HEAD
INFO 01-09 06:24:43 launcher.py:27] Route: /docs/oauth2-redirect, Methods: GET, HEAD
INFO 01-09 06:24:43 launcher.py:27] Route: /redoc, Methods: GET, HEAD
INFO 01-09 06:24:43 launcher.py:27] Route: /health, Methods: GET
INFO 01-09 06:24:43 launcher.py:27] Route: /tokenize, Methods: POST
INFO 01-09 06:24:43 launcher.py:27] Route: /detokenize, Methods: POST
INFO 01-09 06:24:43 launcher.py:27] Route: /v1/models, Methods: GET
INFO 01-09 06:24:43 launcher.py:27] Route: /version, Methods: GET
INFO 01-09 06:24:43 launcher.py:27] Route: /v1/chat/completions, Methods: POST
INFO 01-09 06:24:43 launcher.py:27] Route: /v1/completions, Methods: POST
INFO 01-09 06:24:43 launcher.py:27] Route: /v1/embeddings, Methods: POST
INFO 01-09 06:24:43 launcher.py:27] Route: /score, Methods: POST
INFO 01-09 06:24:43 launcher.py:27] Route: /v1/score, Methods: POST
INFO: Started server process [1]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
INFO: 127.0.0.6:39795 - "GET /metrics HTTP/1.1" 200 OK
INFO: 127.0.0.6:52875 - "GET /metrics HTTP/1.1" 200 OK
INFO: 127.0.0.6:43741 - "GET /metrics HTTP/1.1" 200 OK
INFO: 127.0.0.6:39081 - "GET /metrics HTTP/1.1" 200 OK

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@engleccma engleccma added the bug Something isn't working label Jan 9, 2025
@DarkLight1337
Copy link
Member

DarkLight1337 commented Jan 9, 2025

'INFO: 127.0.0.6:XXX - "GET /metrics HTTP/1.1" 200 OK'

This endpoint only outputs the production metrics (maybe you have a service that periodically pings it?)

How did you send your input to the model? It should be using /v1/chat/completions endpoint.

@engleccma
Copy link
Author

What I have shared are the logs on the VLLM server, usually when a model is up it starts to print something as:
INFO 01-09 23:14:25 metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.

Here when I set it with less than 100%GPUs it fails to print these logs and the model is never reachable.

Where I'm confused is that on the same GPU I don' t have any problem to run two LLMs with VLLM by doing GPU sharing as 50/50.

@DarkLight1337
Copy link
Member

Can you run nvidia-smi (or equivalent) to check whether the model has been loaded yet? You can also follow the troubleshooting guide to find out where is vLLM getting stuck.

@engleccma
Copy link
Author

Thank you for your reply, I don't have to the GPU right now, I will post the logs later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants