You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm using ["v0.6.5"] of VLLM.
When I try to launch Qwen VL 7B with 100% of the GPU (24GB VRAM) it's ok.
Then even if the model is only 4GB when I reduce to a little bit less the launch of VLLM is getting stuck by printing an endless: 'INFO: 127.0.0.6:XXX - "GET /metrics HTTP/1.1" 200 OK'
I'm confused because I know that I have enough space for the model.
INFO 01-09 06:23:59 api_server.py:651] vLLM API server version 0.6.5
INFO 01-09 06:23:59 api_server.py:652] args: Namespace(subparser='serve', model_tag='Qwen/Qwen2-VL-7B-Instruct-AWQ', config='', host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=[''], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='Qwen/Qwen2-VL-7B-Instruct-AWQ', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='fp8_e4m3', quantization_param_path=None, max_model_len=4096, guided_decoding_backend='xgrammar', logits_processor_pattern=None, distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.8, num_gpu_blocks_override=None, max_num_batched_tokens=4097, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization='awq_marlin', rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, mm_cache_preprocessor=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=True, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, dispatch_function=<function serve at 0x7c62441bcea0>)
INFO 01-09 06:23:59 api_server.py:199] Started engine process with PID 38
INFO 01-09 06:24:07 config.py:478] This model supports multiple tasks: {'reward', 'classify', 'score', 'embed', 'generate'}. Defaulting to 'generate'.
INFO 01-09 06:24:08 awq_marlin.py:109] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
INFO 01-09 06:24:08 config.py:925] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor
INFO 01-09 06:24:08 config.py:1364] Chunked prefill is enabled with max_num_batched_tokens=4097.
INFO 01-09 06:24:12 config.py:478] This model supports multiple tasks: {'classify', 'generate', 'reward', 'score', 'embed'}. Defaulting to 'generate'.
INFO 01-09 06:24:13 awq_marlin.py:109] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
INFO 01-09 06:24:13 config.py:925] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor
INFO 01-09 06:24:13 config.py:1364] Chunked prefill is enabled with max_num_batched_tokens=4097.
INFO 01-09 06:24:13 llm_engine.py:249] Initializing an LLM engine (v0.6.5) with config: model='Qwen/Qwen2-VL-7B-Instruct-AWQ', speculative_config=None, tokenizer='Qwen/Qwen2-VL-7B-Instruct-AWQ', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=awq_marlin, enforce_eager=False, kv_cache_dtype=fp8_e4m3, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=Qwen/Qwen2-VL-7B-Instruct-AWQ, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=True, use_async_output_proc=True, mm_cache_preprocessor=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"candidate_compile_sizes":[],"compile_sizes":[],"capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=True,
INFO 01-09 06:24:14 selector.py:227] Cannot use FlashAttention-2 backend for FP8 KV cache.
WARNING 01-09 06:24:14 selector.py:229] Please use FlashInfer backend with FP8 KV Cache for better performance by setting environment variable VLLM_ATTENTION_BACKEND=FLASHINFER
INFO 01-09 06:24:14 selector.py:129] Using XFormers backend.
INFO 01-09 06:24:15 model_runner.py:1092] Starting to load model Qwen/Qwen2-VL-7B-Instruct-AWQ...
WARNING 01-09 06:24:15 utils.py:624] Current vllm-flash-attn has a bug inside vision module, so we use xformers backend instead. You can run pip install flash-attn to use flash-attention backend.
INFO 01-09 06:24:15 weight_utils.py:243] Using model weights format ['.safetensors']
Loading safetensors checkpoint shards: 0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 50% Completed | 1/2 [00:01<00:01, 1.85s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:02<00:00, 1.31s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:02<00:00, 1.39s/it]
INFO 01-09 06:24:19 model_runner.py:1097] Loading model weights took 6.4651 GB
INFO 01-09 06:24:22 worker.py:241] Memory profiling takes 2.64 seconds
INFO 01-09 06:24:22 worker.py:241] the current vLLM instance can use total_gpu_memory (21.95GiB) x gpu_memory_utilization (0.80) = 17.56GiB
INFO 01-09 06:24:22 worker.py:241] model weights take 6.47GiB; non_torch_memory takes 0.33GiB; PyTorch activation peak memory takes 0.74GiB; the rest of the memory reserved for KV Cache is 10.02GiB.
INFO 01-09 06:24:22 gpu_executor.py:76] # GPU blocks: 23460, # CPU blocks: 9362
INFO 01-09 06:24:22 gpu_executor.py:80] Maximum concurrency for 4096 tokens per request: 91.64x
INFO 01-09 06:24:26 model_runner.py:1413] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 01-09 06:24:26 model_runner.py:1417] If out-of-memory error occurs during cudagraph capture, consider decreasing gpu_memory_utilization or switching to eager mode. You can also reduce the max_num_seqs as needed to decrease memory usage.
INFO 01-09 06:24:42 model_runner.py:1527] Graph capturing finished in 17 secs, took 0.42 GiB
INFO 01-09 06:24:42 llm_engine.py:446] init engine (profile, create kv cache, warmup model) took 23.34 seconds
INFO 01-09 06:24:43 api_server.py:586] Using supplied chat template:
INFO 01-09 06:24:43 api_server.py:586] None
INFO 01-09 06:24:43 launcher.py:19] Available routes are:
INFO 01-09 06:24:43 launcher.py:27] Route: /openapi.json, Methods: GET, HEAD
INFO 01-09 06:24:43 launcher.py:27] Route: /docs, Methods: GET, HEAD
INFO 01-09 06:24:43 launcher.py:27] Route: /docs/oauth2-redirect, Methods: GET, HEAD
INFO 01-09 06:24:43 launcher.py:27] Route: /redoc, Methods: GET, HEAD
INFO 01-09 06:24:43 launcher.py:27] Route: /health, Methods: GET
INFO 01-09 06:24:43 launcher.py:27] Route: /tokenize, Methods: POST
INFO 01-09 06:24:43 launcher.py:27] Route: /detokenize, Methods: POST
INFO 01-09 06:24:43 launcher.py:27] Route: /v1/models, Methods: GET
INFO 01-09 06:24:43 launcher.py:27] Route: /version, Methods: GET
INFO 01-09 06:24:43 launcher.py:27] Route: /v1/chat/completions, Methods: POST
INFO 01-09 06:24:43 launcher.py:27] Route: /v1/completions, Methods: POST
INFO 01-09 06:24:43 launcher.py:27] Route: /v1/embeddings, Methods: POST
INFO 01-09 06:24:43 launcher.py:27] Route: /score, Methods: POST
INFO 01-09 06:24:43 launcher.py:27] Route: /v1/score, Methods: POST
INFO: Started server process [1]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
INFO: 127.0.0.6:39795 - "GET /metrics HTTP/1.1" 200 OK
INFO: 127.0.0.6:52875 - "GET /metrics HTTP/1.1" 200 OK
INFO: 127.0.0.6:43741 - "GET /metrics HTTP/1.1" 200 OK
INFO: 127.0.0.6:39081 - "GET /metrics HTTP/1.1" 200 OK
Before submitting a new issue...
Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
The text was updated successfully, but these errors were encountered:
What I have shared are the logs on the VLLM server, usually when a model is up it starts to print something as:
INFO 01-09 23:14:25 metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
Here when I set it with less than 100%GPUs it fails to print these logs and the model is never reachable.
Where I'm confused is that on the same GPU I don' t have any problem to run two LLMs with VLLM by doing GPU sharing as 50/50.
Can you run nvidia-smi (or equivalent) to check whether the model has been loaded yet? You can also follow the troubleshooting guide to find out where is vLLM getting stuck.
Your current environment
I'm using ["v0.6.5"] of VLLM.
When I try to launch Qwen VL 7B with 100% of the GPU (24GB VRAM) it's ok.
Then even if the model is only 4GB when I reduce to a little bit less the launch of VLLM is getting stuck by printing an endless: 'INFO: 127.0.0.6:XXX - "GET /metrics HTTP/1.1" 200 OK'
I'm confused because I know that I have enough space for the model.
Model Input Dumps
No response
🐛 Describe the bug
INFO 01-09 06:23:59 api_server.py:651] vLLM API server version 0.6.5
INFO 01-09 06:23:59 api_server.py:652] args: Namespace(subparser='serve', model_tag='Qwen/Qwen2-VL-7B-Instruct-AWQ', config='', host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=[''], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='Qwen/Qwen2-VL-7B-Instruct-AWQ', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='fp8_e4m3', quantization_param_path=None, max_model_len=4096, guided_decoding_backend='xgrammar', logits_processor_pattern=None, distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.8, num_gpu_blocks_override=None, max_num_batched_tokens=4097, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization='awq_marlin', rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, mm_cache_preprocessor=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=True, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, dispatch_function=<function serve at 0x7c62441bcea0>)
INFO 01-09 06:23:59 api_server.py:199] Started engine process with PID 38
INFO 01-09 06:24:07 config.py:478] This model supports multiple tasks: {'reward', 'classify', 'score', 'embed', 'generate'}. Defaulting to 'generate'.
INFO 01-09 06:24:08 awq_marlin.py:109] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
INFO 01-09 06:24:08 config.py:925] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor
INFO 01-09 06:24:08 config.py:1364] Chunked prefill is enabled with max_num_batched_tokens=4097.
INFO 01-09 06:24:12 config.py:478] This model supports multiple tasks: {'classify', 'generate', 'reward', 'score', 'embed'}. Defaulting to 'generate'.
INFO 01-09 06:24:13 awq_marlin.py:109] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
INFO 01-09 06:24:13 config.py:925] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor
INFO 01-09 06:24:13 config.py:1364] Chunked prefill is enabled with max_num_batched_tokens=4097.
INFO 01-09 06:24:13 llm_engine.py:249] Initializing an LLM engine (v0.6.5) with config: model='Qwen/Qwen2-VL-7B-Instruct-AWQ', speculative_config=None, tokenizer='Qwen/Qwen2-VL-7B-Instruct-AWQ', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=awq_marlin, enforce_eager=False, kv_cache_dtype=fp8_e4m3, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=Qwen/Qwen2-VL-7B-Instruct-AWQ, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=True, use_async_output_proc=True, mm_cache_preprocessor=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"candidate_compile_sizes":[],"compile_sizes":[],"capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=True,
INFO 01-09 06:24:14 selector.py:227] Cannot use FlashAttention-2 backend for FP8 KV cache.
WARNING 01-09 06:24:14 selector.py:229] Please use FlashInfer backend with FP8 KV Cache for better performance by setting environment variable VLLM_ATTENTION_BACKEND=FLASHINFER
INFO 01-09 06:24:14 selector.py:129] Using XFormers backend.
INFO 01-09 06:24:15 model_runner.py:1092] Starting to load model Qwen/Qwen2-VL-7B-Instruct-AWQ...
WARNING 01-09 06:24:15 utils.py:624] Current
vllm-flash-attn
has a bug inside vision module, so we use xformers backend instead. You can runpip install flash-attn
to use flash-attention backend.INFO 01-09 06:24:15 weight_utils.py:243] Using model weights format ['.safetensors']
Loading safetensors checkpoint shards: 0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 50% Completed | 1/2 [00:01<00:01, 1.85s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:02<00:00, 1.31s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:02<00:00, 1.39s/it]
INFO 01-09 06:24:19 model_runner.py:1097] Loading model weights took 6.4651 GB
INFO 01-09 06:24:22 worker.py:241] Memory profiling takes 2.64 seconds
INFO 01-09 06:24:22 worker.py:241] the current vLLM instance can use total_gpu_memory (21.95GiB) x gpu_memory_utilization (0.80) = 17.56GiB
INFO 01-09 06:24:22 worker.py:241] model weights take 6.47GiB; non_torch_memory takes 0.33GiB; PyTorch activation peak memory takes 0.74GiB; the rest of the memory reserved for KV Cache is 10.02GiB.
INFO 01-09 06:24:22 gpu_executor.py:76] # GPU blocks: 23460, # CPU blocks: 9362
INFO 01-09 06:24:22 gpu_executor.py:80] Maximum concurrency for 4096 tokens per request: 91.64x
INFO 01-09 06:24:26 model_runner.py:1413] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 01-09 06:24:26 model_runner.py:1417] If out-of-memory error occurs during cudagraph capture, consider decreasing
gpu_memory_utilization
or switching to eager mode. You can also reduce themax_num_seqs
as needed to decrease memory usage.INFO 01-09 06:24:42 model_runner.py:1527] Graph capturing finished in 17 secs, took 0.42 GiB
INFO 01-09 06:24:42 llm_engine.py:446] init engine (profile, create kv cache, warmup model) took 23.34 seconds
INFO 01-09 06:24:43 api_server.py:586] Using supplied chat template:
INFO 01-09 06:24:43 api_server.py:586] None
INFO 01-09 06:24:43 launcher.py:19] Available routes are:
INFO 01-09 06:24:43 launcher.py:27] Route: /openapi.json, Methods: GET, HEAD
INFO 01-09 06:24:43 launcher.py:27] Route: /docs, Methods: GET, HEAD
INFO 01-09 06:24:43 launcher.py:27] Route: /docs/oauth2-redirect, Methods: GET, HEAD
INFO 01-09 06:24:43 launcher.py:27] Route: /redoc, Methods: GET, HEAD
INFO 01-09 06:24:43 launcher.py:27] Route: /health, Methods: GET
INFO 01-09 06:24:43 launcher.py:27] Route: /tokenize, Methods: POST
INFO 01-09 06:24:43 launcher.py:27] Route: /detokenize, Methods: POST
INFO 01-09 06:24:43 launcher.py:27] Route: /v1/models, Methods: GET
INFO 01-09 06:24:43 launcher.py:27] Route: /version, Methods: GET
INFO 01-09 06:24:43 launcher.py:27] Route: /v1/chat/completions, Methods: POST
INFO 01-09 06:24:43 launcher.py:27] Route: /v1/completions, Methods: POST
INFO 01-09 06:24:43 launcher.py:27] Route: /v1/embeddings, Methods: POST
INFO 01-09 06:24:43 launcher.py:27] Route: /score, Methods: POST
INFO 01-09 06:24:43 launcher.py:27] Route: /v1/score, Methods: POST
INFO: Started server process [1]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
INFO: 127.0.0.6:39795 - "GET /metrics HTTP/1.1" 200 OK
INFO: 127.0.0.6:52875 - "GET /metrics HTTP/1.1" 200 OK
INFO: 127.0.0.6:43741 - "GET /metrics HTTP/1.1" 200 OK
INFO: 127.0.0.6:39081 - "GET /metrics HTTP/1.1" 200 OK
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: