Propagate vLLM batch size controls #588

alvin319 · 2025-02-25T19:54:07Z

Description
In this PR, we fixed issue #573 by propagating the batch size control parameters to VLLMModelConfig. For more detailed explanation of their batch size parameters, see vllm-project/vllm#2492.

Testing

I ran the following CLI commands to invoke a simple evaluation job. Based on the results, we can see that controlling max_num_seqs will determine the batch size at the pre-filling stage, thus impacting the throughput of the model, i.e., 1m4s with max_num_seqs=256 (which is the default) vs. 3m15s with max_num_seqs=1. I'm testing this with an AWS instance of g6e.xlarge.

Default

> lighteval vllm "pretrained=HuggingFaceTB/SmolLM-1.7B-Instruct,revision=main,dtype=bfloat16" "leaderboard|truthfulqa:mc|0|0"

[2025-02-25 19:17:49,834] [    INFO]: PyTorch version 2.5.1 available. (config.py:54)
[2025-02-25 19:17:54,082] [    INFO]: --- LOADING MODEL --- (pipeline.py:186)
[2025-02-25 19:17:54,211] [    INFO]: Automatically detected platform cuda. (__init__.py:207)
[2025-02-25 19:18:00,443] [    INFO]: This model supports multiple tasks: {'embed', 'reward', 'score', 'classify', 'generate'}. Defaulting to 'generate'. (config.py:549)
[2025-02-25 19:18:00,444] [    INFO]: Initializing a V0 LLM engine (v0.7.3) with config: model='HuggingFaceTB/SmolLM-1.7B-Instruct', speculative_config=None, tokenizer='HuggingFaceTB/SmolLM-1.7B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=main, override_neuron_config=None, tokenizer_revision=main, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=1234, served_model_name=HuggingFaceTB/SmolLM-1.7B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False,  (llm_engine.py:234)
[2025-02-25 19:18:01,286] [    INFO]: Using Flash Attention backend. (cuda.py:229)
[2025-02-25 19:18:01,661] [    INFO]: Starting to load model HuggingFaceTB/SmolLM-1.7B-Instruct... (model_runner.py:1110)
[2025-02-25 19:18:01,814] [    INFO]: Using model weights format ['*.safetensors'] (weight_utils.py:254)
model.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3.42G/3.42G [04:28<00:00, 10.4MB/s]
[2025-02-25 19:22:30,510] [    INFO]: Time spent downloading weights for HuggingFaceTB/SmolLM-1.7B-Instruct: 268.695215 seconds (weight_utils.py:270)
[2025-02-25 19:22:30,545] [    INFO]: No model.safetensors.index.json found in remote. (weight_utils.py:304)
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.44it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.44it/s]

[2025-02-25 19:22:31,308] [    INFO]: Loading model weights took 3.1880 GB (model_runner.py:1115)
[2025-02-25 19:22:32,627] [    INFO]: Memory profiling takes 1.07 seconds
the current vLLM instance can use total_gpu_memory (44.32GiB) x gpu_memory_utilization (0.90) = 39.89GiB
model weights take 3.19GiB; non_torch_memory takes 0.08GiB; PyTorch activation peak memory takes 0.46GiB; the rest of the memory reserved for KV Cache is 36.16GiB. (worker.py:267)
[2025-02-25 19:22:32,867] [    INFO]: # cuda blocks: 12342, # CPU blocks: 1365 (executor_base.py:111)
[2025-02-25 19:22:32,867] [    INFO]: Maximum concurrency for 2048 tokens per request: 96.42x (executor_base.py:116)
[2025-02-25 19:22:37,286] [    INFO]: Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage. (model_runner.py:1434)
Capturing CUDA graph shapes: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 35/35 [00:19<00:00,  1.80it/s]
[2025-02-25 19:22:56,712] [    INFO]: Graph capturing finished in 19 secs, took 0.67 GiB (model_runner.py:1562)
[2025-02-25 19:22:56,713] [    INFO]: init engine (profile, create kv cache, warmup model) took 25.40 seconds (llm_engine.py:436)
[2025-02-25 19:22:56,773] [    INFO]: --- LOADING TASKS --- (pipeline.py:213)
[2025-02-25 19:22:56,774] [ WARNING]: If you want to use extended_tasks, make sure you installed their dependencies using `pip install -e .[extended_tasks]`. (registry.py:136)
[2025-02-25 19:22:56,776] [    INFO]: truthful_qa multiple_choice (lighteval_task.py:187)
[2025-02-25 19:22:56,776] [ WARNING]: Careful, the task leaderboard|truthfulqa:mc is using evaluation data to build the few shot examples. (lighteval_task.py:260)
README.md: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.59k/9.59k [00:00<00:00, 32.0MB/s]
validation-00000-of-00001.parquet: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 271k/271k [00:00<00:00, 70.8MB/s]
Generating validation split: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 817/817 [00:00<00:00, 171200.36 examples/s]
[2025-02-25 19:22:57,871] [    INFO]: --- INIT SEEDS --- (pipeline.py:259)
[2025-02-25 19:22:57,871] [    INFO]: --- RUNNING MODEL --- (pipeline.py:464)
[2025-02-25 19:22:57,871] [    INFO]: Running RequestType.LOGLIKELIHOOD requests (pipeline.py:468)
Processed prompts: 100%|████████████████████████████████████████████████████████████████████| 9996/9996 [01:04<00:00, 154.40it/s, est. speed input: 30670.51 toks/s, output: 154.40 toks/s]
1it [01:06, 66.12s/it]%|███████████████████████████████████████████████████████████████████▊| 9965/9996 [01:04<00:00, 179.86it/s, est. speed input: 30613.14 toks/s, output: 154.07 toks/s]
[2025-02-25 19:24:10,490] [    INFO]: --- COMPUTING METRICS --- (pipeline.py:500)
[2025-02-25 19:24:10,622] [    INFO]: --- DISPLAYING RESULTS --- (pipeline.py:542)
|           Task            |Version|    Metric    |Value |   |Stderr|
|---------------------------|------:|--------------|-----:|---|-----:|
|all                        |       |truthfulqa_mc1|0.2485|±  |0.0151|
|                           |       |truthfulqa_mc2|0.3969|±  |0.0144|
|leaderboard:truthfulqa:mc:0|      0|truthfulqa_mc1|0.2485|±  |0.0151|
|                           |       |truthfulqa_mc2|0.3969|±  |0.0144|

[2025-02-25 19:24:10,639] [    INFO]: --- SAVING AND PUSHING RESULTS --- (pipeline.py:532)
[2025-02-25 19:24:10,639] [    INFO]: Saving experiment tracker (evaluation_tracker.py:180)
[2025-02-25 19:24:12,315] [    INFO]: Saving results to /home/ubuntu/lighteval/results/results/HuggingFaceTB/SmolLM-1.7B-Instruct/results_2025-02-25T19-24-10.639358.json (evaluation_tracker.py:234)

BS=1

> lighteval vllm "pretrained=HuggingFaceTB/SmolLM-1.7B-Instruct,revision=main,dtype=bfloat16,max_num_seqs=1" "leaderboard|truthfulqa:mc|0|0"

[2025-02-25 19:45:55,275] [    INFO]: PyTorch version 2.5.1 available. (config.py:54)
[2025-02-25 19:45:59,530] [    INFO]: --- LOADING MODEL --- (pipeline.py:186)
[2025-02-25 19:45:59,657] [    INFO]: Automatically detected platform cuda. (__init__.py:207)
[2025-02-25 19:46:05,727] [    INFO]: This model supports multiple tasks: {'embed', 'generate', 'score', 'reward', 'classify'}. Defaulting to 'generate'. (config.py:549)
[2025-02-25 19:46:05,728] [    INFO]: Initializing a V0 LLM engine (v0.7.3) with config: model='HuggingFaceTB/SmolLM-1.7B-Instruct', speculative_config=None, tokenizer='HuggingFaceTB/SmolLM-1.7B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=main, override_neuron_config=None, tokenizer_revision=main, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=1234, served_model_name=HuggingFaceTB/SmolLM-1.7B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[1],"max_capture_size":1}, use_cached_outputs=False,  (llm_engine.py:234)
[2025-02-25 19:46:06,576] [    INFO]: Using Flash Attention backend. (cuda.py:229)
[2025-02-25 19:46:06,950] [    INFO]: Starting to load model HuggingFaceTB/SmolLM-1.7B-Instruct... (model_runner.py:1110)
[2025-02-25 19:46:07,096] [    INFO]: Using model weights format ['*.safetensors'] (weight_utils.py:254)
[2025-02-25 19:46:07,134] [    INFO]: No model.safetensors.index.json found in remote. (weight_utils.py:304)
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.44it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.44it/s]

[2025-02-25 19:46:07,908] [    INFO]: Loading model weights took 3.1880 GB (model_runner.py:1115)
[2025-02-25 19:46:08,889] [    INFO]: Memory profiling takes 0.75 seconds
the current vLLM instance can use total_gpu_memory (44.32GiB) x gpu_memory_utilization (0.90) = 39.89GiB
model weights take 3.19GiB; non_torch_memory takes 0.08GiB; PyTorch activation peak memory takes 0.13GiB; the rest of the memory reserved for KV Cache is 36.49GiB. (worker.py:267)
[2025-02-25 19:46:09,120] [    INFO]: # cuda blocks: 12456, # CPU blocks: 1365 (executor_base.py:111)
[2025-02-25 19:46:09,121] [    INFO]: Maximum concurrency for 2048 tokens per request: 97.31x (executor_base.py:116)
[2025-02-25 19:46:12,896] [    INFO]: Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage. (model_runner.py:1434)
Capturing CUDA graph shapes: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.84it/s]
[2025-02-25 19:46:13,445] [    INFO]: Graph capturing finished in 1 secs, took 0.06 GiB (model_runner.py:1562)
[2025-02-25 19:46:13,446] [    INFO]: init engine (profile, create kv cache, warmup model) took 5.54 seconds (llm_engine.py:436)
[2025-02-25 19:46:13,522] [    INFO]: --- LOADING TASKS --- (pipeline.py:213)
[2025-02-25 19:46:13,523] [ WARNING]: If you want to use extended_tasks, make sure you installed their dependencies using `pip install -e .[extended_tasks]`. (registry.py:136)
[2025-02-25 19:46:13,526] [    INFO]: truthful_qa multiple_choice (lighteval_task.py:187)
[2025-02-25 19:46:13,527] [ WARNING]: Careful, the task leaderboard|truthfulqa:mc is using evaluation data to build the few shot examples. (lighteval_task.py:260)
[2025-02-25 19:46:14,749] [    INFO]: --- INIT SEEDS --- (pipeline.py:259)
[2025-02-25 19:46:14,749] [    INFO]: --- RUNNING MODEL --- (pipeline.py:464)
[2025-02-25 19:46:14,749] [    INFO]: Running RequestType.LOGLIKELIHOOD requests (pipeline.py:468)
Processed prompts: 100%|██████████████████████████████████████████████████████████████████████| 9996/9996 [03:15<00:00, 51.24it/s, est. speed input: 10179.70 toks/s, output: 51.24 toks/s]
1it [03:16, 196.46s/it]|█████████████████████████████████████████████████████████████████████▉| 9994/9996 [03:15<00:00, 52.90it/s, est. speed input: 10179.80 toks/s, output: 51.24 toks/s]
[2025-02-25 19:49:39,146] [    INFO]: --- COMPUTING METRICS --- (pipeline.py:500)
[2025-02-25 19:49:39,277] [    INFO]: --- DISPLAYING RESULTS --- (pipeline.py:542)
|           Task            |Version|    Metric    |Value |   |Stderr|
|---------------------------|------:|--------------|-----:|---|-----:|
|all                        |       |truthfulqa_mc1|0.2497|±  |0.0152|
|                           |       |truthfulqa_mc2|0.3966|±  |0.0144|
|leaderboard:truthfulqa:mc:0|      0|truthfulqa_mc1|0.2497|±  |0.0152|
|                           |       |truthfulqa_mc2|0.3966|±  |0.0144|

[2025-02-25 19:49:39,293] [    INFO]: --- SAVING AND PUSHING RESULTS --- (pipeline.py:532)
[2025-02-25 19:49:39,293] [    INFO]: Saving experiment tracker (evaluation_tracker.py:180)
[2025-02-25 19:49:40,897] [    INFO]: Saving results to /home/ubuntu/lighteval/results/results/HuggingFaceTB/SmolLM-1.7B-Instruct/results_2025-02-25T19-49-39.294027.json (evaluation_tracker.py:234)

alvin319 added 4 commits February 25, 2025 10:52

expose vLLM batch size control config

718cbec

comments

65379af

type casting

33fc419

bump

c63a5e2

alvin319 mentioned this pull request Feb 25, 2025

[FT] Propagate batch size control for vLLM backend #573

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Propagate vLLM batch size controls #588

Propagate vLLM batch size controls #588

alvin319 commented Feb 25, 2025

Propagate vLLM batch size controls #588

Are you sure you want to change the base?

Propagate vLLM batch size controls #588

Conversation

alvin319 commented Feb 25, 2025

Default

BS=1