[Performance]: Performance regression for long prompt length since vLLM0.6.4.post1 #11912

hustxiayang · 2025-01-10T01:44:12Z

Proposal to improve performance

No response

Report of performance regression

I tested the efficiency/speed of llama3.1-70b-instruct with some internal dataset, basically to summarize some long reports (the average prompt length is around 8000x), using benchmarking_serving.py from vllm https://github.com/vllm-project/vllm/blob/main/benchmarks/benchmark_serving.py. The model I used was llama3.1-70b-instruct, quantized with fp8-dynamic. I used 8 h100, tensor parallelism=4, and two replicas. However, it seems that the mean time of output token is much larger with newer versions of vLLM (since 0.6.4.post1) For example, with QPS=0.2, it increases from 15.7 ms to 25.7. The difference is quite large, so just check whether there are some insights on this.
Other details of the deployment:

        - '--gpu-memory-utilization=0.96' 
        - '--tensor-parallel-size=2'
        - '--enable-chunked-prefill'
        - '--max-num-batched-tokens=4096'
        - '--enable-auto-tool-choice'
        - '--tool-call-parser=llama3_json'
        - '--chat-template=tool_chat_template_llama3.1_json.jinja'

Thanks a lot!

Misc discussion on performance

No response

Your current environment (if you think it is necessary)

The output of `python collect_env.py`

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

The text was updated successfully, but these errors were encountered:

hustxiayang added the performance Performance-related issues label Jan 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance]: Performance regression for long prompt length since vLLM0.6.4.post1 #11912

[Performance]: Performance regression for long prompt length since vLLM0.6.4.post1 #11912

hustxiayang commented Jan 10, 2025 •

edited

Loading

[Performance]: Performance regression for long prompt length since vLLM0.6.4.post1 #11912

[Performance]: Performance regression for long prompt length since vLLM0.6.4.post1 #11912

Comments

hustxiayang commented Jan 10, 2025 • edited Loading

Proposal to improve performance

Report of performance regression

Misc discussion on performance

Your current environment (if you think it is necessary)

Before submitting a new issue...

hustxiayang commented Jan 10, 2025 •

edited

Loading