Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Performance]: Performance regression for long prompt length since vLLM0.6.4.post1 #11912

Open
1 task done
hustxiayang opened this issue Jan 10, 2025 · 0 comments
Open
1 task done
Labels
performance Performance-related issues

Comments

@hustxiayang
Copy link

hustxiayang commented Jan 10, 2025

Proposal to improve performance

No response

Report of performance regression

I tested the efficiency/speed of llama3.1-70b-instruct with some internal dataset, basically to summarize some long reports (the average prompt length is around 8000x), using benchmarking_serving.py from vllm https://github.com/vllm-project/vllm/blob/main/benchmarks/benchmark_serving.py. The model I used was llama3.1-70b-instruct, quantized with fp8-dynamic. I used 8 h100, tensor parallelism=4, and two replicas. However, it seems that the mean time of output token is much larger with newer versions of vLLM (since 0.6.4.post1) For example, with QPS=0.2, it increases from 15.7 ms to 25.7. The difference is quite large, so just check whether there are some insights on this.
Other details of the deployment:

        - '--gpu-memory-utilization=0.96' 
        - '--tensor-parallel-size=2'
        - '--enable-chunked-prefill'
        - '--max-num-batched-tokens=4096'
        - '--enable-auto-tool-choice'
        - '--tool-call-parser=llama3_json'
        - '--chat-template=tool_chat_template_llama3.1_json.jinja'
Thanks a lot!

Misc discussion on performance

No response

Your current environment (if you think it is necessary)

The output of `python collect_env.py`

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@hustxiayang hustxiayang added the performance Performance-related issues label Jan 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance-related issues
Projects
None yet
Development

No branches or pull requests

1 participant