[Performance]: Performance regression for long prompt length since vLLM0.6.4.post1 #11912
Open
1 task done
Labels
performance
Performance-related issues
Proposal to improve performance
No response
Report of performance regression
I tested the efficiency/speed of llama3.1-70b-instruct with some internal dataset, basically to summarize some long reports (the average prompt length is around 8000x), using benchmarking_serving.py from vllm https://github.com/vllm-project/vllm/blob/main/benchmarks/benchmark_serving.py. The model I used was llama3.1-70b-instruct, quantized with fp8-dynamic. I used 8 h100, tensor parallelism=4, and two replicas. However, it seems that the mean time of output token is much larger with newer versions of vLLM (since 0.6.4.post1) For example, with QPS=0.2, it increases from 15.7 ms to 25.7. The difference is quite large, so just check whether there are some insights on this.
Other details of the deployment:
Misc discussion on performance
No response
Your current environment (if you think it is necessary)
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: