-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Abnormal First Token Output on 910B GPU during Inference #46
Comments
Thanks for report the bug. The vllm-ascend is still under progress. We'll make sure the issue is fixed before the first release. @ganyi1996ppo please take a look as well. Thanks. |
@Jozenn Thanks for the report, can you provide us your python script and environments info? So we can take a look in detail.
|
Thanks for your reply. Here are the python script llm = LLM(model=args.model, generation_config="auto")
sampling_params = llm.get_default_sampling_params()
sampling_params.temperature = args.temperature
sampling_params.top_p = args.top_p
sampling_params.max_tokens = args.max_tokens
sampling_params.seed = args.seed
requests = [{"role": 'user', "content": content}]
responses = llm.chat(requests, sampling_params=sampling_params) and some environments info
Many thanks for your help. |
We also encountered the similar issue. We tested the following prompts on the
The results showed different responses on the 910B and A100, such as the following case:
Primarily the generated response from the 910B often started with unusual tokens, such as '!' and '?!'. The environment is as follows:
|
Thanks for the comments, we'll take a look. |
When using vllm-ascend for inference on the 910B GPU, I've encountered an issue where the first output token is often abnormal. For example, when using an instruction-tuned model, the expected output should be "Answer: xxx", but instead, I get outputs like "binAnswer: xxx" or "1Answer: xxx". The first token is frequently incorrect, with an abnormal rate as high as 50%.
To investigate further, I set
temperature=0
for a controlled comparison. Interestingly, this issue does not occur when using thelmdeploy
framework under the same conditions. Additionally, when running the same inference on an A100 GPU using vllm, the problem does not appear either.Could you please provide some guidance or insights into why this might be happening on the 910B GPU? Any help would be greatly appreciated.
The text was updated successfully, but these errors were encountered: