Abnormal First Token Output on 910B GPU during Inference #46

Jozenn · 2025-02-11T13:52:08Z

When using vllm-ascend for inference on the 910B GPU, I've encountered an issue where the first output token is often abnormal. For example, when using an instruction-tuned model, the expected output should be "Answer: xxx", but instead, I get outputs like "binAnswer: xxx" or "1Answer: xxx". The first token is frequently incorrect, with an abnormal rate as high as 50%.

To investigate further, I set temperature=0 for a controlled comparison. Interestingly, this issue does not occur when using the lmdeploy framework under the same conditions. Additionally, when running the same inference on an A100 GPU using vllm, the problem does not appear either.

Could you please provide some guidance or insights into why this might be happening on the 910B GPU? Any help would be greatly appreciated.

The text was updated successfully, but these errors were encountered:

wangxiyuan · 2025-02-11T14:04:55Z

Thanks for report the bug. The vllm-ascend is still under progress. We'll make sure the issue is fixed before the first release. @ganyi1996ppo please take a look as well. Thanks.

ganyi1996ppo · 2025-02-12T01:44:57Z

@Jozenn Thanks for the report, can you provide us your python script and environments info? So we can take a look in detail.

When using vllm-ascend for inference on the 910B GPU, I've encountered an issue where the first output token is often abnormal. For example, when using an instruction-tuned model, the expected output should be "Answer: xxx", but instead, I get outputs like "binAnswer: xxx" or "1Answer: xxx". The first token is frequently incorrect, with an abnormal rate as high as 50%.

To investigate further, I set temperature=0 for a controlled comparison. Interestingly, this issue does not occur when using the lmdeploy framework under the same conditions. Additionally, when running the same inference on an A100 GPU using vllm, the problem does not appear either.

Could you please provide some guidance or insights into why this might be happening on the 910B GPU? Any help would be greatly appreciated.

Jozenn · 2025-02-12T03:23:43Z

Thanks for your reply. Here are the python script

llm = LLM(model=args.model, generation_config="auto")
sampling_params = llm.get_default_sampling_params()
sampling_params.temperature = args.temperature
sampling_params.top_p = args.top_p
sampling_params.max_tokens = args.max_tokens
sampling_params.seed = args.seed

requests = [{"role": 'user', "content": content}]
responses = llm.chat(requests, sampling_params=sampling_params)

and some environments info

vllm                              0.1.dev1+g4ea48fb.d20250208.empty
vllm_ascend                       0.1.0a1                          

[pip3] numpy==1.26.4
[pip3] pynvml==11.5.3
[pip3] pyzmq==26.2.1
[pip3] torch==2.4.0+cpu
[pip3] torch-npu==2.4.0
[pip3] transformers==4.48.3
[pip3] transformers-stream-generator==0.0.5
[conda] numpy                     1.26.4                   pypi_0    pypi
[conda] pynvml                    11.5.3                   pypi_0    pypi
[conda] pyzmq                     26.2.1                   pypi_0    pypi
[conda] torch                     2.4.0+cpu                pypi_0    pypi
[conda] torch-npu                 2.4.0                    pypi_0    pypi
[conda] transformers              4.48.3                   pypi_0    pypi
[conda] transformers-stream-generator 0.0.5                    pypi_0    pypi

Many thanks for your help.

Alexzhuan · 2025-02-12T09:57:10Z

We also encountered the similar issue. We tested the following prompts on the Qwen2.5-7B-Instruct model, with the offline inference and temperature=0.0.

# prompt 1
Ryan has 3 red lava lamps and 3 blue lava lamps. He arranges them in a row on a shelf randomly, then turns 3 random lamps on. What is the probability that the leftmost lamp on the shelf is red, and the leftmost lamp which is turned on is also red?
# prompt 2
One sphere is centered at $(3,-5,7)$ with radius $5 \\sqrt{5}.$  A second sphere is centered at $(0,1,1)$ with radius $2 \\sqrt{17}.$  The two spheres intersect in a circle.  Find the radius of this circle.

The results showed different responses on the 910B and A100, such as the following case:

# prompt
Ryan has 3 red lava lamps and 3 blue lava lamps. He arranges them in a row on a shelf randomly, then turns 3 random lamps on. What is the probability that the leftmost lamp on the shelf is red, and the leftmost lamp which is turned on is also red?

# response on 910B
![](https://cdn,},\nTo solve this problem, we need to calculate the probability of two specific 

# response on A100
To solve this problem, we need to calculate the probability of two independent

Primarily the generated response from the 910B often started with unusual tokens, such as '!' and '?!'.

The environment is as follows:

[pip3] numpy==1.26.4
[pip3] pynvml==11.5.3
[pip3] pyzmq==26.2.1
[pip3] torch==2.5.1+cpu
[pip3] torch-npu==2.5.1rc1
[pip3] torchaudio==2.5.1+cpu
[pip3] torchvision==0.20.1+cpu
[pip3] transformers==4.48.3
[pip3] transformers-stream-generator==0.0.5

ganyi1996ppo · 2025-02-13T13:53:29Z

Thanks for the comments, we'll take a look.

Yikun added the bug Something isn't working label Feb 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Abnormal First Token Output on 910B GPU during Inference #46

Abnormal First Token Output on 910B GPU during Inference #46

Jozenn commented Feb 11, 2025

wangxiyuan commented Feb 11, 2025 •

edited

Loading

ganyi1996ppo commented Feb 12, 2025

Jozenn commented Feb 12, 2025

Alexzhuan commented Feb 12, 2025

ganyi1996ppo commented Feb 13, 2025

Abnormal First Token Output on 910B GPU during Inference #46

Abnormal First Token Output on 910B GPU during Inference #46

Comments

Jozenn commented Feb 11, 2025

wangxiyuan commented Feb 11, 2025 • edited Loading

ganyi1996ppo commented Feb 12, 2025

Jozenn commented Feb 12, 2025

Alexzhuan commented Feb 12, 2025

ganyi1996ppo commented Feb 13, 2025

wangxiyuan commented Feb 11, 2025 •

edited

Loading