Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Abnormal First Token Output on 910B GPU during Inference #46

Open
Jozenn opened this issue Feb 11, 2025 · 5 comments
Open

Abnormal First Token Output on 910B GPU during Inference #46

Jozenn opened this issue Feb 11, 2025 · 5 comments
Labels
bug Something isn't working

Comments

@Jozenn
Copy link

Jozenn commented Feb 11, 2025

When using vllm-ascend for inference on the 910B GPU, I've encountered an issue where the first output token is often abnormal. For example, when using an instruction-tuned model, the expected output should be "Answer: xxx", but instead, I get outputs like "binAnswer: xxx" or "1Answer: xxx". The first token is frequently incorrect, with an abnormal rate as high as 50%.

To investigate further, I set temperature=0 for a controlled comparison. Interestingly, this issue does not occur when using the lmdeploy framework under the same conditions. Additionally, when running the same inference on an A100 GPU using vllm, the problem does not appear either.

Could you please provide some guidance or insights into why this might be happening on the 910B GPU? Any help would be greatly appreciated.

@wangxiyuan
Copy link
Collaborator

wangxiyuan commented Feb 11, 2025

Thanks for report the bug. The vllm-ascend is still under progress. We'll make sure the issue is fixed before the first release. @ganyi1996ppo please take a look as well. Thanks.

@ganyi1996ppo
Copy link
Collaborator

@Jozenn Thanks for the report, can you provide us your python script and environments info? So we can take a look in detail.

When using vllm-ascend for inference on the 910B GPU, I've encountered an issue where the first output token is often abnormal. For example, when using an instruction-tuned model, the expected output should be "Answer: xxx", but instead, I get outputs like "binAnswer: xxx" or "1Answer: xxx". The first token is frequently incorrect, with an abnormal rate as high as 50%.

To investigate further, I set temperature=0 for a controlled comparison. Interestingly, this issue does not occur when using the lmdeploy framework under the same conditions. Additionally, when running the same inference on an A100 GPU using vllm, the problem does not appear either.

Could you please provide some guidance or insights into why this might be happening on the 910B GPU? Any help would be greatly appreciated.

@Jozenn
Copy link
Author

Jozenn commented Feb 12, 2025

Thanks for your reply. Here are the python script

llm = LLM(model=args.model, generation_config="auto")
sampling_params = llm.get_default_sampling_params()
sampling_params.temperature = args.temperature
sampling_params.top_p = args.top_p
sampling_params.max_tokens = args.max_tokens
sampling_params.seed = args.seed

requests = [{"role": 'user', "content": content}]
responses = llm.chat(requests, sampling_params=sampling_params)

and some environments info

vllm                              0.1.dev1+g4ea48fb.d20250208.empty
vllm_ascend                       0.1.0a1                          

[pip3] numpy==1.26.4
[pip3] pynvml==11.5.3
[pip3] pyzmq==26.2.1
[pip3] torch==2.4.0+cpu
[pip3] torch-npu==2.4.0
[pip3] transformers==4.48.3
[pip3] transformers-stream-generator==0.0.5
[conda] numpy                     1.26.4                   pypi_0    pypi
[conda] pynvml                    11.5.3                   pypi_0    pypi
[conda] pyzmq                     26.2.1                   pypi_0    pypi
[conda] torch                     2.4.0+cpu                pypi_0    pypi
[conda] torch-npu                 2.4.0                    pypi_0    pypi
[conda] transformers              4.48.3                   pypi_0    pypi
[conda] transformers-stream-generator 0.0.5                    pypi_0    pypi

Many thanks for your help.

@Alexzhuan
Copy link

We also encountered the similar issue. We tested the following prompts on the Qwen2.5-7B-Instruct model, with the offline inference and temperature=0.0.

# prompt 1
Ryan has 3 red lava lamps and 3 blue lava lamps. He arranges them in a row on a shelf randomly, then turns 3 random lamps on. What is the probability that the leftmost lamp on the shelf is red, and the leftmost lamp which is turned on is also red?
# prompt 2
One sphere is centered at $(3,-5,7)$ with radius $5 \\sqrt{5}.$  A second sphere is centered at $(0,1,1)$ with radius $2 \\sqrt{17}.$  The two spheres intersect in a circle.  Find the radius of this circle.

The results showed different responses on the 910B and A100, such as the following case:

# prompt
Ryan has 3 red lava lamps and 3 blue lava lamps. He arranges them in a row on a shelf randomly, then turns 3 random lamps on. What is the probability that the leftmost lamp on the shelf is red, and the leftmost lamp which is turned on is also red?

# response on 910B
![](https://cdn,},\nTo solve this problem, we need to calculate the probability of two specific 

# response on A100
To solve this problem, we need to calculate the probability of two independent 

Primarily the generated response from the 910B often started with unusual tokens, such as '!' and '?!'.

The environment is as follows:

[pip3] numpy==1.26.4
[pip3] pynvml==11.5.3
[pip3] pyzmq==26.2.1
[pip3] torch==2.5.1+cpu
[pip3] torch-npu==2.5.1rc1
[pip3] torchaudio==2.5.1+cpu
[pip3] torchvision==0.20.1+cpu
[pip3] transformers==4.48.3
[pip3] transformers-stream-generator==0.0.5

@ganyi1996ppo
Copy link
Collaborator

Thanks for the comments, we'll take a look.

@Yikun Yikun added the bug Something isn't working label Feb 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants