You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, great work! I think Zamba2 is a great hybrid SSM model for the whole community.
I'm trying to reproduce the time-to-first-token result presented in your tech report. But I found that I couldn't reproduce the TTFT of Zamba2 reported in the paper. On 1.2B and 2.7B models, I tried setting the input prompt length to 2048, the output token number to 1, and batch size is 1. However, the TTFT of Zamba2 is higher than attention-based models, such as Phi2-2.7B, Qwen2-1.5B, Qwen2.5-3B.
For example, Zamba2-2.7B vs. Phi2-2.7B 150ms vs. 90ms, Zamba2-1.2B vs. Qwen2-1.5B 94ms vs. 81ms, I use 1 A100 (40G)
For Zamba2, I used mamba-ssm and casual-conv1d to speed up the inference, and for attention-based LLM I used flash attention 2.
The following are my machine envs
I would like to know if my reproduction process is correct and how you calculated the TTFT results shown in the tech report.
Looking forward to your reply, thanks!
The text was updated successfully, but these errors were encountered:
Hi, great work! I think Zamba2 is a great hybrid SSM model for the whole community.
I'm trying to reproduce the time-to-first-token result presented in your tech report. But I found that I couldn't reproduce the TTFT of Zamba2 reported in the paper. On 1.2B and 2.7B models, I tried setting the input prompt length to 2048, the output token number to 1, and batch size is 1.
However, the TTFT of Zamba2 is higher than attention-based models, such as Phi2-2.7B, Qwen2-1.5B, Qwen2.5-3B.
For example,
Zamba2-2.7B
vs.Phi2-2.7B
150ms vs. 90ms,Zamba2-1.2B
vs.Qwen2-1.5B
94ms vs. 81ms, I use 1 A100 (40G)For Zamba2, I used
mamba-ssm
andcasual-conv1d
to speed up the inference, and for attention-based LLM I used flash attention 2.The following are my machine envs
python pkgs are
My
mamba-ssm
andcausal-conv1d
use the same version as insetup.py
. Below is my code to reproduce TTFT:I would like to know if my reproduction process is correct and how you calculated the TTFT results shown in the tech report.
Looking forward to your reply, thanks!
The text was updated successfully, but these errors were encountered: