Question on wrapped forward quantized module #251

Jonxjdong · 2025-02-06T10:31:54Z

https://github.com/neuralmagic/compressed-tensors/blob/main/src/compressed_tensors/quantization/lifecycle/forward.py#L257
In the forward function above, the Linear operation with quantization is done in the way:
Y = (Sx*[X/Sx]) @ (Sw*[W/Sw])
which X and W is quantized and dequantized separately and then the two fp16-format matrices multiplied.

Why not in the way like
Y = ([X/Sx] @ [W/Sw]) * (Sx @ Sw)
which [X/Sx] and [W/Sw] are in int format that the integer multiplication would be faster in CUDA

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question on wrapped forward quantized module #251

Question on wrapped forward quantized module #251

Jonxjdong commented Feb 6, 2025

Question on wrapped forward quantized module #251

Question on wrapped forward quantized module #251

Comments

Jonxjdong commented Feb 6, 2025