Performance regression on latest pytorch nightly when using `float8_dynamic_activation_float8_weight` with granularity == `PerTensor` #1609

vgoklani · 2025-01-23T19:21:49Z

Nothing fancy here, just running single-batch inference on LLama3-1 8B with float8_dynamic_activation_float8_weight quantization with the granularity set to PerTensor().

{'ttft': 0.01968639945983887, 'input_token_throughput': 2946.1964397462616, 'output_token_throughput': 73.3861714881309, 'bandwidth': '628.29 GB/s', 'peak_memory_usage': '22.80 GB', 'model_size': '8.58 GB', 'torch_version': '2.6.0a0+df5bbc09d1.nv24.12', 'torchao_version': '0.7.0'}

{'ttft': 0.050607105255126954, 'input_token_throughput': 1146.0841260847276, 'output_token_throughput': 58.74973548302016, 'bandwidth': '502.05 GB/s', 'peak_memory_usage': '22.68 GB', 'model_size': '8.58 GB', 'torch_version': '2.7.0.dev20250122+cu126', 'torchao_version': '0.7.0'}

The second run was from the latest pytorch nightly and uses the same exact code (no changes). This was run on SM89 hardware (NVIDIA 6000 ADA LOVELACE).

Happy to help test if you have questions, thanks!

The text was updated successfully, but these errors were encountered:

supriyar · 2025-01-23T20:33:36Z

cc @drisspg any ideas on why this is the case?

@HDCharles @jainapurva hopefully our upcoming benchmarking work can catch regressions like these.

drisspg · 2025-01-23T20:45:58Z

That Is pretty massive, is the the llama_eval script?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance regression on latest pytorch nightly when using `float8_dynamic_activation_float8_weight` with granularity == `PerTensor` #1609

Performance regression on latest pytorch nightly when using `float8_dynamic_activation_float8_weight` with granularity == `PerTensor` #1609

vgoklani commented Jan 23, 2025

supriyar commented Jan 23, 2025

drisspg commented Jan 23, 2025

Performance regression on latest pytorch nightly when using float8_dynamic_activation_float8_weight with granularity == PerTensor #1609

Performance regression on latest pytorch nightly when using float8_dynamic_activation_float8_weight with granularity == PerTensor #1609

Comments

vgoklani commented Jan 23, 2025

supriyar commented Jan 23, 2025

drisspg commented Jan 23, 2025

Performance regression on latest pytorch nightly when using `float8_dynamic_activation_float8_weight` with granularity == `PerTensor` #1609

Performance regression on latest pytorch nightly when using `float8_dynamic_activation_float8_weight` with granularity == `PerTensor` #1609