You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Nothing fancy here, just running single-batch inference on LLama3-1 8B with float8_dynamic_activation_float8_weight quantization with the granularity set to PerTensor().
The second run was from the latest pytorch nightly and uses the same exact code (no changes). This was run on SM89 hardware (NVIDIA 6000 ADA LOVELACE).
Happy to help test if you have questions, thanks!
The text was updated successfully, but these errors were encountered:
Nothing fancy here, just running single-batch inference on LLama3-1 8B with
float8_dynamic_activation_float8_weight
quantization with the granularity set to PerTensor().{'ttft': 0.01968639945983887, 'input_token_throughput': 2946.1964397462616, 'output_token_throughput': 73.3861714881309, 'bandwidth': '628.29 GB/s', 'peak_memory_usage': '22.80 GB', 'model_size': '8.58 GB', 'torch_version': '2.6.0a0+df5bbc09d1.nv24.12', 'torchao_version': '0.7.0'}
{'ttft': 0.050607105255126954, 'input_token_throughput': 1146.0841260847276, 'output_token_throughput': 58.74973548302016, 'bandwidth': '502.05 GB/s', 'peak_memory_usage': '22.68 GB', 'model_size': '8.58 GB', 'torch_version': '2.7.0.dev20250122+cu126', 'torchao_version': '0.7.0'}
The second run was from the latest pytorch nightly and uses the same exact code (no changes). This was run on SM89 hardware (NVIDIA 6000 ADA LOVELACE).
Happy to help test if you have questions, thanks!
The text was updated successfully, but these errors were encountered: