Numerical Fix. #3688

levendlee · 2025-02-13T17:57:26Z

Summary:

Always allocate workspace.
- Allocating is almost free with PyTorch sub-allocation.
- Not allocating could cause problems in multi-processing and cuda graph capturing.
Avoid using num_warps=8 for FP8 kernel.
- For unknown reason, using 2 warp groups can cause NaN values.
- We are doubting about on-device TMA store. Debugging with htyu.

Differential Revision: D69602533

facebook-github-bot · 2025-02-13T17:57:34Z

This pull request was exported from Phabricator. Differential Revision: D69602533

netlify · 2025-02-13T17:57:46Z

✅ Deploy Preview for pytorch-fbgemm-docs ready!

Name	Link
🔨 Latest commit	`61b6665`
🔍 Latest deploy log	https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/67afcbd8f9558d0008654158
😎 Deploy Preview	https://deploy-preview-3688--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

Summary: X-link: facebookresearch/FBGEMM#764 - Always allocate workspace. - Allocating is almost free with PyTorch sub-allocation. - Not allocating could cause problems in multi-processing and cuda graph capturing. - Disable TMA store for now. - Running into issues with on-device TMA store. Differential Revision: D69602533

facebook-github-bot · 2025-02-14T20:56:09Z

This pull request was exported from Phabricator. Differential Revision: D69602533

Summary: X-link: facebookresearch/FBGEMM#764 - Always allocate workspace. - Allocating is almost free with PyTorch sub-allocation. - Not allocating could cause problems in multi-processing and cuda graph capturing. - Disable TMA store for now. - Running into issues with on-device TMA store. Reviewed By: jiawenliu64, jwfromm Differential Revision: D69602533

facebook-github-bot · 2025-02-14T23:04:03Z

This pull request was exported from Phabricator. Differential Revision: D69602533

Summary: X-link: facebookresearch/FBGEMM#764 - Always allocate workspace. - Allocating is almost free with PyTorch sub-allocation. - Not allocating could cause problems in multi-processing and cuda graph capturing. - Disable TMA store for now. - Running into issues with on-device TMA store. Reviewed By: jiawenliu64, jwfromm Differential Revision: D69602533

facebook-github-bot · 2025-02-15T03:16:38Z

This pull request has been merged in e024eb7.

facebook-github-bot added the cla signed label Feb 13, 2025

facebook-github-bot added the fb-exported label Feb 13, 2025

levendlee force-pushed the export-D69602533 branch from f4bd866 to f8872e7 Compare February 14, 2025 20:56

levendlee force-pushed the export-D69602533 branch from f8872e7 to 61b6665 Compare February 14, 2025 23:03

facebook-github-bot closed this in e024eb7 Feb 15, 2025

facebook-github-bot added the Merged label Feb 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Numerical Fix. #3688

Numerical Fix. #3688

levendlee commented Feb 13, 2025

facebook-github-bot commented Feb 13, 2025

netlify bot commented Feb 13, 2025 •

edited

Loading

facebook-github-bot commented Feb 14, 2025

facebook-github-bot commented Feb 14, 2025

facebook-github-bot commented Feb 15, 2025

Numerical Fix. #3688

Numerical Fix. #3688

Conversation

levendlee commented Feb 13, 2025

facebook-github-bot commented Feb 13, 2025

netlify bot commented Feb 13, 2025 • edited Loading

✅ Deploy Preview for pytorch-fbgemm-docs ready!

facebook-github-bot commented Feb 14, 2025

facebook-github-bot commented Feb 14, 2025

facebook-github-bot commented Feb 15, 2025

netlify bot commented Feb 13, 2025 •

edited

Loading