Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Numerical Fix. #3688

Closed
wants to merge 1 commit into from
Closed

Numerical Fix. #3688

wants to merge 1 commit into from

Conversation

levendlee
Copy link
Member

Summary:

  • Always allocate workspace.

    • Allocating is almost free with PyTorch sub-allocation.
    • Not allocating could cause problems in multi-processing and cuda graph capturing.
  • Avoid using num_warps=8 for FP8 kernel.

    • For unknown reason, using 2 warp groups can cause NaN values.
    • We are doubting about on-device TMA store. Debugging with htyu.

Differential Revision: D69602533

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D69602533

Copy link

netlify bot commented Feb 13, 2025

Deploy Preview for pytorch-fbgemm-docs ready!

Name Link
🔨 Latest commit 61b6665
🔍 Latest deploy log https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/67afcbd8f9558d0008654158
😎 Deploy Preview https://deploy-preview-3688--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

levendlee added a commit to levendlee/FBGEMM that referenced this pull request Feb 14, 2025
Summary:

X-link: facebookresearch/FBGEMM#764

- Always allocate workspace. 
  - Allocating is almost free with  PyTorch sub-allocation.
  - Not allocating could cause problems in multi-processing and cuda graph capturing.

- Disable TMA store for now.
  - Running into issues with on-device TMA store.

Differential Revision: D69602533
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D69602533

Summary:

X-link: facebookresearch/FBGEMM#764

- Always allocate workspace. 
  - Allocating is almost free with  PyTorch sub-allocation.
  - Not allocating could cause problems in multi-processing and cuda graph capturing.

- Disable TMA store for now.
  - Running into issues with on-device TMA store.

Reviewed By: jiawenliu64, jwfromm

Differential Revision: D69602533
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D69602533

levendlee added a commit to levendlee/FBGEMM that referenced this pull request Feb 14, 2025
Summary:

X-link: facebookresearch/FBGEMM#764

- Always allocate workspace. 
  - Allocating is almost free with  PyTorch sub-allocation.
  - Not allocating could cause problems in multi-processing and cuda graph capturing.

- Disable TMA store for now.
  - Running into issues with on-device TMA store.

Reviewed By: jiawenliu64, jwfromm

Differential Revision: D69602533
levendlee added a commit to levendlee/FBGEMM that referenced this pull request Feb 14, 2025
Summary:

X-link: facebookresearch/FBGEMM#764

- Always allocate workspace. 
  - Allocating is almost free with  PyTorch sub-allocation.
  - Not allocating could cause problems in multi-processing and cuda graph capturing.

- Disable TMA store for now.
  - Running into issues with on-device TMA store.

Reviewed By: jiawenliu64, jwfromm

Differential Revision: D69602533
@facebook-github-bot
Copy link
Contributor

This pull request has been merged in e024eb7.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants