Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update cpu check to have better warning #3757

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from
Draft

Conversation

KuuCi
Copy link
Contributor

@KuuCi KuuCi commented Jan 30, 2025

What does this PR do?

When using HF models with mixed init, we can mistakenly throw the warning below, which indicates that the model is not going to be training on GPU. this is not actually the case and we should not throw the warning if using mixed init.

Warning message below indicates that the model is on cpu instead of GPU, which is confusing for users when they specify mixed init (for example, with HF models.)

/usr/lib/python3/dist-packages/composer/callbacks/memory_monitor.py:137: UserWarning: The memory monitor only works on CUDA devices, but the model is on cpu.
  warnings.warn(f'The memory monitor only works on CUDA devices, but the model is on {model_device.type}.')

Testing

Without changes, mcli logs embedding-ft-r3jodb --follow:
image

With changes, mcli logs embedding-ft-Lf9M5o --follow:
The memory monitor only works on CUDA devices, but the model is on {model_device.type}.
isn't in the logs.

https://databricks.atlassian.net/browse/GRT-3119

Before submitting

  • Have you read the contributor guidelines?
  • Is this change a documentation change or typo fix? If so, skip the rest of this checklist.
  • Was this change discussed/approved in a GitHub issue first? It is much more likely to be merged if so.
  • Did you update any related docs and document your change?
  • Did you update any related tests and add any new tests related to your change? (see testing)
  • Did you run the tests locally to make sure they pass?
  • Did you run pre-commit on your change? (see the pre-commit section of prerequisites)

@KuuCi KuuCi marked this pull request as ready for review January 30, 2025 21:30
@irenedea
Copy link
Contributor

irenedea commented Feb 4, 2025

I think this will still give a warning for mixed init right? Because mixed is a concept in foundry only that means the rank 0 gets init on cpu while the others are meta, so rank 0 will still give this warning.

@irenedea
Copy link
Contributor

irenedea commented Feb 4, 2025

I'm seeing

/usr/lib/python3/dist-packages/composer/callbacks/memory_monitor.py:137: UserWarning: The memory monitor only works on CUDA devices, but the model is on cpu.
  warnings.warn(f'The memory monitor only works on CUDA devices, but the model is on {model_device.type}.')

in mcli logs embedding-ft-Lf9M5o

@@ -133,7 +133,8 @@ def init(self, state: State, logger: Logger) -> None:
# Not relying on `torch.cuda.is_available()` since the model could be on CPU.
model_device = next(state.model.parameters()).device

if model_device.type not in ('cuda', 'meta'):
print('----UNGA BUNGA', model_device.type)
if model_device.type == 'cpu':
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you try moving the original warning to after_load? At that point, I think the model should be on GPU only if using mixed init

@KuuCi KuuCi marked this pull request as draft February 4, 2025 07:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants