Update cpu check to have better warning #3757

KuuCi · 2025-01-30T20:12:42Z

What does this PR do?

When using HF models with mixed init, we can mistakenly throw the warning below, which indicates that the model is not going to be training on GPU. this is not actually the case and we should not throw the warning if using mixed init.

Warning message below indicates that the model is on cpu instead of GPU, which is confusing for users when they specify mixed init (for example, with HF models.)

/usr/lib/python3/dist-packages/composer/callbacks/memory_monitor.py:137: UserWarning: The memory monitor only works on CUDA devices, but the model is on cpu.
  warnings.warn(f'The memory monitor only works on CUDA devices, but the model is on {model_device.type}.')

Testing

Without changes, mcli logs embedding-ft-r3jodb --follow:

With changes, mcli logs embedding-ft-Lf9M5o --follow:
The memory monitor only works on CUDA devices, but the model is on {model_device.type}.
isn't in the logs.

https://databricks.atlassian.net/browse/GRT-3119

Before submitting

Have you read the contributor guidelines?
Is this change a documentation change or typo fix? If so, skip the rest of this checklist.
Was this change discussed/approved in a GitHub issue first? It is much more likely to be merged if so.
Did you update any related docs and document your change?
Did you update any related tests and add any new tests related to your change? (see testing)
Did you run the tests locally to make sure they pass?
Did you run pre-commit on your change? (see the pre-commit section of prerequisites)

irenedea · 2025-02-04T00:22:32Z

I think this will still give a warning for mixed init right? Because mixed is a concept in foundry only that means the rank 0 gets init on cpu while the others are meta, so rank 0 will still give this warning.

irenedea · 2025-02-04T00:25:26Z

I'm seeing

/usr/lib/python3/dist-packages/composer/callbacks/memory_monitor.py:137: UserWarning: The memory monitor only works on CUDA devices, but the model is on cpu.
  warnings.warn(f'The memory monitor only works on CUDA devices, but the model is on {model_device.type}.')

in mcli logs embedding-ft-Lf9M5o

irenedea · 2025-02-04T05:52:25Z

composer/callbacks/memory_monitor.py

@@ -133,7 +133,8 @@ def init(self, state: State, logger: Logger) -> None:
        # Not relying on `torch.cuda.is_available()` since the model could be on CPU.
        model_device = next(state.model.parameters()).device

-        if model_device.type not in ('cuda', 'meta'):
+        print('----UNGA BUNGA', model_device.type)
+        if model_device.type == 'cpu':


Can you try moving the original warning to after_load? At that point, I think the model should be on GPU only if using mixed init

update cpu check

0be93c1

KuuCi requested review from irenedea, milocress and dakinggg January 30, 2025 21:30

KuuCi marked this pull request as ready for review January 30, 2025 21:30

Merge branch 'main' into better-mixed-init-warning

f53e084

test

893fb2c

irenedea reviewed Feb 4, 2025

View reviewed changes

KuuCi marked this pull request as draft February 4, 2025 07:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update cpu check to have better warning #3757

Update cpu check to have better warning #3757

KuuCi commented Jan 30, 2025 •

edited

Loading

irenedea commented Feb 4, 2025

irenedea commented Feb 4, 2025

irenedea Feb 4, 2025

Update cpu check to have better warning #3757

Are you sure you want to change the base?

Update cpu check to have better warning #3757

Conversation

KuuCi commented Jan 30, 2025 • edited Loading

What does this PR do?

Testing

Before submitting

irenedea commented Feb 4, 2025

irenedea commented Feb 4, 2025

irenedea Feb 4, 2025

Choose a reason for hiding this comment

KuuCi commented Jan 30, 2025 •

edited

Loading