Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

modify determination of physical memory size #1513

Merged
merged 1 commit into from
Feb 3, 2025

Conversation

jee7s
Copy link
Contributor

@jee7s jee7s commented Jan 29, 2025

Details

Change memory size determination to avoid linux default, which can result in max linker parallelism (16) and in turn cause out of memory crashes on systems with limited memory (<32GB)

Work item: "Internal", or link to GitHub issue (if applicable).

What were the changes?
Modifies calculation of max memory in the build.

Why were the changes made?
Existing use of cgroups memory limit can cause linker to incorrectly use max parallelism because linux defaults the memory.limit_in_bytes to be the span of virtual memory rounded to page size (PAGE_COUNTER_MAX in the kernel). This means the platform erroneously reports 2^64-4096 available bytes rather than the platform physical memory. The problem was observed in WSL ubuntu 24.04.

How was the outcome achieved?
Modified cmake to determine physical memory from 'free'. Total physical memory in KB is the first number in the output from 'free'.

Additional Documentation:
Alternatively, we could provide a cmake option to set the linker parallelism at the command line.

Approval Checklist

Do not approve until these items are satisfied.

  • Verify the CHANGELOG has been updated, if
    • there are any NCCL API version changes,
    • any changes impact library users, and/or
    • any changes impact any other ROCm library.

@jee7s jee7s merged commit 7af21dd into ROCm:develop Feb 3, 2025
23 of 24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants