Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[STF] Improved cache mechanism for executable CUDA graphs #3768

Draft
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

caugonnet
Copy link
Contributor

@caugonnet caugonnet commented Feb 11, 2025

Description

CUDA graph consumes around 10KB per kernel node, which quickly adds up when we have large CUDA graphs kept in cache. This implements a mechanism to cap resources due to cached CUDA graphs.

closes

Checklist

  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@caugonnet caugonnet added the stf Sequential Task Flow programming model label Feb 11, 2025
@caugonnet caugonnet self-assigned this Feb 11, 2025
Copy link

copy-pr-bot bot commented Feb 11, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

fprintf(stderr, "Reclaimed %s in cache graph (asked %s remaining %s)\n", pretty_print_bytes(reclaimed).c_str(), pretty_print_bytes(to_reclaim).c_str(), pretty_print_bytes(total_cache_footprint).c_str());
}

// TODO we should not have to redefine this one again
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should rely on existing ::std::hash<pair<...>> unless we have a cyclic dep

}
};

// TODO per device !
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or have a cache per device in the async resource

last_use = cache->index;
}

void update() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lru_refresh()


executable_graph_cache *cache;
::std::shared_ptr<cudaGraphExec_t> exec_g;
size_t use_cnt;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove, and / or replace by cache stats

// Check if there is a matching entry (and update it if necessary)
::std::pair<bool, ::std::shared_ptr<cudaGraphExec_t>> query(size_t nnodes, size_t nedges, ::std::shared_ptr<cudaGraph_t> g)
{
auto range = cached_graphs.equal_range({nnodes, nedges});
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO reorder by oldest ?

executable_graph_cache()
{
cache_size_limit = 512 * 1024 * 1024;
const char* str = getenv("CUDASTF_GRAPH_CACHE_SIZE_MB");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Define what happens for MB = 0 (infinite ?) and perhaps we'd better make this the default ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stf Sequential Task Flow programming model
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

1 participant