-
Notifications
You must be signed in to change notification settings - Fork 190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[STF] Improved cache mechanism for executable CUDA graphs #3768
base: main
Are you sure you want to change the base?
Conversation
fprintf(stderr, "Reclaimed %s in cache graph (asked %s remaining %s)\n", pretty_print_bytes(reclaimed).c_str(), pretty_print_bytes(to_reclaim).c_str(), pretty_print_bytes(total_cache_footprint).c_str()); | ||
} | ||
|
||
// TODO we should not have to redefine this one again |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should rely on existing ::std::hash<pair<...>> unless we have a cyclic dep
} | ||
}; | ||
|
||
// TODO per device ! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or have a cache per device in the async resource
last_use = cache->index; | ||
} | ||
|
||
void update() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lru_refresh()
|
||
executable_graph_cache *cache; | ||
::std::shared_ptr<cudaGraphExec_t> exec_g; | ||
size_t use_cnt; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove, and / or replace by cache stats
// Check if there is a matching entry (and update it if necessary) | ||
::std::pair<bool, ::std::shared_ptr<cudaGraphExec_t>> query(size_t nnodes, size_t nedges, ::std::shared_ptr<cudaGraph_t> g) | ||
{ | ||
auto range = cached_graphs.equal_range({nnodes, nedges}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO reorder by oldest ?
executable_graph_cache() | ||
{ | ||
cache_size_limit = 512 * 1024 * 1024; | ||
const char* str = getenv("CUDASTF_GRAPH_CACHE_SIZE_MB"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Define what happens for MB = 0 (infinite ?) and perhaps we'd better make this the default ...
Description
CUDA graph consumes around 10KB per kernel node, which quickly adds up when we have large CUDA graphs kept in cache. This implements a mechanism to cap resources due to cached CUDA graphs.
closes
Checklist