New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[STF] Improved cache mechanism for executable CUDA graphs #3768

Draft

caugonnet wants to merge 6 commits into NVIDIA:main from caugonnet:stf_graph_cache

+251 −63

Contributor

caugonnet commented Feb 11, 2025 •

edited

Loading

Description

CUDA graph consumes around 10KB per kernel node, which quickly adds up when we have large CUDA graphs kept in cache. This implements a mechanism to cap resources due to cached CUDA graphs.

closes

Checklist

New or existing tests cover these changes.
The documentation is up to date with these changes.


          Reimplement the cache mechanism for executable CUDA graphs

e6d099d

caugonnet added the stf label

caugonnet self-assigned this

copy-pr-bot bot commented Feb 11, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

caugonnet commented

View reviewed changes

cudax/include/cuda/experimental/__stf/internal/executable_graph_cache.cuh Outdated

    
                     fprintf(stderr, "Reclaimed %s in cache graph (asked %s remaining %s)\n", pretty_print_bytes(reclaimed).c_str(), pretty_print_bytes(to_reclaim).c_str(), pretty_print_bytes(total_cache_footprint).c_str());

                  }

                  // TODO we should not have to redefine this one again

Contributor Author

caugonnet Feb 11, 2025

this should rely on existing ::std::hash<pair<...>> unless we have a cyclic dep

caugonnet commented

View reviewed changes

cudax/include/cuda/experimental/__stf/internal/executable_graph_cache.cuh Outdated

    
                      }

                  };

                 // TODO per device !

Contributor Author

caugonnet Feb 11, 2025

or have a cache per device in the async resource

caugonnet commented

View reviewed changes

cudax/include/cuda/experimental/__stf/internal/executable_graph_cache.cuh Outdated

    
                         last_use = cache->index;

                     }

                     void update() {

Contributor Author

caugonnet Feb 11, 2025

lru_refresh()

caugonnet commented

View reviewed changes

cudax/include/cuda/experimental/__stf/internal/executable_graph_cache.cuh Outdated

    
                     executable_graph_cache *cache;

                     ::std::shared_ptr<cudaGraphExec_t> exec_g;

                     size_t use_cnt;

Contributor Author

caugonnet Feb 11, 2025

remove, and / or replace by cache stats

caugonnet commented

View reviewed changes

cudax/include/cuda/experimental/__stf/internal/executable_graph_cache.cuh Outdated

    
                 // Check if there is a matching entry (and update it if necessary)

                 ::std::pair<bool, ::std::shared_ptr<cudaGraphExec_t>> query(size_t nnodes, size_t nedges, ::std::shared_ptr<cudaGraph_t> g)

                 {

                    auto range = cached_graphs.equal_range({nnodes, nedges});

Contributor Author

caugonnet Feb 11, 2025

TODO reorder by oldest ?

caugonnet and others added 5 commits

February 11, 2025 16:58


          Add an env var to control cache size + clang-format

e4482ab


          Simplify the way we query the cache and deal wwith cache misses

07da30b


          comments and remove a debug print

4380be3


          Merge branch 'main' into stf_graph_cache

d116129


          better name and remove currently unused variabl

fd68556

caugonnet commented

View reviewed changes

cudax/include/cuda/experimental/__stf/internal/executable_graph_cache.cuh

    
                executable_graph_cache()

                {

                  cache_size_limit = 512 * 1024 * 1024;

                  const char* str  = getenv("CUDASTF_GRAPH_CACHE_SIZE_MB");

Contributor Author

caugonnet Feb 12, 2025

Define what happens for MB = 0 (infinite ?) and perhaps we'd better make this the default ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

stf