Replies: 1 comment
-
Hi @tgmaxson , This sounds related to the PyTorch caching allocator, which often keeps memory allocated on the GPU (from the system / driver's perspective) when it is really sitting unused from the Python code's perspective. This is to speed up repeated allocations of similar sized buffers, which is the defining memory pattern of most ML--- if you look this up there is a lot of discussion on PyTorch forums about how to manage, understand, and clear this cache. I don't think there's anything to do here on the NequIP code side, since we rely entirely on PyTorch for memory management.
Yes, this should work as expected. If not, please file an issue. Thanks! |
Beta Was this translation helpful? Give feedback.
-
I am training and loading models using the ASE calculator many times during the course of the work I am doing and it seems that this effectively leaks memory on the GPU since the torch script is somehow saved on it (even when python loses its reference to the ASE calculator). I am currently using the "NequIPCalculator.from_deployed_model(model_filename, device=device)" function to load models and I think I need to do one of two things but am unsure how.
Replace the model in the calculator with a new model. Is it possible to simply overwrite an existing calculator rather than reading and creating a new one? The old ones must still be around somehow.
Clear the Torchscript from the calculator / empty the GPU cache before losing the reference to the calculator.
Also as a final sort of related question, I will be able to work with multiple GPUs in the near future. Does the 'device="cuda"' directive pass directly to Pytorch? I am wondering if "cuda:0" and "cuda:1" options will be supported.
Beta Was this translation helpful? Give feedback.
All reactions