-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
make EnergyModel
pickable
#126
Conversation
Thanks @felixmusil for the PR, that all looks good. Have you run explicit checks that Since it seems like you're dealing with pretty low-level stuff in the code, just wanted to extend that if you have any other questions, please feel free to reach out to us any time. Best, |
Hi @felixmusil, Thanks for the PR. Beyond Simon's point, while I'm happy to merge this, I want to be clear that we dropped the use of pickling for models / any commitment to support it as of 0.4.0. The reason for this is that it carries a lot of internal noise, unnecessarily constrains development, and introduces a strong dependence of saved models on the internals of library code even when nothing fundamental has changed. (This is particularly relevant to e3nn, whose internals often change even when the interfaces are stable.) The new supported approach is to rebuild the model from the configuration wherever you need it, and to save the |
Actually, also, have you tested that this actually works once this fix is included? I remember there being other things broken with pickling as well... (Another warning: the ability to pickle e3nn models is something I implemented in that library with a lot of hacks that are somewhat sensitive to what device you built the object on and what devices are available where you load it...) |
Thanks for the quick reply.
Running this snippet shows that the results are the same within machine precision. x = torch.linspace(-10, 10, 100000)
for func1,func2 in [(torch.nn.functional.silu, torch.nn.SiLU()),
(torch.nn.functional.softplus, torch.nn.Softplus()),
(torch.tanh, torch.nn.Tanh())]:
y1 = func1(x)
y2 = func2(x)
print(torch.allclose(y1,y1))
yes, if you run the failing example above with this fix, it works.
ok I see. I understand your point about saving and your new saving strategy is the way to go. However pickling is essential to work with multiprocessing applications, e.g. multi gpu training. |
Great, thanks for checking.
I'm not actually sure that this is true--- can you elaborate on your setup? We are beginning to look into multi-GPU training and so far it seems to me that the most natural approach is for each process to build it's own copy of the model and to then distribute rank 0's This should also involve less IPC and IO than full pickled models, because then you don't have to serialize/de-serialize TorchScript blobs for generated code from e3nn. (The only way I was able to make e3nn's dynamically codegen'd modules pickle-compatible was to have them serialize their compiled, optimized TorchScript cores using a |
Hi @felixmusil just curious if the above worked for you or if this is still an issue 🙂 |
Hi @felixmusil, it looks like you're making edits on this branch but the PR is still open. We do not wrap the atoms in NequIP since our neighborlists respect unwrapped positions and wrapping them would not allow us to do simulation in which we want to keep track of unwrapped positions, e.g. computing the diffusivity. Shall we close this PR? |
Hi @felixmusil , I'm seeing a few more commits to this PR— is this still something you are trying to do, or can we close this PR? (Was rebuilding the model in child processes sufficient for you?) I'm also seeing that you made this "remove empty dim" commit— what is its purpose, and is it related to some issue with shapes you may have found? Thanks! |
Hi @Linux-cpp-lisp, Sorry for disappearing on you. I reverted the unnecessary commits and updated the PR branch to the current
You are right, it's really tied with me using
I understand that and I removed this commit from the PR.
For some reason, I have a corner case where an additional empty dimension (of size 1) was present which was not compatible with the rest of the computations. |
@felixmusil sounds good, I'll take a look at this again soon here for merging. Re multi-GPU support, take a look at this recent issue: #210. Basically, unless you are really married to PyTorch Lightning, I would recommend looking into our
Hm yeah this sounds like it could be a bug on our part... are you able to reproduce this still on the latest NequIP |
Thanks for the suggestion, I will have a look at how you handled the problem. I have a few other models working with
I will raise an issue/PR if I manage to reproduce it. |
Co-authored-by: Félix <[email protected]>
Hi @felixmusil , I came back to this again now after a while and I thought I would commit it (I've added a commit off of current However, for reasons I'm not sure of (call stack/module stack depth?) this has a non-negligible negative effect on speed, so I will not be merging it. Closing this PR for that reason, but the (See above for discussion of our supported |
Description
The current models created with
EnergyModel
are not pickable. For instance, adeepcopy
of such model will fail with the current master branch (see below for an example).The problem comes from using activation functions directly from
torch.nn.functional
(which are not pickable) instead oftorch.nn
(which aretorch.nn.Module
hence piclable).Motivation and Context
A piclable model can be saved easily and used in a multiprocessing context.
How Has This Been Tested?
The code above runs without errors and the tests are passing.
Types of changes
Checklist:
black
.docs/options
) has been updated with new or changed options.CHANGELOG.md
.