-
Notifications
You must be signed in to change notification settings - Fork 77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to Simulate Large System using TensorNet + OpenMM-Torch? #347
Comments
Hi,
Thanks for using TensorNet. I suspect the problem comes from using a cutoff
of 10A. The model corresponding to the SPICE yaml was never used to run MD
in such large systems. It was set to 10A because of the presence of dimers.
However, I think in the literature people have trained other models on
SPICE with smaller cutoffs without problem. I would try 5A. Also, make sure
that static_shapes argument is False.
As a side note, a smaller and shallower model compared the one you get with
the yaml might be enough for your purposes. Take into account also that the
hyperparameters for SPICE have not been fully optimized. It depends on the
accuracy/efficiency tradeoff, and your system.
Also, you can try to initialize a model (create_model) with your desired
config, even if it is not trained, and try to deploy it in openmm to
perform a simulation step and see if it fits in memory, before training it.
Guillem
…On Fri, 8 Nov 2024 at 00:52, kei0822kei ***@***.***> wrote:
Hi,
Thank you for maintaining great package.
I want to simulate relatively larger system (~10000 atoms) using tensornet.
After I finished to train model using TensorNet-SPICE.yaml
<https://github.com/torchmd/torchmd-net/blob/main/examples/TensorNet-SPICE.yaml>,
I tried to apply this model to MD simulation for larger system using
openmm-torch. When I simulated using the system composed of ~4000 atoms, 80
GiB of GPU memory has filled out. I found out when calculating force
(backpropagation phase) consumed most of the GPU memory and resulted in out
of memory.
Is there possible way to avoid this?
I expect calculating atomic energy using 'reporesentaion_model'
(TensorNet) can be splitted into batch, and can be avoided using large GPU
memory. Is it possible?
—
Reply to this email directly, view it on GitHub
<#347>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANJMOAY3EGM3ZFWDRDVQ5A3Z7P4NHAVCNFSM6AAAAABRMKWLLSVHI2DSMVQWIX3LMV43ASLTON2WKOZSGY2DENBXGYYDAOA>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Thank you for your replying and sorry for my lacking description. Actually, I revised yaml file from original one and I used
As you adviced me, I should check accuracy/efficiency tradeoff, especially the following settings.
If GPU memory lacking still occurs after parameter optimization, If we do not take into account about long range interaction, system energy can be written as the sum of atomic energies, Therefore, I expect I can split into batch when calculating atomic energies and control GPU memory usage. Do you think it is possible or not? |
6 layers is crazy. 256 hidden dimensions is too much. It is normal it does
not fit. The largest model I have ever used is 256 hidden and 3 layers to
achieve SOTA on QM9, but in all other cases I used 128 hidden and 2 layers
(1 layer is also good). I don’t expect the model to work much better with 6
layers than 2 (I would expect the opposite in fact). 6 layers has never
been tested, and the simulation/training will be extremely slow.
Regarding the splitting, I am not sure if I understood. But take into
account that for energy prediction and forces, the output corresponding to
some atom depends on all other atoms found within a distance of
(L+1)*cutoff, where L is the number of layers you put in the yaml file. So
to get a meaningful atomic energy contribution you need that atom and all
the other ones. Also, take into account that the force on an atom is the
gradient of the total energy wrt its position, so you would need to
accumulate and sum the gradients obtained of all the Uis that depend on
that atom. It is possible, but not useful.
…On Fri, 8 Nov 2024 at 01:57, kei0822kei ***@***.***> wrote:
Thank you for your replying and sorry for my lacking description.
Actually, I revised yaml file from original one and I used
cutoff_upper=5.0.
My hparams.yaml is as follows.
(I wrote ASE dataset class in order to use the newest SPICE dataset.
Please ignore these settings.)
load_model: null
conf: null
num_epochs: 100000
batch_size: 64
inference_batch_size: 64
lr: 0.0001
lr_patience: 15
lr_metric: val
lr_min: 1.0e-07
lr_factor: 0.8
lr_warmup_steps: 1000
early_stopping_patience: 30
reset_trainer: false
weight_decay: 0.0
ema_alpha_y: 1.0
ema_alpha_neg_dy: 1.0
ngpus: -1
num_nodes: 1
precision: 32
log_dir: .
splits: null
train_size: null
val_size: 0.05
test_size: 0.1
test_interval: 10
save_interval: 10
seed: 42
num_workers: 48
redirect: true
gradient_clipping: 40
remove_ref_energy: false
dataset: ASE
dataset_root: /data/spice/2.0.1/spice_with_charge
dataset_arg:
periodic: false
energy_key: formation_energy
forces_key: forces
partial_charges_key: charges
coord_files: null
embed_files: null
energy_files: null
force_files: null
dataset_preload_limit: 1024
y_weight: 0.05
neg_dy_weight: 0.95
train_loss: mse_loss
train_loss_arg: null
model: tensornet
output_model: Scalar
output_mlp_num_layers: 0
prior_model:
ZBL:
cutoff_distance: 3
max_num_neighbors: 5
charge: false
spin: false
embedding_dimension: 256
num_layers: 6
num_rbf: 64
activation: silu
rbf_type: expnorm
trainable_rbf: false
neighbor_embedding: false
aggr: add
distance_influence: both
attn_activation: silu
num_heads: 8
vector_cutoff: false
equivariance_invariance_group: O(3)
box_vecs: null
static_shapes: false
check_errors: true
derivative: true
cutoff_lower: 0.0
cutoff_upper: 5.0
atom_filter: -1
max_z: 100
max_num_neighbors: 64
standardize: false
reduce_op: add
wandb_use: false
wandb_name: training
wandb_project: training_
wandb_resume_from_id: null
tensorboard_use: true
prior_args:
- cutoff_distance: 3
max_num_neighbors: 5
atomic_number:
- 0
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 41
- 42
- 43
- 44
- 45
- 46
- 47
- 48
- 49
- 50
- 51
- 52
- 53
- 54
- 55
- 56
- 57
- 58
- 59
- 60
- 61
- 62
- 63
- 64
- 65
- 66
- 67
- 68
- 69
- 70
- 71
- 72
- 73
- 74
- 75
- 76
- 77
- 78
- 79
- 80
- 81
- 82
- 83
- 84
- 85
- 86
- 87
- 88
- 89
- 90
- 91
- 92
- 93
- 94
- 95
- 96
- 97
- 98
- 99
distance_scale: 1.0e-10
energy_scale: 1.60218e-19
As you adviced me, I should check accuracy/efficiency tradeoff, especially
the following settings.
embedding_dimension: 256
num_layers: 6
num_rbf: 64
If GPU memory lacking still occurs after parameter optimization,
I am going to try to split into batch when calculating atomic energies.
If we do not take into account about long range interaction, system energy
can be written as the sum of atomic energies,
$E = \sum E_i$
and $E_i$ can be calculated using its neighbor atomic information, and
also its force $\boldsymbol{F}_i$ can be calculated.
Therefore, I expect I can split into batch when calculating atomic
energies and control GPU memory usage.
Do you think it is possible or not?
—
Reply to this email directly, view it on GitHub
<#347 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANJMOAYTXHOYBIRI7I5MHYTZ7QD6XAVCNFSM6AAAAABRMKWLLSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINRTGUZTQMJXGE>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
I understood my model is terribly large. I should have read the paper more carefully. |
Hi,
Thank you for maintaining great package.
I want to simulate relatively larger system (~10000 atoms) using tensornet.
After I finished to train model using TensorNet-SPICE.yaml, I tried to apply this model to MD simulation for larger system using openmm-torch. When I simulated using the system composed of ~4000 atoms, 80 GiB of GPU memory has filled out. I found out when calculating force (backpropagation phase) consumed most of the GPU memory and resulted in out of memory.
Is there possible way to avoid this?
I expect calculating atomic energy using 'reporesentaion_model' (TensorNet) can be splitted into batch, and can be avoided using large GPU memory. Is it possible?
The text was updated successfully, but these errors were encountered: