Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to Simulate Large System using TensorNet + OpenMM-Torch? #347

Open
kei0822kei opened this issue Nov 7, 2024 · 4 comments
Open

How to Simulate Large System using TensorNet + OpenMM-Torch? #347

kei0822kei opened this issue Nov 7, 2024 · 4 comments

Comments

@kei0822kei
Copy link

Hi,

Thank you for maintaining great package.
I want to simulate relatively larger system (~10000 atoms) using tensornet.

After I finished to train model using TensorNet-SPICE.yaml, I tried to apply this model to MD simulation for larger system using openmm-torch. When I simulated using the system composed of ~4000 atoms, 80 GiB of GPU memory has filled out. I found out when calculating force (backpropagation phase) consumed most of the GPU memory and resulted in out of memory.

Is there possible way to avoid this?

I expect calculating atomic energy using 'reporesentaion_model' (TensorNet) can be splitted into batch, and can be avoided using large GPU memory. Is it possible?

@guillemsimeon
Copy link
Collaborator

guillemsimeon commented Nov 8, 2024 via email

@kei0822kei
Copy link
Author

Thank you for your replying and sorry for my lacking description.

Actually, I revised yaml file from original one and I used cutoff_upper=5.0.
My hparams.yaml is as follows.
(I wrote ASE dataset class in order to use the newest SPICE dataset. Please ignore these settings.)

load_model: null
conf: null
num_epochs: 100000
batch_size: 64
inference_batch_size: 64
lr: 0.0001
lr_patience: 15
lr_metric: val
lr_min: 1.0e-07
lr_factor: 0.8
lr_warmup_steps: 1000
early_stopping_patience: 30
reset_trainer: false
weight_decay: 0.0
ema_alpha_y: 1.0
ema_alpha_neg_dy: 1.0
ngpus: -1
num_nodes: 1
precision: 32
log_dir: .
splits: null
train_size: null
val_size: 0.05
test_size: 0.1
test_interval: 10
save_interval: 10
seed: 42
num_workers: 48
redirect: true
gradient_clipping: 40
remove_ref_energy: false
dataset: ASE
dataset_root: /data/spice/2.0.1/spice_with_charge
dataset_arg:
  periodic: false
  energy_key: formation_energy
  forces_key: forces
  partial_charges_key: charges
coord_files: null
embed_files: null
energy_files: null
force_files: null
dataset_preload_limit: 1024
y_weight: 0.05
neg_dy_weight: 0.95
train_loss: mse_loss
train_loss_arg: null
model: tensornet
output_model: Scalar
output_mlp_num_layers: 0
prior_model:
  ZBL:
    cutoff_distance: 3
    max_num_neighbors: 5
charge: false
spin: false
embedding_dimension: 256
num_layers: 6
num_rbf: 64
activation: silu
rbf_type: expnorm
trainable_rbf: false
neighbor_embedding: false
aggr: add
distance_influence: both
attn_activation: silu
num_heads: 8
vector_cutoff: false
equivariance_invariance_group: O(3)
box_vecs: null
static_shapes: false
check_errors: true
derivative: true
cutoff_lower: 0.0
cutoff_upper: 5.0
atom_filter: -1
max_z: 100
max_num_neighbors: 64
standardize: false
reduce_op: add
wandb_use: false
wandb_name: training
wandb_project: training_
wandb_resume_from_id: null
tensorboard_use: true
prior_args:
- cutoff_distance: 3
  max_num_neighbors: 5
  atomic_number:
  - 0
  - 1
  - 2
  - 3
  - 4
  - 5
  - 6
  - 7
  - 8
  - 9
  - 10
  - 11
  - 12
  - 13
  - 14
  - 15
  - 16
  - 17
  - 18
  - 19
  - 20
  - 21
  - 22
  - 23
  - 24
  - 25
  - 26
  - 27
  - 28
  - 29
  - 30
  - 31
  - 32
  - 33
  - 34
  - 35
  - 36
  - 37
  - 38
  - 39
  - 40
  - 41
  - 42
  - 43
  - 44
  - 45
  - 46
  - 47
  - 48
  - 49
  - 50
  - 51
  - 52
  - 53
  - 54
  - 55
  - 56
  - 57
  - 58
  - 59
  - 60
  - 61
  - 62
  - 63
  - 64
  - 65
  - 66
  - 67
  - 68
  - 69
  - 70
  - 71
  - 72
  - 73
  - 74
  - 75
  - 76
  - 77
  - 78
  - 79
  - 80
  - 81
  - 82
  - 83
  - 84
  - 85
  - 86
  - 87
  - 88
  - 89
  - 90
  - 91
  - 92
  - 93
  - 94
  - 95
  - 96
  - 97
  - 98
  - 99
  distance_scale: 1.0e-10
  energy_scale: 1.60218e-19

As you adviced me, I should check accuracy/efficiency tradeoff, especially the following settings.

embedding_dimension: 256
num_layers: 6
num_rbf: 64

If GPU memory lacking still occurs after parameter optimization,
I am going to try to split into batch when calculating atomic energies.

If we do not take into account about long range interaction, system energy can be written as the sum of atomic energies,
$E = \sum E_i$
and $E_i$ can be calculated using its neighbor atomic information, and also its force $\boldsymbol{F}_i$ can be calculated.

Therefore, I expect I can split into batch when calculating atomic energies and control GPU memory usage.

Do you think it is possible or not?

@guillemsimeon
Copy link
Collaborator

guillemsimeon commented Nov 8, 2024 via email

@kei0822kei
Copy link
Author

I understood my model is terribly large. I should have read the paper more carefully.
Thank you so much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants