Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Error during Running LAMMPS: PyTorch Backend JIT Error in forward_lower function #4530

Open
SchrodingersCattt opened this issue Jan 5, 2025 · 2 comments
Assignees
Labels

Comments

@SchrodingersCattt
Copy link

Bug summary

Hi, developers:
I encountered the following error when running my LAMMPS code with deepmd-kit:

WARNING: Energy due to 1 extra global DOFs will be included in minimizer energies
 (src/min.cpp:219)
ERROR on proc 0: DeePMD-kit C API Error: DeePMD-kit Error: DeePMD-kit PyTorch backend JIT error: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
  File "code/__torch__/deepmd/pt/model/model/ener_model.py", line 59, in forward_lower

The details are as follows. Please kindly help to have a check. Many thanks!

DeePMD-kit Version

3.0.0b4 (0abb67b)

Backend and its version

torch2.4.1.post302

How did you download the software?

Others (write below)

Input Files, Running Commands, Error Log, etc.

Inputs are are attached in steps to reproduce part.

I uploaded the job to lbg with image registry.dp.tech/dptech/deepmd-kit:3.0.0b4-cuda12.1.
It is worth noting that the newer image registry.dp.tech/dptech/deepmd-kit:3.0.0-cuda12.1 can also reproduce the issue, but it can normally run in 2024Q1 version.

Steps to Reproduce

lammps_issue.zip

Further Information, Files, and Links

No response

@njzjz
Copy link
Member

njzjz commented Jan 6, 2025

Running logs:

LAMMPS (29 Aug 2024)
OMP_NUM_THREADS environment is not set. Defaulting to 1 thread. (src/comm.cpp:98)
  using 1 OpenMP thread(s) per MPI task
Loaded 1 plugins from /root/deepmd-kit/lib/deepmd_lmp
Reading data file ...
  triclinic box = (0 0 0) to (6.6633637 6.4442215 9.0246555) with tilt (-0 0 0)
  1 by 1 by 1 MPI processor grid
  reading atoms ...
  32 atoms
  read_data CPU = 0.001 seconds
Summary of lammps deepmd module ...
  >>> Info of deepmd-kit:
  installed to:       /root/deepmd-kit
  source:             
  source branch:      HEAD
  source commit:      c314f1b
  source commit at:   2024-12-23 16:45:06 -0800
  support model ver.: 1.1 
  build variant:      cpu
  build with tf inc:  /root/deepmd-kit/lib/python3.12/site-packages/tensorflow/include;/root/deepmd-kit/lib/python3.12/site-packages/tensorflow/../../../../include
  build with tf lib:  /root/deepmd-kit/lib/python3.12/site-packages/tensorflow/libtensorflow_cc.so.2
  build with pt lib:  torch;torch_library;/root/deepmd-kit/lib/python3.12/site-packages/torch/lib/libc10.so
  set tf intra_op_parallelism_threads: 0
  set tf inter_op_parallelism_threads: 0
  >>> Info of lammps module:
DeePMD-kit WARNING: Environmental variable DP_INTRA_OP_PARALLELISM_THREADS is not set. Tune DP_INTRA_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
DeePMD-kit WARNING: Environmental variable DP_INTER_OP_PARALLELISM_THREADS is not set. Tune DP_INTER_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
DeePMD-kit WARNING: Environmental variable OMP_NUM_THREADS is not set. Tune OMP_NUM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
  use deepmd-kit at:  /root/deepmd-kitload model from: frozen_model.pth to cpu 
DeePMD-kit WARNING: Environmental variable DP_INTRA_OP_PARALLELISM_THREADS is not set. Tune DP_INTRA_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
DeePMD-kit WARNING: Environmental variable DP_INTER_OP_PARALLELISM_THREADS is not set. Tune DP_INTER_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
DeePMD-kit WARNING: Environmental variable OMP_NUM_THREADS is not set. Tune OMP_NUM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
  >>> Info of model(s):
  using   1 model(s): frozen_model.pth 
  rcut in model:      6
  ntypes in model:    118

CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE

Your simulation uses code contributions which should be cited:
- Type Label Framework: https://doi.org/10.1021/acs.jpcb.3c08419
- USER-DEEPMD package:
The log file lists these citations in BibTeX format.

CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE

Generated 0 of 6903 mixed pair_coeff terms from geometric mixing rule
Neighbor list info ...
  update: every = 1 steps, delay = 0 steps, check = yes
  max neighbors/atom: 2000, page size: 100000
  master list distance cutoff = 8
  ghost atom cutoff = 8
  binsize = 4, bins = 2 2 3
  1 neighbor lists, perpetual/occasional/extra = 1 0 0
  (1) pair deepmd, perpetual
      attributes: full, newton on
      pair build: full/bin/atomonly
      stencil: full/bin/3d
      bin: standard
Setting up cg style minimization ...
  Unit style    : metal
  Current step  : 0
WARNING: Energy due to 1 extra global DOFs will be included in minimizer energies
 (src/min.cpp:219)
ERROR on proc 0: DeePMD-kit C API Error: DeePMD-kit Error: DeePMD-kit PyTorch backend JIT error: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
  File "code/__torch__/deepmd/pt/model/model/ener_model.py", line 59, in forward_lower
    comm_dict: Optional[Dict[str, Tensor]]=None) -> Dict[str, Tensor]:
    _5 = (self).need_sorted_nlist_for_lower()
    model_ret = (self).forward_common_lower(extended_coord, extended_atype, nlist, mapping, fparam, aparam, do_atomic_virial, comm_dict, _5, )
                 ~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
    _6 = (self).get_fitting_net()
    model_predict = annotate(Dict[str, Tensor], {})
  File "code/__torch__/deepmd/pt/model/model/ener_model.py", line 221, in forward_common_lower
    cc_ext, _37, fp, ap, input_prec, = _36
    atomic_model = self.atomic_model
    atomic_ret = (atomic_model).forward_common_atomic(cc_ext, extended_atype, nlist0, mapping, fp, ap, comm_dict, )
                  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
    _38 = (self).atomic_output_def()
    training = self.training
  File "code/__torch__/deepmd/pt/model/atomic_model/energy_atomic_model.py", line 52, in forward_common_atomic
    ext_atom_mask = (self).make_atom_mask(extended_atype, )
    _3 = torch.where(ext_atom_mask, extended_atype, 0)
    ret_dict = (self).forward_atomic(extended_coord, _3, nlist, mapping, fparam, aparam, comm_dict, )
                ~~~~~~~~~~~~~~~~~~~~ <--- HERE
    ret_dict0 = (self).apply_out_stat(ret_dict, atype, )
    _4 = torch.slice(torch.slice(ext_atom_mask), 1, None, nloc)
  File "code/__torch__/deepmd/pt/model/atomic_model/energy_atomic_model.py", line 95, in forward_atomic
      pass
    descriptor = self.descriptor
    _16 = (descriptor).forward(extended_coord, extended_atype, nlist, mapping, comm_dict, )
           ~~~~~~~~~~~~~~~~~~~ <--- HERE
    descriptor0, rot_mat, g2, h2, sw, = _16
    enable_eval_descriptor_hook = self.enable_eval_descriptor_hook
  File "code/__torch__/deepmd/pt/model/descriptor/dpa2.py", line 98, in forward
    repformers1 = self.repformers
    _17 = nlist_dict[_1(_16, (repformers1).get_nsel(), )]
    _18 = (repformers).forward(_17, extended_coord, extended_atype, g13, mapping0, comm_dict0, )
           ~~~~~~~~~~~~~~~~~~~ <--- HERE
    g14, g2, h2, rot_mat, sw, = _18
    concat_output_tebd = self.concat_output_tebd
  File "code/__torch__/deepmd/pt/model/descriptor/repformers.py", line 531, in forward
  _110 = "border_op is not available since customized PyTorch OP library is not built when freezing the model. See documentation for DPA-2 for details."
  _111 = uninitialized(Tensor)
  ops.prim.RaiseException(_110, "builtins.NotImplementedError")
  ~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
  return _111

Traceback of TorchScript, original code (most recent call last):
  File "/mnt/data_nas/guomingyu/SOFTWARE/deepmd-kit/deepmd/pt/model/model/ener_model.py", line 108, in forward_lower
        comm_dict: Optional[dict[str, torch.Tensor]] = None,
    ):
        model_ret = self.forward_common_lower(
                    ~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
            extended_coord,
            extended_atype,
  File "/mnt/data_nas/guomingyu/SOFTWARE/deepmd-kit/deepmd/pt/model/model/make_model.py", line 285, in forward_common_lower
            )
            del extended_coord, fparam, aparam
            atomic_ret = self.atomic_model.forward_common_atomic(
                         ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
                cc_ext,
                extended_atype,
  File "/mnt/data_nas/guomingyu/SOFTWARE/deepmd-kit/deepmd/pt/model/atomic_model/base_atomic_model.py", line 239, in forward_common_atomic
    
        ext_atom_mask = self.make_atom_mask(extended_atype)
        ret_dict = self.forward_atomic(
                   ~~~~~~~~~~~~~~~~~~~ <--- HERE
            extended_coord,
            torch.where(ext_atom_mask, extended_atype, 0),
  File "/mnt/data_nas/guomingyu/SOFTWARE/deepmd-kit/deepmd/pt/model/atomic_model/dp_atomic_model.py", line 231, in forward_atomic
        if self.do_grad_r() or self.do_grad_c():
            extended_coord.requires_grad_(True)
        descriptor, rot_mat, g2, h2, sw = self.descriptor(
                                          ~~~~~~~~~~~~~~~ <--- HERE
            extended_coord,
            extended_atype,
  File "/mnt/data_nas/guomingyu/SOFTWARE/deepmd-kit/deepmd/pt/model/descriptor/dpa2.py", line 794, in forward
            g1 = g1_ext
        # repformer
        g1, g2, h2, rot_mat, sw = self.repformers(
                                  ~~~~~~~~~~~~~~~ <--- HERE
            nlist_dict[
                get_multiple_nlist_key(
  File "/mnt/data_nas/guomingyu/SOFTWARE/deepmd-kit/deepmd/pt/model/descriptor/repformers.py", line 58, in forward
        argument8,
    ) -> torch.Tensor:
        raise NotImplementedError(
        ~~~~~~~~~~~~~~~~~~~~~~~~~~
            "border_op is not available since customized PyTorch OP library is not built when freezing the model. "
            ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
            "See documentation for DPA-2 for details."
            ~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
        )
builtins.NotImplementedError: border_op is not available since customized PyTorch OP library is not built when freezing the model. See documentation for DPA-2 for details. (/home/conda/feedstock_root/build_artifacts/deepmd-kit_1735001402594/work/source/lmp/pair_deepmd.cpp:220)
Last command: minimize        1.0e-10 1.0e-10 10000 100000
Abort(1) on node 0 (rank 0 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0

It seems to me that the error message above is clear, though it's unclear why your error message is truncted.

@SchrodingersCattt
Copy link
Author

Now the reason for the error is clear. Thank you very much!

My truncated log is probably because Bohrium hides the complete call stack for security reasons.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants