Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple Training / Validation Datasets🌟 [FEATURE] #441

Open
tgmaxson opened this issue Jul 8, 2024 · 3 comments
Open

Multiple Training / Validation Datasets🌟 [FEATURE] #441

tgmaxson opened this issue Jul 8, 2024 · 3 comments
Labels
enhancement New feature or request

Comments

@tgmaxson
Copy link

tgmaxson commented Jul 8, 2024

Is your feature request related to a problem? Please describe.
It is a common problem we run across internally that we wish to train models on partial datasets which are kept in separate files as well as combined models. For example, imagine a simple case.

  • water-only.traj
  • water-NaCl.traj
  • water-KCl.traj

Ideally, we should be able to read these files independently in Nequip and sample from them as if they were one file. Making the file pairs quickly becomes unwieldy and expensive (in terms of space). Additionally, the cached datasets then also have to be regenerated and stored as well.

Describe the solution you'd like
Simple extension to the dataloader syntax to accept a list of filenames, not just a single filename. The data would then be lumped together and used as normal. From the "ase" dataloader persepctive, this just involves appending multiple ase files together. As an alternative, ASE can also be extended to read multiple files potentially from a specialized filename, but I suspect that will get pushback from the devs (and not result in the proper caching on nequip's end).

dataset_file_name: /mnt/public/tgmaxson/datasets/7-4-24/train.traj # Single filename

or

dataset_file_name: # Multiple filenames
  - /mnt/public/tgmaxson/datasets/7-4-24/train.traj
  - /mnt/public/tgmaxson/datasets/7-2-24/train.traj
@tgmaxson tgmaxson added the enhancement New feature or request label Jul 8, 2024
@Linux-cpp-lisp
Copy link
Collaborator

Hi @tgmaxson ,

The supported solution to this is to have all of the files concatenated into a single comprehensive dataset (which will get preprocessed once), always specify that dataset, and change the provided train_idcs and val_idcs run-to-run.

One change I'd be happy to make is to make providing these indexes more user friendly, such as by allowing them to be provided as a filename in some format, say, rather than directly as a list of integers in the YAML. Let me know what you think would be easy. (You would then be able to write one or two lines of Python to generate the index lists for different subsampling schemes with complete freedom to choose them as you like.)

Side note: if pre-processing time is significant for you, you can consider trying the new but not yet enabled by default support for matscipy neighborlists (NEQUIP_MATSCIPY_NL=true environment variable) available in 0.6.0. Or on develop, you can also try vesin (NEQUIP_NL=vesin or ase or matscipy.)

@tgmaxson
Copy link
Author

tgmaxson commented Jul 9, 2024 via email

@Linux-cpp-lisp
Copy link
Collaborator

This is a good point, thanks--let me think a little more about how this could integrate into other fixes to the dataset architecture.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants