Multiple Training / Validation Datasets🌟 [FEATURE] #441

tgmaxson · 2024-07-08T18:34:15Z

Is your feature request related to a problem? Please describe.
It is a common problem we run across internally that we wish to train models on partial datasets which are kept in separate files as well as combined models. For example, imagine a simple case.

water-only.traj
water-NaCl.traj
water-KCl.traj

Ideally, we should be able to read these files independently in Nequip and sample from them as if they were one file. Making the file pairs quickly becomes unwieldy and expensive (in terms of space). Additionally, the cached datasets then also have to be regenerated and stored as well.

Describe the solution you'd like
Simple extension to the dataloader syntax to accept a list of filenames, not just a single filename. The data would then be lumped together and used as normal. From the "ase" dataloader persepctive, this just involves appending multiple ase files together. As an alternative, ASE can also be extended to read multiple files potentially from a specialized filename, but I suspect that will get pushback from the devs (and not result in the proper caching on nequip's end).

dataset_file_name: /mnt/public/tgmaxson/datasets/7-4-24/train.traj # Single filename

or

dataset_file_name: # Multiple filenames
  - /mnt/public/tgmaxson/datasets/7-4-24/train.traj
  - /mnt/public/tgmaxson/datasets/7-2-24/train.traj

The text was updated successfully, but these errors were encountered:

Linux-cpp-lisp · 2024-07-09T06:07:30Z

Hi @tgmaxson ,

The supported solution to this is to have all of the files concatenated into a single comprehensive dataset (which will get preprocessed once), always specify that dataset, and change the provided train_idcs and val_idcs run-to-run.

One change I'd be happy to make is to make providing these indexes more user friendly, such as by allowing them to be provided as a filename in some format, say, rather than directly as a list of integers in the YAML. Let me know what you think would be easy. (You would then be able to write one or two lines of Python to generate the index lists for different subsampling schemes with complete freedom to choose them as you like.)

Side note: if pre-processing time is significant for you, you can consider trying the new but not yet enabled by default support for matscipy neighborlists (NEQUIP_MATSCIPY_NL=true environment variable) available in 0.6.0. Or on develop, you can also try vesin (NEQUIP_NL=vesin or ase or matscipy.)

tgmaxson · 2024-07-09T07:19:51Z

When working in an iterative manner, it is common to need to add files, making the dataset be repeatedly processed fully. The concatenation operation itself is a potentially messy process and the goal should be to make things as non-error prone as possible for users working with the code. I did not know about the train_idcs and val_idcs however. Reading these from a file would help greatly, but funny enough, my real solution would now be that a notation where you supply multiple filenames + multiple idc files would be most flexible and effective. This allows knowledge of each split to be directly represented without data duplication.

…

On Mon, Jul 8, 2024 at 11:07 PM Alby M. ***@***.***> wrote: Hi @tgmaxson <https://github.com/tgmaxson> , The supported solution to this is to have all of the files concatenated into a single comprehensive dataset (which will get preprocessed once), always specify that dataset, and change the provided train_idcs and val_idcs run-to-run. One change I'd be happy to make is to make providing these indexes more user friendly, such as by allowing them to be provided as a filename in some format, say, rather than directly as a list of integers in the YAML. Let me know what you think would be easy. (You would then be able to write one or two lines of Python to generate the index lists for different subsampling schemes with complete freedom to choose them as you like.) Side note: if pre-processing time is significant for you, you can consider trying the new but not yet enabled by default support for matscipy neighborlists (NEQUIP_MATSCIPY_NL=true environment variable) available in 0.6.0. Or on develop, you can also try vesin (NEQUIP_NL=vesin or ase or matscipy.) — Reply to this email directly, view it on GitHub <#441 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABSBTYENBRH4IBUKWHJTRELZLN43PAVCNFSM6AAAAABKRLX3M6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJWGY2TCNBYGE> . You are receiving this because you were mentioned.Message ID: ***@***.***>

Linux-cpp-lisp · 2024-07-09T12:43:16Z

This is a good point, thanks--let me think a little more about how this could integrate into other fixes to the dataset architecture.

tgmaxson added the enhancement New feature or request label Jul 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiple Training / Validation Datasets🌟 [FEATURE] #441

Multiple Training / Validation Datasets🌟 [FEATURE] #441

tgmaxson commented Jul 8, 2024

Linux-cpp-lisp commented Jul 9, 2024

tgmaxson commented Jul 9, 2024 via email

Linux-cpp-lisp commented Jul 9, 2024

Multiple Training / Validation Datasets🌟 [FEATURE] #441

Multiple Training / Validation Datasets🌟 [FEATURE] #441

Comments

tgmaxson commented Jul 8, 2024

Linux-cpp-lisp commented Jul 9, 2024

tgmaxson commented Jul 9, 2024 via email

Linux-cpp-lisp commented Jul 9, 2024