-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multiple Training / Validation Datasets🌟 [FEATURE] #441
Comments
Hi @tgmaxson , The supported solution to this is to have all of the files concatenated into a single comprehensive dataset (which will get preprocessed once), always specify that dataset, and change the provided One change I'd be happy to make is to make providing these indexes more user friendly, such as by allowing them to be provided as a filename in some format, say, rather than directly as a list of integers in the YAML. Let me know what you think would be easy. (You would then be able to write one or two lines of Python to generate the index lists for different subsampling schemes with complete freedom to choose them as you like.) Side note: if pre-processing time is significant for you, you can consider trying the new but not yet enabled by default support for |
When working in an iterative manner, it is common to need to add files,
making the dataset be repeatedly processed fully. The concatenation
operation itself is a potentially messy process and the goal should be to
make things as non-error prone as possible for users working with the code.
I did not know about the train_idcs and val_idcs however. Reading these
from a file would help greatly, but funny enough, my real solution would
now be that a notation where you supply multiple filenames + multiple idc
files would be most flexible and effective. This allows knowledge of each
split to be directly represented without data duplication.
…On Mon, Jul 8, 2024 at 11:07 PM Alby M. ***@***.***> wrote:
Hi @tgmaxson <https://github.com/tgmaxson> ,
The supported solution to this is to have all of the files concatenated
into a single comprehensive dataset (which will get preprocessed once),
always specify that dataset, and change the provided train_idcs and
val_idcs run-to-run.
One change I'd be happy to make is to make providing these indexes more
user friendly, such as by allowing them to be provided as a filename in
some format, say, rather than directly as a list of integers in the YAML.
Let me know what you think would be easy. (You would then be able to write
one or two lines of Python to generate the index lists for different
subsampling schemes with complete freedom to choose them as you like.)
Side note: if pre-processing time is significant for you, you can consider
trying the new but not yet enabled by default support for matscipy
neighborlists (NEQUIP_MATSCIPY_NL=true environment variable) available in
0.6.0. Or on develop, you can also try vesin (NEQUIP_NL=vesin or ase or
matscipy.)
—
Reply to this email directly, view it on GitHub
<#441 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABSBTYENBRH4IBUKWHJTRELZLN43PAVCNFSM6AAAAABKRLX3M6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJWGY2TCNBYGE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
This is a good point, thanks--let me think a little more about how this could integrate into other fixes to the dataset architecture. |
Is your feature request related to a problem? Please describe.
It is a common problem we run across internally that we wish to train models on partial datasets which are kept in separate files as well as combined models. For example, imagine a simple case.
Ideally, we should be able to read these files independently in Nequip and sample from them as if they were one file. Making the file pairs quickly becomes unwieldy and expensive (in terms of space). Additionally, the cached datasets then also have to be regenerated and stored as well.
Describe the solution you'd like
Simple extension to the dataloader syntax to accept a list of filenames, not just a single filename. The data would then be lumped together and used as normal. From the "ase" dataloader persepctive, this just involves appending multiple ase files together. As an alternative, ASE can also be extended to read multiple files potentially from a specialized filename, but I suspect that will get pushback from the devs (and not result in the proper caching on nequip's end).
The text was updated successfully, but these errors were encountered: