-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataset internal jobs, external files, & server-side view creation #878
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
I'm going to skip a lot of testing for the view creation stuff right now. It's been tested manually. Soon I want to add some features related to ingesting pre-computed data, which will make these kinds of dataset tests much easier, so no need to waste time on working around that right now. Also will need to set up a test S3 bucket |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This one has been a long time coming. This fairly-big PR has 3 main components:
1. Internal Jobs
Cleanup and expansion of server internal jobs. These are jobs that are run server side. Until now, they have been relegated to periodic maintenance tasks. This PR makes those more usable for long-running single tasks as well.
2. External file capability
Adds the ability to store files in an S3 bucket. I have not tried it on actual S3 yet, just testing locally using MinIO, which we will also likely run locally in the near-term.
For now, the only place these external files are used is with dataset attachments. Datasets can now have these files as attachments, and when downloading them they come directly from the S3 bucket.
3. Server-side view creation
Combining those additions, we can now create "views" using internal jobs and then storing them in an S3 bucket. These views are basically pre-rendered cache files which you can download and then use either as a starting point for the local dataset cache or as standalone, read-only dataset views completely disconnected from the server.
The main functions are
create_view
,download_view
,list_views
,use_view_cache
, andpreload_cache
, which are all part of the dataset interface.The
create_view
function will add an internal job on the server, which runs in the background and doesn't require you to be constantly connected to the server (as these can take a while).Some initial benchmarks on a copy of our ML server: roughly 30 mins to create a view (including outputs) for 124,000 singlepoints (dataset 413 "SPICE PubChem Set 10 Single Points Dataset v1.0"). Resulting file was 8.4 GB.
Dataset 422 "SSPICE Amino Acid Ligand OpenFF v1.0" with 388,532 singlepoints took 80 minutes - resulting file was 21.7 GB
Testing
I've tested this locally and with a copy of the ML instance. I am currently thinking of how to test this with GHA - it's tricky because of the requirement for an S3 server.
Future work
I hope to get to a documentation sprint next week. There is a lot to document here (and everywhere else, too).
I want to have the ability to do more with these internal jobs, including large dataset submission.
Status