Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset internal jobs, external files, & server-side view creation #878

Merged
merged 26 commits into from
Jan 14, 2025

Conversation

bennybp
Copy link
Contributor

@bennybp bennybp commented Jan 10, 2025

Description

This one has been a long time coming. This fairly-big PR has 3 main components:

1. Internal Jobs

Cleanup and expansion of server internal jobs. These are jobs that are run server side. Until now, they have been relegated to periodic maintenance tasks. This PR makes those more usable for long-running single tasks as well.

2. External file capability

Adds the ability to store files in an S3 bucket. I have not tried it on actual S3 yet, just testing locally using MinIO, which we will also likely run locally in the near-term.

For now, the only place these external files are used is with dataset attachments. Datasets can now have these files as attachments, and when downloading them they come directly from the S3 bucket.

3. Server-side view creation

Combining those additions, we can now create "views" using internal jobs and then storing them in an S3 bucket. These views are basically pre-rendered cache files which you can download and then use either as a starting point for the local dataset cache or as standalone, read-only dataset views completely disconnected from the server.

The main functions are create_view, download_view, list_views, use_view_cache, and preload_cache, which are all part of the dataset interface.

The create_view function will add an internal job on the server, which runs in the background and doesn't require you to be constantly connected to the server (as these can take a while).

Some initial benchmarks on a copy of our ML server: roughly 30 mins to create a view (including outputs) for 124,000 singlepoints (dataset 413 "SPICE PubChem Set 10 Single Points Dataset v1.0"). Resulting file was 8.4 GB.

Dataset 422 "SSPICE Amino Acid Ligand OpenFF v1.0" with 388,532 singlepoints took 80 minutes - resulting file was 21.7 GB

Testing

I've tested this locally and with a copy of the ML instance. I am currently thinking of how to test this with GHA - it's tricky because of the requirement for an S3 server.

Future work

I hope to get to a documentation sprint next week. There is a lot to document here (and everywhere else, too).

I want to have the ability to do more with these internal jobs, including large dataset submission.

Status

  • Code base linted
  • Ready to go

@bennybp
Copy link
Contributor Author

bennybp commented Jan 14, 2025

I'm going to skip a lot of testing for the view creation stuff right now. It's been tested manually.

Soon I want to add some features related to ingesting pre-computed data, which will make these kinds of dataset tests much easier, so no need to waste time on working around that right now.

Also will need to set up a test S3 bucket

@bennybp bennybp merged commit 4f552af into main Jan 14, 2025
15 checks passed
@bennybp bennybp deleted the create_view branch January 14, 2025 18:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant