Dataset internal jobs, external files, & server-side view creation #878

bennybp · 2025-01-10T22:10:00Z

Description

This one has been a long time coming. This fairly-big PR has 3 main components:

1. Internal Jobs

Cleanup and expansion of server internal jobs. These are jobs that are run server side. Until now, they have been relegated to periodic maintenance tasks. This PR makes those more usable for long-running single tasks as well.

2. External file capability

Adds the ability to store files in an S3 bucket. I have not tried it on actual S3 yet, just testing locally using MinIO, which we will also likely run locally in the near-term.

For now, the only place these external files are used is with dataset attachments. Datasets can now have these files as attachments, and when downloading them they come directly from the S3 bucket.

3. Server-side view creation

Combining those additions, we can now create "views" using internal jobs and then storing them in an S3 bucket. These views are basically pre-rendered cache files which you can download and then use either as a starting point for the local dataset cache or as standalone, read-only dataset views completely disconnected from the server.

The main functions are create_view, download_view, list_views, use_view_cache, and preload_cache, which are all part of the dataset interface.

The create_view function will add an internal job on the server, which runs in the background and doesn't require you to be constantly connected to the server (as these can take a while).

Some initial benchmarks on a copy of our ML server: roughly 30 mins to create a view (including outputs) for 124,000 singlepoints (dataset 413 "SPICE PubChem Set 10 Single Points Dataset v1.0"). Resulting file was 8.4 GB.

Dataset 422 "SSPICE Amino Acid Ligand OpenFF v1.0" with 388,532 singlepoints took 80 minutes - resulting file was 21.7 GB

Testing

I've tested this locally and with a copy of the ML instance. I am currently thinking of how to test this with GHA - it's tricky because of the requirement for an S3 server.

Future work

I hope to get to a documentation sprint next week. There is a lot to document here (and everywhere else, too).

I want to have the ability to do more with these internal jobs, including large dataset submission.

Status

Code base linted
Ready to go

…nse objects

bennybp · 2025-01-14T18:54:20Z

I'm going to skip a lot of testing for the view creation stuff right now. It's been tested manually.

Soon I want to add some features related to ingesting pre-computed data, which will make these kinds of dataset tests much easier, so no need to waste time on working around that right now.

Also will need to set up a test S3 bucket

bennybp added 26 commits January 9, 2025 12:40

Fixup for some durations-as-string issues

5968623

Don't attempt serialization for reponses that are already flask Respo…

b1efeae

…nse objects

Make records table in caches have rowid

2db3a87

Expand features of internal jobs

bbcd6eb

Better handling of incorrect after_functions after migration

be24f32

Add external file capability via S3

ba21407

Don't allow redirects in base client

5203495

Functions for downloading external file in client

f283de9

Enable passthrough for external files

272dc66

Make boto3 package optional

c4f1d2b

Add view creation and attachment to datasets

3298c99

Some more helper functions in the cache classes

c8dd523

Add functionality for downloading & using dataset views

9543608

Add dataset to internal job tracking

163cdce

Enable external file deletion

a7ee70c

Add serial groups to internal jobs

5b0b97f

Better handling of cancelling internal jobs

8709442

Better output from internal job watch() when it finishes

8472174

Make server side view creation cancellable

801bc12

Quieter internal job tests

cd3c64d

Create temporary_dir if it doesn't exist

0ebbb0c

Improve dataset testing and remove duplicate submit tests

38f2da0

Test temporary_dir creation

7c1d925

Move update_nested_dict to utils & remove duplicate

03369fe

Remove unused start_api option to QCATesting snowflake

e36fdba

Have separate function for generating test config

6f56eed

bennybp merged commit 4f552af into main Jan 14, 2025
15 checks passed

bennybp deleted the create_view branch January 14, 2025 18:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset internal jobs, external files, & server-side view creation #878

Dataset internal jobs, external files, & server-side view creation #878

bennybp commented Jan 10, 2025 •

edited

Loading

bennybp commented Jan 14, 2025

Dataset internal jobs, external files, & server-side view creation #878

Dataset internal jobs, external files, & server-side view creation #878

Conversation

bennybp commented Jan 10, 2025 • edited Loading

Description

1. Internal Jobs

2. External file capability

3. Server-side view creation

Testing

Future work

Status

bennybp commented Jan 14, 2025

bennybp commented Jan 10, 2025 •

edited

Loading