New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

First pass at bringing in code from clade_data_utils #17

Merged

bsweger merged 3 commits into main from bsweger/streamline-sequence-retrieval

Sep 6, 2024

Collaborator

bsweger commented Sep 5, 2024 •

edited

Loading

Partially addresses #13

Add functionality to download the latest GenBank genome metadata from nextstrain.

Some of the logic in clade_data_utils is broken up into smaller functions here, with the goal of re-using that code for creating the target data.

Nothing is wired up yet; this PR adds some foundational building blocks for use by the separate "get clade list" and "create target data" tasks.

bsweger added 2 commits

September 5, 2024 16:39


          First pass at bringing in code from clade_data_utils

f409dc1

Add functionality to download the latest GenBank genome metadata
from nextstrain. Some of the logic in clade_data_utils is broken
up into smaller functions here, with the goal of re-using that
code for creating the target data.


          commit the test data files and include them in the docker build

6c24b24

bsweger commented

View reviewed changes

pyproject.toml

@@ @@ -42,6 +45,7 @@ build-backend = "setuptools.build_meta" @@
               tmp_path_retention_policy = "none"
               filterwarnings = [
                   "ignore::DeprecationWarning",
+                  'ignore:polars found a filename',

Collaborator Author

bsweger Sep 6, 2024

suppress this specific polars warning when running the test suite

bsweger commented

View reviewed changes

src/virus_clade_utils/util/sequence.py

               from virus_clade_utils.util.session import check_response, get_session
               logger = structlog.get_logger()
               def get_covid_genome_data(released_since_date: str, base_url: str, filename: str):
-                  """Download genome data package from NCBI."""
+                  """
+                  Download genome data package from NCBI.

Collaborator Author

bsweger Sep 6, 2024

Currently, the target data code uses the NCBI API to download sequences. A future PR will update this function and grab the sequence fasta file directly from Nextstrain (to simplify and ensure reproducibility).

bsweger commented

View reviewed changes

src/virus_clade_utils/util/sequence.py

		@@ -54,6 +62,74 @@ def get_covid_genome_data(released_since_date: str, base_url: str, filename: str
		logger.info("NCBI API call completed", elapsed=elapsed)


		def download_covid_genome_metadata(url: str, data_path: Path) -> Path:

Collaborator Author

bsweger Sep 6, 2024

Ported from https://github.com/rogersbw/clade_data_utils/blob/main/update_clades_list.py#L11 with a few modifications:

put the download in a small function so it can also be used by the target data process
used the requests package for downloading to take advantage of its retry capabilities

bsweger commented

View reviewed changes

src/virus_clade_utils/util/sequence.py

		return result[0]


		def get_covid_genome_metadata(metadata_path: Path, num_rows: int \| None = None) -> pl.LazyFrame:

Collaborator Author

bsweger Sep 6, 2024

De-coupled the metadata download from the process of reading it into polars, mostly for more straightforward unit testing.

bsweger commented

View reviewed changes

src/virus_clade_utils/util/sequence.py

+                  if (compression_type := metadata_path.suffix) == ".zst":
+                      metadata = pl.scan_csv(metadata_path, separator="\t", n_rows=num_rows)
+                  elif compression_type == ".xz":

Collaborator Author

bsweger Sep 6, 2024 •

edited

Loading

The full metadata file that we need is .zst (ZSTD compression), which works fine with Polars scan_csv operation.

However, some of the smaller metadata sample files use a different compression method (annoying!), so this function handles both.

Either way, this function will return a LazyFrame: https://docs.pola.rs/user-guide/concepts/lazy-vs-eager/

bsweger commented

View reviewed changes

src/virus_clade_utils/util/sequence.py

		return metadata


		def filter_covid_genome_metadata(metadata: pl.LazyFrame, cols: list = []) -> pl.LazyFrame:

Collaborator Author

bsweger Sep 6, 2024

This applies most of clade_data_untils data_prep function, with a few changes:

Include genbank accession columns so the target data process can join metdata to sequences
exclude the counts_dat code because it's specific to the "get clade list" process (will add it there in a future PR)

bsweger commented

View reviewed changes

src/virus_clade_utils/util/sequence.py

		return metadata


		def filter_covid_genome_metadata(metadata: pl.LazyFrame, cols: list = []) -> pl.LazyFrame:

Collaborator Author

bsweger Sep 6, 2024

This applies most of clade_data_untils data_prep function, with a few changes:

Include genbank accession columns so the target data process can join metdata to sequences
exclude the counts_dat code because it's specific to the "get clade list" process (will add it there in a future PR)

bsweger requested review from elray1 and rogersbw

September 6, 2024 13:31

bsweger commented

View reviewed changes

src/virus_clade_utils/util/sequence.py

+                  with session.get(url, stream=True) as result:
+                      result.raise_for_status()
+                      with open(filename, "wb") as f:
+                          for chunk in result.iter_content(chunk_size=None):

Collaborator Author

bsweger Sep 6, 2024

Reading the downloaded metadata in chunks isn't strictly necessary at the moment, but it's a hedge against an eventual file size that's too large for memory.

elray1 previously approved these changes

View reviewed changes

Collaborator

elray1 left a comment

lgtm!


          oops!

891b91f

bsweger dismissed elray1’s stale review via

891b91f

September 6, 2024 14:16

Collaborator Author

bsweger commented Sep 6, 2024

lgtm!

Thanks for the quick review 🙏

elray1 approved these changes

View reviewed changes

bsweger merged commit 971c45f into main

1 check passed

bsweger deleted the bsweger/streamline-sequence-retrieval branch

September 6, 2024 16:53

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet