Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First pass at bringing in code from clade_data_utils #17

Merged
merged 3 commits into from
Sep 6, 2024

Conversation

bsweger
Copy link
Collaborator

@bsweger bsweger commented Sep 5, 2024

Partially addresses #13

Add functionality to download the latest GenBank genome metadata from nextstrain.

Some of the logic in clade_data_utils is broken up into smaller functions here, with the goal of re-using that code for creating the target data.

Nothing is wired up yet; this PR adds some foundational building blocks for use by the separate "get clade list" and "create target data" tasks.

Add functionality to download the latest GenBank genome metadata
from nextstrain. Some of the logic in clade_data_utils is broken
up into smaller functions here, with the goal of re-using that
code for creating the target data.
@@ -42,6 +45,7 @@ build-backend = "setuptools.build_meta"
tmp_path_retention_policy = "none"
filterwarnings = [
"ignore::DeprecationWarning",
'ignore:polars found a filename',
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suppress this specific polars warning when running the test suite

from virus_clade_utils.util.session import check_response, get_session

logger = structlog.get_logger()


def get_covid_genome_data(released_since_date: str, base_url: str, filename: str):
"""Download genome data package from NCBI."""
"""
Download genome data package from NCBI.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, the target data code uses the NCBI API to download sequences. A future PR will update this function and grab the sequence fasta file directly from Nextstrain (to simplify and ensure reproducibility).

@@ -54,6 +62,74 @@ def get_covid_genome_data(released_since_date: str, base_url: str, filename: str
logger.info("NCBI API call completed", elapsed=elapsed)


def download_covid_genome_metadata(url: str, data_path: Path) -> Path:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ported from https://github.com/rogersbw/clade_data_utils/blob/main/update_clades_list.py#L11 with a few modifications:

  • put the download in a small function so it can also be used by the target data process
  • used the requests package for downloading to take advantage of its retry capabilities

return result[0]


def get_covid_genome_metadata(metadata_path: Path, num_rows: int | None = None) -> pl.LazyFrame:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

De-coupled the metadata download from the process of reading it into polars, mostly for more straightforward unit testing.


if (compression_type := metadata_path.suffix) == ".zst":
metadata = pl.scan_csv(metadata_path, separator="\t", n_rows=num_rows)
elif compression_type == ".xz":
Copy link
Collaborator Author

@bsweger bsweger Sep 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The full metadata file that we need is .zst (ZSTD compression), which works fine with Polars scan_csv operation.

However, some of the smaller metadata sample files use a different compression method (annoying!), so this function handles both.

Either way, this function will return a LazyFrame: https://docs.pola.rs/user-guide/concepts/lazy-vs-eager/

return metadata


def filter_covid_genome_metadata(metadata: pl.LazyFrame, cols: list = []) -> pl.LazyFrame:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This applies most of clade_data_untils data_prep function, with a few changes:

  1. Include genbank accession columns so the target data process can join metdata to sequences
  2. exclude the counts_dat code because it's specific to the "get clade list" process (will add it there in a future PR)

return metadata


def filter_covid_genome_metadata(metadata: pl.LazyFrame, cols: list = []) -> pl.LazyFrame:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This applies most of clade_data_untils data_prep function, with a few changes:

  1. Include genbank accession columns so the target data process can join metdata to sequences
  2. exclude the counts_dat code because it's specific to the "get clade list" process (will add it there in a future PR)

@bsweger bsweger requested review from elray1 and rogersbw September 6, 2024 13:31
with session.get(url, stream=True) as result:
result.raise_for_status()
with open(filename, "wb") as f:
for chunk in result.iter_content(chunk_size=None):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading the downloaded metadata in chunks isn't strictly necessary at the moment, but it's a hedge against an eventual file size that's too large for memory.

elray1
elray1 previously approved these changes Sep 6, 2024
Copy link
Collaborator

@elray1 elray1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

@bsweger
Copy link
Collaborator Author

bsweger commented Sep 6, 2024

lgtm!

Thanks for the quick review 🙏

@bsweger bsweger merged commit 971c45f into main Sep 6, 2024
1 check passed
@bsweger bsweger deleted the bsweger/streamline-sequence-retrieval branch September 6, 2024 16:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants