-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
First pass at bringing in code from clade_data_utils #17
Conversation
Add functionality to download the latest GenBank genome metadata from nextstrain. Some of the logic in clade_data_utils is broken up into smaller functions here, with the goal of re-using that code for creating the target data.
@@ -42,6 +45,7 @@ build-backend = "setuptools.build_meta" | |||
tmp_path_retention_policy = "none" | |||
filterwarnings = [ | |||
"ignore::DeprecationWarning", | |||
'ignore:polars found a filename', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suppress this specific polars warning when running the test suite
from virus_clade_utils.util.session import check_response, get_session | ||
|
||
logger = structlog.get_logger() | ||
|
||
|
||
def get_covid_genome_data(released_since_date: str, base_url: str, filename: str): | ||
"""Download genome data package from NCBI.""" | ||
""" | ||
Download genome data package from NCBI. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, the target data code uses the NCBI API to download sequences. A future PR will update this function and grab the sequence fasta file directly from Nextstrain (to simplify and ensure reproducibility).
@@ -54,6 +62,74 @@ def get_covid_genome_data(released_since_date: str, base_url: str, filename: str | |||
logger.info("NCBI API call completed", elapsed=elapsed) | |||
|
|||
|
|||
def download_covid_genome_metadata(url: str, data_path: Path) -> Path: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ported from https://github.com/rogersbw/clade_data_utils/blob/main/update_clades_list.py#L11 with a few modifications:
- put the download in a small function so it can also be used by the target data process
- used the
requests
package for downloading to take advantage of its retry capabilities
return result[0] | ||
|
||
|
||
def get_covid_genome_metadata(metadata_path: Path, num_rows: int | None = None) -> pl.LazyFrame: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
De-coupled the metadata download from the process of reading it into polars, mostly for more straightforward unit testing.
|
||
if (compression_type := metadata_path.suffix) == ".zst": | ||
metadata = pl.scan_csv(metadata_path, separator="\t", n_rows=num_rows) | ||
elif compression_type == ".xz": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The full metadata file that we need is .zst (ZSTD compression), which works fine with Polars scan_csv
operation.
However, some of the smaller metadata sample files use a different compression method (annoying!), so this function handles both.
Either way, this function will return a LazyFrame: https://docs.pola.rs/user-guide/concepts/lazy-vs-eager/
return metadata | ||
|
||
|
||
def filter_covid_genome_metadata(metadata: pl.LazyFrame, cols: list = []) -> pl.LazyFrame: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This applies most of clade_data_untils
data_prep
function, with a few changes:
- Include genbank accession columns so the target data process can join metdata to sequences
- exclude the
counts_dat
code because it's specific to the "get clade list" process (will add it there in a future PR)
return metadata | ||
|
||
|
||
def filter_covid_genome_metadata(metadata: pl.LazyFrame, cols: list = []) -> pl.LazyFrame: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This applies most of clade_data_untils
data_prep
function, with a few changes:
- Include genbank accession columns so the target data process can join metdata to sequences
- exclude the
counts_dat
code because it's specific to the "get clade list" process (will add it there in a future PR)
with session.get(url, stream=True) as result: | ||
result.raise_for_status() | ||
with open(filename, "wb") as f: | ||
for chunk in result.iter_content(chunk_size=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading the downloaded metadata in chunks isn't strictly necessary at the moment, but it's a hedge against an eventual file size that's too large for memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm!
Thanks for the quick review 🙏 |
Partially addresses #13
Add functionality to download the latest GenBank genome metadata from nextstrain.
Some of the logic in clade_data_utils is broken up into smaller functions here, with the goal of re-using that code for creating the target data.
Nothing is wired up yet; this PR adds some foundational building blocks for use by the separate "get clade list" and "create target data" tasks.