diff --git a/README.md b/README.md index 09052d9..f86b6e8 100644 --- a/README.md +++ b/README.md @@ -1,8 +1,16 @@ -Run krust on the test data, searching for kmers of length 5 across all sequences, like this: +`krust` is a [k-mer](https://en.wikipedia.org/wiki/K-mer) counter written in Rust and run from the command line that will output canonical k-mers and their frequency across the records in a fasta file. - $ cargo run --release 5 cerevisae.pan.fa > output.tsv +`krust` prints to `stdout`, writing, on alternate lines: +```>{frequency}``` +```{canonical k-mer}``` -or, searching for kmers of length 21: +`krust` uses [`rust-bio`](https://docs.rs/bio/0.38.0/bio/), [`rayon`](https://docs.rs/rayon/1.5.1/rayon/), and [`dashmap`](https://docs.rs/crate/dashmap/4.0.2). - $ cargo run --release 21 cerevisae.pan.fa > output.tsv +Run `krust` on the test data in the [`krust` Github repo](https://github.com/suchapalaver/krust), searching for kmers of length 5, like this: +```$ cargo run --release 5 cerevisae.pan.fa > output.tsv``` +or, searching for kmers of length 21: +```$ cargo run --release 21 cerevisae.pan.fa > output.tsv``` +Future: +A function like fn single_sequence_canonical_kmers(filepath: String, k: usize) {} +Would returns k-mer counts for individual sequences in a fasta file. diff --git a/src/lib.rs b/src/lib.rs index 20895f7..1cce0ab 100644 --- a/src/lib.rs +++ b/src/lib.rs @@ -1,28 +1,21 @@ -//! # Krust +//! # krust //! -//! Krust is a k-mer counter written in Rust and run from the command line that will output canonical k-mers and their frequency across the records in a fasta file. +//! `krust` is a [k-mer](https://en.wikipedia.org/wiki/K-mer) counter written in Rust and run from the command line that will output canonical k-mers and their frequency across the records in a fasta file. //! -//! Krust prints to `stdout`, writing, on alternate lines, for example, to a .tsv file: +//! `krust` prints to `stdout`, writing, on alternate lines: +//! ```>{frequency}``` +//! ```{canonical k-mer}``` //! -//! `>{frequency across fasta file for both canonical k-mer and its reverse complement}` +//! `krust` uses [`rust-bio`](https://docs.rs/bio/0.38.0/bio/), [`rayon`](https://docs.rs/rayon/1.5.1/rayon/), and [`dashmap`](https://docs.rs/crate/dashmap/4.0.2). //! -//! `{canonical k-mer}` +//! Run `krust` on the test data in the [`krust` Github repo](https://github.com/suchapalaver/krust), searching for kmers of length 5, like this: +//! ```$ cargo run --release 5 cerevisae.pan.fa > output.tsv``` +//! or, searching for kmers of length 21: +//! ```$ cargo run --release 21 cerevisae.pan.fa > output.tsv``` //! -//! `krust` uses the [`rust-bio`](https://docs.rs/bio/0.38.0/bio/), [`rayon`](https://docs.rs/rayon/1.5.1/rayon/), and [`dashmap`](https://docs.rs/dashmap/4.0.2/dashmap/struct.DashMap.html) crates. -//! -//! Run krust on the test data in the `krust` [Github repo](https://github.com/suchapalaver/krust), searching for kmers of length 5, like this: -//! -//! ```$ cargo run --release 5 cerevisae.pan.fa > output.tsv``` -//! -//! or, searching for kmers of length 21: -//! -//! ```$ cargo run --release 21 cerevisae.pan.fa > output.tsv``` -//! -//! Future: -//! -//! fn single_sequence_canonical_kmers(filepath: String, k: usize) {} -//! -//! Returns k-mer counts for individual sequences in a fasta file +//! Future: +//! A function like fn single_sequence_canonical_kmers(filepath: String, k: usize) {} +//! Would returns k-mer counts for individual sequences in a fasta file. use bio::{alphabets::dna::revcomp, io::fasta}; use dashmap::DashMap; @@ -48,11 +41,12 @@ impl Config { } } -/// Reads sequences from fasta records in parallel using `rayon` (crate). -/// Ignores substrings containing `N`. -/// Canonicalizes by lexicographically smaller of k-mer/reverse-complement -/// Returns a `DashMap` of canonical k-mers (keys) and their frequency in the data (values). -pub fn canonicalize_kmers( +/// Reads sequences from fasta records in parallel using [`rayon`](https://docs.rs/rayon/1.5.1/rayon/). +/// Using [`Dashmap`](https://docs.rs/dashmap/4.0.2/dashmap/struct.DashMap.html) allows updating single hashmap in parallel. +/// Ignores substrings containing `N`. +/// Canonicalizes by lexicographically smaller of k-mer/reverse-complement. +/// Returns a hashmap of canonical k-mers (keys) and their frequency in the data (values). +pub fn canonicalize_kmer( filepath: String, k: usize, ) -> Result, u64>, &'static str> {