Skip to content

Usage examples

Johan Nystrom-Persson edited this page Apr 13, 2023 · 12 revisions

Below are usage examples for both command line (discount.sh) and the Scala API. The Scala API can be used from the Spark shell, from Zeppelin notebooks, or from your own applications and scripts.

Note that for the Spark shell, the following sequence should be used to set up the environment:

implicit val sp = spark
import com.jnpersson.discount.spark._

It may be helpful to also view the API docs while reading the examples below.

K-mer statistics from a sequence file

Command:

discount.sh -k 35 10M.fasta stats

Example output:

==== Overall statistics ====
Number of buckets                    167,106
Distinct k-mers                  615,449,528
Unique k-mers                    578,469,365
Total abundance                  667,419,413
Mean abundance                         1.084
Max abundance                         16,881
Superkmer count                   53,609,198
Mean superkmer length                 12.450
==== Per bucket (minimizer) statistics ====
                          Mean  Min     Max     Std.dev
k-mers                    3682.989      1       23039   2680.032
abundance                 3993.988      1       24461   2908.654
superkmers                320.810       1       1841    182.905

The output can be written to a file in the following way:

discount.sh -k 35 10M.fasta stats -o 10M_stats.txt

Additional help:

discount.sh --help

discount.sh stats --help

Scala

Discount(35).index("10M.fasta").showStats()

To save the stats to a file:

Discount(35).index("10M.fasta").showStats(Some("10M_stats.txt"))

Create a k-mer index

Command:

discount.sh -k 35 10M.fasta store -o 10M_35

K-mer indexes are stored as parquet files, in this case creating a directory called 10M_35. The number of partitions (files) may be controlled using the -p flag. By default, 200 partitions are used. A good rule of thumb is to aim for 64-128 MB per file for large datasets. The statistics from the stats command above will be displayed when creating an index. It is also possible to display statistics for an existing index manually:

discount.sh -i 10M_35 stats

Here 10M_35 is the name of the directory that was created by the previous command.

Additional help:

discount.sh --help

discount.sh store --help

Scala

Discount(k=35).index("10M.fasta").write("10M_35")

If the index is only to be used for a temporary session, and doesn't need to be stored permanently, then the final write(...) can be omitted and the index can be used as it is.

Stats:

Index.read("10M_35").showStats()

Export counted k-mers

Command:

discount.sh -k 35 10M.fasta count -o counted_kmers

Or, if exporting from a k-mer index in the directory 10M_35:

discount.sh -i 10M_35 count -o counted_kmers

Example output (in a number of fasta files corresponding to the number of partitions):

>1
GCCAGGCTAAACGGGAAATATCTTATTTTCACCAC
>1
CCAGGCTAAACGGGAAATATCTTATTTTCACCACT
>2
TACCGATATACTCTACAAGAGAATGCCAGGCTAAA
>2
ACCGATATACTCTACAAGAGAATGCCAGGCTAAAA
>2
CCGATATACTCTACAAGAGAATGCCAGGCTAAAAG
>2
CGATATACTCTACAAGAGAATGCCAGGCTAAAAGC

Scala

Discount(k=35).index("10M.fasta").counted().writeFasta("counted_kmers")

Or, from an index:

Index.read("10M_35").counted().writeFasta("counted_kmers")

Additional help:

discount.sh --help

discount.sh count --help

Combine two indexes (intersection, union, subtraction)

For indexes to be combined, they have to be compatible (share the same values of k and m, and the same minimizer ordering). Suppose that an index 10M_35 has been created as in the example above. Then a compatible index can be created using:

discount.sh 10M_2.fasta store -c 10M_35 -o 10M_35_2

Here the index settings will be copied from the index that already exists at the path 10M_35, and the new file 10M_2.fasta will be stored in a new index at the path 10M_35_2.

Now that the indexes are compatible (this is verified when necessary by Discount), they can be combined, for example by intersection:

discount.sh -i 10M_35 intersect -i 10M_35_2 -r max -o 35_int

This stores the new index at the path 35_int. It can now be used as an input to other commands, like stats and count above.

Scala

To write the indexes and their intersection to disk, as the command-line interface does:

val d = Discount(k=35)
val i1 = d.index("10M.fasta")
i1.write("10M_35")
val i2 = i1.newCompatible(d, "10M_2.fasta")
i2.write("10M_35_2")
val i3 = i1.intersectMax(i2)
i3.write("35_int")

Or, equivalently but for pre-existing indexes:

val d = Discount(k=35)
val i1 = Index.read("10M_35")
val i2 = Index.read("10M_35_2")
val i3 = i1.intersectMax(i2)
i3.write("35_int")

Or, to operate on the indexes as temporary Spark RDDs:

val d = Discount(k=35)
val i1 = d.index("10M.fasta")
val i2 = i1.newCompatible(d, "10M_2.fasta")
i1.intersectMax(i2).showStats()

Spark RDDs are computed purely in-memory when possible, and spilled to disk when they get too large. In either case they are much more efficient than writing and reading indexes (which uses parquet) when the index only needs to be used once.