-
Notifications
You must be signed in to change notification settings - Fork 2
Usage examples
Below are usage examples for both command line (discount.sh
) and the Scala API. The Scala API can be used from the Spark shell, from Zeppelin notebooks, or from your own applications and scripts.
Note that for the Spark shell, the following sequence should be used to set up the environment:
implicit val sp = spark
import com.jnpersson.discount.spark._
It may be helpful to also view the API docs while reading the examples below.
Command:
discount.sh -k 35 10M.fasta stats
Example output:
==== Overall statistics ====
Number of buckets 167,106
Distinct k-mers 615,449,528
Unique k-mers 578,469,365
Total abundance 667,419,413
Mean abundance 1.084
Max abundance 16,881
Superkmer count 53,609,198
Mean superkmer length 12.450
==== Per bucket (minimizer) statistics ====
Mean Min Max Std.dev
k-mers 3682.989 1 23039 2680.032
abundance 3993.988 1 24461 2908.654
superkmers 320.810 1 1841 182.905
The output can be written to a file in the following way:
discount.sh -k 35 10M.fasta stats -o 10M_stats.txt
Additional help:
discount.sh --help
discount.sh stats --help
Discount(35).index("10M.fasta").showStats()
To save the stats to a file:
Discount(35).index("10M.fasta").showStats(Some("10M_stats.txt"))
Command:
discount.sh -k 35 10M.fasta store -o 10M_35
K-mer indexes are stored as parquet files, in this case creating a directory called 10M_35. The number of partitions (files) may be controlled using the -p
flag. By default, 200 partitions are used. A good rule of thumb is to aim for 64-128 MB per file for large datasets.
The statistics from the stats command above will be displayed when creating an index. It is also possible to display statistics for an existing index manually:
discount.sh -i 10M_35 stats
Here 10M_35 is the name of the directory that was created by the previous command.
Additional help:
discount.sh --help
discount.sh store --help
Discount(k=35).index("10M.fasta").write("10M_35")
If the index is only to be used for a temporary session, and doesn't need to be stored permanently, then the final write(...)
can be omitted and the index can be used as it is.
Stats:
Index.read("10M_35").showStats()
Command:
discount.sh -k 35 10M.fasta count -o counted_kmers
Or, if exporting from a k-mer index in the directory 10M_35:
discount.sh -i 10M_35 count -o counted_kmers
Example output (in a number of fasta files corresponding to the number of partitions):
>1
GCCAGGCTAAACGGGAAATATCTTATTTTCACCAC
>1
CCAGGCTAAACGGGAAATATCTTATTTTCACCACT
>2
TACCGATATACTCTACAAGAGAATGCCAGGCTAAA
>2
ACCGATATACTCTACAAGAGAATGCCAGGCTAAAA
>2
CCGATATACTCTACAAGAGAATGCCAGGCTAAAAG
>2
CGATATACTCTACAAGAGAATGCCAGGCTAAAAGC
Discount(k=35).index("10M.fasta").counted().writeFasta("counted_kmers")
Or, from an index:
Index.read("10M_35").counted().writeFasta("counted_kmers")
Additional help:
discount.sh --help
discount.sh count --help
For indexes to be combined, they have to be compatible (share the same values of k and m, and the same minimizer ordering). Suppose that an index 10M_35 has been created as in the example above. Then a compatible index can be created using:
discount.sh 10M_2.fasta store -c 10M_35 -o 10M_35_2
Here the index settings will be copied from the index that already exists at the path 10M_35
, and the new file 10M_2.fasta will be stored in a new index at the path 10M_35_2
.
Now that the indexes are compatible (this is verified when necessary by Discount), they can be combined, for example by intersection:
discount.sh -i 10M_35 intersect -i 10M_35_2 -r max -o 35_int
This stores the new index at the path 35_int
. It can now be used as an input to other commands, like stats
and count
above.
To write the indexes and their intersection to disk, as the command-line interface does:
val d = Discount(k=35)
val i1 = d.index("10M.fasta")
i1.write("10M_35")
val i2 = i1.newCompatible(d, "10M_2.fasta")
i2.write("10M_35_2")
val i3 = i1.intersectMax(i2)
i3.write("35_int")
Or, equivalently but for pre-existing indexes:
val d = Discount(k=35)
val i1 = Index.read("10M_35")
val i2 = Index.read("10M_35_2")
val i3 = i1.intersectMax(i2)
i3.write("35_int")
Or, to operate on the indexes as temporary Spark RDDs:
val d = Discount(k=35)
val i1 = d.index("10M.fasta")
val i2 = i1.newCompatible(d, "10M_2.fasta")
i1.intersectMax(i2).showStats()
Spark RDDs are computed purely in-memory when possible, and spilled to disk when they get too large. In either case they are much more efficient than writing and reading indexes (which uses parquet) when the index only needs to be used once.