Releases: pangenome/pggb
pggb 0.4.0 - Pasticcino
This introduces:
- temporary directory management (#197);
- improvements of the alignments (#195, #198);
samtools
,fastix
,igraph
,pycairo
, andpafplot
in the docker/singularity image (#204);- better management of the
-r/--resume
flag (#205); - great improvement in graph normalization by replacing abPOA with SPOA, running it in local mode (#207). This resolves the SNV arrays introduced during the smoothing;
- abPOA/SPOA selection (#209);
- variant decomposition with
vcfbub
andvcfwave
(#211)
pggb 0.3.1 - Pasticcione
This:
- updates in how tools are compiled/built to ensure greater inter-system compatibility;
- handles
-n
differences betweenpggb
andwfmash
; - outputs only one final graph with
final
suffix.
pggb 0.3.0 - Esplorazione
This introduces major changes in pggb
's' interface and alignment step for supporting high-divergence genomes and compressed graph representations:
- new wfmash version for aligning highly divergent sequences;
- new default values;
- simpler interface, with a few changes:
- drop
-U
/--normalize
,-I
/--block-id-min
, and-M
/--no-merge-segments
- merge
-v
and-L
in-v
/--skip-viz
) - replace
-F
with-M
for requesting the MAF output;
- drop
- make 1D and 2D visualizations by default (
-v
/--skip-viz
for disabling this); - mandatory normalization with GFAFfix;
- do not keep intermediate files by default (
-A
to keep them); - updated seqwish, smoothxg, odgi, gffafix, vg;
- new (still WIP) documentation at https://pggb.readthedocs.io/, with tutorials for PanSN-spec naming and sequence clustering;
- addded bcftools statistics in the MultiQC report;
- added PAF file format input (
-a
/--input-paf
) to skip the alignment step; - emit 2D graph layouts in TSV too for visualizing them with gfaestus;
- timer for gffafix execution;
- use
bgzip
for compressing VCF files;
pggb 0.2.0 - Pushing
This introduces a series of bug fixings and changes in how several steps are performed:
-
alignment (
wfmash
):- the memory consumption during the alignment has drastically reduced
- fixed memory leaks
- ignore uncalled bases (Ns) during the approximate mapping
- improved alignment of sequences shorter than 50kbps
- improved the resolution of the alignment boundaries
-
normalization (
smoothxg
):- Use unseeded abPOA mode. This requires more memory than the seeded mode, but avoid alignment failure in repetitive and low-complexity regions
- The POA blocks are padded more, to better resolve their boundaries, but not too much in too-deep blocks (depth > 100 times the number of haplotypes in input) to avoid computational burden in big high-repetitive regions (human chr16)
-
Improved documentation
super fast, super good
- Use seeded abPOA mode (in smoothxg). This requires much less memory than the unseeded mode, and allows us to explore much larger POA target lengths (
-G
). - Set a lower POA overlap by default (
-O 0.001
) - Automatically compute
smoothxg -w
as-G * -n
. A specific number of haplotypes can be given via-H
, should this differ from-n
, in which case the window size is set as-G * -H
.
Suggested use: set pggb -k
larger than SINE elements or other short repeats in the genome. For human, we use -k 311
. A long segment length for mapping is also recommended (in human we use -s 100000
). These aren't in the defaults yet, but subsequent releases will have preset settings for different genome lengths and divergences.
integrate gfaffix and pad the POA problems
- Correct VCF output.
- Clean up the output graph with GFAffix.
- Pad the POA problems to localize them slightly. This is configurable, but set to 1% of the longest problem length.
improving output quality
This checkpoints ongoing work to improve the variant description accuracy and parsimony of the output graphs.
a reliable pangenome graph builder
pggb
's development has taken place over much of the last year. It's seen a number of twists and changes to the components of the process and the best parameters for typical problems. Now, as its components are reaching a high level of refinement, it's time to mark a release.
There is nothing special about this particular checkpoint. In the future, we're sure to do better. But, at this point, for the problems we're giving pggb
, it's giving us clean, reasonable solutions that reflect simple models of the underlying variation in the sequences we're aligning.
In particular, the quality of the process has been driven by major improvements to the initial mapping phase. These improvements were the fruit of a shakedown and reformulation of core features of the wfmash
ultralong sequence aligner. Thanks to @AndreaGuarracino and @urbanslug for their work on this.
Now, wfmash
's precise, global alignments of whole chromosomes support the generation of clean, comprehensive pangenome models based on the direct relation of all sequences in our input.