Create a folder for the upcoming analysis:
cd /cbio/projects/037/$USER
mkdir -p haplotype_deconvolution
cd /cbio/projects/037/$USER/haplotype_deconvolution
Quicker! If at the workshop, just use a symlink to get the data (sequencing reads, reference, and pangenome for chr6) in context:
cd /cbio/projects/037/$USER/haplotype_deconvolution
ln -s /cbio/projects/037/erikg/haplotype_deconvolution/sequencing_reads .
ln -s /cbio/projects/037/erikg/haplotype_deconvolution/reference .
ln -s /cbio/projects/037/erikg/haplotype_deconvolution/pangenome .
(and you can skip the downloads)
ALSO, we need to install gfainject and gafpack.
module load rust
cargo install --git https://github.com/pangenome/gafpack
cargo install --git https://github.com/chfi/gfainject
export PATH=~/.cargo/bin/:$PATH
We are going to use pre-aligned reads against the human reference genome hg38 in CRAM format. We download 3 samples from 1000G.
cd /cbio/projects/037/$USER/haplotype_deconvolution
mkdir -p sequencing_reads
cd sequencing_reads
# Get list of 1000G samples
wget https://ftp-trace.ncbi.nlm.nih.gov/1000genomes/ftp/1000G_2504_high_coverage/additional_698_related/1000G_698_related_high_coverage.sequence.index
# Get 3 samples
grep 'HG00438\|HG00673\|HG00735' 1000G_698_related_high_coverage.sequence.index | cut -f 1 > 1000G.selected.txt
# Add also links for CRAM indexes
sed 's/$/.crai/' 1000G.selected.txt >> 1000G.selected.txt
# Download CRAM and index files
wget -i 1000G.selected.txt
cd ..
Since we have sequencing reads in CRAM format, a reference genome is needed to fetch reads from the original alignments:
cd /cbio/projects/037/$USER/haplotype_deconvolution
mkdir -p reference
cd reference
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa.fai
cd ..
We will use year 1 assemblies from the Human Pangenome Reference Consortium. For simplicity, we will download the PGGB
pangenome graph of chromosome 6 and get the contigs in FASTA format with ODGI
:
cd /cbio/projects/037/$USER/haplotype_deconvolution
mkdir -p pangenome
cd pangenome
# Download chromosome 6 pangenome graph
wget https://s3-us-west-2.amazonaws.com/human-pangenomics/pangenomes/scratch/2021_11_16_pggb_wgg.88/chroms/chr6.pan.fa.a2fb268.4030258.6a1ecc2.smooth.gfa.gz
gunzip chr6.pan.fa.a2fb268.4030258.6a1ecc2.smooth.gfa.gz
# GFA -> FASTA
module load pggb
odgi paths -i chr6.pan.fa.a2fb268.4030258.6a1ecc2.smooth.gfa -f -t 16 -P > chr6.pan.fa
samtools faidx chr6.pan.fa
cd ..
RESTART HERE.
We will genotype samples the C4 region. For that, we need to find the corresponding region on all the assemblies in the pangenome. As easy way is to first, we need to align the pangenome against the reference:
cd /cbio/projects/037/$USER/haplotype_deconvolution
mkdir -p wfmash
wfmash \
reference/GRCh38_full_analysis_set_plus_decoy_hla.fa pangenome/chr6.pan.fa \
-s 10k -p 95 \
-t 16 \
> wfmash/pangenome_to_reference.paf
and then project the alignments of the C4 region by using IMPG
:
# load impg module
module load impg
# Download BED file containing the C4 region coordinates on hg38's chromosome 6
wget https://raw.githubusercontent.com/pangenome/HumanPangenomeBYOD2024/refs/heads/main/data/hap-deconv.region-of-interest.bed
cd /cbio/projects/037/$USER/haplotype_deconvolution
mkdir -p impg
impg \
-p wfmash/pangenome_to_reference.paf \
-b hap-deconv.region-of-interest.bed \
-x \
> impg/projected.bedpe
Let's extract now the C4 regions in the pangenome:
# Merge regions
bedtools sort -i impg/projected.bedpe | \
bedtools merge -d 100000 -i - \
> impg/merged.bed
samtools faidx \
-r <(awk -v OFS='\t' '{print($1":"$2+1"-"$3)}' impg/merged.bed) \
pangenome/chr6.pan.fa \
> impg/extracted.fasta
samtools faidx impg/extracted.fasta
Now we build a C4 region pangenome graph with PGGB
(n.b. there are 36 haplotypes):
cd /cbio/projects/037/$USER/haplotype_deconvolution
mkdir -p pggb
pggb -i impg/extracted.fasta -o pggb -t 16 -n 36
# Rename the final ODGI graph in a more human-friendly way
mv pggb/*smooth.final.og pggb/final.og
Let's chop the graph and get a haplotype coverage matrix:
odgi chop \
-i pggb/final.og \
-c 32 \
-o odgi/chopped.og
odgi paths \
-i odgi/chopped.og \
-H | \
cut -f 1,4- | \
gzip > odgi/paths_matrix.tsv.gz
Let's index the C4 region pangenome and align sequencing reads to it with BWA MEM
:
cd /cbio/projects/037/$USER/haplotype_deconvolution
mkdir -p alignment
module load bwa
bwa index impg/extracted.fasta
ls sequencing_reads/*cram | while read CRAM; do
echo $CRAM
NAME=$(basename $CRAM .cram)
# Extract reads covering the C4 region and then align them against the pangenome
samtools view \
-T reference/GRCh38_full_analysis_set_plus_decoy_hla.fa \
-L hap-deconv.region-of-interest.bed \
-M \
-b \
$CRAM | \
samtools fasta | \
bwa mem -t 14 impg/extracted.fasta - | \
samtools view -b -F 4 -@ 2 - \
> alignment/$NAME.reads_vs_extracted.bam
done
Inject the pangenome alignments into the pangenome graph:
cd /cbio/projects/037/$USER/haplotype_deconvolution
mkdir -p odgi
odgi view \
-i odgi/chopped.og \
-g > odgi/chopped.gfa
ls sequencing_reads/*cram | while read CRAM; do
echo $CRAM
NAME=$(basename $CRAM .cram)
gfainject \
--gfa odgi/chopped.gfa \
--bam alignment/$NAME.reads_vs_extracted.bam \
> alignment/$NAME.injected.gaf
done
Let's get the sample coverage vectors:
ls sequencing_reads/*cram | while read CRAM; do
echo $CRAM
NAME=$(basename $CRAM .cram)
gafpack \
-g odgi/chopped.gfa \
-a alignment/$NAME.injected.gaf \
--len-scale | \
gzip > alignment/$NAME.coverage.gafpack.gz
done
IMPORTANT: currently compatible with cosigt
v0.1.0 (commit 92622f6c095a1b43b0e13fef2893c74b9bfee887
).
# setup
module load R
# First install required R packages in user's home directory
R --vanilla <<EOF
.libPaths(c("~/R/library", .libPaths()))
dir.create("~/R/library", recursive=TRUE, showWarnings=FALSE)
install.packages(c("reshape2", "NbClust", "rjson", "dendextend", "ggplot2", "data.table"),
lib="~/R/library",
repos="https://cloud.r-project.org")
EOF
Now the actual genotyping step.
cd /cbio/projects/037/$USER/haplotype_deconvolution
mkdir -p cosigt
module load cosigt
# First generate the similarity matrix using odgi
odgi similarity \
-i odgi/chopped.og \
> odgi/similarity.tsv
# Download the clustering script from cosigt repository
wget https://raw.githubusercontent.com/davidebolo1993/cosigt/16b18815cf9fdfcbf2afbf588a02740c27941ee3/cosigt_smk/workflow/scripts/cluster.r
chmod +x cluster.r
# Create a wrapper script that sets the library path before running cluster.r
cat > run_cluster.r <<'EOF'
#!/usr/bin/env Rscript
.libPaths(c("~/R/library", .libPaths()))
source("cluster.r")
EOF
chmod +x run_cluster.r
# Run the clustering script to generate the JSON
./run_cluster.r odgi/similarity.tsv cosigt/clusters.json
ls sequencing_reads/*cram | while read CRAM; do
echo $CRAM
NAME=$(basename $CRAM .cram)
cosigt \
-i $NAME \
-p odgi/paths_matrix.tsv.gz \
-g alignment/$NAME.coverage.gafpack.gz \
-o cosigt
mv cosigt/cosigt_genotype.tsv cosigt/$NAME.cosigt_genotype.tsv
mv cosigt/sorted_combos.tsv cosigt/$NAME.sorted_combos.tsv
done
Check the output:
grep '^alignment' cosigt/*.cosigt_genotype.tsv | column -t