Affinity Maturation of B-cell receptors (AffMB)
pandas, numpy, matplotlib, scipy, Bio, logomaker, newworkx, igraph>=0.10.4, pycairo or cairocffi
After installing the above requirements:
pip3 install -i https://test.pypi.org/simple/ affmb==1.2.1
The input of AffMB is an BCR repertoire annotation file typically generated by running V(D)J annotation tools such as igblast. The minimum set of columns required for AffMB include: barcode, locus, v_call, j_call, c_call, cdr1, cdr2, cdr3, shm. Note the three cdr columns must be nucleotide (nt) sequences. All columns in the input file will be preserved in the output file.
***** The default column names are barcode, locus, v_call, j_call, c_call, cdr1, cdr2, cdr3, shm; if your input columns have other names (e.g., cell_id, chain, v_gene, v_identity, etc.), you can specify the column names in the parameters (see Parameters) *****
***** AffMB assumes all input sequences to be productive and at least the V and C genes are called. Errors will be thrown if content of the v_call, c_call or any cdr columns is missing. The igblast_parser.py contains commands to filter non-productive input. However, if your input is not generated by the script, please remember to do the filtering before running AffMB *****
A typical starting material is the assembled contig sequence file (.fasta format). Here we provide two example contig fasta files under the example folder. The example.filtered_contig.fasta is from single-cell sequencing data, as part of the 10X cellranger pipeline output; the bulk_contig13m.fasta is from bulk sequencing data.
To ensure repeatable output, please follow the instructions below to call igblast (external software to be installed separately) and prepare the input for AffMB by parsing the igblast output using the python script igblast_parser.py we provided. The preparation steps can apply to both single-cell and bulk data.
Instructions on igblast install: https://ncbi.github.io/igblast/cook/How-to-set-up.html
After igblast installation, go to the working directory of igblast which should have the following file structure:
bin
database
internal_data
optional_file
Download the human vdj and c gene databases for igblast that are used in the prepare python script:
# run under igblast working directory
wget https://ftp.ncbi.nih.gov/blast/executables/igblast/release/database/airr/airr_c_human.tar
mkdir -p airr_c_human
tar -xvf airr_c_human.tar -C airr_c_human/
wget https://ftp.ncbi.nih.gov/blast/executables/igblast/release/database/ncbi_human_c_genes.tar
mkdir -p database/
tar -xvf ncbi_human_c_genes.tar -C database/
Export experimental variables and run igblastn:
export PATH=$PATH:/path/to/ncbi-igblast-1.x.x/bin
export IGDATA=/path/to/ncbi-igblast-1.x.x
igblastn -germline_db_V $IGDATA/airr_c_human/airr_c_human_ig.V -germline_db_J $IGDATA/airr_c_human/airr_c_human_ig.J -germline_db_D $IGDATA/airr_c_human/airr_c_human_igh.D -c_region_db $IGDATA/database/ncbi_human_c_genes -organism human -auxiliary_data $IGDATA/optional_file/human_gl.aux -query example.filtered_contig.fasta -ig_seqtype Ig -outfmt 19 > example.filtered_contig.igblast.airr.tsv
igblastn -germline_db_V $IGDATA/airr_c_human/airr_c_human_ig.V -germline_db_J $IGDATA/airr_c_human/airr_c_human_ig.J -germline_db_D $IGDATA/airr_c_human/airr_c_human_igh.D -c_region_db $IGDATA/database/ncbi_human_c_genes -organism human -auxiliary_data $IGDATA/optional_file/human_gl.aux -query bulk_contig13m.fasta -ig_seqtype Ig -outfmt 19 > bulk_contig13m.igblast.airr.tsv
Run the igblast_parser.py to parse igblast output and generate the input annotation file for AffMB:
# python igblast_parser.py <igblast_output> <outfile>
python igblast_parser.py example.filtered_contig.igblast.airr.tsv example.filtered_contig.annotation.csv
python igblast_parser.py bulk_contig13m.igblast.airr.tsv bulk_contig13m.annotation.csv
The igblast_parser.py calculates SHM rate as the fraction of mutated positions between the V region sequence and the inferred germline V region sequence. Alternatively, one may use 'v_identity' to approximate SHM rate (see Parameters).
For single-cell input:
from affmb import affmb
import subprocess
# single-cell (paired-chain) input
outdir = 'test_paired'
subprocess.run('mkdir -p '+outdir, shell=True)
infile = 'example/example.filtered_contig.annotation.csv'
affmb.paired_repertoire_analysis(infile,outdir,depth_filter=2)
#Alternatively, if you want to analyze and visualize only the IGH chain in a single-cell data
#affmb.IGH_repertoire_analysis(infile,outdir,depth_filter=2)
For bulk input:
# bulk (IGH) input
outdir = 'test_bulk'
subprocess.run('mkdir -p '+outdir, shell=True)
infile = 'example/bulk_contig13m.annotation.csv'
affmb.IGH_repertoire_analysis(infile,outdir,depth_filter=2)
affmb.paired_repertoire_analysis(infile,outdir,cell_id='barcode',chain='locus',v_gene='v_call',j_gene='j_call',c_gene='c_call',cdr1_nt='cdr1',cdr2_nt='cdr2',cdr3_nt='cdr3',shm='shm',sep=',',depth_filter=2,method='inherent',logo=True,logo_dist_cutoff=0.1,logo_depth_cutoff=2,logo_thres=5,vertex_color='red')
affmb.IGH_repertoire_analysis(infile,outdir,cell_id='barcode',chain='locus',v_gene='v_call',j_gene='j_call',c_gene='c_call',cdr1_nt='cdr1',cdr2_nt='cdr2',cdr3_nt='cdr3',shm='shm',sep=',',depth_filter=2,method='inherent',logo=True,logo_dist_cutoff=0.1,logo_depth_cutoff=2,logo_thres=5,vertex_color='red')
infile: input annotation file name
outdir: output directory name, must be an existing directory
cell_id: the column name for the cell ID (or sequence ID in bulk data) [default: 'barcode']
chain: the column name for the chain type (IGH/IGL/IGK) information [default: 'locus']
v_gene: the column name for the V gene [default: 'v_call']
j_gene: the column name for the J gene [default: 'j_call']
c_gene: the column name for the C gene [default: 'c_call']
cdr1_nt: the column name for the cdr1 nucleotide (nt) sequence [default: 'cdr1']
cdr2_nt: the column name for the cdr2 nucleotide (nt) sequence [default: 'cdr2']
cdr3_nt: the column name for the cdr3 nucleotide (nt) sequence [default: 'cdr3']
shm: the column name for the SHM rate; set shm='v_identity' if you want to approximate SHM rate with 'v_identity' [default: 'shm']
sep: delimiter of the input file [default: ',']
depth_filter: lineage depth threshold under which lineage trees will not be plot, reduces the number of generated tree figures; set depth_filter=0 to stop the filter [default: 2]
method: algorithms for tree construction, choose within {'inherent','prims'} [default: 'inherent']
logo: whether to detect branches of interest and generate logo plots [default: True]
logo_dist_cutoff: determines the largest possible pairwise hamming distance between nodes in a branch of interest [default: 0.1]
logo_depth_cutoff: determines the minimal depth of a branch of interest [default: 2]
logo_thres: determines the minimal size of a branch of interest [default: 5]
vertex_color: the color of the nodes in the lineage tree figures [default: 'red']
The figure below is a graphical summary of AffMB output. The output of running AffMB on the example.filtered_contig.annotation.csv is shown in the test_paired folder. A typical AffMB output directory contains multiple lineage tree figures (one tree per figure) with depth_filter applied (in this example depth_filter=2 is applied) and a csv table file which provides cell-to-node mapping to the lineage tree figures. If branches of interest are detected, logo plots at amino acid and nucleotide levels will also be generated. A summary plot of lineage statistics are also generated for the depth of the lineages, the dominated Ig type of the lineages, the extent of clonal expansion of the lineages, and the lineage size (i.e., number of unique genotypes in lineage).