Contig Annotation

For each nucleotide sequence collects annotation by searching nucleotide, protein and conserved domaini(CDD) databases. Each nucleotide sequence translated to protein sequence after ORFs are predicted.

Here is example command to run the script:

python contigAnnotation.py -i ex_contig.fa -o ex_contig_annotation.dat -d databases.ini

Input

Input files for the script are:

fasta (nucleotide) file with contigs
configuration file with paths to databases

Configuration file in INI Format:

[Taxonomy]
viral_protein = /media/THING1/sminot/timecourse/4AnnotateContigs/4.12TaxonomicFamily/4.12.1ViralFamilyProteinsDB/

[CDD]
skip=TRUE
cdd = /media/THING1/dryga/PhageDynamics/CDD/cdd/little_endian 
rpsbproc_ini = ./rpsbproc.ini

[ProteinDB]
integrase = /media/THING1/local/genomeIndexes/blast/UniprotPhageIntegrase.fasta
aclame = /media/THING1/local/genomeIndexes/blast/ACLAME/aclame_proteins_viruses_prophages_0.4.fasta
vfdb = /media/THING1/local/genomeIndexes/blast/VFDB/VFs.faa

[NucleotideDB]
viral = /media/THING1/local/genomeIndexes/blast/viral.1.1.genomic.fna
nt = /media/THING1/local/genomeIndexes/blast_nt/nt
bacteria = /media/THING1/local/genomeIndexes/blast/BacterialGenomes/ncbi_bacteria.fa

Configuration file has 4 sections: Taxonomy, CDD, ProteinDB, NucleotideDB.

Taxonomy section has only one key/value pair, key should be viral\_protein and value is path to blastp protein database for viral family.

CDD section has 3 key/value pairs, 1st key is cdd and value is path to CDD database, 2nd key is rpsbproc\_ini and value is path to init file required for rpsbproc utility. CDD searches can be disabled to save running time:

skip=TRUE

to actually run CDD use:

skip=FALSE

ProteinDB section has arbitrary number of key/value pairs, each key is name of protein blast db and value is path to the DB.

NucleotideDB section has arbitrary number of key/value pairs, each key is name of nucleotide blast db and value is path to the DB.

CDD utility needs file rpsbproc.ini.

CDD data from NCBI can be found at: ftp://ftp.ncbi.nih.gov/pub/mmdb/cdd

Output

Creates a tab-delimited file, where first line is a header line and rest of the lines have annotation for each sequence from FASTA input file.

Provides information about length, cirlularity, number of predicted ORFs, number of ORFs that matches viral protein db, putative viral family based on viral matches, number of CDD domains, number of hits to each protein DB(given in ProteinDB section of INI file), and top hit for each nucleotide DB(given in NucleotideDB section of INI file).

The current format is:

contig_name	length	circular	nORFs	nViralORFs	family	nDomains	integrase   ... 	aclame	viral	...	bacteria

Where '...' represents all the databases that are in ProteinDB(or NucleotideDB) section of INI file.

Dependencies

Script uses several programs and databases to add annotation to sequences.

Programs

blastn
blastp
rpsblast
rpsbproc
glimmer

Databases

Blast databases are required for annotating sequences with blast hits. CDD database and additional files required by rpsbproc.ini.

Name		Name	Last commit message	Last commit date
Latest commit History 93 Commits
tests		tests
README.md		README.md
__init__.py		__init__.py
blast.viral.families.sh		blast.viral.families.sh
circular.py		circular.py
configuration.py		configuration.py
contig2ViralFamily.py		contig2ViralFamily.py
contigAnnotation.py		contigAnnotation.py
contigs2CDD.py		contigs2CDD.py
contigs2ORFs.py		contigs2ORFs.py
contigs2circular.py		contigs2circular.py
contigs2count.py		contigs2count.py
contigs2length.py		contigs2length.py
contigs2topMatch.py		contigs2topMatch.py
databases.ini		databases.ini
ex_contig.fa		ex_contig.fa
fasta_filter.py		fasta_filter.py
glimmer-wrapper.sh		glimmer-wrapper.sh
rpsbproc.ini		rpsbproc.ini
translateFasta.R		translateFasta.R
viral_host.py		viral_host.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Contig Annotation

Input

Output

Dependencies

Programs

Databases

About

Releases

Packages

Languages

anatolydryga/ContigAnnotation

Folders and files

Latest commit

History

Repository files navigation

Contig Annotation

Input

Output

Dependencies

Programs

Databases

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages