Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Virulence genes and plasmids #28

Closed
pavlo888 opened this issue Sep 9, 2020 · 18 comments
Closed

Virulence genes and plasmids #28

pavlo888 opened this issue Sep 9, 2020 · 18 comments
Labels
custom databases enhancement New feature or request

Comments

@pavlo888
Copy link

pavlo888 commented Sep 9, 2020

Hi Narciso,

The new release looks really good!!! I can't wait to try it out. However I was wondering if it would be possible to supply a (fasta) list of virulence genes of interest and a plasmid sequences of interest for the pipeline to focus on?

Cheers,
Pablo

@nmquijada
Copy link
Owner

Hi Pablo,

Yes absolutely! I was am planning to add the option of making TORMES work with user-customized databases for the next version, but at this time I can show you how to incorporate your sequences to the already existing virulence and plasmid databases. Is it that OK for you?

If so, please just send me the output of which tormes (after activating the environment)

Best,
Narciso

@nmquijada nmquijada added enhancement New feature or request custom databases labels Sep 10, 2020
@pavlo888
Copy link
Author

Hi Narciso,

This is what I get after typing

which tormes

/home/sam/anaconda3/envs/tormes-1.2.0/bin/tormes

What next?

Cheers,
Pablo

@nmquijada
Copy link
Owner

Hi Pablo,

Thanks for sharing. This way I can write to you the exact paths.
For the genes screening, TORMES uses Abricate, that has instructions in how to create a custom database in its github page: https://github.com/tseemann/abricate#making-your-own-database
As we are going to make it compatible with TORMES, I will guide you through this process:

What we are going to do is to add your sequences to the already available VFDB and PlasmidFinder databases.
So, imagine you have your new genes in two different files, new-virulence.fasta and new-plasmids.fasta, for virulence and plasmid genes, respectively. The first thing that has to be done is to modify the header of each gene to match the following pattern co it can be recognized by Abricate:

>DBname~~~gene~~~gene_accession~~ description of the gene

For the new-virulence.fasta, modify "DBname" by "vfdb" and for the new-plasmids.fasta, modify "DBname" by "plasmidfinder".
In case you don't have a gene accession number, please write "NA" instead of "gene_accession".
The description of the gene can be an entire sentence. Write "NA" if you don't have a description. As you are including new genes in the original VFDB and PlasmidFinder, maybe some description might make it easier to identify. Please mind the space between the last "~" and the beginning of the description.
As an example, this is how a gene header must look like:

>vfdb~~~aafB~~~AAD27809 (aafB) invasin homolog AafB [AAFs (VF0214)] [Escherichia coli str. 042]

Once you have the headers of all genes from each fasta file already modified, type:

cat new-virulence.fasta >> /home/sam/anaconda3/envs/tormes-1.2.0/db/vfdb/sequences
cat new-plasmids.fasta >> /home/sam/anaconda3/envs/tormes-1.2.0/db/plasmidfinder/sequences

The next step is to format the new databases. Do:

makeblastdb -in /home/sam/anaconda3/envs/tormes-1.2.0/db/vfdb/sequences -title vfdb -dbtype nucl -hash_index
makeblastdb -in /home/sam/anaconda3/envs/tormes-1.2.0/db/plasmidfinder/sequences -title plasmidfinder -dbtype nucl -hash_index

And that's it. The next time you are running TORMES, the new genes will be included in the "Virulence genes" and the "Plasmid replicons" sections of the tormes-report.
Let me know if you have any issue.
Narciso

@pavlo888
Copy link
Author

pavlo888 commented Sep 15, 2020

Hi Narciso,

I will definitely try this! Thank you for your clear explanation. However, I still have two questions:

(1) The final nomenclature for the virulence genes should be

vfdb~~~aafB~~~AAD27809 (aafB) invasin homolog AafB [AAFs (VF0214)] [Escherichia coli str. 042]

or

vfdb~~~aafB~~~AAD27809~~ (aafB) invasin homolog AafB [AAFs (VF0214)] [Escherichia coli str. 042]?

(2) The plasmid sequences should be provided in one multi-fasta file?

Cheers,
Pablo

@nmquijada
Copy link
Owner

Hi Pablo,

(1) Both headers will work (remember the ">" at the beginning of each header). The last ~~ look for the "resistance" in antimicrobial resistance datasets, but that field is not included in virulence or plasmid genes in tormes-report.

(2) This depends on how many sequences do you have. The software is going to read only one file (multi-fasta). Therefore, if you have all sequences in the same fasta file you can follow the steps I told you before. On the other hand, if you have one sequence per file you can concatenate those fasta files into a single one first. For instance:

cat file1.fasta file2.fasta file3.fasta >> new-plasmids.fasta

And then just follow the other steps.

I hope this helps,
Narciso

@pavlo888
Copy link
Author

Hi Narciso,

Thanks a lot for the great explanation!!!

I have tried following and then running the tormes pipeline but then I got an error :(

`Quitting from lines 243-252 (tormes_report.Rmd)
Error in names(x) <- value :
'names' attribute [3] must be the same length as the vector [2]
Calls: ... withCallingHandlers -> withVisible -> eval -> eval -> colnames<-

Execution halted
`

Any idea what is going on?

Cheers,
Pablo

@nmquijada
Copy link
Owner

Hi Pablo,

The report generation failed when performing the action between lines 243 and 252 of the tormes_report.Rmd. Did the other analyses run well?
I might need the "report_files.tgz" to inspect the error. Can you please share that file with me?
If you think you have sensitive information, send it to my email: nmartinquijada at gmail dot com

Best,
Narciso

@nmquijada
Copy link
Owner

Hi Pablo,

Thanks for sharing the files.
The report failed beacuse your bacteria don't have an MLST scheme and so that analysis "failed". You can solve it by running again tormes by including the --no_mlst option, as explained here: #27 (comment).

Additionally, I've see that you don't have any output from Kraken2. Please perform this step before running tormes again:
#30 (comment)

Let me know if it worked!
Narciso

@pavlo888
Copy link
Author

Hi Narciso,

It worked perfectly!!!! Thanks a lot for your support!!!

Cheers,
Pablo

@nmquijada
Copy link
Owner

Hi Pablo,

Good that it worked!

Best,
Narciso

@pavlo888
Copy link
Author

Hi Narciso,

I was trying to add a new set of plasmid sequences to the DB and I got this error when I tried to run this command:

makeblastdb -in /home/sam/anaconda3/envs/tormes-1.2.0/db/plasmidfinder/sequences -title plasmidfinder -dbtype nucl -hash_index

Building a new DB, current time: 10/13/2020 22:03:26
New DB name:   /home/sam/anaconda3/envs/tormes-1.2.0/db/plasmidfinder/sequences
New DB title:  plasmidfinder
Sequence type: Nucleotide
Deleted existing Nucleotide BLAST database named /home/sam/anaconda3/envs/tormes-1.2.0/db/plasmidfinder/sequences
Keep MBits: T
Maximum file size: 1000000000B
Error: (803.7) [makeblastdb] Blast-def-line-set.E.title
Bad char [0xCA] in string at byte 56
Plasmidfinder~~~pRi2659~~~EU186381 Agrobacterium�rhizogenes plasmid pRi2659, completegenome [Agrobacterium rhizogenes]
Error: (803.7) [makeblastdb] Blast-def-line-set.E.title
Bad char [0xCA] in string at byte 67
Plasmidfinder~~~pRi1724~~~AP002086 Agrobacterium rhizogenes�plasmid pRi1724 DNA, complete sequence  [Agrobacterium rhizogenes]
Error: (803.7) [makeblastdb] Blast-def-line-set.E.title
Bad char [0xCA] in string at byte 56
Plasmidfinder~~~pRi2659~~~EU186381 Agrobacterium�rhizogenes plasmid pRi2659, completegenome [Agrobacterium rhizogenes]
Error: (803.7) [makeblastdb] Blast-def-line-set.E.title
Bad char [0xCA] in string at byte 67
Plasmidfinder~~~pRi1724~~~AP002086 Agrobacterium rhizogenes�plasmid pRi1724 DNA, complete sequence  [Agrobacterium rhizogenes]
Adding sequences from FASTA; added 464 sequences in 0.019649 seconds.

Is it ok? Or is there something wrong?

Cheers,
Pablo

@biobrad
Copy link
Collaborator

biobrad commented Oct 13, 2020 via email

@nmquijada
Copy link
Owner

Hi @pavlo888

As @biobrad said, it seems that you have some problematic characters that are making makeblastdb to fail.
Try to remove then manually or either try:

sed -i "s/Agrobacterium.*pRi2659/Agrobacterium rhizogenes plasmid pRi2659/" /home/sam/anaconda3/envs/tormes-1.2.0/db/plasmidfinder/sequences
sed -i "s/Agrobacterium.*pRi1724/Agrobacterium rhizogenes plasmid pRi1724/" /home/sam/anaconda3/envs/tormes-1.2.0/db/plasmidfinder/sequences

Additionally, mind the capital characters when writing the database name. In your makeblastdb command, the database is called plasmidfinder whereas in the fasta file the headers are called Plasmidfinder.

Please run this command for solving this issue:
sed -i "s/^Plasmidfinder/plasmidfinder/" /home/sam/anaconda3/envs/tormes-1.2.0/db/plasmidfinder/sequences

Best,
Narciso

@pavlo888
Copy link
Author

pavlo888 commented Nov 6, 2020

Hi @nmquijada and @biobrad,

Thanks a lot for your input! Indeed there was a weird character in my sequences. But I deleted it manually and it is now all running smoothly! thank you for your help.

Another quick question. Is there a way to visualize the bootstrap values in the phylogenetic trees obtained as a result? Or is there any other way to check statistically for the topology of the tree?

Cheers,
Pablo

@nmquijada
Copy link
Owner

Hi Pablo,

Yes, there are! However, it requires some manipulation of the tormes_report.Rmd file (or doing it by yourself in R environment) contained in report_files.tgz.
For the visualization of the phylogenetic trees, TORMES uses the ggtree (https://bioconductor.org/packages/release/bioc/html/ggtree.html) and treeio (https://bioconductor.org/packages/release/bioc/html/ggtree.html) R packages and by default, an unrooted tree is generated. However, this tree can be subjected to further modifications in order to change its visualization.

Let's modify the visualization of your pangenome tree based on the presence/absence of accessory genes. If you open the tormes_report.Rmd file and navigate to the "Pangenome analysis" section, you might find something like this:

#### Pangenome tree based on presence/absence of accesory genes representation {.tabset .tabset-fade .tabset-pills}
##### Rectangular (phylogram)
{r, echo=FALSE, message=FALSE, warning=FALSE, fig.height= 15}
tree=read.newick("accessory_binary_genes.fa.newick")
treelim=20*mean(tree$edge.length)
ggtree(tree) + geom_tiplab(size=3) + xlim_tree(treelim)

The last line contains the options to be parsed to the ggtree command to visualize your tree. So, for instance, if you would like to include the tree scale (evolution distance), you can modify that line by:

ggtree(tree) + geom_tiplab(size=3) + xlim_tree(treelim) + geom_treescale()

If you would like to add the branch support values, you can do:

ggtree(tree) + geom_tiplab(size=3) + xlim_tree(treelim) + geom_treescale() + geom_nodelab(aes(x=branch), vjust=-0.3, size=2)

After you modified your tormes_report.Rmd file, you can easily re-run you report with the new changes by using the "render_report.sh" script as explained here: https://github.com/nmquijada/tormes#rendering-customized-reports

There's lot of further customizations possible. I encourage you to go through the ggtree guide: http://yulab-smu.top/treedata-book/index.html

Let me know if you have any question!

I think this is a useful discussion and maybe other users would be interested in doing so too. Thus I would try to upload a small tutorial to the wiki pages.

Best,
Narciso

@nmquijada
Copy link
Owner

Hi Pablo,

We have just released the new TORMES version 1.3.0.
This new version allows the possibility to create custom nucleotide databases by using the --custom_genes_db option.
Please visit the Wiki for further instructions on how to create and use your own databases!

I will close the issue now.
Thanks a lot,
Narciso

@shlomobl
Copy link

Hi,
Is it possible to use a custom aa database?
Thanks!

@nmquijada
Copy link
Owner

Hi @shlomobl

I just replied here: #56

Cheers!
Narciso

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
custom databases enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants