Problems with GTDB taxonomy assigments to MARdb sequences #42
Replies: 4 comments
-
Hi there, one of the issues seems to be due to MAR db employing a NCBI taxonomy tool to classify genomes instead of GTDB-TK. Thus, the observed taxonomy in the final output, since we were using the MAR taxonomy database. Doing some research, I have found that the Oceans Microbiomics Database used GTDB-TK to classify their genome collection (paper here). So we can use their database to assign taxonomy to MAR db genomes, at least the ones that are included in their genome collection. I'll expand the codebase to include the possibility to assign NCBI or GTDB taxonomy to MAR sequences (those whose genome contains assigned taxonomy, that is). This will solve one of the issues. Regarding the lack of MMP identifiers, this is something not related to the code per se, but to the MAR db preprocessing. We will have to dig in to figure out what's happening. I have opened issue #41 to address this problem |
Beta Was this translation helpful? Give feedback.
-
After asking Jose and showing him this two examples: Check Cellulophaga_baltica_18 that is missing the ID This is a tree, where the cyanos with ids and taxonomy assigned by relabeltree.py has a green square and those that are missing the ID does not have one. Here is his answer: So, I need first to figure out about the duplicates and then, we will see what can be done with the IDs... |
Beta Was this translation helpful? Give feedback.
-
After a meeting over skype, we have decided to solve this problem as follows:
We might need to decide how to solve conflicts when removing duplicates, but we will see how many conflicts arise during the preprocess. |
Beta Was this translation helpful? Give feedback.
-
I have encounter two problems when using the relabeltree.py --taxonomy option to assign taxonomy to MARdb sequeces:
some genomes from the initial MARdb files that Jose sent are missing the MARdb ID and therefore, those sequences cannot be classified with their ID. To solve it, we need to:
a) First, figure out if all teams are working with the same files (Complete and Partial genomes, QC by Jose)
b) Check with José if the MAR IDs were removed during the QC he did - I will do that once we figure out the firs part.
For those sequences with MAR IDs the script is assigning the NCBI taxonomy and not the GTDB. For example:
One of the genera that appears in my tree classified as the NCBI (Beta) but in GTDB is Gamma.
This happens with several Betas that have been reclassified as Gammas, for example in the tree
The label color is the GTDB web classification and the square in the right the taxonomy applied by the script. Green: Zproteobacteria, Pink: Gamma, Orange: Beta
I have another example with a Streptomyces that has the NCBI taxonomy in the tree label and not the GTDB.
Are we are using an older release??
Beta Was this translation helpful? Give feedback.
All reactions