Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

suffix trees for finding largest common substring #3

Open
AbyssalRemark opened this issue Apr 5, 2024 · 5 comments
Open

suffix trees for finding largest common substring #3

AbyssalRemark opened this issue Apr 5, 2024 · 5 comments
Labels
question Further information is requested

Comments

@AbyssalRemark
Copy link
Collaborator

Suffix trees are a clever way of holding strings that allows us to do really fast string operations such as, longest common sub-string. As well as others. It might make sense to construct our data into suffix trees to allow us to construct the graph more efficiently.

Here is a picture from wikipedia

Suffix_tree_BANANA svg
(lets see if that works)

I read a paper a while back talking about relaxing the definition of suffix trees to allow it to be even faster for the express purpose of mRNA seq (should be true for DNA too)

Will add that paper once I find it.

@AbyssalRemark AbyssalRemark added the question Further information is requested label Apr 5, 2024
@AbyssalRemark
Copy link
Collaborator Author

This might be it.
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6929406/
This talks about a de novo which is assembling without a scaffold to build from. Which is in part what were doing.

@TW-Starbuyer
Copy link
Collaborator

Are you proposing we use Suffix Trees for the assembly? It seems the main paper in the repo proposes using de Bruijn Graphs so I'm just curious what the pros/cons would be with each method.

https://www.pnas.org/doi/10.1073/pnas.171285098

@AbyssalRemark
Copy link
Collaborator Author

Are you proposing we use Suffix Trees for the assembly? It seems the main paper in the repo proposes using de Bruijn Graphs so I'm just curious what the pros/cons would be with each method.

https://www.pnas.org/doi/10.1073/pnas.171285098

So. Yea thats what this paper is talking about. Thats the relaxed part of things. Its an option sure. But im more interested in them for there general ease of string operations.

If we're checking forwards and backward and inversed (and inversed backwards) then the cost of getting our data into a suffix tree (I hope) makes the string operations less time consuming. Upfront data storage cost to save on the countless comparisons per data points.

But, we should totally look into things more.

@AbyssalRemark
Copy link
Collaborator Author

Update. According to Nancy in an email she sent out today:
Our data should all be pointing the same direction according to our extraction methods.
If we have DNA then we still could have the inverse (I think), but if we do have RNA (which is only one strand) then we need only check once in which case we get no benefit from suffix trees. (I still might make em for funzies because there kinda cool)

@learner-long-life
Copy link
Contributor

Nice find for suffix trees, I'm learning about them now.
Is longest-common-substring (LCS) finding using them useful for both de novo assembly and read alignment? It seems so to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants