-
Notifications
You must be signed in to change notification settings - Fork 167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster edit distance for short strings #144
Comments
Hi @lesshaste, thank you for making the suggestion :)! You are right, edlib could certainly be optimized for shorter strings. There are two main reasons why it is slower for shorter strings, while it is so fast for longer ones:
I think that should be it! So basically, we need one simple (but optimized) algorithm in python, one in C++, figuring out when to use them, and making sure they support all the features Edlib supports. |
Thanks for the reply! The paper Iinked discusses when Hyrrö's method is faster than Myers' in some detail. See e.g. page 18. But obviously it doesn't have to worry about Python/C++ overheads. The basic dynamic programming algorithm is never the fastest as far as I can tell. The fast codegolf.se answers in the link I pasted above, for example, all use Hyrrö's method because their strings are short (< 32 length). Of course there may be people using the edlib library who are not calling it from Python so maybe it would be worth having a fast implementation of Hyrrö's method in C++ no matter what. |
Aha, ok that is great, I did not know about Hyrro's method! Well I am certainly up for implementing it, we could do it both in C++ and in Python. |
@Martinsos In language processing we usually have short strings (average size of a word 6-8 characters) and larger alphabets. I implemented the Hyrrö optimisation for LLCS, LCS, Levenshtein-Distance & Alignment, and Damerau in Perl, C99 and JavaScript. They are the fastest. In the moment I'm struggling with strings longer 64 in the Lev, because Hyrrö doesn't describe how to take care of the carry bit. Myers does. That's why I looked in your code. In Perl I also use prefix and suffix optimisation. This is possible in comparing two strings (1:1). Comparing one string to many (1:N, brute force dictionary lookup) is faster, ~double speed. Lev-Distance is significantly slower than LLCS in Perl. Now implemented it quick and dirty in C99 and Lev-Distance is as fast as LLCS in C: 70 million comparisons per second, double speed 1:N. Working directly on UTF-8 instead of ASCII slows LLCS down to 11 million/sec. Speed in pure Python would be comparable to Perl. JavaScript I don't know, but maybe faster than Perl or Python. I use JS only in web-frontends to display the alignment of user input. Perl code is published on CPAN and available on github too: This is my Gold Standard for benchmarks, a typical OCR error: [Chrerrplzon] [Choerephon] Will benchmark your code in the next days. |
Benchmarked edlib now in simple loop with two short strings. Needs 11 seconds for 1 million iterations, that's a rate of 0.1 M/sec. That's 700 times slower than Hyrrö in pure C optimised for short ASCII-strings (m < 64). Here is the code of my simple benchmark:
Output:
|
I think you might want to specify that your task is edit distance
calculation only. I thought the default config would provide the full
alignment.
…On Sat, Sep 5, 2020, 12:27 Helmut Wollmersdorfer ***@***.***> wrote:
Benchmarked edlib now in simple loop with two short strings. Needs 11
seconds for 1 million iterations, that's a rate of 0.1 M/sec. That's 700
times slower than Hyrrö in pure C optimised for short ASCII-strings (m <
64).
Here is the code of my simple benchmark:
#include <stdio.h>
#include <time.h>
#include <string.h>
#include <stdlib.h>
#include <stdint.h>
#include "edlib.h"
int main (void) {
clock_t tic;
clock_t toc;
double elapsed;
double total = 0;
double rate;
uint64_t count;
uint64_t megacount;
uint32_t iters = 1000000;
uint32_t megaiters = 1;
// m=10, n=11, d=4
char str1[] = "Choerephon";
char str2[] = "Chrerrplzon";
uint32_t len1 = strlen(str1);
uint32_t len2 = strlen(str2);
int distance;
if (1) {
tic = clock();
for (megacount = 0; megacount < megaiters; megacount++) {
for (count = 0; count < iters; count++) {
distance = edlibAlign(str1, len1, str2, len2, edlibDefaultAlignConfig()).editDistance;
}
}
toc = clock();
elapsed = (double)(toc - tic) / (double)CLOCKS_PER_SEC;
total += elapsed;
rate = (double)megaiters / (double)elapsed;
printf("[dist] iters: %u M Elapsed: %f s Rate: %.1f (M/sec) %u\n",
megaiters, elapsed, rate, distance);
}
printf("Total: %f seconds\n", total);
return 0;
}
Output:
$ ./edtest
[dist] iters: 1 M Elapsed: 11.253367 s Rate: 0.1 (M/sec) 4
Total: 11.253367 seconds
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#144 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABDQEKD33HDHWFW6JWWIB3SEIHBLANCNFSM4LAOHD2A>
.
|
@ekg Yes, it's about edit distance only. That's exactly what the original poster of this issue meant and compared. Calculating the alignment is ~50% slower in my Perl implementations, because it needs to record more data in the core loop and has the backtracking loop. Calculating the alignment first and then the distance from alignment is slower than calculating alignment only. There are some principles in writing fast code. One is advantage by restriction. This means having extra functions for distance and alignment. And have extra ones for strings shorter 64 and strings longer. For short strings and small alphabets we can use a static array in C like this for
This avoids expensive alloc/free, but needs initialisation with 0. Only comparing 1:N needs an instance of I just commented here to estimate which speed is possible and if it's worth to support the use case of distance only in a faster way. In my field of interest (OCR correction) distance only is heavily used and alignment not so often (or in special variants with costs or split/merge edit operations. |
Hey @wollmers , thanks a lot for sharing the results of your comparisons! They give good insights into how much faster Edlib could be for certain cases. Results are what I expected -> for shorter sequences, current implementation of Myeres just introduces too much overhead and is effectively slower than even a simple dynamic solution. Well, I already explained the reasons before here #144 (comment) so I won't repeat them. I certainly agree with the approach you mention, and I believe it to be the same concept as I describe above -> we need "specializations", in order to achieve the best speed in different situations. I mentioned that for Edlib we should probably Hyrro in C/C++ to cover the case of shorter strings, and then also look into how to speed up Python version, maybe by even avoiding calling C++ for shorter strings and instead just do it all in Python. Then there comes 1:N, for which there is another issue open. @wollmers Is this something you would be interested in being involved with? Maybe contributing to Edlib by implementing Hyrro in C++? I would love to do it myself, but I don't think I will be able to make the time, at least not for a couple more months, so help would be great! |
@Martinsos Sure, I could send pull request with improvements for edlib, but my time is also limited. I can share my solutions which would be the same for the basic, general algorithm. For the special requirements of DNA analysis I don't have enough knowledge. Would you publish an essential speed gain in a journal? IMHO a factor 2-5 isn't completely unrealistic for the DNA specific use cases. From theory as Hyrrö describes in his paper the solution of Myers is a little bit more complicated, but has nearly the same speed. Hyrrö doesn't publish source or pseudocode for sequences > width (=64 bit on most current architectures). Specialisations are exactly what I'm doing. That's ugly, because of some code code duplication and inlining.
Then some restrictions known in advance can be used: E.g.: Even if the potential alphabet of Unicode is 2**21 (21 bits), the used alphabet is <= length(pattern). I. e. a pattern of length 10 can only contain 10 different characters. This allows memory pre-allocation and sequential search, which is much faster than growing hashes (indexed arrays). I doubt that pure Python can be faster in any case. But it's trivial to implement as fallback for those having problems to use C/C++. Some rates of my latest benchmark after implementing Hyrrö in C with Perl interface:
|
@wollmers thanks, if you could share some of the solutions, that could help for whoever is going to be implementing this, be it me or smb else. |
@Martinsos My code (ASCII, m < 64, distance only) is on github: https://github.com/wollmers/Text-Levenshtein-BVXS/blob/master/levbv.c |
Have you compared the speed of your new code to the code at I assume it should be roughly the same? |
@lesshaste No, I didn't compare it to for 2 reasons:
|
Has the same second problem:
A perfect benchmark would look like the scientists do it since decades: Generate cases randomly along the parameters m, n, alphabet size and distance. |
It would be really good to have a "standard" test for these kind of problems, or maybe just edit distance specifically, and as you said it would be as you described, thoroughly testing for different values of parameters, so that we can say precisely for which cases algorithm works and how fast it is. On one hand it is not hard to generate those yourself, but still, everybody seems to be doing smth different, and even for Edlib I haven't found yet the time to make tests more "full" -> I am measuring speed in automated manner only for some cases. I was thinking of defining this "standard" test myself + instructions and scripts but didn't get to it yet. |
@wollmers the implementation for strings lengths > word size are described in https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.19.7158&rep=rep1&type=pdf section 5. It requires you to store the carrybit for |
@maxbachmann Thanks for link to the paper. Will compare, if I still read this (Yes, already had this paper). Solving the problem I had 9 papers open, instrumented my Perl prototype to compare each vector with the vector compiled from the simple method. I solved it predicting carry. Perl implementation: https://github.com/wollmers/Text-Levenshtein-BV/blob/master/lib/Text/Levenshtein/BV.pm LCS had the same problem. My working Perl implementation: https://github.com/wollmers/LCS-BV/blob/master/lib/LCS/BV.pm Will have a closer look on your |
I have an implementation for blockwise LCS here: https://github.com/maxbachmann/rapidfuzz-cpp/blob/ea90dd443b3ec40eb244e2810ec7082f73a6ab77/rapidfuzz/distance/LCSseq.impl#L127 It requires you to store the carry for
For the backtracing of the alignment in LCS/Levenshtein we both use There are currently a couple of things I would like to implement but did not finish or failed:
|
That has a reason. A LUT (LookUp Table) with 256 entries is slower than on of 128. Seems to trigger more cache misses. Also it needs to be initialised with zeroes. Benchmark it. My implementation focuses on words and lines of text. In European languages ASCII is the most common case > 90%. Average words have a length of ~6 characters in English, and ~7 e.g. in German or French. Same for prefix and postfix optimisation, which is cheap. If the domain is OCR and error rates of modern OCR is 1-3%, then many lines have 0 or 1 error and don't need the full algorithm.
Did you try to compile with
That's e.g. a use case for comparing 1:N, which can allow 8 words of length < 8 characters at once. See above about the average length of words. Same for a line as sequence of words: average line length 60 / (average word length 7 + 1 space) =< 8 words/line. Trying SIMD for the simple algorithm is still on my list. This can be competitive if the 2-D array is arranged in the right way.
What's OSA? Anyway, congratulations, you did great work. Seems we share the same addictions (sequence matching, tuning). If you want to compare: In pure C on ASCII my implementation works at 25 Mio. word pairs of length 10 per second, which is a rate of ~250 Mio. character pairs per second. https://wollmers-perl.blogspot.com/2022/05/make-levenshtein-distance-fast.html. LLCS is nearly 3-fold faster. I will use your Python solution. Maybe after benchmarks your C++ implementation, because I have a target completely written in C++ and uses vectors of I will finish my JavaSript version, which I only used for talks so far, but is worth to use in e.g. web-interfaces for smart spelling correction. For further communication we should IMHO communicate via email. I have a large collection (> 1k) of all important papers about similarity. |
I will contact you via e-mail |
…so that start position of barcode is more refined, this change alters the output compared to the previous version, ii) search for exact match of barcode sequence against known barcodes (much faster for low error reads as this is just a hash look up, ii) switch barcode alignment again to known list from edlib to custom dynamic programming function (faster by ~30%, with exactly the same output). see Martinsos/edlib#144
…so that start position of barcode is more refined, this change alters the output compared to the previous version, ii) search for exact match of barcode sequence against known barcodes (much faster for low error reads as this is just a hash look up, ii) switch barcode alignment again to known list from edlib to custom dynamic programming function (faster by ~30%, with exactly the same output). see Martinsos/edlib#144
Describe the bug
This is not a bug report, just a suggestion for how to speed up computation for short strings.
To Reproduce
In the benchmarks there is:
Here Levenshtein is quite a lot faster than edlib.
I wondered if the method of https://users.dcc.uchile.cl/~gnavarro/ps/jea06.pdf might speed up edlib to at least match that of Levenshtein for short strings. I also noticed this series of answers https://codegolf.stackexchange.com/a/197716/9207 that seems to contain optimized code in Java, Python and Rust. For example:
Maybe one of those implementations can be translated to C++?
The text was updated successfully, but these errors were encountered: