-
Notifications
You must be signed in to change notification settings - Fork 167
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Implemented python binding for Edlib, fully working.
- Loading branch information
Showing
13 changed files
with
318 additions
and
9 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
build/ | ||
dist/ | ||
*.egg-info/ | ||
edlib/ | ||
edlib.c | ||
edlib.*.so |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
include edlib/include/edlib.h |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,20 @@ | ||
default: build | ||
|
||
FILES=edlib *.pyx *.pxd setup.py MANIFEST.in README.rst | ||
|
||
edlib: ../../edlib | ||
cp -R ../../edlib . | ||
|
||
build: ${FILES} | ||
OPT="" python setup.py build_ext -i | ||
|
||
sdist: ${FILES} | ||
cp -R ../../edlib . | ||
python setup.py sdist | ||
|
||
publish: sdist | ||
twine upload dist/* | ||
|
||
clean: | ||
rm -rf edlib dist edlib.egg-info build | ||
rm -f edlib.c edlib.*.so |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,114 @@ | ||
===== | ||
Edlib | ||
===== | ||
|
||
Lightweight, super fast library for sequence alignment using edit (Levenshtein) distance. | ||
|
||
.. code:: python | ||
edlib.align("hello", "world") | ||
Edlib is actually a C/C++ library, and this package is it's wrapper for Python. | ||
Python Edlib has mostly the same API as C/C++ Edlib, so make sure to check out `C/C++ Edlib docs <http://github.com/Martinsos/edlib>`_ for more code examples, details on API and how Edlib works. | ||
|
||
-------- | ||
Features | ||
-------- | ||
|
||
* Calculates **edit distance**. | ||
* It can find **optimal alignment path** (instructions how to transform first sequence into the second sequence). | ||
* It can find just the **start and/or end locations of alignment path** - can be useful when speed is more important than having exact alignment path. | ||
* Supports **multiple alignment methods**: global(**NW**), prefix(**SHW**) and infix(**HW**), each of them useful for different scenarios. | ||
* It can easily handle small or **very large** sequences, even when finding alignment path. | ||
* **Super fast** thanks to Myers's bit-vector algorithm. | ||
|
||
------------ | ||
Installation | ||
------------ | ||
:: | ||
|
||
pip install edlib | ||
|
||
--- | ||
API | ||
--- | ||
|
||
Edlib has only one function: | ||
|
||
.. code:: python | ||
align(query, target, [mode], [task], [k]) | ||
To learn more about it, type :code:`help(edlib.align)` in your python interpreter. | ||
|
||
----- | ||
Usage | ||
----- | ||
.. code:: python | ||
import edlib | ||
result = edlib.align("elephant", "telephone") | ||
print(result["editDistance"]) # 3 | ||
print(result["alphabetLength"]) # 8 | ||
print(result["locations"]) # [(None, 8)] | ||
print(result["cigar"]) # None | ||
result = edlib.align("elephant", "telephone", mode="HW", task="path") | ||
print(result["editDistance"]) # 2 | ||
print(result["alphabetLength"]) # 8 | ||
print(result["locations"]) # [(1, 7), (1, 8)] | ||
print(result["cigar"]) # "5=1X1=1I" | ||
--------- | ||
Benchmark | ||
--------- | ||
|
||
I run a simple benchmark on 7 Feb 2017 (using timeit, on Python3) to get a feeling of how Edlib compares to other Python libraries: `editdistance <https://pypi.python.org/pypi/editdistance>`_ and `python-Levenshtein <https://pypi.python.org/pypi/python-Levenshtein>`_. | ||
|
||
As input data I used pairs of DNA sequences of different lengths, where each pair has about 90% similarity. | ||
|
||
:: | ||
|
||
#1: query length: 30, target length: 30 | ||
edlib.align(query, target): 1.88µs | ||
editdistance.eval(query, target): 1.26µs | ||
Levenshtein.distance(query, target): 0.43µs | ||
|
||
#2: query length: 100, target length: 100 | ||
edlib.align(query, target): 3.64µs | ||
editdistance.eval(query, target): 3.86µs | ||
Levenshtein.distance(query, target): 14.1µs | ||
|
||
#3: query length: 1000, target length: 1000 | ||
edlib.align(query, target): 0.047ms | ||
editdistance.eval(query, target): 5.4ms | ||
Levenshtein.distance(query, target): 1.9ms | ||
|
||
#4: query length: 10000, target length: 10000 | ||
edlib.align(query, target): 0.0021s | ||
editdistance.eval(query, target): 0.56s | ||
Levenshtein.distance(query, target): 0.2s | ||
|
||
#5: query length: 50000, target length: 50000 | ||
edlib.align(query, target): 0.031s | ||
editdistance.eval(query, target): 13.8s | ||
Levenshtein.distance(query, target): 5.0s | ||
|
||
---- | ||
More | ||
---- | ||
|
||
Check out `C/C++ Edlib docs <http://github.com/Martinsos/edlib>`_ for more information about Edlib! | ||
|
||
----------- | ||
Development | ||
----------- | ||
|
||
Run :code:`make build` to generate an extension module as .so file. You can test it then by importing it from python interpreter :code:`import edlib` and running :code:`edlib.align(...)` (you have to be positioned in the directory where .so was built). You can also run :code:`sudo pip install -e .` from that directory which makes editable install, and then you have edlib available globally. Use this methods for testing. | ||
|
||
Run :code:`make sdist` to create a source distribution, but not publish it - it is a tarball in dist/. Use this to check that tarball is well structured, contains all needed files. | ||
|
||
Run :code:`make publish` to create a source distribution and publish it to the PyPI. Use this to publish new version of package. | ||
|
||
:code:`make clean` removes all generated files. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,28 @@ | ||
cdef extern from "edlib.h": | ||
|
||
ctypedef enum EdlibAlignMode: EDLIB_MODE_NW, EDLIB_MODE_SHW, EDLIB_MODE_HW | ||
ctypedef enum EdlibAlignTask: EDLIB_TASK_DISTANCE, EDLIB_TASK_LOC, EDLIB_TASK_PATH | ||
ctypedef enum EdlibCigarFormat: EDLIB_CIGAR_STANDARD, EDLIB_CIGAR_EXTENDED | ||
|
||
ctypedef struct EdlibAlignConfig: | ||
int k | ||
EdlibAlignMode mode | ||
EdlibAlignTask task | ||
|
||
EdlibAlignConfig edlibNewAlignConfig(int k, EdlibAlignMode mode, EdlibAlignTask task) | ||
EdlibAlignConfig edlibDefaultAlignConfig() | ||
|
||
ctypedef struct EdlibAlignResult: | ||
int editDistance | ||
int* endLocations | ||
int* startLocations | ||
int numLocations | ||
unsigned char* alignment | ||
int alignmentLength | ||
int alphabetLength | ||
|
||
void edlibFreeAlignResult(EdlibAlignResult result) | ||
|
||
EdlibAlignResult edlibAlign(const char* query, int queryLength, const char* target, int targetLength, const EdlibAlignConfig config) | ||
|
||
char* edlibAlignmentToCigar(const unsigned char* alignment, int alignmentLength, EdlibCigarFormat cigarFormat) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,66 @@ | ||
cimport cedlib | ||
|
||
def align(query, target, mode="NW", task="distance", k=-1): | ||
""" Align query with target using edit distance. | ||
@param {string} query | ||
@param {string} target | ||
@param {string} mode Optional. Alignment method do be used. Possible values are: | ||
- 'NW' for global (default) | ||
- 'HW' for infix | ||
- 'SHW' for prefix. | ||
@param {string} task Optional. Tells edlib what to calculate. Less there is to calculate, | ||
faster it is. Possible value are (from fastest to slowest): | ||
- 'distance' - find edit distance and end locations in target. Default. | ||
- 'locations' - find edit distance, end locations and start locations. | ||
- 'path' - find edit distance, start and end locations and alignment path. | ||
@param {int} k Optional. Max edit distance to search for - the lower this value, | ||
the faster is calculation. Set to -1 (default) to have no limit on edit distance. | ||
@return Dictionary with following fields: | ||
{int} editDistance -1 if it is larger than k. | ||
{int} alphabetLength | ||
{[(int, int)]} locations List of locations, in format [(start, end)]. | ||
{string} cigar Cigar is a standard format for alignment path. | ||
Here we are using extended cigar format, which uses following symbols: | ||
Match: '=', Insertion to target: 'I', Deletion from target: 'D', Mismatch: 'X'. | ||
e.g. cigar of "5=1X1=1I" means "5 matches, 1 mismatch, 1 match, 1 insertion (to target)". | ||
""" | ||
# Transfrom python strings into c strings. | ||
cdef bytes query_bytes = query.encode(); | ||
cdef char* cquery = query_bytes; | ||
cdef bytes target_bytes = target.encode(); | ||
cdef char* ctarget = target_bytes; | ||
|
||
# Build an edlib config object based on given parameters. | ||
cconfig = cedlib.edlibDefaultAlignConfig() | ||
if k is not None: cconfig.k = k | ||
if mode == 'NW': cconfig.mode = cedlib.EDLIB_MODE_NW | ||
if mode == 'HW': cconfig.mode = cedlib.EDLIB_MODE_HW | ||
if mode == 'SHW': cconfig.mode = cedlib.EDLIB_MODE_SHW | ||
if task == 'distance': cconfig.task = cedlib.EDLIB_TASK_DISTANCE | ||
if task == 'locations': cconfig.task = cedlib.EDLIB_TASK_LOC | ||
if task == 'path': cconfig.task = cedlib.EDLIB_TASK_PATH | ||
|
||
# Run alignment. | ||
cresult = cedlib.edlibAlign(cquery, len(query), ctarget, len(target), cconfig) | ||
|
||
# Build python dictionary with results from result object that edlib returned. | ||
locations = [] | ||
if cresult.numLocations >= 0: | ||
for i in range(cresult.numLocations): | ||
locations.append((cresult.startLocations[i] if cresult.startLocations else None, | ||
cresult.endLocations[i] if cresult.endLocations else None)) | ||
cigar = None | ||
if cresult.alignment: | ||
ccigar = cedlib.edlibAlignmentToCigar(cresult.alignment, cresult.alignmentLength, | ||
cedlib.EDLIB_CIGAR_EXTENDED) | ||
cigar = <bytes> ccigar | ||
cigar = cigar.decode('UTF-8') | ||
result = { | ||
'editDistance': cresult.editDistance, | ||
'alphabetLength': cresult.alphabetLength, | ||
'locations': locations, | ||
'cigar': cigar | ||
} | ||
cedlib.edlibFreeAlignResult(cresult) | ||
|
||
return result |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,38 @@ | ||
#!/usr/bin/env python | ||
|
||
import timeit | ||
|
||
import edlib | ||
import editdistance | ||
import Levenshtein | ||
|
||
with open('../../test_data/Enterobacteria_Phage_1/mutated_90_perc_oneline.fasta', 'r') as f: | ||
queryFull = f.readline() | ||
print('Read query: ', len(queryFull) ,' characters.') | ||
|
||
with open('../../test_data/Enterobacteria_Phage_1/Enterobacteria_phage_1_oneline.fa', 'r') as f: | ||
targetFull = f.readline() | ||
print('Read target: ', len(targetFull) ,' characters.') | ||
|
||
for seqLen in [30, 100, 1000, 10000, 50000]: | ||
query = queryFull[:seqLen] | ||
target = targetFull[:seqLen] | ||
numRuns = max(1000000000 // (seqLen**2), 1) | ||
|
||
print('Sequence length: ', seqLen) | ||
|
||
edlibTime = timeit.timeit(stmt="edlib.align(query, target)", | ||
number=numRuns, globals=globals()) / numRuns | ||
print('Edlib: ', edlibTime) | ||
print(edlib.align(query, target)) | ||
|
||
editdistanceTime = timeit.timeit(stmt="editdistance.eval(query, target)", | ||
number=numRuns, globals=globals()) / numRuns | ||
print('editdistance: ', editdistanceTime) | ||
|
||
levenshteinTime = timeit.timeit(stmt="Levenshtein.distance(query, target)", | ||
number=numRuns, globals=globals()) / numRuns | ||
print('levenshtein: ', levenshteinTime) | ||
|
||
print('edlib is %f times faster than editdistance.' % (editdistanceTime / edlibTime)) | ||
print('edlib is %f times faster than Levenshtein.' % (levenshteinTime / edlibTime)) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,29 @@ | ||
from setuptools import setup, Extension | ||
import Cython.Build | ||
from codecs import open | ||
from os import path | ||
|
||
here = path.abspath(path.dirname(__file__)) | ||
with open(path.join(here, 'README.rst'), encoding='utf-8') as f: | ||
long_description = f.read() | ||
|
||
setup( | ||
# Information | ||
name = "edlib", | ||
description = "Lightweight, super fast library for sequence alignment using edit (Levenshtein) distance.", | ||
long_description = long_description, | ||
version = "1.1.2", | ||
url = "https://github.com/Martinsos/edlib", | ||
author = "Martin Sosic", | ||
author_email = "[email protected]", | ||
license = "MIT", | ||
keywords = "edit distance levehnstein align sequence bioinformatics", | ||
# Build instructions | ||
ext_modules = [Extension("edlib", | ||
["edlib.pyx", "edlib/src/edlib.cpp"], | ||
include_dirs=["edlib/include"], | ||
depends=["edlib/include/edlib.h"], | ||
extra_compile_args=["-O3"])], | ||
setup_requires = ['cython (>=0.25)'], | ||
cmdclass = {'build_ext': Cython.Build.build_ext} | ||
) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Large diffs are not rendered by default.
Oops, something went wrong.
3 changes: 3 additions & 0 deletions
3
test_data/Enterobacteria_Phage_1/mutated_90_perc_oneline.fasta
Git LFS file not shown