Skip to content

Commit

Permalink
Implemented python binding for Edlib, fully working.
Browse files Browse the repository at this point in the history
  • Loading branch information
Martinsos committed Feb 7, 2017
1 parent 75c50ce commit 8a9007d
Show file tree
Hide file tree
Showing 13 changed files with 318 additions and 9 deletions.
16 changes: 9 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,15 @@ Join the chat at [![Join the chat at https://gitter.im/Martinsos/edlib](https://
[Doxygen documentation](http://martinsos.github.io/edlib)
---
### Wrappers for other languages
Edlib for Python: [![PyPI version](https://badge.fury.io/py/edlib.svg)](https://badge.fury.io/py/edlib)
Edlib for Node.js: [![npm version](https://badge.fury.io/js/node-edlib.svg)](https://badge.fury.io/js/node-edlib)
---
Expand Down Expand Up @@ -238,13 +247,6 @@ In [test_data/](test_data) directory there are different genome sequences, rangi
---
### Nodejs
For those who want to use edlib in nodejs there is a nodejs addon, [node-edlib](https://www.npmjs.com/package/node-edlib)!
---
### Development
Feel free to send pull requests and raise issues.
Expand Down
6 changes: 6 additions & 0 deletions bindings/python/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
build/
dist/
*.egg-info/
edlib/
edlib.c
edlib.*.so
1 change: 1 addition & 0 deletions bindings/python/MANIFEST.in
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
include edlib/include/edlib.h
20 changes: 20 additions & 0 deletions bindings/python/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
default: build

FILES=edlib *.pyx *.pxd setup.py MANIFEST.in README.rst

edlib: ../../edlib
cp -R ../../edlib .

build: ${FILES}
OPT="" python setup.py build_ext -i

sdist: ${FILES}
cp -R ../../edlib .
python setup.py sdist

publish: sdist
twine upload dist/*

clean:
rm -rf edlib dist edlib.egg-info build
rm -f edlib.c edlib.*.so
114 changes: 114 additions & 0 deletions bindings/python/README.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
=====
Edlib
=====

Lightweight, super fast library for sequence alignment using edit (Levenshtein) distance.

.. code:: python
edlib.align("hello", "world")
Edlib is actually a C/C++ library, and this package is it's wrapper for Python.
Python Edlib has mostly the same API as C/C++ Edlib, so make sure to check out `C/C++ Edlib docs <http://github.com/Martinsos/edlib>`_ for more code examples, details on API and how Edlib works.

--------
Features
--------

* Calculates **edit distance**.
* It can find **optimal alignment path** (instructions how to transform first sequence into the second sequence).
* It can find just the **start and/or end locations of alignment path** - can be useful when speed is more important than having exact alignment path.
* Supports **multiple alignment methods**: global(**NW**), prefix(**SHW**) and infix(**HW**), each of them useful for different scenarios.
* It can easily handle small or **very large** sequences, even when finding alignment path.
* **Super fast** thanks to Myers's bit-vector algorithm.

------------
Installation
------------
::

pip install edlib

---
API
---

Edlib has only one function:

.. code:: python
align(query, target, [mode], [task], [k])
To learn more about it, type :code:`help(edlib.align)` in your python interpreter.

-----
Usage
-----
.. code:: python
import edlib
result = edlib.align("elephant", "telephone")
print(result["editDistance"]) # 3
print(result["alphabetLength"]) # 8
print(result["locations"]) # [(None, 8)]
print(result["cigar"]) # None
result = edlib.align("elephant", "telephone", mode="HW", task="path")
print(result["editDistance"]) # 2
print(result["alphabetLength"]) # 8
print(result["locations"]) # [(1, 7), (1, 8)]
print(result["cigar"]) # "5=1X1=1I"
---------
Benchmark
---------

I run a simple benchmark on 7 Feb 2017 (using timeit, on Python3) to get a feeling of how Edlib compares to other Python libraries: `editdistance <https://pypi.python.org/pypi/editdistance>`_ and `python-Levenshtein <https://pypi.python.org/pypi/python-Levenshtein>`_.

As input data I used pairs of DNA sequences of different lengths, where each pair has about 90% similarity.

::

#1: query length: 30, target length: 30
edlib.align(query, target): 1.88µs
editdistance.eval(query, target): 1.26µs
Levenshtein.distance(query, target): 0.43µs

#2: query length: 100, target length: 100
edlib.align(query, target): 3.64µs
editdistance.eval(query, target): 3.86µs
Levenshtein.distance(query, target): 14.1µs

#3: query length: 1000, target length: 1000
edlib.align(query, target): 0.047ms
editdistance.eval(query, target): 5.4ms
Levenshtein.distance(query, target): 1.9ms

#4: query length: 10000, target length: 10000
edlib.align(query, target): 0.0021s
editdistance.eval(query, target): 0.56s
Levenshtein.distance(query, target): 0.2s

#5: query length: 50000, target length: 50000
edlib.align(query, target): 0.031s
editdistance.eval(query, target): 13.8s
Levenshtein.distance(query, target): 5.0s

----
More
----

Check out `C/C++ Edlib docs <http://github.com/Martinsos/edlib>`_ for more information about Edlib!

-----------
Development
-----------

Run :code:`make build` to generate an extension module as .so file. You can test it then by importing it from python interpreter :code:`import edlib` and running :code:`edlib.align(...)` (you have to be positioned in the directory where .so was built). You can also run :code:`sudo pip install -e .` from that directory which makes editable install, and then you have edlib available globally. Use this methods for testing.

Run :code:`make sdist` to create a source distribution, but not publish it - it is a tarball in dist/. Use this to check that tarball is well structured, contains all needed files.

Run :code:`make publish` to create a source distribution and publish it to the PyPI. Use this to publish new version of package.

:code:`make clean` removes all generated files.
28 changes: 28 additions & 0 deletions bindings/python/cedlib.pxd
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
cdef extern from "edlib.h":

ctypedef enum EdlibAlignMode: EDLIB_MODE_NW, EDLIB_MODE_SHW, EDLIB_MODE_HW
ctypedef enum EdlibAlignTask: EDLIB_TASK_DISTANCE, EDLIB_TASK_LOC, EDLIB_TASK_PATH
ctypedef enum EdlibCigarFormat: EDLIB_CIGAR_STANDARD, EDLIB_CIGAR_EXTENDED

ctypedef struct EdlibAlignConfig:
int k
EdlibAlignMode mode
EdlibAlignTask task

EdlibAlignConfig edlibNewAlignConfig(int k, EdlibAlignMode mode, EdlibAlignTask task)
EdlibAlignConfig edlibDefaultAlignConfig()

ctypedef struct EdlibAlignResult:
int editDistance
int* endLocations
int* startLocations
int numLocations
unsigned char* alignment
int alignmentLength
int alphabetLength

void edlibFreeAlignResult(EdlibAlignResult result)

EdlibAlignResult edlibAlign(const char* query, int queryLength, const char* target, int targetLength, const EdlibAlignConfig config)

char* edlibAlignmentToCigar(const unsigned char* alignment, int alignmentLength, EdlibCigarFormat cigarFormat)
66 changes: 66 additions & 0 deletions bindings/python/edlib.pyx
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
cimport cedlib

def align(query, target, mode="NW", task="distance", k=-1):
""" Align query with target using edit distance.
@param {string} query
@param {string} target
@param {string} mode Optional. Alignment method do be used. Possible values are:
- 'NW' for global (default)
- 'HW' for infix
- 'SHW' for prefix.
@param {string} task Optional. Tells edlib what to calculate. Less there is to calculate,
faster it is. Possible value are (from fastest to slowest):
- 'distance' - find edit distance and end locations in target. Default.
- 'locations' - find edit distance, end locations and start locations.
- 'path' - find edit distance, start and end locations and alignment path.
@param {int} k Optional. Max edit distance to search for - the lower this value,
the faster is calculation. Set to -1 (default) to have no limit on edit distance.
@return Dictionary with following fields:
{int} editDistance -1 if it is larger than k.
{int} alphabetLength
{[(int, int)]} locations List of locations, in format [(start, end)].
{string} cigar Cigar is a standard format for alignment path.
Here we are using extended cigar format, which uses following symbols:
Match: '=', Insertion to target: 'I', Deletion from target: 'D', Mismatch: 'X'.
e.g. cigar of "5=1X1=1I" means "5 matches, 1 mismatch, 1 match, 1 insertion (to target)".
"""
# Transfrom python strings into c strings.
cdef bytes query_bytes = query.encode();
cdef char* cquery = query_bytes;
cdef bytes target_bytes = target.encode();
cdef char* ctarget = target_bytes;

# Build an edlib config object based on given parameters.
cconfig = cedlib.edlibDefaultAlignConfig()
if k is not None: cconfig.k = k
if mode == 'NW': cconfig.mode = cedlib.EDLIB_MODE_NW
if mode == 'HW': cconfig.mode = cedlib.EDLIB_MODE_HW
if mode == 'SHW': cconfig.mode = cedlib.EDLIB_MODE_SHW
if task == 'distance': cconfig.task = cedlib.EDLIB_TASK_DISTANCE
if task == 'locations': cconfig.task = cedlib.EDLIB_TASK_LOC
if task == 'path': cconfig.task = cedlib.EDLIB_TASK_PATH

# Run alignment.
cresult = cedlib.edlibAlign(cquery, len(query), ctarget, len(target), cconfig)

# Build python dictionary with results from result object that edlib returned.
locations = []
if cresult.numLocations >= 0:
for i in range(cresult.numLocations):
locations.append((cresult.startLocations[i] if cresult.startLocations else None,
cresult.endLocations[i] if cresult.endLocations else None))
cigar = None
if cresult.alignment:
ccigar = cedlib.edlibAlignmentToCigar(cresult.alignment, cresult.alignmentLength,
cedlib.EDLIB_CIGAR_EXTENDED)
cigar = <bytes> ccigar
cigar = cigar.decode('UTF-8')
result = {
'editDistance': cresult.editDistance,
'alphabetLength': cresult.alphabetLength,
'locations': locations,
'cigar': cigar
}
cedlib.edlibFreeAlignResult(cresult)

return result
38 changes: 38 additions & 0 deletions bindings/python/performance.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
#!/usr/bin/env python

import timeit

import edlib
import editdistance
import Levenshtein

with open('../../test_data/Enterobacteria_Phage_1/mutated_90_perc_oneline.fasta', 'r') as f:
queryFull = f.readline()
print('Read query: ', len(queryFull) ,' characters.')

with open('../../test_data/Enterobacteria_Phage_1/Enterobacteria_phage_1_oneline.fa', 'r') as f:
targetFull = f.readline()
print('Read target: ', len(targetFull) ,' characters.')

for seqLen in [30, 100, 1000, 10000, 50000]:
query = queryFull[:seqLen]
target = targetFull[:seqLen]
numRuns = max(1000000000 // (seqLen**2), 1)

print('Sequence length: ', seqLen)

edlibTime = timeit.timeit(stmt="edlib.align(query, target)",
number=numRuns, globals=globals()) / numRuns
print('Edlib: ', edlibTime)
print(edlib.align(query, target))

editdistanceTime = timeit.timeit(stmt="editdistance.eval(query, target)",
number=numRuns, globals=globals()) / numRuns
print('editdistance: ', editdistanceTime)

levenshteinTime = timeit.timeit(stmt="Levenshtein.distance(query, target)",
number=numRuns, globals=globals()) / numRuns
print('levenshtein: ', levenshteinTime)

print('edlib is %f times faster than editdistance.' % (editdistanceTime / edlibTime))
print('edlib is %f times faster than Levenshtein.' % (levenshteinTime / edlibTime))
29 changes: 29 additions & 0 deletions bindings/python/setup.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
from setuptools import setup, Extension
import Cython.Build
from codecs import open
from os import path

here = path.abspath(path.dirname(__file__))
with open(path.join(here, 'README.rst'), encoding='utf-8') as f:
long_description = f.read()

setup(
# Information
name = "edlib",
description = "Lightweight, super fast library for sequence alignment using edit (Levenshtein) distance.",
long_description = long_description,
version = "1.1.2",
url = "https://github.com/Martinsos/edlib",
author = "Martin Sosic",
author_email = "[email protected]",
license = "MIT",
keywords = "edit distance levehnstein align sequence bioinformatics",
# Build instructions
ext_modules = [Extension("edlib",
["edlib.pyx", "edlib/src/edlib.cpp"],
include_dirs=["edlib/include"],
depends=["edlib/include/edlib.h"],
extra_compile_args=["-O3"])],
setup_requires = ['cython (>=0.25)'],
cmdclass = {'build_ext': Cython.Build.build_ext}
)
2 changes: 1 addition & 1 deletion edlib/include/edlib.h
Original file line number Diff line number Diff line change
Expand Up @@ -112,7 +112,7 @@ typedef enum {
* @return Default configuration object, with following defaults:
* k = -1, mode = EDLIB_MODE_NW, task = EDLIB_TASK_DISTANCE.
*/
EdlibAlignConfig edlibDefaultAlignConfig();
EdlibAlignConfig edlibDefaultAlignConfig(void);


/**
Expand Down
2 changes: 1 addition & 1 deletion edlib/src/edlib.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1392,7 +1392,7 @@ EdlibAlignConfig edlibNewAlignConfig(int k, EdlibAlignMode mode, EdlibAlignTask
return config;
}

EdlibAlignConfig edlibDefaultAlignConfig() {
EdlibAlignConfig edlibDefaultAlignConfig(void) {
return edlibNewAlignConfig(-1, EDLIB_MODE_NW, EDLIB_TASK_DISTANCE);
}

Expand Down
2 changes: 2 additions & 0 deletions test_data/Enterobacteria_Phage_1/mutated_90_perc.fasta.out

Large diffs are not rendered by default.

Git LFS file not shown

0 comments on commit 8a9007d

Please sign in to comment.