forked from explosion/spaCy
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Clean up of char classes, few tokenizer fixes and faster default Fren…
…ch tokenizer (explosion#3293) * splitting up latin unicode interval * removing hyphen as infix for French * adding failing test for issue 1235 * test for issue explosion#3002 which now works * partial fix for issue explosion#2070 * keep the hyphen as infix for French (as it was) * restore french expressions with hyphen as infix (as it was) * added succeeding unit test for Issue explosion#2656 * Fix issue explosion#2822 with custom Italian exception * Fix issue explosion#2926 by allowing numbers right before infix / * splitting up latin unicode interval * removing hyphen as infix for French * adding failing test for issue 1235 * test for issue explosion#3002 which now works * partial fix for issue explosion#2070 * keep the hyphen as infix for French (as it was) * restore french expressions with hyphen as infix (as it was) * added succeeding unit test for Issue explosion#2656 * Fix issue explosion#2822 with custom Italian exception * Fix issue explosion#2926 by allowing numbers right before infix / * remove duplicate * remove xfail for Issue explosion#2179 fixed by Matt * adjust documentation and remove reference to regex lib
- Loading branch information
Showing
15 changed files
with
277 additions
and
37 deletions.
There are no files selected for viewing
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
# coding: utf8 | ||
from __future__ import unicode_literals | ||
from ...symbols import ORTH, LEMMA | ||
|
||
_exc = { | ||
"po'": [{ORTH: "po'", LEMMA: 'poco'}] | ||
} | ||
|
||
TOKENIZER_EXCEPTIONS = _exc |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,24 @@ | ||
# coding: utf8 | ||
from __future__ import unicode_literals | ||
from spacy.lang.en import English | ||
|
||
|
||
def test_issue2656(): | ||
""" Test that tokenizer correctly splits of punctuation after numbers with decimal points """ | ||
text = "I went for 40.3, and got home by 10.0." | ||
nlp = English() | ||
doc = nlp(text) | ||
|
||
assert len(doc) == 11 | ||
|
||
assert doc[0].text == "I" | ||
assert doc[1].text == "went" | ||
assert doc[2].text == "for" | ||
assert doc[3].text == "40.3" | ||
assert doc[4].text == "," | ||
assert doc[5].text == "and" | ||
assert doc[6].text == "got" | ||
assert doc[7].text == "home" | ||
assert doc[8].text == "by" | ||
assert doc[9].text == "10.0" | ||
assert doc[10].text == "." |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
# coding: utf8 | ||
from __future__ import unicode_literals | ||
from spacy.lang.it import Italian | ||
|
||
|
||
def test_issue2822(): | ||
""" Test that the abbreviation of poco is kept as one word """ | ||
nlp = Italian() | ||
text = "Vuoi un po' di zucchero?" | ||
|
||
doc = nlp(text) | ||
|
||
assert len(doc) == 6 | ||
|
||
assert doc[0].text == "Vuoi" | ||
assert doc[1].text == "un" | ||
assert doc[2].text == "po'" | ||
assert doc[2].lemma_ == "poco" | ||
assert doc[3].text == "di" | ||
assert doc[4].text == "zucchero" | ||
assert doc[5].text == "?" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
# coding: utf8 | ||
from __future__ import unicode_literals | ||
from spacy.lang.fr import French | ||
|
||
|
||
def test_issue2926(): | ||
""" Test that the tokenizer correctly splits tokens separated by a slash (/) ending in a digit """ | ||
nlp = French() | ||
text = "Learn html5/css3/javascript/jquery" | ||
doc = nlp(text) | ||
|
||
assert len(doc) == 8 | ||
|
||
assert doc[0].text == "Learn" | ||
assert doc[1].text == "html5" | ||
assert doc[2].text == "/" | ||
assert doc[3].text == "css3" | ||
assert doc[4].text == "/" | ||
assert doc[5].text == "javascript" | ||
assert doc[6].text == "/" | ||
assert doc[7].text == "jquery" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
# coding: utf8 | ||
from __future__ import unicode_literals | ||
|
||
from spacy.lang.de import German | ||
|
||
|
||
def test_issue3002(): | ||
"""Test that the tokenizer doesn't hang on a long list of dots""" | ||
nlp = German() | ||
doc = nlp('880.794.982.218.444.893.023.439.794.626.120.190.780.624.990.275.671 ist eine lange Zahl') | ||
assert len(doc) == 5 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters