forked from explosion/spaCy
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Replacing regex library with re to increase tokenization speed (explo…
…sion#3218) * replace unicode categories with raw list of code points * simplifying ranges * fixing variable length quotes * removing redundant regular expression * small cleanup of regexp notations * quotes and alpha as ranges instead of alterations * removed most regexp dependencies and features * exponential backtracking - unit tests * rewrote expression with pathological backtracking * disabling double hyphen tests for now * test additional variants of repeating punctuation * remove regex and redundant backslashes from load_reddit script * small typo fixes * disable double punctuation test for russian * clean up old comments * format block code * final cleanup * naming consistency * french strings as unicode for python 2 support * french regular expression case insensitive
- Loading branch information
Showing
32 changed files
with
258 additions
and
222 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -4,7 +4,7 @@ | |
|
||
from ..char_classes import LIST_PUNCT, LIST_ELLIPSES, LIST_QUOTES, LIST_CURRENCY | ||
from ..char_classes import LIST_ICONS, ALPHA_LOWER, ALPHA_UPPER, ALPHA, HYPHENS | ||
from ..char_classes import QUOTES, CURRENCY | ||
from ..char_classes import CONCAT_QUOTES, CURRENCY | ||
|
||
_units = ( | ||
"km km² km³ m m² m³ dm dm² dm³ cm cm² cm³ mm mm² mm³ ha µm nm yd in ft " | ||
|
@@ -57,10 +57,10 @@ def merge_chars(char): | |
r"^([0-9]){1}\)$", # 12) | ||
r"(?<=°[FfCcKk])\.", | ||
r"([0-9])+\&", # 12& | ||
r"(?<=[0-9])(?:{})".format(CURRENCY), | ||
r"(?<=[0-9])(?:{})".format(UNITS), | ||
r"(?<=[0-9{}{}(?:{})])\.".format(ALPHA_LOWER, r"²\-\)\]\+", QUOTES), | ||
r"(?<=[{a}][{a}])\.".format(a=ALPHA_UPPER), | ||
r"(?<=[0-9])(?:{c})".format(c=CURRENCY), | ||
r"(?<=[0-9])(?:{u})".format(u=UNITS), | ||
r"(?<=[0-9{al}{e}(?:{q})])\.".format(al=ALPHA_LOWER, e=r"²\-\+", q=CONCAT_QUOTES), | ||
r"(?<=[{au}][{au}])\.".format(au=ALPHA_UPPER), | ||
r"(?<=[Α-Ωα-ωίϊΐόάέύϋΰήώ])\-", # όνομα- | ||
r"(?<=[Α-Ωα-ωίϊΐόάέύϋΰήώ])\.", | ||
r"^[Α-Ω]{1}\.", | ||
|
@@ -85,10 +85,10 @@ def merge_chars(char): | |
r"([0-9]){1,4}[\/]([0-9]){1,2}([\/]([0-9]){0,4}){0,1}", | ||
r"[A-Za-z]+\@[A-Za-z]+(\-[A-Za-z]+)*\.[A-Za-z]+", # [email protected] | ||
r"([a-zA-Z]+)(\-([a-zA-Z]+))+", # abc-abc | ||
r"(?<=[{}])\.(?=[{}])".format(ALPHA_LOWER, ALPHA_UPPER), | ||
r"(?<=[{al}])\.(?=[{au}])".format(al=ALPHA_LOWER, au=ALPHA_UPPER), | ||
r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA), | ||
r'(?<=[{a}])[?";:=,.]*(?:{h})(?=[{a}])'.format(a=ALPHA, h=HYPHENS), | ||
r'(?<=[{a}"])[:<>=/](?=[{a}])'.format(a=ALPHA), | ||
r'(?<=[{a}])(?:{h})(?=[{a}])'.format(a=ALPHA, h=HYPHENS), | ||
r'(?<=[{a}])[:<>=/](?=[{a}])'.format(a=ALPHA), | ||
] | ||
) | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.