-
Notifications
You must be signed in to change notification settings - Fork 621
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix mixed-direction text layout in translations (Resolves #34) #37
base: main
Are you sure you want to change the base?
Conversation
* remove redundant line the line removed was redundant - the `source_text` gets already replaced through the f"-string * Update utils.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Salah-Sal
Thanks for your contribution.
I added some comments
words_and_delimiters = re.findall(r"\w+|[^\w\s]+|\s+", text) | ||
|
||
new_text = [] | ||
ltr_pattern = re.compile( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Salah-Sal
Thanks for your contribution.
can we modify the ltr_pattern to include numerical digits?
ltr_pattern = re.compile("[A-Za-z0-9]+")
is_rtl = language in rtl_languages | ||
|
||
# Regex to capture words, non-word characters, and any whitespace | ||
words_and_delimiters = re.findall(r"\w+|[^\w\s]+|\s+", text) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we pre-compile the regular expression outside for better performance?
tokenizer_pattern = re.compile(r"\w+|[^\w\s]+|\s+")
# ... Inside the function:
words_and_delimiters = tokenizer_pattern.findall(text)
"Urdu", | ||
} | ||
is_rtl = language in rtl_languages | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest adding an appropriate normalization form, but it may lead to the removal of unimportant text, so you can neglect it.
# . , before tokenizaText:
text = unicodedata.normalize("NFC", text)
Context
What is the purpose of this PR? Is it to
Please link to any issues this PR addresses.
Fixes #34
Changelog
Test plan
Please make sure to do each of the following if applicable to your PR. (If you're not sure about any one of these just ask and we will happily help.)
pre-commit install
)pytest tests