Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix mixed-direction text layout in translations (Resolves #34) #37

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Salah-Sal
Copy link

Context

What is the purpose of this PR? Is it to

  • add a new feature
  • fix a bug
  • update tests and/or documentation
  • other (please add here)

Please link to any issues this PR addresses.
Fixes #34

Changelog

  • Removed a redundant line in utils.py where source_text was being replaced unnecessarily.
  • Added the tokenize_mixed_direction_text function in process.py to handle mixed-direction text (LTR within RTL).
  • Updated the translator and translator_sec functions in process.py to use tokenize_mixed_direction_text for processing the translations.

Test plan

Please make sure to do each of the following if applicable to your PR. (If you're not sure about any one of these just ask and we will happily help.)

  • run pre-commit hooks and linters (make sure you've first installed via pre-commit install)
  • add unit tests for any new functionality
  • update docstrings for any new or updated methods or classes
  • run unit tests via pytest tests
  • include relevant commands and any other artifacts in this summary (eval results, etc.)

* remove redundant line

the line removed was redundant - the `source_text` gets already replaced through the f"-string

* Update utils.py
@Salah-Sal Salah-Sal changed the title remove redundant line (#8) Fix mixed-direction text layout in translations (Resolves #34) (#8) Jul 12, 2024
@Salah-Sal Salah-Sal changed the title Fix mixed-direction text layout in translations (Resolves #34) (#8) Fix mixed-direction text layout in translations (Resolves #34) Jul 12, 2024
Copy link

@abodacs abodacs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Salah-Sal
Thanks for your contribution.
I added some comments

words_and_delimiters = re.findall(r"\w+|[^\w\s]+|\s+", text)

new_text = []
ltr_pattern = re.compile(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Salah-Sal
Thanks for your contribution.
can we modify the ltr_pattern to include numerical digits?

ltr_pattern = re.compile("[A-Za-z0-9]+")

is_rtl = language in rtl_languages

# Regex to capture words, non-word characters, and any whitespace
words_and_delimiters = re.findall(r"\w+|[^\w\s]+|\s+", text)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we pre-compile the regular expression outside for better performance?

tokenizer_pattern = re.compile(r"\w+|[^\w\s]+|\s+")
# ... Inside the function:
words_and_delimiters = tokenizer_pattern.findall(text)

"Urdu",
}
is_rtl = language in rtl_languages

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest adding an appropriate normalization form, but it may lead to the removal of unimportant text, so you can neglect it.

# . , before tokenizaText:
text = unicodedata.normalize("NFC", text)  

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

LTR to RTL Translation Layout Misalignment in Web UI
3 participants