Fix mixed-direction text layout in translations (Resolves #34) #37

Salah-Sal · 2024-07-12T07:20:11Z

Context

What is the purpose of this PR? Is it to

add a new feature
fix a bug
update tests and/or documentation
other (please add here)

Please link to any issues this PR addresses.
Fixes #34

Changelog

Removed a redundant line in utils.py where source_text was being replaced unnecessarily.
Added the tokenize_mixed_direction_text function in process.py to handle mixed-direction text (LTR within RTL).
Updated the translator and translator_sec functions in process.py to use tokenize_mixed_direction_text for processing the translations.

Test plan

Please make sure to do each of the following if applicable to your PR. (If you're not sure about any one of these just ask and we will happily help.)

run pre-commit hooks and linters (make sure you've first installed via pre-commit install)
add unit tests for any new functionality
update docstrings for any new or updated methods or classes
run unit tests via pytest tests
include relevant commands and any other artifacts in this summary (eval results, etc.)

* remove redundant line the line removed was redundant - the `source_text` gets already replaced through the f"-string * Update utils.py

abodacs

@Salah-Sal
Thanks for your contribution.
I added some comments

abodacs · 2024-11-15T02:26:32Z

app/process.py

+    words_and_delimiters = re.findall(r"\w+|[^\w\s]+|\s+", text)
+
+    new_text = []
+    ltr_pattern = re.compile(


@Salah-Sal
Thanks for your contribution.
can we modify the ltr_pattern to include numerical digits?

ltr_pattern = re.compile("[A-Za-z0-9]+")

abodacs · 2024-11-15T02:28:51Z

app/process.py

+    is_rtl = language in rtl_languages
+
+    # Regex to capture words, non-word characters, and any whitespace
+    words_and_delimiters = re.findall(r"\w+|[^\w\s]+|\s+", text)


Can we pre-compile the regular expression outside for better performance?

tokenizer_pattern = re.compile(r"\w+|[^\w\s]+|\s+") # ... Inside the function: words_and_delimiters = tokenizer_pattern.findall(text)

abodacs · 2024-11-15T02:38:11Z

app/process.py

+        "Urdu",
+    }
+    is_rtl = language in rtl_languages
+


I suggest adding an appropriate normalization form, but it may lead to the removal of unimportant text, so you can neglect it.

# . , before tokenizaText: text = unicodedata.normalize("NFC", text)

remove redundant line (andrewyng#8)

e64bc03

* remove redundant line the line removed was redundant - the `source_text` gets already replaced through the f"-string * Update utils.py

Salah-Sal changed the title ~~remove redundant line (#8)~~ Fix mixed-direction text layout in translations (Resolves #34) (#8) Jul 12, 2024

Salah-Sal changed the title ~~Fix mixed-direction text layout in translations (Resolves #34) (#8)~~ Fix mixed-direction text layout in translations (Resolves #34) Jul 12, 2024

abodacs suggested changes Nov 15, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix mixed-direction text layout in translations (Resolves #34) #37

Fix mixed-direction text layout in translations (Resolves #34) #37

Salah-Sal commented Jul 12, 2024

abodacs left a comment

abodacs Nov 15, 2024

abodacs Nov 15, 2024

abodacs Nov 15, 2024

Fix mixed-direction text layout in translations (Resolves #34) #37

Are you sure you want to change the base?

Fix mixed-direction text layout in translations (Resolves #34) #37

Conversation

Salah-Sal commented Jul 12, 2024

Context

Changelog

Test plan

abodacs left a comment

Choose a reason for hiding this comment

abodacs Nov 15, 2024

Choose a reason for hiding this comment

abodacs Nov 15, 2024

Choose a reason for hiding this comment

abodacs Nov 15, 2024

Choose a reason for hiding this comment