Added list of string support to sent_tokenize #927

ayaan-qadri · 2024-10-08T08:52:43Z

What does this changes

sent_tokenizer function now also supports list of string

What was wrong

Before the changes, The sent_tokenizer function was taking string as parameter only.

How this fixes it

Joined the list of string using join method.

Fixes #906

Your checklist for this pull request

🚨Please review the guidelines for contributing to this repository.

Passed code styles and structures
Passed code linting checks and unit test

wannaphong

Hello! Thank you for your pull request! It looks may doesn't fixed the issues.

I think it should output as list of word too.

Example: ["ผม", "กิน", "ข้าว"," ","เธอ","เล่น" ,"เกม"] -> "ผมกินข้าว เธอเล่นเกม" (store index 0-1 as "ผม", 2-4 "กิน", ...) -> ["ผมกินข้าว"," ","เธอเล่นเกม"] (store index 0-1 as "ผม", 2-4 "กิน", ...) -> [["ผม", "กิน", "ข้าว"],[" "],["เธอ","เล่น" ,"เกม"]]

Can you edit the code?

ayaan-qadri · 2024-10-12T16:58:09Z

Hi @wannaphong, you did not specify those changes before. I made the changes as you requested. Feel free to ask if you need further changes.

wannaphong · 2024-10-12T21:36:19Z

Can you add those functions to sent_tokenize?

def indices_words(words):
    indices = []
    start_index = 0

    for word in words:
        if len(word)>1:
            _temp= len(word)-1
        else:
            _temp=1
        indices.append((start_index, start_index + _temp))
        start_index += len(word)

    return indices
list_word = ['ผม', 'กินข้าว', ' ', 'เธอ', 'เล่น', 'เกม']
index=indices_words(list_word)  # [(0, 1), (2, 8), (9, 10), (10, 12), (13, 16), (17, 19)]
list_sent=sent_tokenize(list_word)  # ['ผมกินข้าว ', 'เธอเล่นเกม']

import copy
def map_indices_to_words(index_list, sentences):
    result = []
    c=copy.copy(index_list)
    n_sum=0
    
    for sentence in sentences:
        words = sentence
        sentence_result = []
        n=0
        
        for start, end in c:
            if start > n_sum+len(words)-1:
                break
            else:
                word = sentence[start-n_sum:end+1-n_sum]
                sentence_result.append(word)
                n+=1

        result.append(sentence_result)
        n_sum+=len(words)
        for _ in range(n):
            del c[0]
        
    
    return result
list_sent_word = map_indices_to_words(index,list_sent)  #  [['ผม', 'กินข้าว', ' '], ['เธอ', 'เล่น', 'เกม']]

ayaan-qadri · 2024-10-13T06:34:31Z

@wannaphong , I'm a bit confused.

where do you want to implement 'map_indices_to_words' and 'indices_words' functions?
last changes for 'sent_tokenize' function are fine?
list_word = ['ผม', 'กินข้าว', ' ', 'เธอ', 'เล่น', 'เกม']
index=indices_words(list_word) # [(0, 1), (2, 8), (9, 10), (10, 12), (13, 16), (17, 19)]
list_sent=sent_tokenize(list_word) # ['ผมกินข้าว ', 'เธอเล่นเกม']

current sent_tokenize return that output (['ผมกินข้าว ', 'เธอเล่นเกม']) if keep_whitespace is false.

wannaphong · 2024-10-13T07:06:09Z

@wannaphong , I'm a bit confused.

1. where do you want to implement 'map_indices_to_words' and 'indices_words' functions?

2. last changes for 'sent_tokenize' function are fine?

3. > list_word = ['ผม', 'กินข้าว', ' ', 'เธอ', 'เล่น', 'เกม']
   > index=indices_words(list_word)  # [(0, 1), (2, 8), (9, 10), (10, 12), (13, 16), (17, 19)]
   > list_sent=sent_tokenize(list_word)  # ['ผมกินข้าว ', 'เธอเล่นเกม']


* current sent_tokenize return that output (['ผมกินข้าว ', 'เธอเล่นเกม']) if keep_whitespace is false.

Yes, it can still keep list of words. Every sentence tokenizer engine have different word tokenizer inside sentence tokenizer but the engine can output as list of sentence only, so we can keep list of words to restore list of words in list of sentences.

Example: I used deepcut tokenizer for my text and used crfcut that use newmm inside crfcut engine. The output is list of sentences. If I use map_indices_to_words, it can output as list of words (from deepcut) in list of sentences.

ayaan-qadri · 2024-10-13T08:09:25Z

Sorry, but I do not have any idea about the tokenization engine and how it works. Could you please provide more details? Then maybe I could help you with this.

wannaphong · 2024-10-13T08:26:36Z

Sorry, but I do not have any idea about the tokenization engine and how it works. Could you please provide more details? Then maybe I could help you with this.

from https://github.com/PyThaiNLP/pythainlp/blob/dev/pythainlp/tokenize/core.py#L325-L507

Here are the sentence tokenization engines:

crfcut - (default) split by CRF trained on TED dataset
thaisum - The implementation of sentence segmenter from Nakhun Chumpolsathien, 2020
tltk - split by TLTK <https://pypi.org/project/tltk/>_.,
wtp - split by wtpsplita <https://github.com/bminixhofer/wtpsplit>_.
whitespace+newline - split by whitespace and newline.
whitespace - split by whitespace, specifically with regex pattern r" +"

crfcut, thaisum, tltk, and wtp has own our word tokenizer inside the engine, so i think it should preprocessing by store indices word index and use the indices to tokenize word in list of sentences.

List of words ( ['ผม', 'กินข้าว', ' ', 'เธอ', 'เล่น', 'เกม']) -> Text ("ผมกินข้าว เธอเล่นเกม" and store indices)-> sent tokenizer -> List of sentences (['ผมกินข้าว ', 'เธอเล่นเกม']) -> List of sentences and list of word inside the list ([['ผม', 'กินข้าว', ' '], ['เธอ', 'เล่น', 'เกม']])

pep8speaks · 2024-10-13T10:29:04Z

Hello @ayaan-qadri! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

In the file pythainlp/tokenize/core.py:

Line 332:30: E226 missing whitespace around arithmetic operator
Line 352:29: E226 missing whitespace around arithmetic operator
Line 352:40: E226 missing whitespace around arithmetic operator
Line 355:38: E226 missing whitespace around arithmetic operator
Line 355:48: E226 missing whitespace around arithmetic operator
Line 355:50: E226 missing whitespace around arithmetic operator

Comment last updated at 2024-10-13 17:37:56 UTC

ayaan-qadri · 2024-10-13T10:31:11Z

@wannaphong , Thanks for the explanation, Please check again, I have made mistake in past commit too sorry for that.

wannaphong · 2024-10-13T13:52:37Z

I found a bug.

The indices store whitespace but whitespace and whitespace+newline are tokenize by whitespace. Can you re-create word_indices in those engines?

word_indices = indices_words(text that are without whitespace in whitespace and whitespace+newline in whitespace+newline)

sonarcloud · 2024-10-13T17:38:33Z

Quality Gate passed

Issues
1 New issue
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarCloud

ayaan-qadri · 2024-10-13T17:39:05Z

@wannaphong , Check it now.

wannaphong · 2024-10-14T08:25:50Z

I found two bug in two case.

list_word=["ผม","กิน","ข้าว"," ","\n","เธอ","เล่น","เกม"]
sent_tokenize(list_word)

Results: [['ผม', 'กิน', 'ข้าว', ' \n', '\nเ', 'เธอ', 'เล่น', 'เกม']]

Expected results: [['ผม', 'กิน', 'ข้าว', ' ', '\n', 'เธอ', 'เล่น', 'เกม']]

list_word=["ผม","กิน","ข้าว"," ","\n","เธอ","เล่น","เกม"]
sent_tokenize(list_word,engine="whitespace")

Results: [['ผม', 'กิน', 'ข้าว'], ['\nเธ', 'อเล่', 'นเก']]

Expected results: [['ผม', 'กิน', 'ข้าว'], ['\n', 'เธอ', 'เล่น', 'เกม']]

Added list of string support to sent_tokenize/solved PyThaiNLP#906

2a95070

wannaphong reviewed Oct 11, 2024

View reviewed changes

Added list grouping

817b87a

Implemented indices_words & map_indices_to_words in sent_tokenize

5bbf410

Added return to map_indices_to_words

f1b9c11

Reolved Bug: whitespace+newline was tokenize by whitespace

77756c4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added list of string support to sent_tokenize #927

Added list of string support to sent_tokenize #927

ayaan-qadri commented Oct 8, 2024 •

edited

Loading

wannaphong left a comment

ayaan-qadri commented Oct 12, 2024

wannaphong commented Oct 12, 2024 •

edited

Loading

ayaan-qadri commented Oct 13, 2024

wannaphong commented Oct 13, 2024

ayaan-qadri commented Oct 13, 2024

wannaphong commented Oct 13, 2024

pep8speaks commented Oct 13, 2024 •

edited

Loading

ayaan-qadri commented Oct 13, 2024

wannaphong commented Oct 13, 2024

sonarcloud bot commented Oct 13, 2024

ayaan-qadri commented Oct 13, 2024

wannaphong commented Oct 14, 2024

Added list of string support to sent_tokenize #927

Are you sure you want to change the base?

Added list of string support to sent_tokenize #927

Conversation

ayaan-qadri commented Oct 8, 2024 • edited Loading

What does this changes

What was wrong

How this fixes it

Your checklist for this pull request

wannaphong left a comment

Choose a reason for hiding this comment

ayaan-qadri commented Oct 12, 2024

wannaphong commented Oct 12, 2024 • edited Loading

ayaan-qadri commented Oct 13, 2024

wannaphong commented Oct 13, 2024

ayaan-qadri commented Oct 13, 2024

wannaphong commented Oct 13, 2024

pep8speaks commented Oct 13, 2024 • edited Loading

Comment last updated at 2024-10-13 17:37:56 UTC

ayaan-qadri commented Oct 13, 2024

wannaphong commented Oct 13, 2024

sonarcloud bot commented Oct 13, 2024

Quality Gate passed

ayaan-qadri commented Oct 13, 2024

wannaphong commented Oct 14, 2024

ayaan-qadri commented Oct 8, 2024 •

edited

Loading

wannaphong commented Oct 12, 2024 •

edited

Loading

pep8speaks commented Oct 13, 2024 •

edited

Loading