Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added list of string support to sent_tokenize #927

Open
wants to merge 5 commits into
base: dev
Choose a base branch
from

Conversation

ayaan-qadri
Copy link

@ayaan-qadri ayaan-qadri commented Oct 8, 2024

What does this changes

sent_tokenizer function now also supports list of string

What was wrong

Before the changes, The sent_tokenizer function was taking string as parameter only.

How this fixes it

Joined the list of string using join method.

Fixes #906

Your checklist for this pull request

🚨Please review the guidelines for contributing to this repository.

  • Passed code styles and structures
  • Passed code linting checks and unit test

Copy link
Member

@wannaphong wannaphong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello! Thank you for your pull request! It looks may doesn't fixed the issues.

I think it should output as list of word too.

Example: ["ผม", "กิน", "ข้าว"," ","เธอ","เล่น" ,"เกม"] -> "ผมกินข้าว เธอเล่นเกม" (store index 0-1 as "ผม", 2-4 "กิน", ...) -> ["ผมกินข้าว"," ","เธอเล่นเกม"] (store index 0-1 as "ผม", 2-4 "กิน", ...) -> [["ผม", "กิน", "ข้าว"],[" "],["เธอ","เล่น" ,"เกม"]]

Can you edit the code?

@ayaan-qadri
Copy link
Author

Hi @wannaphong, you did not specify those changes before. I made the changes as you requested. Feel free to ask if you need further changes.

@wannaphong
Copy link
Member

wannaphong commented Oct 12, 2024

Can you add those functions to sent_tokenize?

def indices_words(words):
    indices = []
    start_index = 0

    for word in words:
        if len(word)>1:
            _temp= len(word)-1
        else:
            _temp=1
        indices.append((start_index, start_index + _temp))
        start_index += len(word)

    return indices
list_word = ['ผม', 'กินข้าว', ' ', 'เธอ', 'เล่น', 'เกม']
index=indices_words(list_word)  # [(0, 1), (2, 8), (9, 10), (10, 12), (13, 16), (17, 19)]
list_sent=sent_tokenize(list_word)  # ['ผมกินข้าว ', 'เธอเล่นเกม']

import copy
def map_indices_to_words(index_list, sentences):
    result = []
    c=copy.copy(index_list)
    n_sum=0
    
    for sentence in sentences:
        words = sentence
        sentence_result = []
        n=0
        
        for start, end in c:
            if start > n_sum+len(words)-1:
                break
            else:
                word = sentence[start-n_sum:end+1-n_sum]
                sentence_result.append(word)
                n+=1

        result.append(sentence_result)
        n_sum+=len(words)
        for _ in range(n):
            del c[0]
        
    
    return result
list_sent_word = map_indices_to_words(index,list_sent)  #  [['ผม', 'กินข้าว', ' '], ['เธอ', 'เล่น', 'เกม']]

@ayaan-qadri
Copy link
Author

@wannaphong , I'm a bit confused.

  1. where do you want to implement 'map_indices_to_words' and 'indices_words' functions?
  2. last changes for 'sent_tokenize' function are fine?
  3. list_word = ['ผม', 'กินข้าว', ' ', 'เธอ', 'เล่น', 'เกม']
    index=indices_words(list_word) # [(0, 1), (2, 8), (9, 10), (10, 12), (13, 16), (17, 19)]
    list_sent=sent_tokenize(list_word) # ['ผมกินข้าว ', 'เธอเล่นเกม']

  • current sent_tokenize return that output (['ผมกินข้าว ', 'เธอเล่นเกม']) if keep_whitespace is false.

@wannaphong
Copy link
Member

@wannaphong , I'm a bit confused.

1. where do you want to implement 'map_indices_to_words' and 'indices_words' functions?

2. last changes for 'sent_tokenize' function are fine?

3. > list_word = ['ผม', 'กินข้าว', ' ', 'เธอ', 'เล่น', 'เกม']
   > index=indices_words(list_word)  # [(0, 1), (2, 8), (9, 10), (10, 12), (13, 16), (17, 19)]
   > list_sent=sent_tokenize(list_word)  # ['ผมกินข้าว ', 'เธอเล่นเกม']


* current sent_tokenize return that output (['ผมกินข้าว ', 'เธอเล่นเกม']) if keep_whitespace is false.

Yes, it can still keep list of words. Every sentence tokenizer engine have different word tokenizer inside sentence tokenizer but the engine can output as list of sentence only, so we can keep list of words to restore list of words in list of sentences.

Example: I used deepcut tokenizer for my text and used crfcut that use newmm inside crfcut engine. The output is list of sentences. If I use map_indices_to_words, it can output as list of words (from deepcut) in list of sentences.

@ayaan-qadri
Copy link
Author

Sorry, but I do not have any idea about the tokenization engine and how it works. Could you please provide more details? Then maybe I could help you with this.

@wannaphong
Copy link
Member

Sorry, but I do not have any idea about the tokenization engine and how it works. Could you please provide more details? Then maybe I could help you with this.

from https://github.com/PyThaiNLP/pythainlp/blob/dev/pythainlp/tokenize/core.py#L325-L507

Here are the sentence tokenization engines:

  • crfcut - (default) split by CRF trained on TED dataset
  • thaisum - The implementation of sentence segmenter from Nakhun Chumpolsathien, 2020
  • tltk - split by TLTK <https://pypi.org/project/tltk/>_.,
  • wtp - split by wtpsplita <https://github.com/bminixhofer/wtpsplit>_.
  • whitespace+newline - split by whitespace and newline.
  • whitespace - split by whitespace, specifically with regex pattern r" +"

crfcut, thaisum, tltk, and wtp has own our word tokenizer inside the engine, so i think it should preprocessing by store indices word index and use the indices to tokenize word in list of sentences.

List of words ( ['ผม', 'กินข้าว', ' ', 'เธอ', 'เล่น', 'เกม']) -> Text ("ผมกินข้าว เธอเล่นเกม" and store indices)-> sent tokenizer -> List of sentences (['ผมกินข้าว ', 'เธอเล่นเกม']) -> List of sentences and list of word inside the list ([['ผม', 'กินข้าว', ' '], ['เธอ', 'เล่น', 'เกม']])

@pep8speaks
Copy link

pep8speaks commented Oct 13, 2024

Hello @ayaan-qadri! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 332:30: E226 missing whitespace around arithmetic operator
Line 352:29: E226 missing whitespace around arithmetic operator
Line 352:40: E226 missing whitespace around arithmetic operator
Line 355:38: E226 missing whitespace around arithmetic operator
Line 355:48: E226 missing whitespace around arithmetic operator
Line 355:50: E226 missing whitespace around arithmetic operator

Comment last updated at 2024-10-13 17:37:56 UTC

@ayaan-qadri
Copy link
Author

@wannaphong , Thanks for the explanation, Please check again, I have made mistake in past commit too sorry for that.

@wannaphong
Copy link
Member

I found a bug.
image

The indices store whitespace but whitespace and whitespace+newline are tokenize by whitespace. Can you re-create word_indices in those engines?

word_indices = indices_words(text that are without whitespace in whitespace and whitespace+newline in whitespace+newline)

Copy link

sonarcloud bot commented Oct 13, 2024

@ayaan-qadri
Copy link
Author

@wannaphong , Check it now.

@wannaphong
Copy link
Member

I found two bug in two case.

list_word=["ผม","กิน","ข้าว"," ","\n","เธอ","เล่น","เกม"]
sent_tokenize(list_word)

Results: [['ผม', 'กิน', 'ข้าว', ' \n', '\nเ', 'เธอ', 'เล่น', 'เกม']]

Expected results: [['ผม', 'กิน', 'ข้าว', ' ', '\n', 'เธอ', 'เล่น', 'เกม']]

list_word=["ผม","กิน","ข้าว"," ","\n","เธอ","เล่น","เกม"]
sent_tokenize(list_word,engine="whitespace")

Results: [['ผม', 'กิน', 'ข้าว'], ['\nเธ', 'อเล่', 'นเก']]

Expected results: [['ผม', 'กิน', 'ข้าว'], ['\n', 'เธอ', 'เล่น', 'เกม']]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Get List of words to sent_tokenize
3 participants