Skip to content

Model Cards

Wannaphong Phatthiyaphaibun edited this page Jul 31, 2021 · 19 revisions

These model cards contain technical details of the models developed and used in PyThaiNLP.

Index

LST20 CLS

v0.2

Model Details

  • Developer: Wannaphong Phatthiyaphaibun
  • Model date: 2020-10-03
  • Model version: 0.2
  • Used in PyThaiNLP version: 2.2.4 +
  • Filename: ~/pythainlp-data/cls-v0.2.crfsuite
  • GitHub: https://github.com/PyThaiNLP/pythainlp/pull/479
  • CRF Model
  • License: CC0

Intended Use

  • Segmenting Thai text into clauses (smaller than a sentence but bigger than a word)
  • Not suitable for other language or non-news domain.

Factors

  • Based on known problems with thai natural Language processing.

Metrics

  • Evaluation metrics include precision, recall and f1-score.

Training Data LST20 Corpus Train set (news domain)

Evaluation Data LST20 Corpus Test set (news domain)

Quantitative Analyses

              precision    recall  f1-score   support

       B_CLS       0.90      0.94      0.92     16111
       E_CLS       0.90      0.94      0.92     15947
       I_CLS       0.99      0.97      0.98    169565

   micro avg       0.97      0.97      0.97    201623
   macro avg       0.93      0.95      0.94    201623
weighted avg       0.97      0.97      0.97    201623
 samples avg       0.94      0.94      0.94    201623

Ethical Considerations no ideas

Caveats and Recommendations

  • The user must perform word segmentation first before using this model.
  • Thai text only

^ Back to top

CRFcut

v1.0

Model Details

  • Developer: Chonlapat Patanajirasit
  • Model date: 2020-05-09
  • Model version: 1.0
  • Used in PyThaiNLP version: 2.2 +
  • Filename: pythainlp/corpus/sentenceseg_crfcut.model
  • GitHub: https://github.com/vistec-AI/crfcut
  • CRF Model
  • License: CC0

Intended Use

  • Segmenting Thai text into sentences.

Factors

  • Based on known problems with thai natural Language processing.

Metrics

  • Evaluation metrics include precision, recall and f1-score.

Training Data Ted + Orchid + Fake review

Evaluation Data The result of CRF-Cut is trained by different datasets are as follows:

dataset-train dataset-validate I-precision I-recall I-fscore E-precision E-recall E-fscore space-correct
Ted Ted 0.99 0.99 0.99 0.74 0.70 0.72 0.82
Ted Orchid 0.95 0.99 0.97 0.73 0.24 0.36 0.73
Ted Fake review 0.98 0.99 0.98 0.86 0.70 0.77 0.78
Orchid Ted 0.98 0.98 0.98 0.56 0.59 0.58 0.71
Orchid Orchid 0.98 0.99 0.99 0.85 0.71 0.77 0.87
Orchid Fake review 0.97 0.99 0.98 0.77 0.63 0.69 0.70
Fake review Ted 0.99 0.95 0.97 0.42 0.85 0.56 0.56
Fake review Orchid 0.97 0.96 0.96 0.48 0.59 0.53 0.67
Fake review Fake review 1 1 1 0.98 0.96 0.97 0.97
Ted + Orchid + Fake review Ted 0.99 0.98 0.99 0.66 0.77 0.71 0.78
Ted + Orchid + Fake review Orchid 0.98 0.98 0.98 0.73 0.66 0.69 0.82
Ted + Orchid + Fake review Fake review 1 1 1 0.98 0.95 0.96 0.96

Quantitative Analyses ? Ethical Considerations no ideas

Caveats and Recommendations

  • Thai text only

^ Back to top

Thai NER

v1.4

Model Details

Intended Use

  • Named-Entity Tagging for Thai.
  • Not suitable for other language or non-news domain.

Factors

  • Based on known problems with thai natural Language processing.

Metrics

  • Evaluation metrics include precision, recall and f1-score.

Training Data ThaiNER 1.3 Corpus Train set

Evaluation Data ThaiNER 1.3 Corpus Test set

Quantitative Analyses


                precision    recall  f1-score   support

        B-DATE       0.92      0.86      0.89       375
        I-DATE       0.94      0.94      0.94       747
       B-EMAIL       1.00      1.00      1.00         5
       I-EMAIL       1.00      1.00      1.00        28
         B-LAW       0.71      0.56      0.62        43
         I-LAW       0.74      0.70      0.72       154
         B-LEN       0.96      0.93      0.95        29
         I-LEN       0.98      0.94      0.96        69
    B-LOCATION       0.88      0.77      0.82       864
    I-LOCATION       0.86      0.73      0.79       852
       B-MONEY       0.98      0.85      0.91       105
       I-MONEY       0.96      0.95      0.95       239
B-ORGANIZATION       0.90      0.78      0.84      1166
I-ORGANIZATION       0.84      0.77      0.81      1338
     B-PERCENT       1.00      0.97      0.99        34
     I-PERCENT       1.00      0.96      0.98        51
      B-PERSON       0.96      0.82      0.88       676
      I-PERSON       0.94      0.92      0.93      2424
       B-PHONE       1.00      0.72      0.84        29
       I-PHONE       0.96      0.92      0.94        78
        B-TIME       0.87      0.73      0.79       172
        I-TIME       0.94      0.83      0.88       336
         B-URL       0.89      1.00      0.94        24
         I-URL       0.96      1.00      0.98       371
         B-ZIP       1.00      1.00      1.00         4

     micro avg       0.91      0.84      0.87     10213
     macro avg       0.93      0.87      0.89     10213
  weighted avg       0.91      0.84      0.87     10213
   samples avg       0.17      0.17      0.17     10213

Ethical Considerations no ideas

Caveats and Recommendations

  • Thai text only

^ Back to top

v1.5

Model Details

  • Developer: Wannaphong Phatthiyaphaibun
  • Model date: 2021-1-16
  • Model version: 1.5
  • Used in PyThaiNLP version: 2.3 +
  • Filename: ~/pythainlp-data/thai-ner-1-5-newmm-lst20.crfsuite
  • CRF Model
  • License: CC0
  • GitHub for Thai NER 1.5 (Data and train notebook): thai-ner-1-5-newmm-lst20.ipynb https://github.com/wannaphong/thai-ner/tree/master/model/1.5

Intended Use

  • Named-Entity Tagging for Thai.
  • Not suitable for other language or non-news domain.

Factors

  • Based on known problems with thai natural Language processing.

Metrics

  • Evaluation metrics include precision, recall and f1-score.

Training Data ThaiNER 1.5 Corpus Train set (5089 sent)

Evaluation Data ThaiNER 1.5 Corpus Test set (1274 sent)

Quantitative Analyses

                precision    recall  f1-score   support

        B-DATE       0.93      0.82      0.87       350
        I-DATE       0.95      0.94      0.95       665
         B-LAW       0.85      0.54      0.66        87
         I-LAW       0.85      0.64      0.73       253
         B-LEN       1.00      0.75      0.86        12
         I-LEN       1.00      0.69      0.82        26
    B-LOCATION       0.81      0.70      0.75       620
    I-LOCATION       0.74      0.72      0.73       533
       B-MONEY       1.00      0.91      0.95       131
       I-MONEY       0.99      0.95      0.97       321
B-ORGANIZATION       0.92      0.70      0.80      1334
I-ORGANIZATION       0.80      0.73      0.76      1198
     B-PERCENT       0.94      0.88      0.91        17
     I-PERCENT       0.91      0.95      0.93        22
      B-PERSON       0.96      0.78      0.86       607
      I-PERSON       0.94      0.88      0.91      2181
       B-PHONE       1.00      0.50      0.67         2
       I-PHONE       1.00      1.00      1.00         8
        B-TIME       0.93      0.66      0.77        87
        I-TIME       0.97      0.77      0.86       158
         B-URL       0.91      0.83      0.87        12
         I-URL       0.93      0.96      0.94        94

     micro avg       0.89      0.79      0.84      8718
     macro avg       0.92      0.79      0.84      8718
  weighted avg       0.90      0.79      0.84      8718
   samples avg       0.16      0.16      0.16      8718

Ethical Considerations no ideas

Caveats and Recommendations

  • Thai text only

^ Back to top

Part of speech

orchid perceptron

Model Details

Intended Use

  • Part of speech for Thai.
  • Not suitable for other language or other domain of orchid corpus.

Factors

  • Based on known problems with thai natural Language processing.

Metrics

  • Evaluation metrics include precision, recall and f1-score.

Training Data Orchid Corpus

Evaluation Data Orchid Corpus

Quantitative Analyses

No data

Ethical Considerations no ideas

Caveats and Recommendations

  • Thai word token only

^ Back to top

LST20 perceptron

Model Details

Intended Use

  • Part of speech for Thai.
  • Not suitable for other language or other domain of LST20 corpus.

Factors

  • Based on known problems with thai natural Language processing.

Metrics

  • Evaluation metrics include precision, recall and f1-score.

Training Data

LST20 Corpus Train set

Evaluation Data

LST20 Corpus Test set

Quantitative Analyses

              precision    recall  f1-score   support

          AJ       0.90      0.87      0.88      4403
          AV       0.88      0.79      0.83      6722
          AX       0.95      0.94      0.95      7556
          CC       0.94      0.97      0.95     17613
          CL       0.87      0.85      0.86      3739
          FX       0.99      0.99      0.99      6918
          IJ       1.00      0.25      0.40         4
          NG       1.00      1.00      1.00      1694
          NN       0.97      0.98      0.98     58568
          NU       0.98      0.98      0.98      6256
          PA       0.88      0.89      0.88       194
          PR       0.88      0.85      0.86      2139
          PS       0.94      0.93      0.94     10886
          PU       1.00      1.00      1.00     37973
          VV       0.95      0.97      0.96     42586
          XX       0.00      0.00      0.00        27

    accuracy                           0.96    207278
   macro avg       0.88      0.83      0.84    207278
weighted avg       0.96      0.96      0.96    207278

Ethical Considerations no ideas

Caveats and Recommendations

  • Thai word token only

^ Back to top

UD_Thai-PUD Part-of-speech

v0.1

Model Details

Intended Use

  • Part of speech for Thai.
  • Not suitable for other language or other domain of UD_Thai-PUD corpus.

Factors

  • Based on known problems with thai natural Language processing.

Metrics None

Training Data

UD_Thai-PUD v2.2 https://github.com/UniversalDependencies/UD_Thai-PUD/releases/tag/r2.2

Evaluation Data

None

Quantitative Analyses None

Ethical Considerations no ideas

Caveats and Recommendations

  • Thai word token only

^ Back to top

v0.2

Model Details

Intended Use

  • Part of speech for Thai.
  • Not suitable for other language or other domain of UD_Thai-PUD corpus.

Factors

  • Based on known problems with thai natural Language processing.

Metrics None

Training Data

UD_Thai-PUD v2.8 https://github.com/UniversalDependencies/UD_Thai-PUD/releases/tag/r2.8

Evaluation Data

None

Quantitative Analyses None

Ethical Considerations no ideas

Caveats and Recommendations

  • Thai word token only

^ Back to top

Thai W2P

Model Details

Intended Use

  • Converter thai word to thai phoneme
  • Not suitable for other language.

Factors

  • Based on thai word to thai phoneme problems.

Metrics

  • Evaluation metrics include phoneme error rate (number error / number phonemes)

Training Data

Thai W2P

Evaluation Data

Thai W2P

Quantitative Analyses

epoch: 100
step: 100, loss: 0.03179970383644104
step: 200, loss: 0.04126007482409477
step: 300, loss: 0.01877519115805626
step: 400, loss: 0.03311225399374962
per: 0.0432
per: 0.0419

Ethical Considerations

thai phoneme based on website (wiktionary, Royal Institute et cetera). It may not be the dialect that you use in everyday life.

Caveats and Recommendations

  • 1 Thai word only

^ Back to top

Chunk Parser

CRFChunk orchidpp

v0.2

Model Details

Intended Use

  • Parser thai sentence to phrase structure
  • Not suitable for other language or other domain of orchid corpus.

Factors

  • Based on thai chunk parser problems.

Metrics

  • Evaluation metrics include precision, recall and f1-score.

Training Data

ORCHID++ (90%) from Thai Treebanks Dataset

Evaluation Data

ORCHID++ (10%) from Thai Treebanks Dataset

Quantitative Analyses

              precision    recall  f1-score   support

        B-NP       0.95      0.98      0.96       518
        I-NP       0.86      0.91      0.88      2128
           O       0.87      0.91      0.89       280
        B-PP       0.91      0.77      0.83        65
        I-PP       0.66      0.52      0.59       252
         B-S       0.65      0.49      0.56        90
         I-S       0.67      0.49      0.56      1082
        B-VP       0.86      0.89      0.88       515
        I-VP       0.90      0.94      0.92      4565

   micro avg       0.86      0.86      0.86      9495
   macro avg       0.81      0.77      0.79      9495
weighted avg       0.86      0.86      0.86      9495
 samples avg       0.86      0.86      0.86      9495


Ethical Considerations

no ideas

Caveats and Recommendations

  • 1 Thai sentence with [(word,part-of-speech)] (part-of-speech model trained from orchid corpus)

^ Back to top

Clone this wiki locally