Skip to content
/ bo-pos Public

Resources connected to Tibetan part of speech

Notifications You must be signed in to change notification settings

Esukhia/bo-pos

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

bo-pos

This repo contains various resources related to Tibetan part of speech.

Resources

ACTib-one-tag.txt -- tokens from the ACTIBII annotated corpus with the most frequent tag for each token. The corpus was segmented and tagged with the TiMBL Memory Based Tagger trained on the SOAS corpus (1,463,920 types)

SOAS-lexicon.txt -- tokens from the SOAS corpus with multiple tags per token. (15,643 types)

SOAS-one-tag.txt -- tokens from the SOAS corpus with the most frequent tag for each token. (14,983 types)

MGD-dict.yml -- Monlam Grand Dictionary data normalized and re-organized into a structured .yml file. (107,064 types)

To Do - prep

  • convert mgd POS to UD with attributes
    • find POS attributes in mgd definitions
    • extract tokens with less than 5 syllables
    • extract POS + attributes when available
    • convert mgd pos to UD
    • give priority to tags from ACTib-one-tag when available
  • revise the mapping of SOAS to UD

To Do - Segmentation

  • convert segmentation and vocabularies to XYZ
  • train a RDR model
  • convert model into pybo matcher syntax

To Do - Tagging

  • plug pybo to spacy with the init file
  • train spacy models
  • ...

To Do - Suggestions

  • train a 3-gram model on mgd
  • plug to pybo
  • plug bopho to pybo
  • ...

About

Resources connected to Tibetan part of speech

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages