bo-pos

This repo contains various resources related to Tibetan part of speech.

Resources

ACTib-one-tag.txt -- tokens from the ACTIBII annotated corpus with the most frequent tag for each token. The corpus was segmented and tagged with the TiMBL Memory Based Tagger trained on the SOAS corpus (1,463,920 types)

SOAS-lexicon.txt -- tokens from the SOAS corpus with multiple tags per token. (15,643 types)

SOAS-one-tag.txt -- tokens from the SOAS corpus with the most frequent tag for each token. (14,983 types)

MGD-dict.yml -- Monlam Grand Dictionary data normalized and re-organized into a structured .yml file. (107,064 types)

To Do - prep

convert mgd POS to UD with attributes
- find POS attributes in mgd definitions
- extract tokens with less than 5 syllables
- extract POS + attributes when available
- convert mgd pos to UD
- give priority to tags from ACTib-one-tag when available
revise the mapping of SOAS to UD

To Do - Segmentation

convert segmentation and vocabularies to XYZ
train a RDR model
convert model into pybo matcher syntax

To Do - Tagging

plug pybo to spacy with the init file
train spacy models
...

To Do - Suggestions

train a 3-gram model on mgd
plug to pybo
plug bopho to pybo
...

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
models/ngram		models/ngram
rc		rc
tests		tests
README.md		README.md
script.py		script.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

bo-pos

Resources

To Do - prep

To Do - Segmentation

To Do - Tagging

To Do - Suggestions

About

Releases

Packages

Languages

Esukhia/bo-pos

Folders and files

Latest commit

History

Repository files navigation

bo-pos

Resources

To Do - prep

To Do - Segmentation

To Do - Tagging

To Do - Suggestions

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages