-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathworkflow-tech-specs.txt
146 lines (128 loc) · 6.51 KB
/
workflow-tech-specs.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
1 Note reinsertion
a. Reinsert notes [ 1-a-reinsert_notes ]
1. pre-process for processing errors
- Pardrang column
* find all the csv files containing a pardrang column before the note column
(csv_contains_pardrang_col.py)
* rectify the བསྡུར་མཆན་ file
(in nalanda-corpus repo using docx_txt.sh to regenerate the csv file)
- replace all ; by , for the csv to be well-formed
* the " tr ';' ',' " command in docx_txt.sh
- debugging བསྡུར་མཆན་ and/or ཆོས་ཚན་ files
use debug_files() function to :
* find and modify the discrepancies in note number and numbers in ཆོས་ཚན་ file with the enters if the reinsertion stops for a given file.
* reformat the notes in the བསྡུར་མཆན་ file if the execution stops because the note is not well formatted
* show the notes where a note has been included in the note text
Manually separate the comments from the note and put them in the column where no other edition can have a note. if there are 3 editions not counting the basis edition(Derge), column 1 contains the edition name, column 2 contains the note. So the comments should be put after the 6th cell, so on K or L.
* create a backup of the both བསྡུར་མཆན་ and ཆོས་ཚན་ files before:
- deleting the lines that are too long to be correctly processed
- deleting both files in case the formatting is too bad
2. reinsert notes
comment debug_files() and execute the last lines of reinsertion.py to generate the
b. manually correct the reinsertion [ 1-b-manually_corrected_conc ]
1. Rabten corrects the reinsertion [ 1-b-manually_corrected_conc/corrected ]
(the remaining 3% of the notes that can’t be correctly reinserted by my script)
2. reinsert the removed notes in 1.a.1 [ 1-b-manually_corrected_conc/notes_restored ]
- if they are long, keep them for after
- if they are short, reinsert them
3. Reintegrate discrepancies found by Rabten
- correct the xlsx and docx with the errors found (in nalanda_corpus repo) + regenerate the csv and txt files
- ???reprocess the files if the reinsertions could not be corrected???
2 Note selection
0 [criteria_points] : a folder with a yaml for each text that will receive all the points
create a template
- Automatic categorise (full note)
- ཀ་ mistake raised :
- ༡ missing vowel
- ༢ nga/da
- ༣ sskrt
- ༤ non-word
- ill-formed
- well-formed
- ཁ་ particle difference
- ༡ min mod
- ༢ differing part
- ག་ verb difference
- ༡ diff tense
- ༢ diff verb
- ང་ All other notes
- free string
- Kangyur Ngram frequency (full note)
- frequency of note words
- frequency of note+context
- note profile : (note ID)
- (depends for each text)
- Manual categorise (full note)
- more than 2 differing syls
- <F> diff formulation
- <M> major mod
- evaluation
- y (Derge is obviously correct)
- a (great meaning diff)
- b (medium meaning diff)
- c (small meaning diff)
- Non-standard new notes: (note ID)
(<ID>, <string>)
format of each note:
(<ID>, <string>, <category correction>)
ID: a number from 1 to N
category correction: the number of the right category
mistake = {missing_vowel: [], nga_da: [], sskrt: [], non_word: [], other: []}
1 categorise():
0. prepare(ID + full note):
def find_words()
for each edition:
return (ID, (left words, words at note, right words))
for each note:
for each edition:
words = find_words()
return (ID, words)
note = prepare()
1. raised_mistake(note)
0. find_parts()
try to cut in two each syllable
return parts / ill-formed syl
if not ill-formed:
1. if word is a mistake:
1. missing_vowel()
for each vowel:
if part1 + vowel + part 2 makes a word:
return True
2. ga_da()
if nga in second part:
if part1 + replaced part2 makes a word:
return True
2. if the difference is a particle:
1. if is_min_mod()
- it is in the min-mod groups
- it is in the particle groups
- it is a particle agreement difference
2. else:
it is a differing particle
3. if the note word is a verb:
use the list of verbs from Monlam
0. extract the different verbs from the editions
1. if is_diff_tense()
return true
2. if is_diff_verb()
return true
4. if none of the above, it is a wellformed non-word
else:
either sskrt or non-word
4. if is_sskrt()
if a regex of Hackett detects is as sskrt:
return True
else:
5. it is a non-word
for notes in the dunno category:
if differing number of note syllables:
6. if 2 or less:
it is most likely a addition or deletion
7. if more than 2:
keep for manual selection (big-mod or formulation)
else:
8. keep all other notes for manual selection
2 Ngram frequency
use the raw ngrams from 1 to 12
from bigram onwards, the ngrams with lower frequency than 40 were removed from the results
3 Note profiling