Skip to content

Latest commit

 

History

History
25 lines (18 loc) · 761 Bytes

File metadata and controls

25 lines (18 loc) · 761 Bytes

Process arabic corpora for pronouncing dictionary

Pre-usage

Tashkeela

  • Add corpus in a folder named "tashkeela-raw" in the root folder

Nawar's Halabi Corpus

  • Add corpus in a folder named "tashkeela-raw" in the root folder

Kasct Corpus

  • Add corpus in a folder named "kasct-raw" in the root folder
  • Open both files and in VS Code
    • Reopen with encoding "Arabic Windows(1256)"
    • Save with encoding "utf-8"

Usage

  • python .\process_corpus.py --corpus nawar
  • python .\process_corpus.py --corpus kasct
  • python .\process_corpus.py --corpus tashkeela
  • python .\concat_corpora.py

Post Usage

  • Use ara pronounciaiton tool to produce the dict file using python corpus2cmudict.py -i output_with_last_haraka.txt -p haraka