This project comprises the first homework assignment for CS 549, where the renowned Reuters dataset is utilized. The primary objective is to explore various preprocessing techniques applied to the dataset and construct unigram, bigram, and trigram models.
The project involves the following key steps:
- Utilizing the Reuters dataset
- Applying a variety of preprocessing methods
- Building unigram, bigram, and trigram models
- Evaluating model performance using a test set
- Employing metrics such as recall, precision, and F1 score for assessment
This homework assignment serves as an introduction to text data preprocessing and n-gram modeling techniques, with a focus on practical implementation and evaluation using real-world data.
This project is compiled by python 3.8
pip install -r requirements.txt
Before run the code, you should change dataset path in main.py file. default is 'path = 'reuters21578''
python main.py
main.py file print metrics to the terminal.