News Dataset Text Preprocessing Experiment

This project investigates the impact of different text preprocessing techniques on the fine-tuning of a RoBERTa model for a news dataset. We conduct an experiment to compare two preprocessing approaches and analyze their effectiveness.

Experiment Setup

Original Dataset: News Category Dataset
Sub-dataset Creation: 18,981 items were randomly selected from the original dataset in a stratified manner to ensure proportional representation of different news categories.
Preprocessing Techniques:
- Dataset 1: Removal of numbers, stop words, and punctuation.
- Dataset 2: Removal of numbers, stop words, and punctuation, plus stop words removing.
Models:
- Two Roberta models were fine-tuned, each on one of the preprocessed datasets.
- A final model was trained based on results of the 2 model.

Project Structure

experiments/
- Contains 4 Jupyter notebooks detailing the experimental process:
  - data-preprocessing-nostopwords-vs-stopwords.ipynb: Creation of the two experimental datasets, one with and one without stop words.
  - roberta-fine-tuning-stopwords.ipynb: Fine-tuning of the Roberta model on the dataset with stop words.
  - roberta-fine-tuning-nostopwords.ipynb: Fine-tuning of the Roberta model on the dataset without stop words.
  - dataset-preprocessing.ipynb: Processes the entire dataset based on the insights gained from the fine-tuning experiments.
news-classification-big-dataset.ipynb: Trains and evaluates the final model on the full dataset using the best preprocessing technique identified in the experiments.
README.md

Experimental Results and Analysis

Compare the performance of the two fine-tuned models.
Discuss which preprocessing technique proved more effective.
Explain the rationale behind the final model training.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

News Dataset Text Preprocessing Experiment

Experiment Setup

Project Structure

Experimental Results and Analysis

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
experiments		experiments
README.md		README.md
news-classification-big-dataset.ipynb		news-classification-big-dataset.ipynb

dkurdomanenko/news-categorization

Folders and files

Latest commit

History

Repository files navigation

News Dataset Text Preprocessing Experiment

Experiment Setup

Project Structure

Experimental Results and Analysis

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages