This project implements an AdaBoost and a Random Forest model as well as a Stacked Bidirectional RNN with GRU cells classifiers to classify movie reviews into positive or negative opinions. The dataset used is the Large Movie Review Dataset (IMDB Dataset).
- Represent text as binary feature vectors, where each feature corresponds to the presence (
1
) or absence (0
) of a word in the review. - Construct a vocabulary by removing the
n
most frequent andk
rarest words and selecting the topm
words with the highest information gain. - Evaluate the classifier on a subset of training data (development set) and test data.
- GPU acceleration is used to speed up training and prediction.
Clone this repository:
git clone https://github.com/Thanos-png/IMDB-ML-Classifiers.git
cd IMDB-ML-Classifiers
pip install -r requirements.txt
python -c "import torch; print(torch.cuda.is_available())"
If True
your GPU is ready.If False
check your CUDA installation or the CPU will be used automatically.
cd src/
python train_adaboost.py
- Load and preprocess the IMDB dataset.
- Construct a vocabulary by removing frequent/rare words and selecting words based on information gain.
- Convert reviews into binary feature vectors.
- Train an AdaBoost classifier with T=200 boosting iterations.
Parameter | Value | Description |
---|---|---|
T |
200 |
Number of AdaBoost iterations |
m |
5000 |
Vocabulary size |
n_most |
50 |
Most frequent words removed |
k_rarest |
50 |
Rarest words removed |
Loading training data...
Loaded 25000 training examples.
--- Training with T=200, m=5000, n_most=50, k_rarest=50 ---
Building vocabulary...
Vectorizing texts...
Training AdaBoost classifier...
(Iterations)
Development Accuracy: 82.60%
Training Sklearn AdaBoost classifier...
Sklearn AdaBoost Dev Accuracy: 80.90%
Custom Model and vocabulary saved to ../results/adaboost_model.pkl and ../results/vocab.pkl
Sklearn AdaBoost model saved to ../results/sklearn_adaboost.pkl
Running learning curve experiment (evaluating for positive class)...
(Table)
--- Best Hyperparameters ---
{'T': 200, 'm': 5000, 'n_most': 50, 'k_rarest': 50}
cd src/
python test_adaboost.py
- Load the trained adaboost model and vocabulary.
- Convert test reviews into binary feature vectors.
- Make predictions and compute test accuracy.
- Evaluate both custom AdaBoost and Sklearnโs AdaBoost models.
- Compute and display precision, recall, and F1-score for positive and negative sentiment classes.
- Report micro- and macro-averaged evaluation metrics for both models.
Loading test data...
Loaded 25000 test examples.
Test Accuracy: 82.37%
Loaded trained Sklearn AdaBoost model from ../results/sklearn_adaboost.pkl
Sklearn AdaBoost Test Accuracy: 80.77%
Custom AdaBoost Test Evaluation Metrics:
Category Precision Recall F1
Positive 0.8098 0.8462 0.8276
Negative 0.8390 0.8012 0.8197
Sklearn AdaBoost Test Evaluation Metrics:
Category Precision Recall F1
Positive 0.7875 0.8428 0.8142
Negative 0.8309 0.7726 0.8007
Custom AdaBoost Micro-averaged: Precision: 0.8237, Recall: 0.8237, F1: 0.8237
Custom AdaBoost Macro-averaged: Precision: 0.8244, Recall: 0.8237, F1: 0.8236
Sklearn AdaBoost Micro-averaged: Precision: 0.8077, Recall: 0.8077, F1: 0.8077
Sklearn AdaBoost Macro-averaged: Precision: 0.8092, Recall: 0.8077, F1: 0.8075
cd src/
python train_rnnmodel.py
- Load and preprocess the IMDB dataset.
- Construct a vocabulary and convert reviews into sequences of token indices.
- Use pre-trained word embeddings to initialize the embedding layer.
- Train a Stacked Bidirectional RNN with GRU cells using Adam optimizer.
- Monitor training and development loss across epochs.
Parameter | Value | Description |
---|---|---|
embedding_dim |
300 |
Dimension of the pre-trained word embeddings |
hidden_dim |
256 |
Hidden dimension of the RNN |
num_layers |
2 |
Number of stacked RNN layers |
dropout |
0.2 |
Dropout probability |
num_epochs |
10 |
Number of epochs |
lr |
0.0002 |
Learning rate |
Loading training data...
Loaded 25000 training examples.
--- Training with embedding_dim=300, hidden_dim=256, num_layers=2, dropout=0.2, lr=0.0002, num_epochs=10 ---
(Iterations)
Dev Accuracy: 87.00%
RNN Model and vocabulary saved to ../results/rnn_model.pth and ../results/vocab.pkl
--- Best Hyperparameters ---
{'embedding_dim': 300, 'hidden_dim': 256, 'num_layers': 2, 'dropout': 0.2, 'lr': 0.0002, 'num_epochs': 10}
cd src/
python test_rnnmodel.py
- Load the saved RNN model and vocabulary.
- Preprocess the IMDB test dataset, converting reviews into sequences of token indices.
- Evaluate the model on test data using mini-batches to prevent memory issues.
- Compute and display test accuracy.
- Calculate precision, recall, and F1-score for both positive and negative sentiment classes.
- Report micro- and macro-averaged evaluation metrics.
Loading test data...
Loaded 25000 test examples.
Test Accuracy: 86.91%
Evaluation Metrics on Test Data:
Category Precision Recall F1
Positive 0.8686 0.8698 0.8692
Negative 0.8697 0.8684 0.8690
Micro-averaged: Precision: 0.8691, Recall: 0.8691, F1: 0.8691
Macro-averaged: Precision: 0.8691, Recall: 0.8691, F1: 0.8691
- Tokenization: Splitting text into words.
- Vocabulary Selection:
- Remove
n
most frequent andk
rarest words. - Select
m
words with highest information gain.
- Remove
- Vectorization:
- Convert reviews into binary feature vectors (
1
= word present,0
= word absent).
- Convert reviews into binary feature vectors (
- Uses decision stumps as weak learners.
- Each weak learner classifies reviews based on the presence/absence of a single word.
- Weighted voting of multiple weak learners improves accuracy.
- Uses a Stacked Bidirectional GRU network for sequential text classification.
- Each review is represented as a sequence of token indices, processed using pre-trained word embeddings.
- The GRU layers capture contextual dependencies in both forward and backward directions.
- A global max pooling layer extracts the most important features from the hidden states.
- The final classification is performed using a fully connected layer with softmax activation.
Size | Train Prec | Train Rec | Train F1 | Dev Prec | Dev Rec | Dev F1 |
---|---|---|---|---|---|---|
2000 | 0.8365 | 0.8797 | 0.8576 | 0.7699 | 0.8423 | 0.8045 |
4000 | 0.8239 | 0.8600 | 0.8416 | 0.7815 | 0.8483 | 0.8135 |
8000 | 0.8236 | 0.8395 | 0.8315 | 0.8048 | 0.8302 | 0.8173 |
12000 | 0.8273 | 0.8429 | 0.8350 | 0.8048 | 0.8318 | 0.8181 |
16000 | 0.8251 | 0.8437 | 0.8343 | 0.8050 | 0.8294 | 0.8170 |
20000 | 0.8176 | 0.8528 | 0.8349 | 0.7992 | 0.8491 | 0.8234 |
Size | Custom Train F1 | Custom Dev F1 | Sklearn Train F1 | Sklearn Dev F1 |
---|---|---|---|---|
2000 | 0.8564 | 0.8067 | 0.8483 | 0.7970 |
4000 | 0.8515 | 0.8137 | 0.8331 | 0.8068 |
8000 | 0.8378 | 0.8160 | 0.8272 | 0.8092 |
12000 | 0.8334 | 0.8191 | 0.8205 | 0.8081 |
16000 | 0.8320 | 0.8189 | 0.8196 | 0.8115 |
20000 | 0.8327 | 0.8272 | 0.8190 | 0.8132 |
- IMDB Dataset: Stanford AI Lab