This repository contains code for training a spam classification model using the Naive Bayes algorithm. It also includes functions for evaluating the model's performance and visualizing the spamicity of a given file. An explanation of the algorithm is given on my github page.
- Python 3.x
- NLTK library
- Matplotlib library
- NumPy library
- Clone the repository:
git clone https://github.com/your-username/your-repository.git
- Install the required dependencies:
pip install nltk matplotlib numpy
- Install nltk stop words:
import nltk nltk.download('stopwords')
- Import the necessary modules:
import os
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from matplotlib import pyplot as plt
import numpy as np
import re
- Train the spam classification model by calling the
train_model
function:
train_model(training_percent=0.8, SPAM_FOLDER='HAMS', HAM_FOLDER='SPAMS')
This function will randomly select a percentage of files from the provided spam and ham folders for training the model. It will store the training and testing file lists in separate text files.
- Classify a file's spamicity using the
get_file_spamicity
function:
spamicity = get_file_spamicity(filename, n=8, plot=False)
This function calculates the spamicity of a given file by comparing the words in the file to the trained word count dictionary. It returns the calculated spamicity value.
- Test misclassification for a given
n
using thetest_misclassification
function:
test_misclassification(testing_files_spams, testing_files_hams, n=(8, 16, 32), threshold=0.6, unseen_spamicity=0.4, plot=False, verbose=False)
This function tests the misclassification rate of the spam classification model on the provided testing files. It compares the calculated spamicity of each file to a threshold value and counts the false positives and true negatives. It accepts an optional n
parameter to specify the number of words used for classification.