Skip to content

Latest commit

 

History

History
48 lines (39 loc) · 2.3 KB

README.MD

File metadata and controls

48 lines (39 loc) · 2.3 KB

Email Spam Classification

This repository contains code for training a spam classification model using the Naive Bayes algorithm. It also includes functions for evaluating the model's performance and visualizing the spamicity of a given file. An explanation of the algorithm is given on my github page.

Prerequisites

  • Python 3.x
  • NLTK library
  • Matplotlib library
  • NumPy library

Installation

  1. Clone the repository: git clone https://github.com/your-username/your-repository.git
  2. Install the required dependencies: pip install nltk matplotlib numpy
  3. Install nltk stop words: import nltk nltk.download('stopwords')

Usage

  1. Import the necessary modules:
import os
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from matplotlib import pyplot as plt
import numpy as np
import re
  1. Train the spam classification model by calling the train_model function:
train_model(training_percent=0.8, SPAM_FOLDER='HAMS', HAM_FOLDER='SPAMS')

This function will randomly select a percentage of files from the provided spam and ham folders for training the model. It will store the training and testing file lists in separate text files.

  1. Classify a file's spamicity using the get_file_spamicity function:
spamicity = get_file_spamicity(filename, n=8, plot=False)

This function calculates the spamicity of a given file by comparing the words in the file to the trained word count dictionary. It returns the calculated spamicity value.

alt text

  1. Test misclassification for a given n using the test_misclassification function:
test_misclassification(testing_files_spams, testing_files_hams, n=(8, 16, 32), threshold=0.6, unseen_spamicity=0.4, plot=False, verbose=False)

This function tests the misclassification rate of the spam classification model on the provided testing files. It compares the calculated spamicity of each file to a threshold value and counts the false positives and true negatives. It accepts an optional n parameter to specify the number of words used for classification.

alt text