https://github.com/an0n1mity/spamclassifiereval

A repository for evaluating the misclassification rate of spam classification models using a threshold-based approach.
https://github.com/an0n1mity/spamclassifiereval

data-analysis machine-learning natural-language-processing python-programming spam-classification text-classification

Last synced: 3 months ago
JSON representation

A repository for evaluating the misclassification rate of spam classification models using a threshold-based approach.

Host: GitHub
URL: https://github.com/an0n1mity/spamclassifiereval
Owner: An0n1mity
Created: 2023-05-29T18:23:07.000Z (almost 2 years ago)
Default Branch: master
Last Pushed: 2023-06-12T21:36:33.000Z (almost 2 years ago)
Last Synced: 2024-12-26T18:44:36.218Z (5 months ago)
Topics: data-analysis, machine-learning, natural-language-processing, python-programming, spam-classification, text-classification
Language: Python
Homepage:
Size: 4.86 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.MD

Awesome Lists containing this project

README

        # Email Spam Classification

This repository contains code for training a spam classification model using the Naive Bayes algorithm. It also includes functions for evaluating the model's performance and visualizing the spamicity of a given file.

An explanation of the algorithm is given on my [github page](https://an0n1mity.github.io/posts/spam_classifier/).

## Prerequisites

- Python 3.x

- NLTK library

- Matplotlib library

- NumPy library

## Installation

1. Clone the repository: `git clone https://github.com/your-username/your-repository.git`

2. Install the required dependencies: `pip install nltk matplotlib numpy`

3. Install nltk stop words: `import nltk nltk.download('stopwords')`

## Usage

1. Import the necessary modules:

```python

import os

from nltk.stem import WordNetLemmatizer

from nltk.corpus import stopwords

from matplotlib import pyplot as plt

import numpy as np

import re

```

2. Train the spam classification model by calling the `train_model` function:

```python

train_model(training_percent=0.8, SPAM_FOLDER='HAMS', HAM_FOLDER='SPAMS')

```

This function will randomly select a percentage of files from the provided spam and ham folders for training the model. It will store the training and testing file lists in separate text files.

3. Classify a file's spamicity using the `get_file_spamicity` function:

```python

spamicity = get_file_spamicity(filename, n=8, plot=False)

```

This function calculates the spamicity of a given file by comparing the words in the file to the trained word count dictionary. It returns the calculated spamicity value.

![alt text](https://github.com/An0n1mity/SpamClassifierEval/blob/master/get_file_spamicity_plot.png)

4. Test misclassification for a given `n` using the `test_misclassification` function:

```python

test_misclassification(testing_files_spams, testing_files_hams, n=(8, 16, 32), threshold=0.6, unseen_spamicity=0.4, plot=False, verbose=False)

```

This function tests the misclassification rate of the spam classification model on the provided testing files. It compares the calculated spamicity of each file to a threshold value and counts the false positives and true negatives. It accepts an optional `n` parameter to specify the number of words used for classification. 

![alt text](https://github.com/An0n1mity/SpamClassifierEval/blob/master/test_misclassification_plot.png)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/an0n1mity/spamclassifiereval

Awesome Lists containing this project

README