Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/an0n1mity/spamclassifiereval
A repository for evaluating the misclassification rate of spam classification models using a threshold-based approach.
https://github.com/an0n1mity/spamclassifiereval
data-analysis machine-learning natural-language-processing python-programming spam-classification text-classification
Last synced: 20 days ago
JSON representation
A repository for evaluating the misclassification rate of spam classification models using a threshold-based approach.
- Host: GitHub
- URL: https://github.com/an0n1mity/spamclassifiereval
- Owner: An0n1mity
- Created: 2023-05-29T18:23:07.000Z (over 1 year ago)
- Default Branch: master
- Last Pushed: 2023-06-12T21:36:33.000Z (over 1 year ago)
- Last Synced: 2024-11-07T00:49:50.425Z (2 months ago)
- Topics: data-analysis, machine-learning, natural-language-processing, python-programming, spam-classification, text-classification
- Language: Python
- Homepage:
- Size: 4.86 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.MD
Awesome Lists containing this project
README
# Email Spam Classification
This repository contains code for training a spam classification model using the Naive Bayes algorithm. It also includes functions for evaluating the model's performance and visualizing the spamicity of a given file.
An explanation of the algorithm is given on my [github page](https://an0n1mity.github.io/posts/spam_classifier/).## Prerequisites
- Python 3.x
- NLTK library
- Matplotlib library
- NumPy library## Installation
1. Clone the repository: `git clone https://github.com/your-username/your-repository.git`
2. Install the required dependencies: `pip install nltk matplotlib numpy`
3. Install nltk stop words: `import nltk nltk.download('stopwords')`## Usage
1. Import the necessary modules:
```python
import os
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from matplotlib import pyplot as plt
import numpy as np
import re
```2. Train the spam classification model by calling the `train_model` function:
```python
train_model(training_percent=0.8, SPAM_FOLDER='HAMS', HAM_FOLDER='SPAMS')
```
This function will randomly select a percentage of files from the provided spam and ham folders for training the model. It will store the training and testing file lists in separate text files.3. Classify a file's spamicity using the `get_file_spamicity` function:
```python
spamicity = get_file_spamicity(filename, n=8, plot=False)
```
This function calculates the spamicity of a given file by comparing the words in the file to the trained word count dictionary. It returns the calculated spamicity value.![alt text](https://github.com/An0n1mity/SpamClassifierEval/blob/master/get_file_spamicity_plot.png)
4. Test misclassification for a given `n` using the `test_misclassification` function:
```python
test_misclassification(testing_files_spams, testing_files_hams, n=(8, 16, 32), threshold=0.6, unseen_spamicity=0.4, plot=False, verbose=False)
```
This function tests the misclassification rate of the spam classification model on the provided testing files. It compares the calculated spamicity of each file to a threshold value and counts the false positives and true negatives. It accepts an optional `n` parameter to specify the number of words used for classification.![alt text](https://github.com/An0n1mity/SpamClassifierEval/blob/master/test_misclassification_plot.png)