An open API service indexing awesome lists of open source software.

https://github.com/professorlearncode/email-classifier-model

This project implements a spam email classifier using Natural Language Processing (NLP) and a Naive Bayes model. The pipeline preprocesses email text and uses the model to classify messages as spam or not spam, with an evaluation on test data.
https://github.com/professorlearncode/email-classifier-model

Last synced: about 1 year ago
JSON representation

This project implements a spam email classifier using Natural Language Processing (NLP) and a Naive Bayes model. The pipeline preprocesses email text and uses the model to classify messages as spam or not spam, with an evaluation on test data.

Awesome Lists containing this project

README

          

### Documentation for Spam Email Classifier Using Naive Bayes

#### Overview
This repository contains a Python implementation of a spam email classifier using Natural Language Processing (NLP) techniques and a Naive Bayes model. The model is trained on a labeled dataset of emails, and it predicts whether new emails are spam or not. The project includes text preprocessing, model training, prediction, and evaluation steps.

#### Prerequisites
Ensure the following Python libraries are installed before running the code:
- `nltk`
- `pandas`
- `scikit-learn`

You can install the necessary libraries using:
```bash
pip install nltk pandas scikit-learn
```

#### Code Breakdown

1. **Importing Libraries**
```python
import nltk
import pandas as pd
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
```
The required libraries for text preprocessing, model building, and evaluation are imported.

2. **Downloading NLTK Data**
```python
nltk.download('punkt')
nltk.download('stopwords')
```
The necessary NLTK data for tokenization and stopword filtering is downloaded.

3. **Text Preprocessing Function**
```python
def preprocess_text(text):
tokens = word_tokenize(text.lower())
filtered_words = [word for word in tokens if word.isalnum() and word not in stopwords.words('english')]
return ' '.join(filtered_words)
```
This function tokenizes the text, converts it to lowercase, removes stopwords, and keeps only alphanumeric tokens. The resulting cleaned text is returned as a string.

4. **Sample Text Preprocessing**
```python
sample_text = "This is an example email!"
processed_text = preprocess_text(sample_text)
print("Processed Sample Text:", processed_text)
```
A sample email is processed using the `preprocess_text` function to demonstrate text preprocessing.

5. **Loading and Preparing the Dataset**
```python
data = pd.read_csv('spam.csv', encoding='latin-1')
data = data[['v1', 'v2']] # Assuming the columns are named 'v1' for Category and 'v2' for Message
data.columns = ['Category', 'Message']
data['Spam'] = data['Category'].apply(lambda x: 1 if x == 'spam' else 0)
```
The dataset is loaded and prepared for modeling. The dataset is assumed to have columns for email category (`v1`) and message (`v2`). The 'Category' column is renamed, and a new binary column 'Spam' is created where spam is labeled as `1` and non-spam as `0`.

6. **Text Preprocessing for the Dataset**
```python
data['Message'] = data['Message'].apply(preprocess_text)
```
The `Message` column is preprocessed using the `preprocess_text` function.

7. **Splitting the Dataset**
```python
X_train, X_test, y_train, y_test = train_test_split(data['Message'], data['Spam'], test_size=0.25, random_state=42)
```
The dataset is split into training and testing sets with a 75-25 ratio.

8. **Building and Training the Model Pipeline**
```python
clf = Pipeline([
('vectorizer', CountVectorizer()),
('nb', MultinomialNB())
])
clf.fit(X_train, y_train)
```
A pipeline is created that first converts text into a matrix of token counts using `CountVectorizer` and then applies the `MultinomialNB` classifier. The model is trained on the training data.

9. **Making Predictions**
```python
emails = [
'Sounds great! Are you home now?',
'Will u meet ur dream partner soon? Is ur career off 2 a flyng start? 2 find out free, txt HORO followed by ur star sign, e. g. HORO ARIES'
]
predictions = clf.predict(emails)
print("Predictions:", predictions)
```
The model predicts whether a list of sample emails is spam or not.

10. **Evaluating the Model**
```python
accuracy = clf.score(X_test, y_test)
print("Accuracy on test data:", accuracy)
```
The model's accuracy is calculated on the test data, providing a measure of how well the model generalizes to unseen data.

#### Conclusion
This project provides a complete pipeline for building a spam email classifier using a Naive Bayes model. The process includes text preprocessing, model training, and evaluation, offering a robust approach to identifying spam emails.

#### Future Improvements
- **Feature Engineering**: Explore different feature extraction techniques, such as TF-IDF, to potentially improve model accuracy.
- **Model Optimization**: Fine-tune the hyperparameters of the Naive Bayes model or explore other classifiers like SVM or Random Forest.

#### License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

---

This documentation should help guide software developers in understanding and using the code effectively.