https://github.com/professorlearncode/email-classifier-model
This project implements a spam email classifier using Natural Language Processing (NLP) and a Naive Bayes model. The pipeline preprocesses email text and uses the model to classify messages as spam or not spam, with an evaluation on test data.
https://github.com/professorlearncode/email-classifier-model
Last synced: about 1 year ago
JSON representation
This project implements a spam email classifier using Natural Language Processing (NLP) and a Naive Bayes model. The pipeline preprocesses email text and uses the model to classify messages as spam or not spam, with an evaluation on test data.
- Host: GitHub
- URL: https://github.com/professorlearncode/email-classifier-model
- Owner: ProfessorlearnCode
- Created: 2024-08-18T00:20:54.000Z (almost 2 years ago)
- Default Branch: main
- Last Pushed: 2024-08-18T00:22:52.000Z (almost 2 years ago)
- Last Synced: 2025-02-16T21:30:06.169Z (over 1 year ago)
- Language: Python
- Size: 238 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
### Documentation for Spam Email Classifier Using Naive Bayes
#### Overview
This repository contains a Python implementation of a spam email classifier using Natural Language Processing (NLP) techniques and a Naive Bayes model. The model is trained on a labeled dataset of emails, and it predicts whether new emails are spam or not. The project includes text preprocessing, model training, prediction, and evaluation steps.
#### Prerequisites
Ensure the following Python libraries are installed before running the code:
- `nltk`
- `pandas`
- `scikit-learn`
You can install the necessary libraries using:
```bash
pip install nltk pandas scikit-learn
```
#### Code Breakdown
1. **Importing Libraries**
```python
import nltk
import pandas as pd
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
```
The required libraries for text preprocessing, model building, and evaluation are imported.
2. **Downloading NLTK Data**
```python
nltk.download('punkt')
nltk.download('stopwords')
```
The necessary NLTK data for tokenization and stopword filtering is downloaded.
3. **Text Preprocessing Function**
```python
def preprocess_text(text):
tokens = word_tokenize(text.lower())
filtered_words = [word for word in tokens if word.isalnum() and word not in stopwords.words('english')]
return ' '.join(filtered_words)
```
This function tokenizes the text, converts it to lowercase, removes stopwords, and keeps only alphanumeric tokens. The resulting cleaned text is returned as a string.
4. **Sample Text Preprocessing**
```python
sample_text = "This is an example email!"
processed_text = preprocess_text(sample_text)
print("Processed Sample Text:", processed_text)
```
A sample email is processed using the `preprocess_text` function to demonstrate text preprocessing.
5. **Loading and Preparing the Dataset**
```python
data = pd.read_csv('spam.csv', encoding='latin-1')
data = data[['v1', 'v2']] # Assuming the columns are named 'v1' for Category and 'v2' for Message
data.columns = ['Category', 'Message']
data['Spam'] = data['Category'].apply(lambda x: 1 if x == 'spam' else 0)
```
The dataset is loaded and prepared for modeling. The dataset is assumed to have columns for email category (`v1`) and message (`v2`). The 'Category' column is renamed, and a new binary column 'Spam' is created where spam is labeled as `1` and non-spam as `0`.
6. **Text Preprocessing for the Dataset**
```python
data['Message'] = data['Message'].apply(preprocess_text)
```
The `Message` column is preprocessed using the `preprocess_text` function.
7. **Splitting the Dataset**
```python
X_train, X_test, y_train, y_test = train_test_split(data['Message'], data['Spam'], test_size=0.25, random_state=42)
```
The dataset is split into training and testing sets with a 75-25 ratio.
8. **Building and Training the Model Pipeline**
```python
clf = Pipeline([
('vectorizer', CountVectorizer()),
('nb', MultinomialNB())
])
clf.fit(X_train, y_train)
```
A pipeline is created that first converts text into a matrix of token counts using `CountVectorizer` and then applies the `MultinomialNB` classifier. The model is trained on the training data.
9. **Making Predictions**
```python
emails = [
'Sounds great! Are you home now?',
'Will u meet ur dream partner soon? Is ur career off 2 a flyng start? 2 find out free, txt HORO followed by ur star sign, e. g. HORO ARIES'
]
predictions = clf.predict(emails)
print("Predictions:", predictions)
```
The model predicts whether a list of sample emails is spam or not.
10. **Evaluating the Model**
```python
accuracy = clf.score(X_test, y_test)
print("Accuracy on test data:", accuracy)
```
The model's accuracy is calculated on the test data, providing a measure of how well the model generalizes to unseen data.
#### Conclusion
This project provides a complete pipeline for building a spam email classifier using a Naive Bayes model. The process includes text preprocessing, model training, and evaluation, offering a robust approach to identifying spam emails.
#### Future Improvements
- **Feature Engineering**: Explore different feature extraction techniques, such as TF-IDF, to potentially improve model accuracy.
- **Model Optimization**: Fine-tune the hyperparameters of the Naive Bayes model or explore other classifiers like SVM or Random Forest.
#### License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
---
This documentation should help guide software developers in understanding and using the code effectively.