Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.
Awesome Lists | Featured Topics | Projects
https://github.com/arif-miad/email-spam-classification-project

keras machine-learning nlp nltk numpy pandas-dataframe svm-classifier
Last synced: about 14 hours ago
JSON representation
Host: GitHub
URL: https://github.com/arif-miad/email-spam-classification-project
Owner: Arif-miad
License: gpl-3.0
Created: 2025-01-14T14:59:40.000Z (3 days ago)
Default Branch: main
Last Pushed: 2025-01-14T15:24:53.000Z (3 days ago)
Last Synced: 2025-01-14T16:55:20.143Z (3 days ago)
Topics: keras, machine-learning, nlp, nltk, numpy, pandas-dataframe, svm-classifier
Homepage:
Size: 56.6 KB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

README

        # Email-Spam-Classification-Project

### Dataset Overview

1. **Dataset Name**

   - Email Spam Classification Dataset

2. **Description**

   - This dataset contains email messages labeled as spam or ham (non-spam).

   - Each entry includes the full content of the email and its classification (0 for ham, 1 for spam).

   - Total observations: 5,728

3. **Columns**

   - `text`: Full content of the email message.

   - `spam`: Binary label indicating whether the email is spam (1) or ham (0).

4. **Dataset Characteristics**

   - **Format**: Tab-delimited (easy to import and process).

   - **Content**: Diverse range of topics and styles typical of real-world emails.

   - **Use Case**: Suitable for natural language processing (NLP), machine learning, and email filtering system development.

5. **Objective**

   - Perform Exploratory Data Analysis (EDA) to understand data distribution, text characteristics, and common patterns in spam and ham emails.

   - Build and evaluate machine learning models (e.g., Logistic Regression, Random Forest) for accurate email classification.

6. **Tasks and Steps**

   - **Data Loading and Inspection**

     - Load dataset into Pandas DataFrame.

     - Check basic statistics and structure.

     - Handle missing values if any.

   - **Exploratory Data Analysis (EDA)**

     - Visualize class distribution (spam vs. ham).

     - Analyze text length distribution for spam and ham emails.

     - Generate word clouds to identify common words in spam and ham emails.

     - Explore common words and n-grams using techniques like TF-IDF.

   - **Data Preprocessing**

     - Clean and preprocess text data (lowercase, remove punctuation, etc.).

     - Convert text data into numerical features suitable for machine learning models.

     - Handle categorical variables (if any).

   - **Model Building and Evaluation**

     - Split dataset into training and testing sets.

     - Implement various classification models (Logistic Regression, SVM, etc.).

     - Evaluate model performance using metrics like accuracy, precision, recall, F1 score, and ROC-AUC.

     - Compare and select the best-performing model based on evaluation metrics and business requirements.

7. **Conclusion**

   - Summarize findings from EDA and model evaluation.

   - Discuss insights gained and potential improvements for future work.

 

## Overview

This project focuses on classifying emails as either spam (1) or ham (0). Using machine learning and natural language processing (NLP) techniques, we perform exploratory data analysis (EDA) and develop robust models to accurately classify emails.

---

## Table of Contents

1. [Introduction](#introduction)  

2. [Dataset Description](#dataset-description)  

3. [Key Objectives](#key-objectives)  

4. [Workflow](#workflow)  

5. [Requirements](#requirements)  

6. [Data Loading and Preprocessing](#data-loading-and-preprocessing)  

7. [Exploratory Data Analysis (EDA)](#exploratory-data-analysis-eda)  

8. [Feature Engineering](#feature-engineering)  

9. [Model Development](#model-development)  

10. [Performance Metrics and Equations](#performance-metrics-and-equations)  

11. [Results](#results)  

12. [Conclusion](#conclusion)  

13. [References](#references)  

---

## Introduction

Spam emails have become a major concern in the digital era. This project aims to create a robust email classification system using the Email Spam Classification Dataset, which consists of text-based emails and their labels (spam or ham).  

---

## Dataset Description

- **Dataset Name:** Email Spam Classification  

- **Columns:**

  - `text`: Content of the email.

  - `spam`: Label indicating spam (1) or ham (0).  

- **Total Observations:** 5,728 emails  

- **Format:** Tab-delimited  

---

## Key Objectives

1. Perform detailed exploratory data analysis (EDA).  

2. Clean and preprocess the text data.  

3. Engineer features using tokenization, TF-IDF, etc.  

4. Build and evaluate classification models.  

5. Compare models using performance metrics.  

---

## Workflow

### Step-by-Step Implementation:

1. Import necessary libraries.  

2. Load the dataset.  

3. Check for missing values.  

4. Visualize spam vs. ham distribution.  

5. Analyze word frequencies for spam and ham.  

6. Generate word clouds for better insights.  

7. Preprocess text data (lowercase, punctuation removal, etc.).  

8. Remove stopwords and perform lemmatization.  

9. Perform text vectorization using TF-IDF.  

10. Split the dataset into training and testing sets.  

11. Build the following models:

    - Logistic Regression  

    - Naive Bayes  

    - Support Vector Machine (SVM)  

    - Random Forest  

    - Gradient Boosting  

12. Evaluate models using performance metrics:

    - Accuracy  

    - Precision  

    - Recall  

    - F1 Score  

    - ROC and AUC  

13. Perform hyperparameter tuning using GridSearchCV.  

14. Visualize the confusion matrix.  

15. Compare model performance and finalize the best model.  

---

## Requirements

Install the required Python libraries using the following command:

```bash

pip install numpy pandas matplotlib seaborn scikit-learn wordcloud nltk

```

---

## Data Loading and Preprocessing

### Loading Dataset:

```python

# Step 1: Import required libraries

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

from sklearn.model_selection import train_test_split

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.linear_model import LogisticRegression

from sklearn.ensemble import RandomForestClassifier

from sklearn.svm import SVC

from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve, accuracy_score

# Step 2: Load the dataset

data = pd.read_csv("email_spam_dataset.csv")  # Replace with your dataset path

data.columns = ['text', 'spam']  # Ensure column names are correct

# Step 3: Inspect the dataset

print(data.head())

print(data.info())

print(data.describe())

# Step 4: Check for missing values

print("Missing values:\n", data.isnull().sum())

# Step 5: Check class distribution

sns.countplot(data['spam'])

plt.title("Spam vs Ham Distribution")

plt.show()

# Step 6: Convert target column to numeric if not already

data['spam'] = data['spam'].map({'ham': 0, 'spam': 1})  # Adjust based on dataset

# Step 7: Analyze text length

data['text_length'] = data['text'].apply(len)

sns.histplot(data, x='text_length', hue='spam', kde=True)

plt.title("Text Length Distribution")

plt.show()

# Step 8: Word cloud for spam

from wordcloud import WordCloud

spam_words = ' '.join(data[data['spam'] == 1]['text'])

wordcloud = WordCloud(width=800, height=400, background_color='black').generate(spam_words)

plt.imshow(wordcloud, interpolation='bilinear')

plt.axis('off')

plt.title("Word Cloud for Spam Emails")

plt.show()

# Step 9: Word cloud for ham

ham_words = ' '.join(data[data['spam'] == 0]['text'])

wordcloud = WordCloud(width=800, height=400, background_color='white').generate(ham_words)

plt.imshow(wordcloud, interpolation='bilinear')

plt.axis('off')

plt.title("Word Cloud for Ham Emails")

plt.show()

# Step 10: Analyze most common words

from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(stop_words='english', max_features=20)

common_words = cv.fit_transform(data['text'])

common_words_df = pd.DataFrame(cv.get_feature_names_out(), columns=['Word'])

print("Most common words:\n", common_words_df)

# Step 11: Preprocess text (lowercase, remove punctuation, etc.)

import string

import re

def preprocess_text(text):

    text = text.lower()  # Convert to lowercase

    text = re.sub(r'\d+', '', text)  # Remove numbers

    text = text.translate(str.maketrans('', '', string.punctuation))  # Remove punctuation

    text = text.strip()  # Remove leading/trailing spaces

    return text

data['cleaned_text'] = data['text'].apply(preprocess_text)

# Step 12: Split the dataset

X = data['cleaned_text']

y = data['spam']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 13: Transform text data using TF-IDF

tfidf = TfidfVectorizer(max_features=5000)

X_train_tfidf = tfidf.fit_transform(X_train)

X_test_tfidf = tfidf.transform(X_test)

# Step 14: Build logistic regression model

lr_model = LogisticRegression()

lr_model.fit(X_train_tfidf, y_train)

# Step 15: Evaluate logistic regression model

y_pred_lr = lr_model.predict(X_test_tfidf)

print("Logistic Regression Classification Report:\n", classification_report(y_test, y_pred_lr))

print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_lr))

# Step 16: Random Forest model

rf_model = RandomForestClassifier()

rf_model.fit(X_train_tfidf, y_train)

# Step 17: Evaluate Random Forest model

y_pred_rf = rf_model.predict(X_test_tfidf)

print("Random Forest Classification Report:\n", classification_report(y_test, y_pred_rf))

# Step 18: Support Vector Machine (SVM) model

svm_model = SVC(probability=True)

svm_model.fit(X_train_tfidf, y_train)

# Step 19: Evaluate SVM model

y_pred_svm = svm_model.predict(X_test_tfidf)

print("SVM Classification Report:\n", classification_report(y_test, y_pred_svm))

# Step 20: ROC Curve for Logistic Regression

lr_probs = lr_model.predict_proba(X_test_tfidf)[:, 1]

fpr, tpr, _ = roc_curve(y_test, lr_probs)

plt.plot(fpr, tpr, label="Logistic Regression")

plt.xlabel("False Positive Rate")

plt.ylabel("True Positive Rate")

plt.title("ROC Curve")

plt.legend()

plt.show()

# Step 21: Compare model performance

models = ['Logistic Regression', 'Random Forest', 'SVM']

accuracies = [accuracy_score(y_test, y_pred_lr), accuracy_score(y_test, y_pred_rf), accuracy_score(y_test, y_pred_svm)]

plt.bar(models, accuracies, color=['blue', 'green', 'red'])

plt.ylabel("Accuracy")

plt.title("Model Comparison")

plt.show()

# Step 22: Hyperparameter tuning for Random Forest

from sklearn.model_selection import GridSearchCV

param_grid = {'n_estimators': [50, 100, 200], 'max_depth': [10, 20, None]}

grid_rf = GridSearchCV(RandomForestClassifier(), param_grid, cv=3, scoring='accuracy')

grid_rf.fit(X_train_tfidf, y_train)

print("Best Random Forest Parameters:\n", grid_rf.best_params_)

# Step 23: Train tuned Random Forest

rf_tuned = grid_rf.best_estimator_

rf_tuned.fit(X_train_tfidf, y_train)

# Step 24: Evaluate tuned Random Forest

y_pred_rf_tuned = rf_tuned.predict(X_test_tfidf)

print("Tuned Random Forest Accuracy:", accuracy_score(y_test, y_pred_rf_tuned))

# Step 25: Save the model and vectorizer

import joblib

joblib.dump(rf_tuned, "spam_classifier_model.pkl")

joblib.dump(tfidf, "tfidf_vectorizer.pkl")

# Step 26: Load saved model

loaded_model = joblib.load("spam_classifier_model.pkl")

loaded_vectorizer = joblib.load("tfidf_vectorizer.pkl")

# Step 27: Test saved model

new_email = ["Congratulations! You've won a prize!"]

new_email_tfidf = loaded_vectorizer.transform(new_email)

print("Prediction for new email:", loaded_model.predict(new_email_tfidf))

# Step 28: Feature importance for Random Forest

feature_importances = rf_tuned.feature_importances_

top_features = np.argsort(feature_importances)[-10:]

plt.barh(np.array(tfidf.get_feature_names_out())[top_features], feature_importances[top_features])

plt.title("Top Features in Random Forest")

plt.show()

# Step 29: Check misclassified emails

misclassified = X_test[(y_test != y_pred_rf_tuned)]

print("Misclassified Emails:\n", misclassified)

# Step 30: Summarize findings

print("Summary: Logistic Regression performed better in terms of [metrics]. Random Forest had higher interpretability due to feature importance.")

```

![](https://github.com/Arif-miad/Email-Spam-Classification-Project/blob/main/e1.PNG)

---

## Performance Metrics and Equations

### Key Metrics:

1. **Accuracy**:

![output](https://github.com/Arif-miad/Email-Spam-Classification-Project/blob/main/eial.PNG)

---

## Results

- **Best Model:** Logistic Regression achieved the highest accuracy (e.g., 95%).

- **Key Insights:**

  - Spam emails often have shorter text lengths.

  - Common spam words include "offer," "free," and "win."  

---

## Conclusion

This project demonstrates a complete workflow for spam classification, starting from data loading to model evaluation. The results show that NLP and machine learning techniques can effectively classify emails as spam or ham.  

---

## References

1. [Scikit-learn Documentation](https://scikit-learn.org/)

2. [Kaggle - Spam Email Datasets](https://www.kaggle.com/)

3. [NLTK Library](https://www.nltk.org/)

```