https://github.com/sergio11/spam_email_classifier_lstm

This project uses a Bi-directional LSTM model 📧🤖 to classify emails as spam or legitimate, utilizing NLP techniques like tokenization, padding, and stopword removal. It aims to create an effective email classifier 💻📊 while addressing overfitting with strategies like early stopping 🚫.
https://github.com/sergio11/spam_email_classifier_lstm

bilstm confusion-matrix data-preprocessing deep-learning lstm lstm-model lstm-neural-networks machine-learning natural-language-processing sentiment-analysis spam-detection text-classification word-cloud

Last synced: 6 months ago
JSON representation

Host: GitHub
URL: https://github.com/sergio11/spam_email_classifier_lstm
Owner: sergio11
License: mit
Created: 2016-01-26T09:22:26.000Z (over 9 years ago)
Default Branch: master
Last Pushed: 2025-03-11T07:38:33.000Z (7 months ago)
Last Synced: 2025-03-29T06:11:16.760Z (6 months ago)
Topics: bilstm, confusion-matrix, data-preprocessing, deep-learning, lstm, lstm-model, lstm-neural-networks, machine-learning, natural-language-processing, sentiment-analysis, spam-detection, text-classification, word-cloud
Language: Jupyter Notebook
Homepage:
Size: 5.92 MB
Stars: 5
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE.txt

Awesome Lists containing this project

README

# Spam Email Classification using LSTM 📧🤖

This project explores building a classification model to differentiate between **Spam** and **Legitimate (Ham)** emails using **Long Short-Term Memory (LSTM)** networks. The notebook details the end-to-end process of preparing the data, training the model, and evaluating its performance.

The focus is on using **Natural Language Processing (NLP)** techniques for text preprocessing and **Deep Learning** to classify emails based on their content. By the end of the project, we aim to have a trained model that can effectively predict whether an email is spam or legitimate.

## ⚠️ Disclaimer
**This project was developed for learning and research purposes only.** It is an educational exercise aimed at exploring **Natural Language Processing (NLP) techniques and Deep Learning models**—specifically **Long Short-Term Memory (LSTM) networks**—for spam email classification.

The model and findings presented in this project should not be used for real-world email filtering or commercial applications, as they have not been rigorously tested for deployment. Additionally, this project leverages publicly available datasets and references existing research contributions for educational insights.

🙏 I would like to extend my heartfelt gratitude to [Santiago Hernández, an expert in Cybersecurity and Artificial Intelligence](https://www.udemy.com/user/shramos/). His incredible course on Deep Learning, available at Udemy, was instrumental in shaping the development of this project. The insights and techniques learned from his course were crucial in crafting the neural network architecture used in this classifier.

We would like to express our gratitude to **purusinghvi** for creating and sharing the **Spam Email Classification Dataset - Combined Spam Email CSV of 2007 TREC Public Spam Corpus and Enron-Spam Dataset** on Kaggle. This dataset, which contains detailed information about spam and legitimate emails, has been invaluable in building and training the machine learning model for spam detection.

🌟 The dataset can be found on [Kaggle](https://www.kaggle.com/datasets/purusinghvi/email-spam-classification-dataset). Your contribution is greatly appreciated! 🙌

📌 Additionally, this project was **inspired by amazing contributions** from the Kaggle community:
- **[Detecting Spam in Emails with LSTMs (99% accuracy)](https://www.kaggle.com/code/hrhuynguyen/detecting-spam-in-emails-with-lstms-99-accuracy)** by **hrhuynguyen**, which provided valuable insights into applying LSTMs for spam detection.
- **[Spam Email - XGBoost - 99%](https://www.kaggle.com/code/rem4000/spam-email-xgboost-99)** by **rem4000**, which demonstrated the power of ensemble learning methods for this task.

## 🌟 Explore My Other Cutting-Edge AI Projects! 🌟

If you found this project intriguing, I invite you to check out my other AI and machine learning initiatives, where I tackle real-world challenges across various domains:

+ [🌍 Advanced Classification of Disaster-Related Tweets Using Deep Learning 🚨](https://github.com/sergio11/disasters_prediction)
Uncover how social media responds to crises in real time using **deep learning** to classify tweets related to disasters.

+ [📰 Fighting Misinformation: Source-Based Fake News Classification 🕵️‍♂️](https://github.com/sergio11/fake_news_classifier)
Combat misinformation by classifying news articles as real or fake based on their source using **machine learning** techniques.

+ [🛡️ IoT Network Malware Classifier with Deep Learning Neural Network Architecture 🚀](https://github.com/sergio11/iot_network_malware_classifier)
Detect malware in IoT network traffic using **Deep Learning Neural Networks**, offering proactive cybersecurity solutions.

+ [📧 Spam Email Classification using LSTM 🤖](https://github.com/sergio11/spam_email_classifier_lstm)
Classify emails as spam or legitimate using a **Bi-directional LSTM** model, implementing NLP techniques like tokenization and stopword removal.

+ [💳 Fraud Detection Model with Deep Neural Networks (DNN)](https://github.com/sergio11/online_payment_fraud)
Detect fraudulent transactions in financial data with **Deep Neural Networks**, addressing imbalanced datasets and offering scalable solutions.

+ [🧠🚀 AI-Powered Brain Tumor Classification](https://github.com/sergio11/brain_tumor_classification_cnn)
Classify brain tumors from MRI scans using **Deep Learning**, CNNs, and Transfer Learning for fast and accurate diagnostics.

+ [📊💉 Predicting Diabetes Diagnosis Using Machine Learning](https://github.com/sergio11/diabetes_prediction_ml)
Create a machine learning model to predict the likelihood of diabetes using medical data, helping with early diagnosis.

+ [🚀🔍 LLM Fine-Tuning and Evaluation](https://github.com/sergio11/llm_finetuning_and_evaluation)
Fine-tune large language models like **FLAN-T5**, **TinyLLAMA**, and **Aguila7B** for various NLP tasks, including summarization and question answering.

+ [📰 Headline Generation Models: LSTM vs. Transformers](https://github.com/sergio11/headline_generation_lstm_transformers)
Compare **LSTM** and **Transformer** models for generating contextually relevant headlines, leveraging their strengths in sequence modeling.

+ [🩺💻 Breast Cancer Diagnosis with MLP](https://github.com/sergio11/breast_cancer_diagnosis_mlp)
Automate breast cancer diagnosis using a **Multi-Layer Perceptron (MLP)** model to classify tumors as benign or malignant based on biopsy data.

+ [Deep Learning for Safer Roads 🚗 Exploring CNN-Based and YOLOv11 Driver Drowsiness Detection 💤](https://github.com/sergio11/safedrive_drowsiness_detection)
Comparing driver drowsiness detection with CNN + MobileNetV2 vs YOLOv11 for real-time accuracy and efficiency 🧠🚗. Exploring both deep learning models to prevent fatigue-related accidents 😴💡.
## Key Steps in the Process 🛠️

### 1. **Data Collection & Preprocessing 📊**
- **Loading the Data**: The dataset consists of emails labeled as **Spam (1)** or **Legitimate (0)**.
- **Text Normalization**: We start by converting text to lowercase and removing unnecessary characters, such as numbers, punctuation, and special symbols.
- **Stopword Removal**: Common words that do not contribute to meaningful classification (like "the", "and", etc.) are removed.
- **Hyperlink Removal**: URLs and hyperlinks in the text are deleted as they do not provide useful information for classification.
- **Tokenization**: We split the email text into individual words (tokens) for easier processing.

### 2. **Exploratory Data Analysis (EDA) 🔍**
- **Visualizing the Data**: The notebook includes visualizations such as word clouds and n-gram analysis, which help in understanding the most common terms used in spam and legitimate emails.
- **Class Distribution**: The dataset is explored to understand the distribution of spam vs. legitimate emails, which helps in deciding model evaluation strategies.

### 3. **Feature Engineering ⚙️**
- **Text Tokenization**: The email text is tokenized into sequences, and the vocabulary is built.
- **Padding**: As the text data varies in length, padding is applied to ensure that all input sequences have the same size, making them suitable for model input.
- **Label Encoding**: The target labels (spam or legitimate) are encoded into numeric values (0 or 1) using LabelEncoder.

### 4. **Model Construction 🏗️**
- **Bi-directional LSTM**: We use a **Bi-directional LSTM** model to process the sequence of words in both forward and backward directions. This helps capture contextual information from both past and future words.
- **Dense Layer**: A fully connected layer with **ReLU** activation is added to capture non-linear relationships between features.
- **Dropout**: A dropout layer is included to prevent overfitting and help the model generalize better.

### 5. **Model Training 🚀**
- The model is trained on the preprocessed training data using **binary cross-entropy loss** and the **Adam optimizer**.
- **Early Stopping** is implemented to monitor the validation loss and stop training once the model starts overfitting.
- **Evaluation**: The model is evaluated on a separate test set to determine its accuracy and ability to generalize to unseen data.

### 6. **Model Evaluation and Results 📊**
- **Training Metrics**: The model's performance during training is tracked by monitoring the loss and accuracy.
- **Validation Metrics**: The validation loss and accuracy provide insight into how well the model generalizes.
- **Overfitting**: If the validation accuracy starts to drop while training accuracy continues to rise, it indicates overfitting. This is addressed by using techniques like **early stopping**.

## Goals of the Project 🎯
- To classify emails as **Spam** or **Legitimate** using deep learning.
- To explore NLP techniques for text preprocessing and sequence classification.
- To evaluate the model's performance on both training and validation sets, and improve it through strategies like **early stopping** and **dropout**.

## Results 📈
- The model’s performance on the training data is typically high, with **99%** accuracy.
- On the validation data, accuracy usually reaches around **97%**, though slight fluctuations are observed due to overfitting.

## Conclusion 🎓
By the end of this project, you will have a functional **Bi-LSTM model** for spam email classification that can be further fine-tuned, deployed, or integrated into a larger system for filtering unwanted emails. Techniques like **early stopping** are crucial to prevent overfitting and ensure the model’s generalizability.

## **📚 References**
- [Keras Documentation](https://keras.io/)
- [TensorFlow Documentation](https://www.tensorflow.org/)
- [NLP with Disaster Tweets Challenge](https://www.kaggle.com/c/nlp-getting-started)
- [https://www.kaggle.com/code/hrhuynguyen/detecting-spam-in-emails-with-lstms-99-accuracy](https://www.kaggle.com/code/hrhuynguyen/detecting-spam-in-emails-with-lstms-99-accuracy)
- [https://www.kaggle.com/code/rem4000/spam-email-xgboost-99](https://www.kaggle.com/code/rem4000/spam-email-xgboost-99)

## **🙏 Acknowledgments**

A huge **thank you** to **purusinghvi** for providing the dataset that made this project possible! 🌟 The dataset can be found on [Kaggle](https://www.kaggle.com/datasets/purusinghvi/email-spam-classification-dataset). Your contribution is greatly appreciated! 🙌

Your contributions have been incredibly helpful in refining and optimizing this project. **Thank you!** 🙏🚀

## Visitors Count

## Please Share & Star the repository to keep me motivated.

## License ⚖️

This project is licensed under the MIT License, an open-source software license that allows developers to freely use, copy, modify, and distribute the software. 🛠️ This includes use in both personal and commercial projects, with the only requirement being that the original copyright notice is retained. 📄

Please note the following limitations:

- The software is provided "as is", without any warranties, express or implied. 🚫🛡️
- If you distribute the software, whether in original or modified form, you must include the original copyright notice and license. 📑
- The license allows for commercial use, but you cannot claim ownership over the software itself. 🏷️

The goal of this license is to maximize freedom for developers while maintaining recognition for the original creators.

```
MIT License

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/sergio11/spam_email_classifier_lstm

Awesome Lists containing this project

README