{"id":25096552,"url":"https://github.com/sergio11/spam_email_classifier_lstm","last_synced_at":"2025-04-17T12:43:12.967Z","repository":{"id":90143787,"uuid":"50417885","full_name":"sergio11/spam_email_classifier_lstm","owner":"sergio11","description":"This project uses a Bi-directional LSTM model 📧🤖 to classify emails as spam or legitimate, utilizing NLP techniques like tokenization, padding, and stopword removal. It aims to create an effective email classifier 💻📊 while addressing overfitting with strategies like early stopping 🚫.","archived":false,"fork":false,"pushed_at":"2025-03-11T07:38:33.000Z","size":6207,"stargazers_count":5,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-03-29T06:11:16.760Z","etag":null,"topics":["bilstm","confusion-matrix","data-preprocessing","deep-learning","lstm","lstm-model","lstm-neural-networks","machine-learning","natural-language-processing","sentiment-analysis","spam-detection","text-classification","word-cloud"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":false,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/sergio11.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2016-01-26T09:22:26.000Z","updated_at":"2025-03-18T08:03:18.000Z","dependencies_parsed_at":"2025-01-14T21:29:59.574Z","dependency_job_id":"f9b117a6-5eea-45e5-aa09-5d00160951d3","html_url":"https://github.com/sergio11/spam_email_classifier_lstm","commit_stats":null,"previous_names":["sergio11/spam_email_classifier_lstm"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sergio11%2Fspam_email_classifier_lstm","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sergio11%2Fspam_email_classifier_lstm/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sergio11%2Fspam_email_classifier_lstm/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sergio11%2Fspam_email_classifier_lstm/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/sergio11","download_url":"https://codeload.github.com/sergio11/spam_email_classifier_lstm/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":249342155,"owners_count":21254234,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bilstm","confusion-matrix","data-preprocessing","deep-learning","lstm","lstm-model","lstm-neural-networks","machine-learning","natural-language-processing","sentiment-analysis","spam-detection","text-classification","word-cloud"],"created_at":"2025-02-07T16:48:09.508Z","updated_at":"2025-04-17T12:43:12.960Z","avatar_url":"https://github.com/sergio11.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Spam Email Classification using LSTM 📧🤖\n\nThis project explores building a classification model to differentiate between **Spam** and **Legitimate (Ham)** emails using **Long Short-Term Memory (LSTM)** networks. The notebook details the end-to-end process of preparing the data, training the model, and evaluating its performance.\n\nThe focus is on using **Natural Language Processing (NLP)** techniques for text preprocessing and **Deep Learning** to classify emails based on their content. By the end of the project, we aim to have a trained model that can effectively predict whether an email is spam or legitimate.\n\n## ⚠️ Disclaimer  \n**This project was developed for learning and research purposes only.** It is an educational exercise aimed at exploring **Natural Language Processing (NLP) techniques and Deep Learning models**—specifically **Long Short-Term Memory (LSTM) networks**—for spam email classification.  \n\nThe model and findings presented in this project should not be used for real-world email filtering or commercial applications, as they have not been rigorously tested for deployment. Additionally, this project leverages publicly available datasets and references existing research contributions for educational insights.  \n\n\u003cp align=\"center\"\u003e\n   \u003cimg src=\"https://img.shields.io/badge/pypi-3775A9?style=for-the-badge\u0026logo=pypi\u0026logoColor=white\" /\u003e\n   \u003cimg src=\"https://img.shields.io/badge/Python-FFD43B?style=for-the-badge\u0026logo=python\u0026logoColor=blue\" /\u003e\n   \u003cimg src=\"https://img.shields.io/badge/Keras-FF0000?style=for-the-badge\u0026logo=keras\u0026logoColor=white\" /\u003e\n   \u003cimg src=\"https://img.shields.io/badge/TensorFlow-FF6F00?style=for-the-badge\u0026logo=tensorflow\u0026logoColor=white\" /\u003e\n   \u003cimg src=\"https://img.shields.io/badge/Jupyter-F37626.svg?\u0026style=for-the-badge\u0026logo=Jupyter\u0026logoColor=white\" /\u003e\n   \u003cimg src=\"https://img.shields.io/badge/Pandas-2C2D72?style=for-the-badge\u0026logo=pandas\u0026logoColor=white\" /\u003e\n   \u003cimg src=\"https://img.shields.io/badge/Numpy-777BB4?style=for-the-badge\u0026logo=numpy\u0026logoColor=white\" /\u003e\n\u003c/p\u003e\n\n🙏 I would like to extend my heartfelt gratitude to [Santiago Hernández, an expert in Cybersecurity and Artificial Intelligence](https://www.udemy.com/user/shramos/). His incredible course on Deep Learning, available at Udemy, was instrumental in shaping the development of this project. The insights and techniques learned from his course were crucial in crafting the neural network architecture used in this classifier.\n\nWe would like to express our gratitude to **purusinghvi** for creating and sharing the **Spam Email Classification Dataset - Combined Spam Email CSV of 2007 TREC Public Spam Corpus and Enron-Spam Dataset** on Kaggle. This dataset, which contains detailed information about spam and legitimate emails, has been invaluable in building and training the machine learning model for spam detection.\n\n🌟 The dataset can be found on [Kaggle](https://www.kaggle.com/datasets/purusinghvi/email-spam-classification-dataset). Your contribution is greatly appreciated! 🙌\n\n📌 Additionally, this project was **inspired by amazing contributions** from the Kaggle community:  \n- **[Detecting Spam in Emails with LSTMs (99% accuracy)](https://www.kaggle.com/code/hrhuynguyen/detecting-spam-in-emails-with-lstms-99-accuracy)** by **hrhuynguyen**, which provided valuable insights into applying LSTMs for spam detection.  \n- **[Spam Email - XGBoost - 99%](https://www.kaggle.com/code/rem4000/spam-email-xgboost-99)** by **rem4000**, which demonstrated the power of ensemble learning methods for this task.  \n\n## 🌟 Explore My Other Cutting-Edge AI Projects! 🌟\n\nIf you found this project intriguing, I invite you to check out my other AI and machine learning initiatives, where I tackle real-world challenges across various domains:\n\n+ [🌍 Advanced Classification of Disaster-Related Tweets Using Deep Learning 🚨](https://github.com/sergio11/disasters_prediction)  \nUncover how social media responds to crises in real time using **deep learning** to classify tweets related to disasters.\n\n+ [📰 Fighting Misinformation: Source-Based Fake News Classification 🕵️‍♂️](https://github.com/sergio11/fake_news_classifier)  \nCombat misinformation by classifying news articles as real or fake based on their source using **machine learning** techniques.\n\n+ [🛡️ IoT Network Malware Classifier with Deep Learning Neural Network Architecture 🚀](https://github.com/sergio11/iot_network_malware_classifier)  \nDetect malware in IoT network traffic using **Deep Learning Neural Networks**, offering proactive cybersecurity solutions.\n\n+ [📧 Spam Email Classification using LSTM 🤖](https://github.com/sergio11/spam_email_classifier_lstm)  \nClassify emails as spam or legitimate using a **Bi-directional LSTM** model, implementing NLP techniques like tokenization and stopword removal.\n\n+ [💳 Fraud Detection Model with Deep Neural Networks (DNN)](https://github.com/sergio11/online_payment_fraud) \nDetect fraudulent transactions in financial data with **Deep Neural Networks**, addressing imbalanced datasets and offering scalable solutions.\n\n+ [🧠🚀 AI-Powered Brain Tumor Classification](https://github.com/sergio11/brain_tumor_classification_cnn)  \nClassify brain tumors from MRI scans using **Deep Learning**, CNNs, and Transfer Learning for fast and accurate diagnostics.\n\n+ [📊💉 Predicting Diabetes Diagnosis Using Machine Learning](https://github.com/sergio11/diabetes_prediction_ml)  \nCreate a machine learning model to predict the likelihood of diabetes using medical data, helping with early diagnosis.\n\n+ [🚀🔍 LLM Fine-Tuning and Evaluation](https://github.com/sergio11/llm_finetuning_and_evaluation)  \nFine-tune large language models like **FLAN-T5**, **TinyLLAMA**, and **Aguila7B** for various NLP tasks, including summarization and question answering.\n\n+ [📰 Headline Generation Models: LSTM vs. Transformers](https://github.com/sergio11/headline_generation_lstm_transformers)  \nCompare **LSTM** and **Transformer** models for generating contextually relevant headlines, leveraging their strengths in sequence modeling.\n\n+ [🩺💻 Breast Cancer Diagnosis with MLP](https://github.com/sergio11/breast_cancer_diagnosis_mlp)  \nAutomate breast cancer diagnosis using a **Multi-Layer Perceptron (MLP)** model to classify tumors as benign or malignant based on biopsy data.\n\n\n+ [Deep Learning for Safer Roads 🚗 Exploring CNN-Based and YOLOv11 Driver Drowsiness Detection 💤](https://github.com/sergio11/safedrive_drowsiness_detection)\nComparing driver drowsiness detection with CNN + MobileNetV2 vs YOLOv11 for real-time accuracy and efficiency 🧠🚗. Exploring both deep learning models to prevent fatigue-related accidents 😴💡.\n## Key Steps in the Process 🛠️\n\n### 1. **Data Collection \u0026 Preprocessing 📊**\n- **Loading the Data**: The dataset consists of emails labeled as **Spam (1)** or **Legitimate (0)**.\n- **Text Normalization**: We start by converting text to lowercase and removing unnecessary characters, such as numbers, punctuation, and special symbols.\n- **Stopword Removal**: Common words that do not contribute to meaningful classification (like \"the\", \"and\", etc.) are removed.\n- **Hyperlink Removal**: URLs and hyperlinks in the text are deleted as they do not provide useful information for classification.\n- **Tokenization**: We split the email text into individual words (tokens) for easier processing.\n\n### 2. **Exploratory Data Analysis (EDA) 🔍**\n- **Visualizing the Data**: The notebook includes visualizations such as word clouds and n-gram analysis, which help in understanding the most common terms used in spam and legitimate emails.\n- **Class Distribution**: The dataset is explored to understand the distribution of spam vs. legitimate emails, which helps in deciding model evaluation strategies.\n\n### 3. **Feature Engineering ⚙️**\n- **Text Tokenization**: The email text is tokenized into sequences, and the vocabulary is built.\n- **Padding**: As the text data varies in length, padding is applied to ensure that all input sequences have the same size, making them suitable for model input.\n- **Label Encoding**: The target labels (spam or legitimate) are encoded into numeric values (0 or 1) using LabelEncoder.\n\n### 4. **Model Construction 🏗️**\n- **Bi-directional LSTM**: We use a **Bi-directional LSTM** model to process the sequence of words in both forward and backward directions. This helps capture contextual information from both past and future words.\n- **Dense Layer**: A fully connected layer with **ReLU** activation is added to capture non-linear relationships between features.\n- **Dropout**: A dropout layer is included to prevent overfitting and help the model generalize better.\n\n### 5. **Model Training 🚀**\n- The model is trained on the preprocessed training data using **binary cross-entropy loss** and the **Adam optimizer**.\n- **Early Stopping** is implemented to monitor the validation loss and stop training once the model starts overfitting.\n- **Evaluation**: The model is evaluated on a separate test set to determine its accuracy and ability to generalize to unseen data.\n\n### 6. **Model Evaluation and Results 📊**\n- **Training Metrics**: The model's performance during training is tracked by monitoring the loss and accuracy.\n- **Validation Metrics**: The validation loss and accuracy provide insight into how well the model generalizes.\n- **Overfitting**: If the validation accuracy starts to drop while training accuracy continues to rise, it indicates overfitting. This is addressed by using techniques like **early stopping**.\n\n## Goals of the Project 🎯\n- To classify emails as **Spam** or **Legitimate** using deep learning.\n- To explore NLP techniques for text preprocessing and sequence classification.\n- To evaluate the model's performance on both training and validation sets, and improve it through strategies like **early stopping** and **dropout**.\n\n## Results 📈\n- The model’s performance on the training data is typically high, with **99%** accuracy.\n- On the validation data, accuracy usually reaches around **97%**, though slight fluctuations are observed due to overfitting.\n\n## Conclusion 🎓\nBy the end of this project, you will have a functional **Bi-LSTM model** for spam email classification that can be further fine-tuned, deployed, or integrated into a larger system for filtering unwanted emails. Techniques like **early stopping** are crucial to prevent overfitting and ensure the model’s generalizability.\n\n## **📚 References**\n- [Keras Documentation](https://keras.io/)\n- [TensorFlow Documentation](https://www.tensorflow.org/)\n- [NLP with Disaster Tweets Challenge](https://www.kaggle.com/c/nlp-getting-started)\n- [https://www.kaggle.com/code/hrhuynguyen/detecting-spam-in-emails-with-lstms-99-accuracy](https://www.kaggle.com/code/hrhuynguyen/detecting-spam-in-emails-with-lstms-99-accuracy)\n- [https://www.kaggle.com/code/rem4000/spam-email-xgboost-99](https://www.kaggle.com/code/rem4000/spam-email-xgboost-99)\n\n## ⚠️ Disclaimer  \n**This project was developed for learning and research purposes only.** It is an educational exercise aimed at exploring **Natural Language Processing (NLP) techniques and Deep Learning models**—specifically **Long Short-Term Memory (LSTM) networks**—for spam email classification.  \n\nThe model and findings presented in this project should not be used for real-world email filtering or commercial applications, as they have not been rigorously tested for deployment. Additionally, this project leverages publicly available datasets and references existing research contributions for educational insights.  \n\n## **🙏 Acknowledgments**  \n\nA huge **thank you** to **purusinghvi** for providing the dataset that made this project possible! 🌟 The dataset can be found on [Kaggle](https://www.kaggle.com/datasets/purusinghvi/email-spam-classification-dataset). Your contribution is greatly appreciated! 🙌  \n\n🙏 I would like to extend my heartfelt gratitude to [Santiago Hernández, an expert in Cybersecurity and Artificial Intelligence](https://www.udemy.com/user/shramos/). His incredible course on Deep Learning, available at Udemy, was instrumental in shaping the development of this project. The insights and techniques learned from his course were crucial in crafting the neural network architecture used in this classifier.  \n\n📌 Additionally, this project was **inspired by amazing contributions** from the Kaggle community:  \n- **[Detecting Spam in Emails with LSTMs (99% accuracy)](https://www.kaggle.com/code/hrhuynguyen/detecting-spam-in-emails-with-lstms-99-accuracy)** by **hrhuynguyen**, which provided valuable insights into applying LSTMs for spam detection.  \n- **[Spam Email - XGBoost - 99%](https://www.kaggle.com/code/rem4000/spam-email-xgboost-99)** by **rem4000**, which demonstrated the power of ensemble learning methods for this task.  \n\nYour contributions have been incredibly helpful in refining and optimizing this project. **Thank you!** 🙏🚀  \n\n## Visitors Count\n\n\u003cimg width=\"auto\" src=\"https://profile-counter.glitch.me/spam_email_classifier_lstm/count.svg\" /\u003e\n\n## Please Share \u0026 Star the repository to keep me motivated.\n\u003ca href = \"https://github.com/sergio11/spam_email_classifier_lstm/stargazers\"\u003e\n   \u003cimg src = \"https://img.shields.io/github/stars/sergio11/spam_email_classifier_lstm\" /\u003e\n\u003c/a\u003e\n\n## License ⚖️\n\nThis project is licensed under the MIT License, an open-source software license that allows developers to freely use, copy, modify, and distribute the software. 🛠️ This includes use in both personal and commercial projects, with the only requirement being that the original copyright notice is retained. 📄\n\nPlease note the following limitations:\n\n- The software is provided \"as is\", without any warranties, express or implied. 🚫🛡️\n- If you distribute the software, whether in original or modified form, you must include the original copyright notice and license. 📑\n- The license allows for commercial use, but you cannot claim ownership over the software itself. 🏷️\n\nThe goal of this license is to maximize freedom for developers while maintaining recognition for the original creators.\n\n```\nMIT License\n\nCopyright (c) 2024 Dream software - Sergio Sánchez \n\nPermission is hereby granted, free of charge, to any person obtaining a copy\nof this software and associated documentation files (the \"Software\"), to deal\nin the Software without restriction, including without limitation the rights\nto use, copy, modify, merge, publish, distribute, sublicense, and/or sell\ncopies of the Software, and to permit persons to whom the Software is\nfurnished to do so, subject to the following conditions:\n\nThe above copyright notice and this permission notice shall be included in all\ncopies or substantial portions of the Software.\n\nTHE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\nIMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\nFITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\nAUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\nLIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\nOUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\nSOFTWARE.\n```\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsergio11%2Fspam_email_classifier_lstm","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsergio11%2Fspam_email_classifier_lstm","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsergio11%2Fspam_email_classifier_lstm/lists"}