{"id":15157793,"url":"https://github.com/manjit-baishya-datascience/spam-email-detection","last_synced_at":"2026-02-12T19:31:41.028Z","repository":{"id":253721751,"uuid":"844323773","full_name":"manjit-baishya-datascience/Spam-Email-Detection","owner":"manjit-baishya-datascience","description":" This project demonstrates how to build a spam detection system using Natural Language Processing (NLP) and machine learning techniques.","archived":false,"fork":false,"pushed_at":"2024-08-31T05:27:29.000Z","size":633,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-02-13T17:18:05.794Z","etag":null,"topics":["imblearn","nlp","nlp-machine-learning","nltk","scikit-learn","spam-detection"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/manjit-baishya-datascience.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-08-19T02:38:54.000Z","updated_at":"2024-08-31T05:27:32.000Z","dependencies_parsed_at":"2024-11-03T03:41:52.460Z","dependency_job_id":"37983bc5-e84e-40a0-898d-1cec71117afd","html_url":"https://github.com/manjit-baishya-datascience/Spam-Email-Detection","commit_stats":{"total_commits":3,"total_committers":1,"mean_commits":3.0,"dds":0.0,"last_synced_commit":"ebb0ea10c3ac2578542ba365e5afbd6559a97abc"},"previous_names":["manjit-baishya-datascience/spam-email-detection"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/manjit-baishya-datascience%2FSpam-Email-Detection","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/manjit-baishya-datascience%2FSpam-Email-Detection/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/manjit-baishya-datascience%2FSpam-Email-Detection/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/manjit-baishya-datascience%2FSpam-Email-Detection/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/manjit-baishya-datascience","download_url":"https://codeload.github.com/manjit-baishya-datascience/Spam-Email-Detection/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247675633,"owners_count":20977376,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["imblearn","nlp","nlp-machine-learning","nltk","scikit-learn","spam-detection"],"created_at":"2024-09-26T20:03:41.618Z","updated_at":"2026-02-12T19:31:40.979Z","avatar_url":"https://github.com/manjit-baishya-datascience.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# **Spam Email Detection**\n![Asset 2](https://github.com/user-attachments/assets/ef019c7e-b82a-4aee-bacd-e7b69567f7ec)\n\n\n## **Overview**\nSpam Email Detection is a critical feature for improving user experience and security by automatically identifying and filtering out unwanted emails. This project demonstrates how to build a spam detection system using Natural Language Processing (NLP) and machine learning techniques.\n\n## **Project Structure**\nThis project is organized into several key components:\n\n- **Data**: A dataset of emails labeled as spam or ham (non-spam).\n- **Preprocessing**: Steps to clean and prepare the email data for modeling.\n- **Modeling**: Machine learning algorithms used to classify emails.\n- **Evaluation**: Assessing the performance of the models.\n- **Pipeline**: A complete pipeline from data preprocessing to prediction.\n- **Testing**: Examples of how the model performs on new, unseen emails.\n\n## **Data**\nThe dataset contains two main columns:\n- `Category`: Indicates whether the email is `spam` or `ham`.\n- `Message`: The actual content of the email.\n\nThe data is loaded from a CSV file and initially explored to understand its structure and content.\n\n## **Data Preprocessing**\nEffective preprocessing is essential for building a reliable spam detection model. The preprocessing steps include:\n\n1. **Lowercasing**: Converting all text to lowercase ensures uniformity, making the model less sensitive to case variations.\n2. **Removing Punctuation**: Punctuation marks are typically not useful for spam detection and are removed to simplify the text.\n3. **Removing Non-Alphabetic Characters**: This step eliminates numbers and special characters, focusing on the meaningful words.\n4. **Tokenization**: Breaking down each email into individual words (tokens) to analyze the text more effectively.\n5. **Removing Stop Words**: Common words like \"and,\" \"the,\" and \"is\" are removed since they do not contribute significantly to the classification.\n6. **Lemmatization**: Reducing words to their root form (e.g., \"running\" to \"run\") ensures that different forms of a word are treated as the same.\n\n## **Modeling**\nThe processed data is then used to train a machine learning model. In this project, we use the **Extra Trees Classifier**, a powerful ensemble learning method that combines the predictions of multiple decision trees for more accurate results.\n\n### **Key Steps in Modeling:**\n- **Vectorization**: Transforming the text data into numerical format using techniques like TF-IDF (Term Frequency-Inverse Document Frequency).\n- **Training**: The model is trained on the processed and vectorized data.\n- **Oversampling**: Since spam emails are less frequent in the dataset, oversampling is applied to balance the classes and improve model performance.\n\n## **Evaluation**\nThe model's performance is evaluated using metrics such as accuracy, precision, recall, and F1-score. A confusion matrix is also plotted to visualize how well the model distinguishes between spam and ham emails.\n\n## **Pipeline**\nA complete pipeline is created to automate the entire process, from data preprocessing to making predictions on new emails. This pipeline can be easily integrated into email systems to provide real-time spam detection.\n\n## **Testing the Model**\nThe model is tested with a variety of sample emails to demonstrate its effectiveness. For each email, the model predicts whether it is spam or ham, with results that align well with expectations.\n\n### **Example Predictions:**\n- \"Congratulations! You've won a $1000 Walmart gift card. Click here to claim your prize now!\" - **Predicted as Spam**\n- \"Don't forget our meeting at 3 PM today. Looking forward to discussing the new project.\" - **Predicted as Ham**\n\n## **Conclusion**\nThis project successfully implements a spam email detection system using NLP and machine learning. The system is accurate and efficient, making it a valuable tool for enhancing email security and user experience.\n\n## **Requirements**\nTo run this project, you'll need to install the necessary Python libraries:\n- pandas\n- matplotlib\n- seaborn\n- scikit-learn\n- imblearn\n- nltk\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmanjit-baishya-datascience%2Fspam-email-detection","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmanjit-baishya-datascience%2Fspam-email-detection","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmanjit-baishya-datascience%2Fspam-email-detection/lists"}