https://github.com/yashrk3103/spam-detector
E-mail/SMS Spam Detector
https://github.com/yashrk3103/spam-detector
jupyter-notebook lime naive-bayes-classifier python streamlit
Last synced: about 2 months ago
JSON representation
E-mail/SMS Spam Detector
- Host: GitHub
- URL: https://github.com/yashrk3103/spam-detector
- Owner: yashrk3103
- Created: 2025-06-29T13:52:37.000Z (12 months ago)
- Default Branch: main
- Last Pushed: 2025-06-29T14:14:40.000Z (12 months ago)
- Last Synced: 2025-06-29T15:26:24.537Z (12 months ago)
- Topics: jupyter-notebook, lime, naive-bayes-classifier, python, streamlit
- Language: HTML
- Homepage: https://spam-detector0.streamlit.app/
- Size: 210 MB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# ๐ฉ Email/SMS Spam Detector with Explainability
This project is a complete end-to-end machine learning application that detects whether a given **email or SMS message is spam or not**, using advanced **natural language processing (NLP)** and a **Naive Bayes classifier** โ all wrapped in a modern and interactive **Streamlit** web interface.
The application goes beyond simple classification by providing **model explainability using LIME (Local Interpretable Model-Agnostic Explanations)**, helping users understand *why* a message was classified as spam or ham. This makes the system transparent, educational, and more trustworthy for users.
---
## ๐ Project Goals
- Build a robust ML model to classify messages as **spam** or **ham**
- Apply **real-world NLP preprocessing**: cleaning URLs, numbers, symbols, and noise
- Use **TF-IDF vectorization** with 1โ3 gram features for enhanced pattern recognition
- Enable **LIME explainability** to highlight influential words/phrases
- Visualize predictions and explanations in a clean, browser-based UI
---
## ๐ Key Features
- ๐ง Custom preprocessing pipeline to normalize text
- ๐ TF-IDF vectorizer (1โ3 grams, max 10,000 features)
- โ๏ธ Tuned Multinomial Naive Bayes model with smoothing
- ๐ฏ Multi-level spam detection thresholds:
- 80%+ โ Definite spam
- 60โ80% โ Likely spam
- 40โ60% โ Potential spam
- Below 40% โ Ham
- ๐ LIME-powered explainability:
- Top influential words/phrases
- Impact strength (strong/moderate/weak)
- Color-coded bar chart for interpretation
- ๐งช Uses the **SMS Spam Collection Dataset** (UCI)
---
## ๐ก What Youโll Learn from This Project
- How to clean and vectorize textual data
- How to train, evaluate, and persist ML models using `scikit-learn`
- How to build modular, production-ready ML pipelines
- How to serve ML models as interactive web apps using Streamlit
- How to apply **Explainable AI (XAI)** to NLP use cases
---
## ๐ Technology Stack
- **Frontend & UI**: Streamlit
- **Data Handling**: Pandas
- **Modeling**: Scikit-learn (Naive Bayes + TF-IDF)
- **Explainability**: LIME
- **Visualization**: Matplotlib
- **Packaging**: Joblib
---
## ๐ Dataset
This project is trained on the **SMS Spam Collection Dataset**, a popular benchmark dataset for binary text classification tasks involving spam detection. It contains over 5,000 real SMS messages labeled as `spam` or `ham`.
Dataset source: [UCI Machine Learning Repository](https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset)
---
## ๐ Use Cases
- Educational tool for learning NLP and spam detection
- Lightweight explainable AI (XAI) demo
- Prototype for filtering malicious or promotional SMS/email traffic
- Deployment-ready ML app for showcasing end-to-end ML skills
---
Feel free to explore the code, tweak parameters, add new features, or integrate more advanced models like Logistic Regression or BERT in future versions.