An open API service indexing awesome lists of open source software.

https://github.com/yashrk3103/spam-detector

E-mail/SMS Spam Detector
https://github.com/yashrk3103/spam-detector

jupyter-notebook lime naive-bayes-classifier python streamlit

Last synced: about 2 months ago
JSON representation

E-mail/SMS Spam Detector

Awesome Lists containing this project

README

          

# ๐Ÿ“ฉ Email/SMS Spam Detector with Explainability

This project is a complete end-to-end machine learning application that detects whether a given **email or SMS message is spam or not**, using advanced **natural language processing (NLP)** and a **Naive Bayes classifier** โ€” all wrapped in a modern and interactive **Streamlit** web interface.

The application goes beyond simple classification by providing **model explainability using LIME (Local Interpretable Model-Agnostic Explanations)**, helping users understand *why* a message was classified as spam or ham. This makes the system transparent, educational, and more trustworthy for users.

---

## ๐Ÿš€ Project Goals

- Build a robust ML model to classify messages as **spam** or **ham**
- Apply **real-world NLP preprocessing**: cleaning URLs, numbers, symbols, and noise
- Use **TF-IDF vectorization** with 1โ€“3 gram features for enhanced pattern recognition
- Enable **LIME explainability** to highlight influential words/phrases
- Visualize predictions and explanations in a clean, browser-based UI

---

## ๐Ÿ” Key Features

- ๐Ÿง  Custom preprocessing pipeline to normalize text
- ๐Ÿ“Š TF-IDF vectorizer (1โ€“3 grams, max 10,000 features)
- โš–๏ธ Tuned Multinomial Naive Bayes model with smoothing
- ๐ŸŽฏ Multi-level spam detection thresholds:
- 80%+ โ†’ Definite spam
- 60โ€“80% โ†’ Likely spam
- 40โ€“60% โ†’ Potential spam
- Below 40% โ†’ Ham
- ๐Ÿ”Ž LIME-powered explainability:
- Top influential words/phrases
- Impact strength (strong/moderate/weak)
- Color-coded bar chart for interpretation
- ๐Ÿงช Uses the **SMS Spam Collection Dataset** (UCI)

---

## ๐Ÿ’ก What Youโ€™ll Learn from This Project

- How to clean and vectorize textual data
- How to train, evaluate, and persist ML models using `scikit-learn`
- How to build modular, production-ready ML pipelines
- How to serve ML models as interactive web apps using Streamlit
- How to apply **Explainable AI (XAI)** to NLP use cases

---

## ๐Ÿ“‚ Technology Stack

- **Frontend & UI**: Streamlit
- **Data Handling**: Pandas
- **Modeling**: Scikit-learn (Naive Bayes + TF-IDF)
- **Explainability**: LIME
- **Visualization**: Matplotlib
- **Packaging**: Joblib

---

## ๐Ÿ“ Dataset

This project is trained on the **SMS Spam Collection Dataset**, a popular benchmark dataset for binary text classification tasks involving spam detection. It contains over 5,000 real SMS messages labeled as `spam` or `ham`.

Dataset source: [UCI Machine Learning Repository](https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset)

---

## ๐Ÿ“Œ Use Cases

- Educational tool for learning NLP and spam detection
- Lightweight explainable AI (XAI) demo
- Prototype for filtering malicious or promotional SMS/email traffic
- Deployment-ready ML app for showcasing end-to-end ML skills

---

Feel free to explore the code, tweak parameters, add new features, or integrate more advanced models like Logistic Regression or BERT in future versions.