https://github.com/sahiltiwariiii/email-spam-classifier

This model will tell you weather mail is spam or not
https://github.com/sahiltiwariiii/email-spam-classifier

dataanalysis datacleaning datascience eda machine-learning nlp-machine-learning nltk numpy pandas python scikit-learn streamlit streamlit-webapp tfidf-vectorizer wordcloud-visualization wordtovec

Last synced: 5 months ago
JSON representation

This model will tell you weather mail is spam or not

Host: GitHub
URL: https://github.com/sahiltiwariiii/email-spam-classifier
Owner: sahilTiwariiii
Created: 2024-06-07T14:52:40.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2024-06-07T14:58:22.000Z (about 1 year ago)
Last Synced: 2024-12-31T05:30:04.940Z (7 months ago)
Topics: dataanalysis, datacleaning, datascience, eda, machine-learning, nlp-machine-learning, nltk, numpy, pandas, python, scikit-learn, streamlit, streamlit-webapp, tfidf-vectorizer, wordcloud-visualization, wordtovec
Language: Jupyter Notebook
Homepage: https://email-spam-classifier-sahil.streamlit.app
Size: 1.24 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# 📧 Email Spam Classification Project

## Overview
This project focuses on building a machine learning model to classify emails as spam or not spam using natural language processing (NLP) techniques. The dataset used is the [SMS Spam Collection Dataset](https://www.kaggle.com/uciml/sms-spam-collection-dataset) from Kaggle, which contains 5572 SMS messages labeled as spam or ham (not spam).

![Problem Statement](images/front.png)

## Motivation
Spam emails are a significant issue, causing inconvenience and security risks. This project aims to develop an effective spam classification model to help filter out unwanted messages, enhancing email security and user experience.

## Problem Statement
The goal is to classify emails as spam or ham using various NLP and machine learning techniques, focusing on achieving high precision to minimize false positives.

## Not Spam Email
![Model](images/m.png)

## Success Metrics
The performance of the models is evaluated using the following metrics:
- **Accuracy**
- **Precision**

## Spam Email
![Spam Email](images/spam.png)

Given the imbalanced nature of the dataset, precision is prioritized over accuracy.

## Methodology
1. **Data Cleaning** 🧹
- Removed duplicates, handled missing values, and transformed the text data.

2. **Exploratory Data Analysis (EDA)** 📊
- Analyzed the distribution of spam and ham emails.

3. **Text Preprocessing** ✍️
- Converted text to lower case, removed stop words, and applied stemming.

4. **Vectorization** 🧮
- Used Bag of Words (BoW) and Term Frequency-Inverse Document Frequency (TF-IDF) techniques for text vectorization.

5. **Model Building** 🛠️
- Implemented various models including:
- **Multinomial Naive Bayes**
- **Bernoulli Naive Bayes**
- **Gaussian Naive Bayes**

6. **Evaluation** 📈
- Evaluated models based on accuracy, precision,

7. **Improvement** 🔧
- Tuned hyperparameters and tried different vectorization techniques to improve performance.

8. **Website** 🌐
- Built a user-friendly web interface using Streamlit.

9. **Deployment** 🚀
- Deployed the application on Streamlit Cloud.

## Best Model
The **Multinomial Naive Bayes** model performed best in terms of precision, making it the chosen model for this project. Despite BernoulliNB and GaussianNB showing better overall performance, the high precision of MultinomialNB makes it more suitable for our needs.

## Dataset
The raw dataset contained 5572 rows and 5 columns. After data cleaning and EDA, the focus was on two columns:
- **target**: The label indicating if the message is spam or ham.
- **transformed_text**: The cleaned and preprocessed text of the message.

## Requirements
The following libraries were used in this project:
- Streamlit
- NLTK
- Pandas
- Numpy
- Scikit-learn
- Wordcloud

## Steps Followed
1. **Data Cleaning** 🧹
2. **EDA** 📊
3. **Text Preprocessing** ✍️
4. **Model Building** 🛠️
5. **Evaluation** 📈
6. **Improvement** 🔧
7. **Website** 🌐
8. **Deployment** 🚀

## Conclusion
This project successfully built an email spam classifier with high precision using the Multinomial Naive Bayes model. The application is deployed and accessible through a user-friendly Streamlit interface.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/sahiltiwariiii/email-spam-classifier

Awesome Lists containing this project

README