Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/sahiltiwariiii/email-spam-classifier

This model will tell you weather mail is spam or not
https://github.com/sahiltiwariiii/email-spam-classifier

dataanalysis datacleaning datascience eda machine-learning nlp-machine-learning nltk numpy pandas python scikit-learn streamlit streamlit-webapp tfidf-vectorizer wordcloud-visualization wordtovec

Last synced: about 8 hours ago
JSON representation

This model will tell you weather mail is spam or not

Awesome Lists containing this project

README

        

# ๐Ÿ“ง Email Spam Classification Project

## Overview
This project focuses on building a machine learning model to classify emails as spam or not spam using natural language processing (NLP) techniques. The dataset used is the [SMS Spam Collection Dataset](https://www.kaggle.com/uciml/sms-spam-collection-dataset) from Kaggle, which contains 5572 SMS messages labeled as spam or ham (not spam).

![Problem Statement](images/front.png)

## Motivation
Spam emails are a significant issue, causing inconvenience and security risks. This project aims to develop an effective spam classification model to help filter out unwanted messages, enhancing email security and user experience.

## Problem Statement
The goal is to classify emails as spam or ham using various NLP and machine learning techniques, focusing on achieving high precision to minimize false positives.

## Not Spam Email
![Model](images/m.png)

## Success Metrics
The performance of the models is evaluated using the following metrics:
- **Accuracy**
- **Precision**

## Spam Email
![Spam Email](images/spam.png)

Given the imbalanced nature of the dataset, precision is prioritized over accuracy.

## Methodology
1. **Data Cleaning** ๐Ÿงน
- Removed duplicates, handled missing values, and transformed the text data.

2. **Exploratory Data Analysis (EDA)** ๐Ÿ“Š
- Analyzed the distribution of spam and ham emails.

3. **Text Preprocessing** โœ๏ธ
- Converted text to lower case, removed stop words, and applied stemming.

4. **Vectorization** ๐Ÿงฎ
- Used Bag of Words (BoW) and Term Frequency-Inverse Document Frequency (TF-IDF) techniques for text vectorization.

5. **Model Building** ๐Ÿ› ๏ธ
- Implemented various models including:
- **Multinomial Naive Bayes**
- **Bernoulli Naive Bayes**
- **Gaussian Naive Bayes**

6. **Evaluation** ๐Ÿ“ˆ
- Evaluated models based on accuracy, precision,

7. **Improvement** ๐Ÿ”ง
- Tuned hyperparameters and tried different vectorization techniques to improve performance.

8. **Website** ๐ŸŒ
- Built a user-friendly web interface using Streamlit.

9. **Deployment** ๐Ÿš€
- Deployed the application on Streamlit Cloud.

## Best Model
The **Multinomial Naive Bayes** model performed best in terms of precision, making it the chosen model for this project. Despite BernoulliNB and GaussianNB showing better overall performance, the high precision of MultinomialNB makes it more suitable for our needs.

## Dataset
The raw dataset contained 5572 rows and 5 columns. After data cleaning and EDA, the focus was on two columns:
- **target**: The label indicating if the message is spam or ham.
- **transformed_text**: The cleaned and preprocessed text of the message.

## Requirements
The following libraries were used in this project:
- Streamlit
- NLTK
- Pandas
- Numpy
- Scikit-learn
- Wordcloud

## Steps Followed
1. **Data Cleaning** ๐Ÿงน
2. **EDA** ๐Ÿ“Š
3. **Text Preprocessing** โœ๏ธ
4. **Model Building** ๐Ÿ› ๏ธ
5. **Evaluation** ๐Ÿ“ˆ
6. **Improvement** ๐Ÿ”ง
7. **Website** ๐ŸŒ
8. **Deployment** ๐Ÿš€

## Conclusion
This project successfully built an email spam classifier with high precision using the Multinomial Naive Bayes model. The application is deployed and accessible through a user-friendly Streamlit interface.