https://github.com/thatsabhishek/spam-mail-detection-using-python

Last synced: 10 months ago
JSON representation

Host: GitHub
URL: https://github.com/thatsabhishek/spam-mail-detection-using-python
Owner: thatsabhishek
Created: 2023-07-21T04:15:34.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2023-12-09T05:58:38.000Z (about 2 years ago)
Last Synced: 2025-01-14T15:28:44.051Z (11 months ago)
Language: Jupyter Notebook
Size: 241 KB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Spam Mail Detection Using Python

## Overview
This project implements a simple spam mail detection system using Python and machine learning techniques. The code utilizes the scikit-learn library for machine learning tasks and pandas for data manipulation. The model is based on logistic regression and employs the TF-IDF (Term Frequency-Inverse Document Frequency) vectorization technique to convert text data into numerical features.

## Code Structure
The code is organized into the following sections:

1. **Data Loading and Preprocessing:**
- The dataset is loaded using pandas from a CSV file (`mail_data.csv`).
- Initial exploration of the dataset is performed using `head()`, `info()`, and `isnull().sum()` functions.
- Null values in the dataset are replaced with empty strings.
- A binary classification is created for the 'Category' column, where 'ham' is labeled as 1 (not spam) and others as 0 (spam).
- Unnecessary columns like 'Category' are dropped.

2. **Data Splitting:**
- The data is split into training and testing sets using `train_test_split` from scikit-learn.

3. **Text Vectorization:**
- TF-IDF vectorization is applied to convert the text data into feature vectors. The `TfidfVectorizer` is configured to ignore English stop words and convert all text to lowercase.

4. **Model Training:**
- A logistic regression model is employed for training using the transformed training data.

5. **Model Evaluation:**
- The accuracy of the model is evaluated on both the training and testing datasets.

6. **Prediction:**
- The model is used to predict the category of a new input mail. The input mail is first transformed using the same TF-IDF vectorizer, and the model predicts whether it is spam or not.

## How to Use
1. **Clone the Repository:**
```
git clone https://github.com/your-username/Spam-Mail-Detection-Using-Python.git
cd Spam-Mail-Detection-Using-Python
```

2. **Install Dependencies:**
```
pip install pandas numpy scikit-learn
```

3. **Run the Code:**
- Ensure that the `mail_data.csv` file is present in the same directory.
```
python spam_mail_detection.py
```

4. **Modify Input:**
- You can modify the `input_your_mail` variable with your own email content for testing.

5. **Review Results:**
- The code will print the accuracy on both the training and testing datasets.
- It will also predict whether the input mail is spam or ham.

## Notes
- The dataset used (`mail_data.csv`) should have a 'Category' column indicating whether the mail is 'ham' or 'spam'.
- Ensure that the necessary libraries are installed using the provided requirements file.

Feel free to customize and extend this project according to your needs!

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/thatsabhishek/spam-mail-detection-using-python

Awesome Lists containing this project

README