https://github.com/vedantvare/spam-mail-detection

dataset ipynb machine-learning natural-language-processing spam-classification spam-detection spam-email-classifier spam-email-detection spam-email-recognition spam-filtering

Last synced: 3 months ago
JSON representation

Host: GitHub
URL: https://github.com/vedantvare/spam-mail-detection
Owner: VedantVare
Created: 2024-12-30T15:07:47.000Z (5 months ago)
Default Branch: main
Last Pushed: 2024-12-30T15:15:55.000Z (5 months ago)
Last Synced: 2025-01-07T05:20:07.662Z (5 months ago)
Topics: dataset, ipynb, machine-learning, natural-language-processing, spam-classification, spam-detection, spam-email-classifier, spam-email-detection, spam-email-recognition, spam-filtering
Language: Jupyter Notebook
Homepage:
Size: 2.1 MB
Stars: 0
Watchers: 1
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Spam Email Classification

This project demonstrates how to classify emails as **Spam** or **Ham (Not Spam)** using Natural Language Processing (NLP) and a Random Forest Classifier.

## Features
- **Preprocessing**: Cleans and processes email text (removes punctuation, converts to lowercase, stems words, and removes stopwords).
- **Vectorization**: Converts text data into numerical format using CountVectorizer.
- **Model Training**: Uses a Random Forest Classifier for prediction.
- **Prediction**: Classifies new emails as Spam or Ham.

## Requirements
- Python 3.7 or higher
- Libraries:
- `numpy`
- `pandas`
- `nltk`
- `scikit-learn`

Install required libraries:
```bash
pip install numpy pandas nltk scikit-learn
```

## Dataset
The dataset used for this project:
- **Columns**:
- `text`: The email content.
- `label_num`: The label (0 for Ham, 1 for Spam).

Replace `'spam_ham_dataset.csv'` with your dataset file.

## How It Works
1. **Data Preprocessing**:
- Converts text to lowercase.
- Removes punctuation.
- Applies stemming to reduce words to their root forms.
- Removes stopwords (e.g., "the", "is", "in").

2. **Feature Extraction**:
- Text is converted to a bag-of-words representation using `CountVectorizer`.

3. **Model Training**:
- Splits data into training and testing sets.
- Trains a Random Forest Classifier on the training data.

4. **Email Prediction**:
- Takes an example email, preprocesses it, and predicts if it's Spam or Ham.

## Usage
1. Load the dataset:
```python
data = pd.read_csv('spam_ham_dataset.csv')
```
2. Run the code to train the model and evaluate accuracy:
```python
cl.score(X_test, y_test)
```
3. Predict an email:
```python
prediction = cl.predict(x_email)
print(f"Prediction: {'Spam' if prediction[0] == 1 else 'Ham'}")
```

## Output
- Prints the model's prediction (Spam or Ham) for a sample email.
- Displays the actual label from the dataset for comparison.

## Notes
- Ensure the dataset is in the correct format before running the notebook.
- The `nltk` library requires downloading stopwords:
```python
nltk.download('stopwords')
```

## License
Feel free to use and modify this project for learning purposes.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/vedantvare/spam-mail-detection

Awesome Lists containing this project

README