https://github.com/vedantvare/spam-mail-detection
https://github.com/vedantvare/spam-mail-detection
dataset ipynb machine-learning natural-language-processing spam-classification spam-detection spam-email-classifier spam-email-detection spam-email-recognition spam-filtering
Last synced: 3 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/vedantvare/spam-mail-detection
- Owner: VedantVare
- Created: 2024-12-30T15:07:47.000Z (5 months ago)
- Default Branch: main
- Last Pushed: 2024-12-30T15:15:55.000Z (5 months ago)
- Last Synced: 2025-01-07T05:20:07.662Z (5 months ago)
- Topics: dataset, ipynb, machine-learning, natural-language-processing, spam-classification, spam-detection, spam-email-classifier, spam-email-detection, spam-email-recognition, spam-filtering
- Language: Jupyter Notebook
- Homepage:
- Size: 2.1 MB
- Stars: 0
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Spam Email Classification
This project demonstrates how to classify emails as **Spam** or **Ham (Not Spam)** using Natural Language Processing (NLP) and a Random Forest Classifier.
## Features
- **Preprocessing**: Cleans and processes email text (removes punctuation, converts to lowercase, stems words, and removes stopwords).
- **Vectorization**: Converts text data into numerical format using CountVectorizer.
- **Model Training**: Uses a Random Forest Classifier for prediction.
- **Prediction**: Classifies new emails as Spam or Ham.## Requirements
- Python 3.7 or higher
- Libraries:
- `numpy`
- `pandas`
- `nltk`
- `scikit-learn`Install required libraries:
```bash
pip install numpy pandas nltk scikit-learn
```## Dataset
The dataset used for this project:
- **Columns**:
- `text`: The email content.
- `label_num`: The label (0 for Ham, 1 for Spam).Replace `'spam_ham_dataset.csv'` with your dataset file.
## How It Works
1. **Data Preprocessing**:
- Converts text to lowercase.
- Removes punctuation.
- Applies stemming to reduce words to their root forms.
- Removes stopwords (e.g., "the", "is", "in").2. **Feature Extraction**:
- Text is converted to a bag-of-words representation using `CountVectorizer`.3. **Model Training**:
- Splits data into training and testing sets.
- Trains a Random Forest Classifier on the training data.4. **Email Prediction**:
- Takes an example email, preprocesses it, and predicts if it's Spam or Ham.## Usage
1. Load the dataset:
```python
data = pd.read_csv('spam_ham_dataset.csv')
```
2. Run the code to train the model and evaluate accuracy:
```python
cl.score(X_test, y_test)
```
3. Predict an email:
```python
prediction = cl.predict(x_email)
print(f"Prediction: {'Spam' if prediction[0] == 1 else 'Ham'}")
```## Output
- Prints the model's prediction (Spam or Ham) for a sample email.
- Displays the actual label from the dataset for comparison.## Notes
- Ensure the dataset is in the correct format before running the notebook.
- The `nltk` library requires downloading stopwords:
```python
nltk.download('stopwords')
```## License
Feel free to use and modify this project for learning purposes.