https://github.com/professorlearncode/email-classifier-model

This project implements a spam email classifier using Natural Language Processing (NLP) and a Naive Bayes model. The pipeline preprocesses email text and uses the model to classify messages as spam or not spam, with an evaluation on test data.
https://github.com/professorlearncode/email-classifier-model

Last synced: about 1 year ago
JSON representation

Host: GitHub
URL: https://github.com/professorlearncode/email-classifier-model
Owner: ProfessorlearnCode
Created: 2024-08-18T00:20:54.000Z (almost 2 years ago)
Default Branch: main
Last Pushed: 2024-08-18T00:22:52.000Z (almost 2 years ago)
Last Synced: 2025-02-16T21:30:06.169Z (over 1 year ago)
Language: Python
Size: 238 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          ### Documentation for Spam Email Classifier Using Naive Bayes

#### Overview

This repository contains a Python implementation of a spam email classifier using Natural Language Processing (NLP) techniques and a Naive Bayes model. The model is trained on a labeled dataset of emails, and it predicts whether new emails are spam or not. The project includes text preprocessing, model training, prediction, and evaluation steps.

#### Prerequisites

Ensure the following Python libraries are installed before running the code:

- `nltk`

- `pandas`

- `scikit-learn`

You can install the necessary libraries using:

```bash

pip install nltk pandas scikit-learn

```

#### Code Breakdown

1. **Importing Libraries**

   ```python

   import nltk

   import pandas as pd

   from nltk.corpus import stopwords

   from nltk.tokenize import word_tokenize

   from sklearn.naive_bayes import MultinomialNB

   from sklearn.feature_extraction.text import CountVectorizer

   from sklearn.model_selection import train_test_split

   from sklearn.pipeline import Pipeline

   ```

   The required libraries for text preprocessing, model building, and evaluation are imported.

2. **Downloading NLTK Data**

   ```python

   nltk.download('punkt')

   nltk.download('stopwords')

   ```

   The necessary NLTK data for tokenization and stopword filtering is downloaded.

3. **Text Preprocessing Function**

   ```python

   def preprocess_text(text):

       tokens = word_tokenize(text.lower())

       filtered_words = [word for word in tokens if word.isalnum() and word not in stopwords.words('english')]

       return ' '.join(filtered_words)

   ```

   This function tokenizes the text, converts it to lowercase, removes stopwords, and keeps only alphanumeric tokens. The resulting cleaned text is returned as a string.

4. **Sample Text Preprocessing**

   ```python

   sample_text = "This is an example email!"

   processed_text = preprocess_text(sample_text)

   print("Processed Sample Text:", processed_text)

   ```

   A sample email is processed using the `preprocess_text` function to demonstrate text preprocessing.

5. **Loading and Preparing the Dataset**

   ```python

   data = pd.read_csv('spam.csv', encoding='latin-1')

   data = data[['v1', 'v2']]  # Assuming the columns are named 'v1' for Category and 'v2' for Message

   data.columns = ['Category', 'Message']

   data['Spam'] = data['Category'].apply(lambda x: 1 if x == 'spam' else 0)

   ```

   The dataset is loaded and prepared for modeling. The dataset is assumed to have columns for email category (`v1`) and message (`v2`). The 'Category' column is renamed, and a new binary column 'Spam' is created where spam is labeled as `1` and non-spam as `0`.

6. **Text Preprocessing for the Dataset**

   ```python

   data['Message'] = data['Message'].apply(preprocess_text)

   ```

   The `Message` column is preprocessed using the `preprocess_text` function.

7. **Splitting the Dataset**

   ```python

   X_train, X_test, y_train, y_test = train_test_split(data['Message'], data['Spam'], test_size=0.25, random_state=42)

   ```

   The dataset is split into training and testing sets with a 75-25 ratio.

8. **Building and Training the Model Pipeline**

   ```python

   clf = Pipeline([

       ('vectorizer', CountVectorizer()),

       ('nb', MultinomialNB())

   ])

   clf.fit(X_train, y_train)

   ```

   A pipeline is created that first converts text into a matrix of token counts using `CountVectorizer` and then applies the `MultinomialNB` classifier. The model is trained on the training data.

9. **Making Predictions**

   ```python

   emails = [

       'Sounds great! Are you home now?',

       'Will u meet ur dream partner soon? Is ur career off 2 a flyng start? 2 find out free, txt HORO followed by ur star sign, e. g. HORO ARIES'

   ]

   predictions = clf.predict(emails)

   print("Predictions:", predictions)

   ```

   The model predicts whether a list of sample emails is spam or not.

10. **Evaluating the Model**

    ```python

    accuracy = clf.score(X_test, y_test)

    print("Accuracy on test data:", accuracy)

    ```

    The model's accuracy is calculated on the test data, providing a measure of how well the model generalizes to unseen data.

#### Conclusion

This project provides a complete pipeline for building a spam email classifier using a Naive Bayes model. The process includes text preprocessing, model training, and evaluation, offering a robust approach to identifying spam emails.

#### Future Improvements

- **Feature Engineering**: Explore different feature extraction techniques, such as TF-IDF, to potentially improve model accuracy.

- **Model Optimization**: Fine-tune the hyperparameters of the Naive Bayes model or explore other classifiers like SVM or Random Forest.

#### License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

---

This documentation should help guide software developers in understanding and using the code effectively.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/professorlearncode/email-classifier-model

Awesome Lists containing this project

README