https://github.com/lorenanicole/python-naive-bayes-spam-classifier

Python 2 and Python 3 naive bayes spam classifier trained with nltk.
https://github.com/lorenanicole/python-naive-bayes-spam-classifier

jupyter-notebook naive-bayes-classifier nltk-python notebook python spam virtualenv

Last synced: 5 months ago
JSON representation

Python 2 and Python 3 naive bayes spam classifier trained with nltk.

Host: GitHub
URL: https://github.com/lorenanicole/python-naive-bayes-spam-classifier
Owner: lorenanicole
Created: 2015-06-24T18:56:00.000Z (about 11 years ago)
Default Branch: master
Last Pushed: 2018-11-20T22:17:08.000Z (over 7 years ago)
Last Synced: 2024-04-14T23:02:05.049Z (about 2 years ago)
Topics: jupyter-notebook, naive-bayes-classifier, nltk-python, notebook, python, spam, virtualenv
Language: Python
Homepage:
Size: 15.5 MB
Stars: 5
Watchers: 2
Forks: 3
Open Issues: 0
Metadata Files:
- Readme: readme.md

Awesome Lists containing this project

README

### Basic Naive Bayes Classifier in Python

This approach makes use of pre-labeled data provided by the [Kaggle Classroom spam detection challenge](https://inclass.kaggle.com/c/adcg-ss14-challenge-02-spam-mails-detection/data).

### `naive-bayes` Python 2 Classifier

Python project code in `naive-bayes` is written with Python 2.7.

For setup create a virtualenv with the requirements:

```
virtualenv nbenv
source nbenv/bin/activate
pip install -r pathway/to/naive-baves/requirements.txt
```

To run the Naive Bayes classifier:

```
cd naive-bayes
python spam_detector.py
```
### Python 3 Jupyter Notebook

The Python 2.7 project has been ported to Python 3 and can be run in the Jupyter notebook.

First you will want to create a Python3 virtualenv:

```
pyenv-3.5 python3env # Update 3.5 with your version of Python 3
source python3env/bin/activate # Name your env whatever you like!
pip3 install -r requirements.txt
```
Then start the notebook!

```
jupyter notebook
```

### Notes on Python Naive Bayes Implementation

You can have the detector either train and evaluate itself against the training data (using 90% of the pre-labeled data as training and 10% to label) with:

```
detector.train_and_evaluate()
```

Or you can train against the entire labeled data set (2500 emails) and classify on the unlabeled data (1827 emails).

```
detector.train()
detector.classify(1827) # Number of emails to classify
```

Ham has a label of 1 while Spam has a label of 0.

### How Naive Bayes Implemented

This solution makes use of [Python's 2.7 Decimal module](https://docs.python.org/2/library/decimal.html), which is used for floating point arithmetic. (Prevents [floating point underflow](http://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classification-1.html)!)

Inside the NaiveBayes#train method each document has common stop words removed using [NLTK](install http://www.nltk.org/install.html). The words have not yet been [stemmed](http://stackoverflow.com/questions/24647400/what-is-the-best-stemming-method-in-python) as this is a forthcoming feature.

Only the corpus of words are used as selectors to determine if an email is spam or ham.

To prevent words with 0 frequency from miscontruing the results, Laplace smoothing is applied to increment each 0 frequency word to 1.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/lorenanicole/python-naive-bayes-spam-classifier

Awesome Lists containing this project

README