https://github.com/belsabbagh/naive-bayes-model

A naive bayes model implemented from scratch
https://github.com/belsabbagh/naive-bayes-model

Last synced: 4 months ago
JSON representation

A naive bayes model implemented from scratch

Host: GitHub
URL: https://github.com/belsabbagh/naive-bayes-model
Owner: belsabbagh
Created: 2023-03-04T16:48:49.000Z (over 2 years ago)
Default Branch: master
Last Pushed: 2024-04-13T16:47:45.000Z (about 1 year ago)
Last Synced: 2025-01-09T02:07:09.465Z (6 months ago)
Language: Python
Size: 563 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Classifying Emails as Spam or Not Spam Using Naive Bayes

## Introduction

This project is a simple implementation of a Naive Bayes classifier to classify emails as spam or not spam. The dataset contains 5574 emails, 749 of which are spam. The dataset was obtained from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection).

Making the data usable for the classifier required parsing each email for words, vectorizing these words into a feature vector, and then training the classifier on the feature vectors. The classifier was then tested using k-fold cross validation.

## Vectorizer Algorithms

### TFIDF (Term Frequency Inverse Document Frequency)

- Term Frequency: The number of times a word appears in a document, divided by the total number of words in the document. Every document has its own term frequency.
- Inverse Document Frequency: The log of the number of documents divided by the number of documents that contain the word w. Every word has its own inverse document frequency.
- TFIDF: The product of the term frequency and the inverse document frequency.

Accuracy increases as the number of features increases. However, it scored lower than chi^2.

### CHI^2 (Chi-squared Test)

- Chi-squared test: A statistical hypothesis test that is used to determine whether there is a significant difference between the expected frequencies and the observed frequencies in one or more categories of a contingency table.
- The chi^2 algorithm selects the k best features based on the chi-squared test.

Accuracy was highest when the number of features was 3000.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/belsabbagh/naive-bayes-model

Awesome Lists containing this project

README