Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/belsabbagh/naive-bayes-model
A naive bayes model implemented from scratch
https://github.com/belsabbagh/naive-bayes-model
Last synced: 4 days ago
JSON representation
A naive bayes model implemented from scratch
- Host: GitHub
- URL: https://github.com/belsabbagh/naive-bayes-model
- Owner: belsabbagh
- Created: 2023-03-04T16:48:49.000Z (over 1 year ago)
- Default Branch: master
- Last Pushed: 2024-04-13T16:47:45.000Z (7 months ago)
- Last Synced: 2024-05-20T22:09:53.428Z (6 months ago)
- Language: Python
- Size: 563 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Classifying Emails as Spam or Not Spam Using Naive Bayes
## Introduction
This project is a simple implementation of a Naive Bayes classifier to classify emails as spam or not spam. The dataset contains 5574 emails, 749 of which are spam. The dataset was obtained from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection).
Making the data usable for the classifier required parsing each email for words, vectorizing these words into a feature vector, and then training the classifier on the feature vectors. The classifier was then tested using k-fold cross validation.
## Vectorizer Algorithms
### TFIDF (Term Frequency Inverse Document Frequency)
- Term Frequency: The number of times a word appears in a document, divided by the total number of words in the document. Every document has its own term frequency.
- Inverse Document Frequency: The log of the number of documents divided by the number of documents that contain the word w. Every word has its own inverse document frequency.
- TFIDF: The product of the term frequency and the inverse document frequency.Accuracy increases as the number of features increases. However, it scored lower than chi^2.
### CHI^2 (Chi-squared Test)
- Chi-squared test: A statistical hypothesis test that is used to determine whether there is a significant difference between the expected frequencies and the observed frequencies in one or more categories of a contingency table.
- The chi^2 algorithm selects the k best features based on the chi-squared test.Accuracy was highest when the number of features was 3000.