Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/shivam5992/spam_classifier

Naive Bayes Email Spam Classification
https://github.com/shivam5992/spam_classifier

Last synced: 1 day ago
JSON representation

Naive Bayes Email Spam Classification

Host: GitHub
URL: https://github.com/shivam5992/spam_classifier
Owner: shivam5992
Created: 2014-08-04T08:45:11.000Z (over 10 years ago)
Default Branch: master
Last Pushed: 2019-02-19T23:58:15.000Z (almost 6 years ago)
Last Synced: 2024-04-16T03:51:50.659Z (8 months ago)
Language: Python
Homepage:
Size: 3.06 MB
Stars: 3
Watchers: 2
Forks: 3
Open Issues: 1
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

#Bayesian spam filtering

Bayesian spam filtering is a technique based on statistics for of e-mail classification. It is in general a naive Bayes classifier on lexicon lookup table with lexicons as features to identify spam e-mail.

###Training of Spam and Ham pickle

This classifier makes use of public dataset released by Enron Corporation. With over 17000 Spam mails and 15000 Ham mails, enron corpus is trained by Navive Bayes Classifier. The pivot feature of this approach is the words. Bag of words is created with frequency distibution of them in ham and spam mails. The trained dump is stored in a pickle file.

###Spamicity

Spamicity is calculated using bayesian forumla

prob_SnW = prob_WnS/(prob_WnS + prob_WnH)

prob_SnW = P(Spam/Word)
prob_Wns = P(Word/Spam)
prob_WnH = P(Word/Ham)

This term is the calculated for each word in the test mail. The combined probability is calculated using this formula:

X = SUMMATION[log(1 - prob_SnW)]
Y = SUMMATION[log(prob_SnW)]
Spamicity = 1/(EXP(X-Y) + 1)

Average spamicity trained on 80% of SPAM test data gives value of 0.9.

Average hamicity trained on 80% of HAM test data gives value of 0.09.

###Classfication

Spamicity for Test mail is calculated. If its less than 0.5 then is more probably a HAM mail, else SPAM mail.

Training Data can be downloaded from Here.