Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/gakas14/spam_classifier

In this project we are going to Build a spam classifier.
https://github.com/gakas14/spam_classifier

classification-algorithm logistic-regression nltk-python

Last synced: 20 days ago
JSON representation

In this project we are going to Build a spam classifier.

Awesome Lists containing this project

README

        

# spam_classifier
In this project we are going to Build a spam classifier.

We did:

- Split the datasets into a training set and a test set.
- Write a data preparation pipeline to convert each email into a feature vector. The preparation pipeline should transform an email into a (sparse) vector that indicates the presence or absence of each possible word. For example, if all emails only ever contain four words, “Hello,” “how,” “are,” “you,” then the email “Hello you Hello Hello you” would be converted into a vector [1, 0, 0, 1] (meaning [“Hello” is present, “how” is absent, “are” is absent, “you” is present]), or [3, 0, 0, 2] if you prefer to count the number of occurrences of each word.
- Add hyperparameters to the pipeline to control whether or not to strip off email headers, convert each email to lowercase, remove punctuation, replace all URLs with “URL,” replace all numbers with “NUMBER,” or even perform stemming (i.e., trim off word endings; there are Python libraries available to do this).
- Finally, try out several classifiers and build a great spam classifier, with both high recall and high precision.