Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/gakas14/spam_classifier
In this project we are going to Build a spam classifier.
https://github.com/gakas14/spam_classifier
classification-algorithm logistic-regression nltk-python
Last synced: 20 days ago
JSON representation
In this project we are going to Build a spam classifier.
- Host: GitHub
- URL: https://github.com/gakas14/spam_classifier
- Owner: gakas14
- Created: 2022-04-27T04:43:38.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2022-04-27T04:51:17.000Z (over 2 years ago)
- Last Synced: 2024-11-08T09:44:39.494Z (2 months ago)
- Topics: classification-algorithm, logistic-regression, nltk-python
- Language: Jupyter Notebook
- Homepage:
- Size: 11.7 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# spam_classifier
In this project we are going to Build a spam classifier.We did:
- Split the datasets into a training set and a test set.
- Write a data preparation pipeline to convert each email into a feature vector. The preparation pipeline should transform an email into a (sparse) vector that indicates the presence or absence of each possible word. For example, if all emails only ever contain four words, “Hello,” “how,” “are,” “you,” then the email “Hello you Hello Hello you” would be converted into a vector [1, 0, 0, 1] (meaning [“Hello” is present, “how” is absent, “are” is absent, “you” is present]), or [3, 0, 0, 2] if you prefer to count the number of occurrences of each word.
- Add hyperparameters to the pipeline to control whether or not to strip off email headers, convert each email to lowercase, remove punctuation, replace all URLs with “URL,” replace all numbers with “NUMBER,” or even perform stemming (i.e., trim off word endings; there are Python libraries available to do this).
- Finally, try out several classifiers and build a great spam classifier, with both high recall and high precision.