Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/zoobereq/clickbait_detector
A Naive Bayes classifier to detect clickbait headlines
https://github.com/zoobereq/clickbait_detector
bag-of-words clickbait clickbait-detection naive-bayes naive-bayes-classifier natural-language-processing nlp
Last synced: 17 days ago
JSON representation
A Naive Bayes classifier to detect clickbait headlines
- Host: GitHub
- URL: https://github.com/zoobereq/clickbait_detector
- Owner: zoobereq
- License: mit
- Created: 2022-09-21T20:31:15.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2023-07-23T21:58:33.000Z (over 1 year ago)
- Last Synced: 2024-10-29T18:44:04.263Z (2 months ago)
- Topics: bag-of-words, clickbait, clickbait-detection, naive-bayes, naive-bayes-classifier, natural-language-processing, nlp
- Language: Python
- Homepage:
- Size: 706 KB
- Stars: 6
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: Readme.md
- License: LICENSE
Awesome Lists containing this project
README
## Clickbait Detector
### Motivation
There is an overwhelming amount of news information available online. Some news headlines are known as clickbait – they aim to attract users to click on a link but the articles that they link to may not be of value or interest to the reader. This program automatically distinguishes between clickbait and non-clickbait headlines.### Data
Two corpora of clickbait and non-clicbait headlines are included. Each corpus counts 16,000 headlines, for a total of 32,000 headlines.### Code
The code is informed by [the paper](https://arxiv.org/pdf/1610.09786.pdf) by Chakraborty et al. (2016). The program loads the data, extracts sets of features as frequency-count vectors, and uses them to train a Naive Bayes classifier. The classifier accuracy is generated using 10-fold cross-validation and output for each feature set individually. The program extracts the following features:
- **Stop words:** counts for each function word (from the NLTK stopwords list)
- **Syntactic:** counts for the following 10 common POS tags: `['NN', 'NNP', 'DT', 'IN', 'JJ', 'NNS','CC','PRP','VB','VBG']`
- **Lexical:** counts for 30 most common unigrams in the entire corpus
- **Punctuation:** counts for each punctuation mark in `string.punctuation`
- **Complexity:**
- Average number of characters per word
- Type-to-token ratio (the number of unique words / the total number of words)
- Count of *long* words - words with at least 6 letters
- **Interrogative words:** counts of the common English interrogatives### Accuracy
*Clickbait Detector* achieves 91% accuracy for all features combined, meaning that it correctly identifies over 9 out of 10 headlines.