https://github.com/hodgesmr/biden_nlp
Jupyter Notebook that introduces BIDEN: Binary Inference Dictionaries for Electoral NLP. It demonstrates a compression-based binary classification technique that is fast at both training and inference on common CPU hardware in Python
https://github.com/hodgesmr/biden_nlp
compression data-science machine-learning natural-language-processing nlp zstandard zstd
Last synced: 6 days ago
JSON representation
Jupyter Notebook that introduces BIDEN: Binary Inference Dictionaries for Electoral NLP. It demonstrates a compression-based binary classification technique that is fast at both training and inference on common CPU hardware in Python
- Host: GitHub
- URL: https://github.com/hodgesmr/biden_nlp
- Owner: hodgesmr
- License: bsd-3-clause
- Created: 2023-10-01T23:57:43.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2023-10-17T20:38:34.000Z (over 2 years ago)
- Last Synced: 2024-01-27T17:41:38.101Z (about 2 years ago)
- Topics: compression, data-science, machine-learning, natural-language-processing, nlp, zstandard, zstd
- Language: Jupyter Notebook
- Homepage:
- Size: 627 KB
- Stars: 5
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# BIDEN: Binary Inference Dictionaries for Electoral NLP

This is a [Jupyter Notebook](https://github.com/hodgesmr/biden_nlp/blob/main/Binary_Inference_Dictionaries_Electoral_NLP.ipynb) that introduces BIDEN: Binary Inference Dictionaries for Electoral NLP. It demonstrates a compression-based binary classification technique that is fast at both training and inference on common CPU hardware in Python.
It is largely built on the strategies presented by [FTCC](https://github.com/cyrilou242/ftcc), which in turn, was a reaction to [Low-Resource Text Classification: A Parameter-Free Classification Method with Compressors](https://github.com/bazingagin/npc_gzip) (the gzip method). Like FTCC, **BIDEN** is built atop of [Zstandard](https://facebook.github.io/zstd/) (Zstd), which leverages [dictionary compression](https://facebook.github.io/zstd/#small-data). Zstd dictionary compression seeds a compressor with sample data, so that it can efficiently compress _small data_ (~1 KB) of similar composition. Seeding the compressor dictionaries acts as our "training" method for the model.
The **BIDEN** model was trained on the [ElectionEmails 2020](https://electionemails2020.org) data set — a database of over 900,000 political campaign emails from the 2020 US election cycle. **In compliance with the data set's [terms](https://electionemails2020.org/downloads/corpus_documentation_v1.0.pdf), the training data is NOT provided with this repository.** If you would like to train the **BIDEN** model yourself, you can [request a copy of the data for free](https://docs.google.com/forms/d/e/1FAIpQLSdcgjZo-D1nNON4d90H2j0VLtTdxiHK6Y8HPJSpdRu4w5YILw/viewform). The **BIDEN** model was trained on `corpus_v1.0`.
It also demonstrates success at fast partisan classification for tweets and samples from the [campaign email database](https://political-emails.herokuapp.com/emails) maintained by [Derek Willis](https://www.thescoop.org).
The idea of classification by compression is not new; Russell and Norvig wrote about it in 1995 in the venerable [Artificial Intelligence: A Modern Approach](https://aima.cs.berkeley.edu/3rd-ed/):

More recently, the ["gzip beats BERT" paper](https://aclanthology.org/2023.findings-acl.426/) got a lot of attention. What the **BIDEN** model demonstrates is that this technique is effective and likely generalizable on modern partisan texts.
## License
All code is provided under the [BSD 3-Clause license](https://github.com/hodgesmr/biden_nlp/blob/main/LICENSE).
## A Matt Hodges project
This project is maintained by [@MattHodges](https://mastodon.social/@MattHodges).
_Please use it for good, not evil._