{"id":40950865,"url":"https://github.com/hodgesmr/biden_nlp","last_synced_at":"2026-01-22T05:12:08.467Z","repository":{"id":197684621,"uuid":"699110303","full_name":"hodgesmr/biden_nlp","owner":"hodgesmr","description":"Jupyter Notebook that introduces BIDEN: Binary Inference Dictionaries for Electoral NLP. It demonstrates a compression-based binary classification technique that is fast at both training and inference on common CPU hardware in Python","archived":false,"fork":false,"pushed_at":"2023-10-17T20:38:34.000Z","size":642,"stargazers_count":5,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2024-01-27T17:41:38.101Z","etag":null,"topics":["compression","data-science","machine-learning","natural-language-processing","nlp","zstandard","zstd"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/hodgesmr.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2023-10-01T23:57:43.000Z","updated_at":"2023-11-12T22:23:52.000Z","dependencies_parsed_at":"2023-10-02T01:36:18.226Z","dependency_job_id":"c178e5d5-3b15-417b-bd86-fcc987e46397","html_url":"https://github.com/hodgesmr/biden_nlp","commit_stats":{"total_commits":13,"total_committers":1,"mean_commits":13.0,"dds":0.0,"last_synced_commit":"d6eb55d9049e6fc126f22ddf9f65ecffd01d4d4f"},"previous_names":["hodgesmr/biden_nlp"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/hodgesmr/biden_nlp","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hodgesmr%2Fbiden_nlp","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hodgesmr%2Fbiden_nlp/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hodgesmr%2Fbiden_nlp/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hodgesmr%2Fbiden_nlp/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/hodgesmr","download_url":"https://codeload.github.com/hodgesmr/biden_nlp/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hodgesmr%2Fbiden_nlp/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28655305,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-22T01:17:37.254Z","status":"online","status_checked_at":"2026-01-22T02:00:07.137Z","response_time":144,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["compression","data-science","machine-learning","natural-language-processing","nlp","zstandard","zstd"],"created_at":"2026-01-22T05:12:07.820Z","updated_at":"2026-01-22T05:12:08.459Z","avatar_url":"https://github.com/hodgesmr.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# BIDEN: Binary Inference Dictionaries for Electoral NLP\n\n![BIDEN](BIDEN.png)\n\nThis is a [Jupyter Notebook](https://github.com/hodgesmr/biden_nlp/blob/main/Binary_Inference_Dictionaries_Electoral_NLP.ipynb) that introduces BIDEN: Binary Inference Dictionaries for Electoral NLP. It demonstrates a compression-based binary classification technique that is fast at both training and inference on common CPU hardware in Python.\n\nIt is largely built on the strategies presented by [FTCC](https://github.com/cyrilou242/ftcc), which in turn, was a reaction to [Low-Resource Text Classification: A Parameter-Free Classification Method with Compressors](https://github.com/bazingagin/npc_gzip) (the gzip method). Like FTCC, **BIDEN** is built atop of [Zstandard](https://facebook.github.io/zstd/) (Zstd), which leverages [dictionary compression](https://facebook.github.io/zstd/#small-data). Zstd dictionary compression seeds a compressor with sample data, so that it can efficiently compress _small data_ (~1 KB) of similar composition. Seeding the compressor dictionaries acts as our \"training\" method for the model.\n\nThe **BIDEN** model was trained on the [ElectionEmails 2020](https://electionemails2020.org) data set — a database of over 900,000 political campaign emails from the 2020 US election cycle. **In compliance with the data set's [terms](https://electionemails2020.org/downloads/corpus_documentation_v1.0.pdf), the training data is NOT provided with this repository.** If you would like to train the **BIDEN** model yourself, you can [request a copy of the data for free](https://docs.google.com/forms/d/e/1FAIpQLSdcgjZo-D1nNON4d90H2j0VLtTdxiHK6Y8HPJSpdRu4w5YILw/viewform). The **BIDEN** model was trained on `corpus_v1.0`.\n\nIt also demonstrates success at fast partisan classification for tweets and samples from the [campaign email database](https://political-emails.herokuapp.com/emails) maintained by [Derek Willis](https://www.thescoop.org).\n\nThe idea of classification by compression is not new; Russell and Norvig wrote about it in 1995 in the venerable [Artificial Intelligence: A Modern Approach](https://aima.cs.berkeley.edu/3rd-ed/):\n\n![Classification by data compression](aiama.png)\n\nMore recently, the [\"gzip beats BERT\" paper](https://aclanthology.org/2023.findings-acl.426/) got a lot of attention. What the **BIDEN** model demonstrates is that this technique is effective and likely generalizable on modern partisan texts.\n\n## License\n\nAll code is provided under the [BSD 3-Clause license](https://github.com/hodgesmr/biden_nlp/blob/main/LICENSE).\n\n## A Matt Hodges project\n\nThis project is maintained by [@MattHodges](https://mastodon.social/@MattHodges).\n\n_Please use it for good, not evil._\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhodgesmr%2Fbiden_nlp","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhodgesmr%2Fbiden_nlp","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhodgesmr%2Fbiden_nlp/lists"}