{"id":23021151,"url":"https://github.com/alea-institute/kl3m-data","last_synced_at":"2025-09-02T10:43:25.997Z","repository":{"id":261827140,"uuid":"859325637","full_name":"alea-institute/kl3m-data","owner":"alea-institute","description":"KL3M training data collection and preprocessing","archived":false,"fork":false,"pushed_at":"2025-04-14T12:10:02.000Z","size":13133,"stargazers_count":10,"open_issues_count":0,"forks_count":1,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-04-23T07:51:33.822Z","etag":null,"topics":["ai","alea","kl3m","training-data"],"latest_commit_sha":null,"homepage":"https://aleainstitute.ai/data/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/alea-institute.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-09-18T13:27:07.000Z","updated_at":"2025-04-14T12:10:06.000Z","dependencies_parsed_at":"2025-02-03T19:25:26.567Z","dependency_job_id":"fbd69133-b3bb-45ed-b172-a88d54fc11f7","html_url":"https://github.com/alea-institute/kl3m-data","commit_stats":null,"previous_names":["alea-institute/kl3m-data"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/alea-institute/kl3m-data","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alea-institute%2Fkl3m-data","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alea-institute%2Fkl3m-data/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alea-institute%2Fkl3m-data/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alea-institute%2Fkl3m-data/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/alea-institute","download_url":"https://codeload.github.com/alea-institute/kl3m-data/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alea-institute%2Fkl3m-data/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":273272549,"owners_count":25075981,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-09-02T02:00:09.530Z","response_time":77,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","alea","kl3m","training-data"],"created_at":"2024-12-15T12:16:45.543Z","updated_at":"2025-09-02T10:43:25.941Z","avatar_url":"https://github.com/alea-institute.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# KL3M Training Data\n## Collection and Preprocessing of Training Data for KL3M\n\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n\n\n## Description\n\nThis [ALEA](https://aleainstitute.ai/) project contains the complete source code to collect and preprocess\nall training data related to the [KL3M embedding and generative models](https://kl3m.ai/). The KL3M Data Project \nprovides a comprehensive, copyright-clean dataset for training large language models, addressing legal risks in \nAI data collection.\n\n### Key Features\n- Over 132 million documents spanning trillions of tokens\n- Verifiably public domain or appropriately licensed sources\n- Complete source code for document acquisition and processing\n- Multi-stage data access with original formats, extracted content, and pre-tokenized representations\n\n\n## Paper\n[The KL3M Data Project: Copyright-Clean Training Resources for Large Language Models](https://arxiv.org/html/2504.07854v1)\n\n## Dataset\n[Hugging Face Dataset: kl3m-data-snapshot-20250324](https://huggingface.co/datasets/alea-institute/kl3m-data-snapshot-20250324)\n\n## Citation\n```bibtex\n@misc{bommarito2025kl3mdata,\n  title={The KL3M Data Project: Copyright-Clean Training Resources for Large Language Models},\n  author={Bommarito II, Michael J. and Bommarito, Jillian and Katz, Daniel Martin},\n  year={2025},\n  eprint={2504.07854},\n  archivePrefix={arXiv},\n  primaryClass={cs.CL}\n}\n```\n\n## Primary Sources\n\n### Summary\nTODO: Table\n\n\n### US\n\n* [x] us/dockets: PACER/RECAP docket sheets via archive.org\n* [x] us/dotgov: filtered .gov TLD domains via direct retrieval\n* [x] us/ecfr: Electronic Code of Federal Regulations (eCFR) via NARA/GPO API\n* [x] us/edgar: SEC EDGAR data via SEC feed\n* [x] us/fdlp: US Federal Depository Library Program (FDLP) via GPO\n* [x] us/fr: Federal Register data via NARA/GPO API\n* [x] us/govinfo: US Government Publishing Office (GPO) data via GovInfo API\n* [x] us/recap: RECAP raw documents via S3\n* [x] us/recap_docs: RECAP attached docs (Word, WordPerfect, PDF, MP3) via S3\n* [x] us/reg_docs: Documents associated with regulations.gov dockets via regulations.gov API\n* [x] us/usc: US Code releases via Office of the Law Revision Counsel (OLRC)\n* [x] us/uspto_patents: USPTO patent grants via USPTO bulk data\n\n\n### EU (\"Federal\")\n\n * [x] eu/eurlex_oj: EU Official Journal via Cellar/Europa\n\n### UK\n\n * [x] uk/legislation: All enacted UK legislation via legislation.gov.uk bulk download\n\n\n### Germany\n\n * [ ] de/bundesgesetzblatt: Bundesgesetzblatt (BGBl) 2023- from recht.bund.de\n\n\n### Australia\n\n### Canada\n\n### India\n\n## Tasks\n\n### Extraction\n\n\n### Summarization\n\n\n### Transform and Convert\n\n\n\n## Installation\n\n```bash\n# Clone the repository\ngit clone https://github.com/alea-institute/kl3m-data.git\ncd kl3m-data\n\n# Install dependencies using Poetry\npoetry install\n```\n\n## Usage\n\n### Accessing the Dataset\nThe KL3M dataset is available through multiple channels:\n\n1. **Hugging Face**:\n   ```python\n   from datasets import load_dataset\n   dataset = load_dataset(\"alea-institute/kl3m-data-snapshot-20250324\")\n   ```\n\n2. **S3 Bucket**:\n   ```bash\n   aws s3 ls s3://data.kl3m.ai/\n   ```\n\n3. **Project Website**:\n   Visit [https://gallery.kl3m.ai/](https://gallery.kl3m.ai/) for more information.\n\n## License\n\nThe source code for this ALEA project is released under the MIT License. See the [LICENSE](LICENSE) file for details.\n\nTop-level dependencies are all licensed MIT, BSD-3, or Apache 2.0  See `poetry show --tree` for details.\n\n## Support\n\nIf you encounter any issues or have questions about using this ALEA project, please [open an issue](https://github.com/alea-institute/kl3m-data/issues) on GitHub.\n\n## Learn More\n\nTo learn more about ALEA and our KL3M models and data, visit the [ALEA website](https://aleainstitute.ai/).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Falea-institute%2Fkl3m-data","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Falea-institute%2Fkl3m-data","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Falea-institute%2Fkl3m-data/lists"}