Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/alea-institute/kl3m-data
KL3M training data collection and preprocessing
https://github.com/alea-institute/kl3m-data
ai alea kl3m training-data
Last synced: 7 days ago
JSON representation
KL3M training data collection and preprocessing
- Host: GitHub
- URL: https://github.com/alea-institute/kl3m-data
- Owner: alea-institute
- License: mit
- Created: 2024-09-18T13:27:07.000Z (3 months ago)
- Default Branch: main
- Last Pushed: 2024-11-08T16:34:14.000Z (about 1 month ago)
- Last Synced: 2024-11-08T17:29:04.644Z (about 1 month ago)
- Topics: ai, alea, kl3m, training-data
- Language: Python
- Homepage: https://aleainstitute.ai/data/
- Size: 255 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# KL3M Training Data
## Collection and Preprocessing of Training Data for KL3M[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
## Description
This [ALEA](https://aleainstitute.ai/) project contains the complete source code to collect and preprocess
all training data related to the [KL3M embedding and generative models](https://kl3m.ai/).## Paper
Pending arXiv submission## Citation
Pending arXiv submission## Primary Sources
### Summary
TODO: Table### US
* [x] us/dockets: PACER/RECAP docket sheets via archive.org
* [x] us/dotgov: filtered .gov TLD domains via direct retrieval
* [x] us/ecfr: Electronic Code of Federal Regulations (eCFR) via NARA/GPO API
* [x] us/edgar: SEC EDGAR data via SEC feed
* [x] us/fdlp: US Federal Depository Library Program (FDLP) via GPO
* [x] us/fr: Federal Register data via NARA/GPO API
* [x] us/govinfo: US Government Publishing Office (GPO) data via GovInfo API
* [x] us/recap: RECAP raw documents via S3
* [x] us/recap_docs: RECAP attached docs (Word, WordPerfect, PDF, MP3) via S3
* [x] us/reg_docs: Documents associated with regulations.gov dockets via regulations.gov API
* [x] us/usc: US Code releases via Office of the Law Revision Counsel (OLRC)
* [x] us/uspto_patents: USPTO patent grants via USPTO bulk data### EU ("Federal")
* [x] eu/eurlex_oj: EU Official Journal via Cellar/Europa
### UK
* [x] uk/legislation: All enacted UK legislation via legislation.gov.uk bulk download
### Germany
* [ ] de/bundesgesetzblatt: Bundesgesetzblatt (BGBl) 2023- from recht.bund.de
### Australia
### Canada
### India
## Tasks
### Extraction
### Summarization
### Transform and Convert
## Installation
TODO## Usage
TODO## License
The source code for this ALEA project is released under the MIT License. See the [LICENSE](LICENSE) file for details.
Top-level dependencies are all licensed MIT, BSD-3, or Apache 2.0 See `poetry show --tree` for details.
## Support
If you encounter any issues or have questions about using this ALEA project, please [open an issue](https://github.com/alea-institute/kl3m-data/issues) on GitHub.
## Learn More
To learn more about ALEA and our KL3M models and data, visit the [ALEA website](https://aleainstitute.ai/).