https://github.com/alea-institute/alea-preprocess
Accessible, efficient data preprocessing library for pretrain and SFT datasets, including KL3M
https://github.com/alea-institute/alea-preprocess
ai alea kl3m preprocessing pretraining
Last synced: 24 days ago
JSON representation
Accessible, efficient data preprocessing library for pretrain and SFT datasets, including KL3M
- Host: GitHub
- URL: https://github.com/alea-institute/alea-preprocess
- Owner: alea-institute
- Created: 2024-09-25T18:39:15.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2024-11-11T16:08:05.000Z (11 months ago)
- Last Synced: 2025-02-08T09:26:07.734Z (8 months ago)
- Topics: ai, alea, kl3m, preprocessing, pretraining
- Language: HTML
- Homepage:
- Size: 3.19 MB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
Awesome Lists containing this project
README
# alea-preprocess
[](https://badge.fury.io/py/alea-preprocess)
[](https://opensource.org/licenses/MIT)
[](https://pypi.org/project/alea-preprocess/)## Description
Efficient, accessible preprocessing routines for pretrain, SFT, and DPO training data preparation.This library is part of ALEA's open source large language model training pipeline, used in the research and development
of the [KL3M](https://kl3m.ai/) project.## Installation
Note that this project is a work-in-progress and relies on compiled Rust code. As such, it is recommended to install
the package from GitHub source until a stable release is available.You can install the latest release from PyPI using pip:
```
pip install alea-preprocess
```You can install a development version of the package by running the following command:
```
poetry run maturin develop
```## Examples
Example use cases are currently available under the `tests/` directory.Additional documentation and examples will be provided in the future.
## License
This ALEA project is released under the MIT License. See the [LICENSE](LICENSE) file for details.
## Support
If you encounter any issues or have questions about using this ALEA project, please [open an issue](https://github.com/alea-institute/alea-preprocess/issues) on GitHub.
## Learn More
To learn more about ALEA and its software and research projects like KL3M, visit the [ALEA website](https://aleainstitute.ai/).