https://github.com/dross20/tuatara
Generates high-quality fine-tuning pairs for large language models (LLMs) from unstructured documents.
https://github.com/dross20/tuatara
dataset-generation fine-tuning graph knowledge-extraction llm nlp ocr python sft synthetic-data
Last synced: 5 months ago
JSON representation
Generates high-quality fine-tuning pairs for large language models (LLMs) from unstructured documents.
- Host: GitHub
- URL: https://github.com/dross20/tuatara
- Owner: dross20
- License: mit
- Created: 2025-09-05T02:22:10.000Z (9 months ago)
- Default Branch: main
- Last Pushed: 2025-10-05T21:18:18.000Z (8 months ago)
- Last Synced: 2025-10-05T23:29:04.740Z (8 months ago)
- Topics: dataset-generation, fine-tuning, graph, knowledge-extraction, llm, nlp, ocr, python, sft, synthetic-data
- Language: Python
- Homepage:
- Size: 72.3 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README



---
"Artificial intelligence is only as good as the data it learns from."
- Unknown
## 🦎 What is Tuatara?
Tuatara is a library for generating fine-tuning pairs for large language model (LLM) post training.
## 🤔 Why Tuatara?
Fine-tuning large language models requires high-quality training data pairs that are well grounded in their source documents. Creating these pairs manually is laborious and error-prone, and existing tools often lack flexibility or fail to scale across different document types and domains. Tuatara addresses these challenges directly.
## 📦 Installation
Run the following command to install Tuatara:
```sh
pip install git+https://github.com/dross20/tuatara
```
## 🚀 Quickstart
The following example demonstrates how to use Tuatara's preconfigured pipeline for creating fine tuning pairs from multiple documents. By default, `default_pipeline` will use the OpenAI API for LLM inference and search for your OpenAI API key in the environment variables.
```python
from tuatara import default_pipeline
documents = [
"./document1.pdf",
"./document2.pdf",
"./document3.txt"
]
pipeline = default_pipeline(model="gpt-4o")
pairs, history = pipeline(documents)
```
## 📜 License
This project is licensed under the [MIT license](https://github.com/dross20/tuatara/blob/2ab8b458f0d6d3109d7e5381c58961c9df992449/LICENSE).