https://github.com/dross20/tuatara

Generates high-quality fine-tuning pairs for large language models (LLMs) from unstructured documents.
https://github.com/dross20/tuatara

dataset-generation fine-tuning graph knowledge-extraction llm nlp ocr python sft synthetic-data

Last synced: 6 months ago
JSON representation

Generates high-quality fine-tuning pairs for large language models (LLMs) from unstructured documents.

Host: GitHub
URL: https://github.com/dross20/tuatara
Owner: dross20
License: mit
Created: 2025-09-05T02:22:10.000Z (11 months ago)
Default Branch: main
Last Pushed: 2025-10-05T21:18:18.000Z (10 months ago)
Last Synced: 2025-10-05T23:29:04.740Z (10 months ago)
Topics: dataset-generation, fine-tuning, graph, knowledge-extraction, llm, nlp, ocr, python, sft, synthetic-data
Language: Python
Homepage:
Size: 72.3 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 2
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

Tuatara logo

![Static Badge](https://img.shields.io/badge/python-3.9+-green)
![GitHub license](https://img.shields.io/badge/license-MIT-brown.svg)
![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)

---

"Artificial intelligence is only as good as the data it learns from."

- Unknown

## 🦎 What is Tuatara?

Tuatara is a library for generating fine-tuning pairs for large language model (LLM) post training.

## 🤔 Why Tuatara?

Fine-tuning large language models requires high-quality training data pairs that are well grounded in their source documents. Creating these pairs manually is laborious and error-prone, and existing tools often lack flexibility or fail to scale across different document types and domains. Tuatara addresses these challenges directly.

## 📦 Installation
Run the following command to install Tuatara:

```sh
pip install git+https://github.com/dross20/tuatara
```

## 🚀 Quickstart
The following example demonstrates how to use Tuatara's preconfigured pipeline for creating fine tuning pairs from multiple documents. By default, `default_pipeline` will use the OpenAI API for LLM inference and search for your OpenAI API key in the environment variables.

```python
from tuatara import default_pipeline

documents = [
"./document1.pdf",
"./document2.pdf",
"./document3.txt"
]

pipeline = default_pipeline(model="gpt-4o")
pairs, history = pipeline(documents)
```

## 📜 License
This project is licensed under the [MIT license](https://github.com/dross20/tuatara/blob/2ab8b458f0d6d3109d7e5381c58961c9df992449/LICENSE).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/dross20/tuatara

Awesome Lists containing this project

README