https://github.com/flairNLP/fabricator

[EMNLP 2023 Demo] fabricator - annotating and generating datasets with large language models.
https://github.com/flairNLP/fabricator

Last synced: 3 days ago
JSON representation

[EMNLP 2023 Demo] fabricator - annotating and generating datasets with large language models.

Host: GitHub
URL: https://github.com/flairNLP/fabricator
Owner: flairNLP
License: apache-2.0
Created: 2023-05-24T11:26:07.000Z (about 2 years ago)
Default Branch: main
Last Pushed: 2024-05-16T10:55:39.000Z (about 1 year ago)
Last Synced: 2024-11-17T12:21:22.348Z (8 months ago)
Language: Python
Homepage:
Size: 458 KB
Stars: 102
Watchers: 6
Forks: 13
Open Issues: 11
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

![Fabricator Logo](resources/logo_fabricator.drawio_dark.png#gh-dark-mode-only)
![Fabricator Logo](resources/logo_fabricator.drawio_white.png#gh-light-mode-only)

A flexible open-source framework to generate datasets with large language models.

## News

- **[10/23]** We released the first version of this repository on PyPI. You can install it via `pip install fabricator-ai`.
- **[10/23]** Our paper got accepted at EMNLP 2023. You can find the preprint [here](https://arxiv.org/abs/2309.09582). You can find the experimental scripts under release v0.1.0.
- **[09/23]** Support for `gpt-3.5-turbo-instruct` added in the new [Haystack](https://github.com/deepset-ai/haystack) release!
- **[08/23]** Added several experimental scripts to investigate the generation and annotation ability of `gpt-3.5-turbo` on various downstream tasks + the influence of few-shot examples on the performance for different downstream tasks.
- **[07/23]** Refactorings of majors classes - you can now simply use our BasePrompt class to create your own customized prompts for every downstream task!
- **[07/23]** Added dataset transformations for token classification to prompt LLMs with textual spans rather than with list of tags.
- **[06/23]** Initial version of fabricator supporting text classification and question answering tasks.

## Overview

This repository:

- is an easy-to-use open-source library to generate datasets with large language models. If you want to train
a model on a specific domain / label distribution / downstream task, you can use this framework to generate
a dataset for it.
- builds on top of deepset's haystack and huggingface's datasets libraries. Thus, we support a wide range
of language models and you can load and use the generated datasets as you know it from the Datasets library for your
model training.
- is highly flexible and offers various adaptions possibilities such as
prompt customization, integration and sampling of fewshot examples or annotation of the unlabeled datasets.

## Installation
Using conda:
```
git clone [email protected]:flairNLP/fabricator.git
cd fabricator
conda create -y -n fabricator python=3.10
conda activate fabricator
pip install fabricator-ai
```

If you want to install in editable mode, you can use the following command:
```
pip install -e .
```

## Basic Concepts

This framework is based on the idea of using large language models to generate datasets for specific tasks. To do so,
we need four basic modules: a dataset, a prompt, a language model and a generator:
- Dataset: We use [huggingface's datasets library](https://github.com/huggingface/datasets) to load fewshot or
unlabeled datasets and store the generated or annotated datasets with their `Dataset` class. Once
created, you can share the dataset with others via the hub or use it for your model training.
- Prompt: A prompt is the instruction made to the language model. It can be a simple sentence or a more complex
template with placeholders. We provide an easy interface for custom dataset generation prompts in which you can specify
label options for the LLM to choose from, provide fewshot examples to support the prompt with or annotate an unlabeled
dataset in a specific way.
- LLM: We use [deepset's haystack library](https://github.com/deepset-ai/haystack) as our LLM interface. deepset
supports a wide range of LLMs including OpenAI, all models from the HuggingFace model hub and many more.
- Generator: The generator is the core of this framework. It takes a dataset, a prompt and a LLM and generates a
dataset based on your specifications.

## Examples

With our library, you can generate datasets for any task you want. You can start as simple
as that:

### Generate a dataset from scratch

```python
import os
from haystack.nodes import PromptNode
from fabricator import DatasetGenerator
from fabricator.prompts import BasePrompt

prompt = BasePrompt(
task_description="Generate a short movie review.",
)

prompt_node = PromptNode(
model_name_or_path="gpt-3.5-turbo",
api_key=os.environ.get("OPENAI_API_KEY"),
max_length=100,
)

generator = DatasetGenerator(prompt_node)
generated_dataset = generator.generate(
prompt_template=prompt,
max_prompt_calls=10,
)

generated_dataset.push_to_hub("your-first-generated-dataset")
```

In our tutorial, we introduce how to create classification datasets with label options to choose from, how to include
fewshot examples or how to annotate unlabeled data into predefined categories.

## Citation

If you find this repository useful, please cite our work.

```
@inproceedings{golde2023fabricator,
title = "Fabricator: An Open Source Toolkit for Generating Labeled Training Data with Teacher {LLM}s",
author = "Golde, Jonas and Haller, Patrick and Hamborg, Felix and Risch, Julian and Akbik, Alan",
booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations",
month = dec,
year = "2023",
address = "Singapore",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.emnlp-demo.1",
pages = "1--11",
}
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/flairNLP/fabricator

Awesome Lists containing this project

README