https://github.com/radi-cho/datasetGPT

A command-line interface to generate textual and conversational datasets with LLMs.
https://github.com/radi-cho/datasetGPT

cli dataset-generation large-language-models python3

Last synced: 8 months ago
JSON representation

A command-line interface to generate textual and conversational datasets with LLMs.

Host: GitHub
URL: https://github.com/radi-cho/datasetGPT
Owner: radi-cho
Created: 2023-03-12T18:14:37.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2023-08-25T16:39:10.000Z (over 2 years ago)
Last Synced: 2025-04-05T11:12:12.653Z (8 months ago)
Topics: cli, dataset-generation, large-language-models, python3
Language: Python
Homepage:
Size: 59.6 KB
Stars: 294
Watchers: 4
Forks: 19
Open Issues: 4
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

awesome-gpt4 - datasetGPT - A command-line interface to generate textual and conversational datasets with LLMs. (Tools / Open-source projects)
jimsghstars - radi-cho/datasetGPT - A command-line interface to generate textual and conversational datasets with LLMs. (Python)
awesome-langchain - datasetGPT - line interface to generate textual and conversational datasets with LLMs. ![GitHub Repo stars](https://img.shields.io/github/stars/radi-cho/datasetGPT?style=social) (Tools / Services)
awesome_ai_agents - Datasetgpt - A command-line interface to generate textual and conversational datasets with LLMs. (Building / Datasets)
awesome-gpt4-zh-CN - datasetGPT - 生成文本数据集和对话数据集的命令行界面。 (提示- [会议安排](https://github.com/stephanj/Scheduling-using-GPT4) 使用GPT-4。 / 开源项目)
awesome-langchain-zh - datasetGPT - cho/datasetGPT?style=social): 一个命令行界面，用于使用LLM生成文本和对话数据集。 (工具 / 服务)
awesome-cli-apps-in-a-csv - datasetGPT - A command-line interface and a Python library for inferencing Large Language Models to generate textual datasets. (<a name="data-management"></a>Data management)
awesome-cli-apps - datasetGPT - A command-line interface and a Python library for inferencing Large Language Models to generate textual datasets. (<a name="data-management"></a>Data management)

README

# datasetGPT

`datasetGPT` is a command-line interface and a Python library for inferencing Large Language Models to generate textual datasets.

Possible use cases may include:

- Constructing textual corpora to train/fine-tune detectors for content written by AI.
- Collecting datasets of LLM-produced conversations for research purposes, analysis of AI performance/impact/ethics, etc.
- Automating a task that a LLM can handle over big amounts of input texts. For example, using GPT-3 to summarize 1000 paragraphs with a single CLI command.
- Leveraging APIs of especially big LLMs to produce diverse texts for a specific task and then fine-tune a smaller model with them.

> This tool is distributed freely and doesn't imply any restrictions on the downstream use cases.
> However, you should make sure to follow the **Terms of use** of the backend APIs (OpenAI, Cohere, Petals, etc.) in your specific context.

## Installation

```
pip install datasetGPT
```

Most of the generation features rely on third-party APIs. Install their respective packages:

```
pip install openai cohere petals
```

## Usage examples

### Inference LLMs at scale

```bash
export OPENAI_API_KEY="..."
export COHERE_API_KEY="..."

datasetGPT texts \
--prompt "If {country} was a planet in the Star Wars universe it would be called" \
--backend "openai|text-davinci-003" \
--backend "cohere|medium" \
--temperature 0.9 \
--option country Germany \
--option country France \
--max-length 50 \
--num-samples 1 \
--single-file
```

The command above should produce a dataset file with 4 texts. Each possible combination of options is used for each of the backend LLMs. Check out the [CLI reference](#cli-reference) for more details. A dataset file looks like this:

```json
[
{
"sample_id": 0,
"backend": "openai|text-davinci-003",
"max_length": 50,
"temperature": 0.9,
"country": "Germany",
"prompt": "If Germany was a planet in the Star Wars universe it would be called",
"output": " Euron. The planet would be home to a powerful and diverse species of aliens, known as the Eurons, that have evolved to a higher level of understanding and technological advancement compared to many of the other planets in the galaxy. The planet would be"
},
{
"sample_id": 0,
"backend": "openai|text-davinci-003",
"max_length": 50,
"temperature": 0.9,
"country": "France",
"prompt": "If France was a planet in the Star Wars universe it would be called",
"output": " The Empire of Liberty. It would be a peaceful, democratic planet with a strong sense of justice and equality. The planet would be home to many different species of aliens but the majority of its population would be humans. It would have a strong military and"
},
{
"sample_id": 0,
"backend": "cohere|medium",
"max_length": 50,
"temperature": 0.9,
"country": "Germany",
"prompt": "If Germany was a planet in the Star Wars universe it would be called",
"output": " the Hoth of the universe.\nAfter the Soviet invasion of Eastern Europe and the subsequent Western anti-Soviet sentiment, Germany's arms manufacturers went into hyperdrive and the country churned out guns at a frightening pace. By the early 1930"
},
...
]
```

Alternatively, you can use our [`TextsGenerator`](https://github.com/radi-cho/datasetGPT/blob/main/datasetGPT/texts.py#L27) to produce texts in Python.

### Generate conversations with the ChatGPT API

```bash
export OPENAI_API_KEY="..."
datasetGPT conversations \
--length 4 \
--length 5 \
--agent1 "You're a shop assistant in a pet store. Answer to customer questions politely." \
--agent2 "You're a customer in a pet store. You should behave like a human. You want to buy {n} pets. Ask questions about the pets in the store." \
--temperature 0.1 \
--temperature 0.2 \
--option n 2 \
--option n 3 \
--path dataset
```

The command above should produce a dataset directory with 16 conversations saved as JSON files. You can specify if you want all of them to be saved in a single file. But why 16? Because `num_samples` dialogues are generated for each possible combination of parameters (conversation length, LLM temperature, and custom prompt options). A dataset item looks like this:

```json
{
"length": 5,
"temperature": 0.1,
"n": "2",
"agent1": "You're a shop assistant in a pet store. Answer to customer questions politely. When the whole conversation is over end with \"Goodbye\".",
"agent2": "You're a customer in a pet store. You should behave like a human. You want to buy 2 pets. Ask questions about the pets in the store. When the whole conversation is over end with \"Goodbye\".",
"utterances": [
[
"agent1",
"Hello! How can I assist you today?"
],
[
"agent2",
"Hi! I'm interested in buying two pets. Can you tell me what kind of pets you have available in the store?"
],
[
"agent1",
"Certainly! We have a variety of pets available, including dogs, cats, birds, fish, hamsters, guinea pigs, rabbits, and reptiles. Is there a specific type of pet you're interested in?"
],
[
"agent2",
"I'm not sure yet. Can you tell me more about the dogs and cats you have available? What breeds do you have?"
],
...
]
}
```

You can also use our [`ConversationsGenerator`](https://github.com/radi-cho/datasetGPT/blob/main/datasetGPT/conversations.py#L43) to produce texts with the `gpt-3.5-turbo` API programatically.

```python
from datasetGPT import ConversationsGenerator, ConversationsGeneratorConfig, DatasetWriter

dataset_writer = DatasetWriter() # single_file=True

generator_config = ConversationsGeneratorConfig(agent1="You're a shop assistant in a pet store. Answer to customer questions politely.",
agent2="You're a customer in a pet store. You should behave like a human. You want to buy {n} pets. Ask questions about the pets in the store.",
num_samples=2,
interruption="length",
lengths=[4, 5],
temperatures=[0.1, 0.2],
options=[("n", "2"), ("n", "3")],
model="gpt-3.5-turbo", # or model="gpt-4"
# Specific models per agent can be set. If one of the parameters is set, the other must also be provided, otherwise both agents use the "model" value.
model_agent_one="gpt-3.5-turbo",
model_agent_two="gpt-4"

)

conversations_generator = ConversationsGenerator(generator_config)

for conversation in conversations_generator:
dataset_writer.save_intermediate_result(conversation)
```

## Contributing

> Still under active development.

Contributions will be highly appreciated. Currently these features are under development:

- [x] `datasetGPT conversations` - Make two ChatGPT agents talk with one another and record the conversation history.
- [x] `datasetGPT texts` - Inference different LLMs with a given input prompt and generate multiple outputs by varying parameters.
- [ ] `datasetGPT transformations` - Apply a list of transformations to a list of texts. For example, summarizing a list of texts for a {child | university student | PhD candidate} to understand.
- [ ] Support more backend LLMs.

To set up a local development environment:

```bash
git clone https://github.com/radi-cho/datasetGPT/
cd datasetGPT
pip install -e .
```

## CLI Reference

```
datasetGPT [OPTIONS] COMMAND [ARGS]...

Command line interface that generates datasets with LLMs.

Options:
--help Show this message and exit.

Commands:
conversations Produce conversations between two gpt-3.5-turbo agents...
texts Inference multiple LLMs at scale.
```

```
datasetGPT texts [OPTIONS]

Inference multiple LLMs at scale.

Options:
-p, --prompt TEXT Input prompt. [required]
-b, --backend TEXT LLM APIs to use as backends. Use
"backend|model_name" notation. For example:
"openai|text-davinci-003".
-l, --max-length INTEGER Maximum number of tokens to generate for each
prompt.
-t, --temperature FLOAT Possible temperature values for the backend
language model.
-n, --num-samples INTEGER Number of conversations for each configuration.
-o, --option ... Values for additional options denoted in your
prompts by {OPTION_NAME}.
-f, --path PATH Where to save the dataset. Either a file or a
directory (folder).
-s, --single-file Either save the whole dataset to a single file
or create multiple files.
--help Show this message and exit.
```

- You can specify multiple variants for the following options: `--length`, `--temperature`, `--num-samples`, `--option`. A dataset item will be generated for each possible combination of the supplied values.
- Each `--option` provided must be formatted as follows: `--option option_name "Some option value"`.
- Currently supported backends: GPT-3 and GPT-4 model variants by [OpenAI](https://openai.com/blog/openai-api), the language models by [Cohere](https://pypi.org/project/cohere/), BLOOM through the [Petals API](https://petals.ml/).

```
datasetGPT conversations [OPTIONS]

Produce conversations between two gpt-3.5-turbo agents with given roles.

Options:
-k, --openai-api-key TEXT OpenAI API key.
-a, --agent1 TEXT Agent role description. [required]
-b, --agent2 TEXT Agent role description. [required]
-i, --interruption [length|end_phrase]
Interruption mode.
-e, --end-phrase TEXT Interrupt after this phrase is outputted by
one of the agents.
-d, --end-agent [agent1|agent2|both]
In which agent's messages to look for the
end phrase.
-l, --length INTEGER Maximum number of utterances for each agent.
A conversation sample will be generated for
each length.
-t, --temperature FLOAT Possible temperature values for the backend
language model.
-m, --model [gpt-3.5-turbo|gpt-4]
OpenAI Chat model to use. GPT-4 is only
supported if provided API key has access to
GPT4. Defaults to GPT-3.5-Turbo.
-m1, --model-agent1 [gpt-3.5-turbo|gpt-4]
OpenAI Chat model to use for agent1. GPT-4
is only supported if given API key has
access to GPT4. Defaults to GPT-3.5-Turbo.
If set, --model-agent2 must also be
provided, otherwise --model value will be
used.
-m2, --model-agent2 [gpt-3.5-turbo|gpt-4]
OpenAI Chat model to use for agent2. GPT-4
is only supported if given API key has
access to GPT4. Defaults to GPT-3.5-Turbo If
set, --model-agent1 must also be provided,
otherwise --model value will be used.
-n, --num-samples INTEGER Number of conversations for each
configuration.
-o, --option ... Values for additional options denoted in
your prompts by {OPTION_NAME}.
-f, --path PATH Where to save the dataset. Either a file or
a directory (folder).
-s, --single-file Either save the whole dataset to a single
file or create multiple files.
--help Show this message and exit.
```

- The length parameter specifies how many utterances each agent should make. A length of 4 typically produces 8 utterances in total.
- You can specify either `length` (default) or `end_phrase` as an interruption strategy. When using `end_phrase` a conversation will be interrupted once the `--end-phrase` has appeared in the messages of the `--end-agent` (could be both). In this case, the lengths provided will be treated as maximum conversation lengths.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/radi-cho/datasetGPT

Awesome Lists containing this project

README