{"id":13643782,"url":"https://github.com/radi-cho/datasetgpt","last_synced_at":"2025-04-05T11:12:33.676Z","repository":{"id":142208136,"uuid":"613044182","full_name":"radi-cho/datasetGPT","owner":"radi-cho","description":"A command-line interface to generate textual and conversational datasets with LLMs.","archived":false,"fork":false,"pushed_at":"2023-08-25T16:39:10.000Z","size":61,"stargazers_count":293,"open_issues_count":4,"forks_count":19,"subscribers_count":4,"default_branch":"main","last_synced_at":"2025-03-29T10:09:30.949Z","etag":null,"topics":["cli","dataset-generation","large-language-models","python3"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/radi-cho.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2023-03-12T18:14:37.000Z","updated_at":"2025-03-12T06:26:27.000Z","dependencies_parsed_at":"2024-01-06T20:54:51.650Z","dependency_job_id":null,"html_url":"https://github.com/radi-cho/datasetGPT","commit_stats":null,"previous_names":[],"tags_count":4,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/radi-cho%2FdatasetGPT","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/radi-cho%2FdatasetGPT/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/radi-cho%2FdatasetGPT/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/radi-cho%2FdatasetGPT/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/radi-cho","download_url":"https://codeload.github.com/radi-cho/datasetGPT/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247325695,"owners_count":20920714,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cli","dataset-generation","large-language-models","python3"],"created_at":"2024-08-02T01:01:52.518Z","updated_at":"2025-04-05T11:12:33.653Z","avatar_url":"https://github.com/radi-cho.png","language":"Python","funding_links":[],"categories":["CLIs"],"sub_categories":[],"readme":"# datasetGPT\n\n`datasetGPT` is a command-line interface and a Python library for inferencing Large Language Models to generate textual datasets.\n\nPossible use cases may include:\n\n- Constructing textual corpora to train/fine-tune detectors for content written by AI.\n- Collecting datasets of LLM-produced conversations for research purposes, analysis of AI performance/impact/ethics, etc.\n- Automating a task that a LLM can handle over big amounts of input texts. For example, using GPT-3 to summarize 1000 paragraphs with a single CLI command.\n- Leveraging APIs of especially big LLMs to produce diverse texts for a specific task and then fine-tune a smaller model with them.\n\n\u003e This tool is distributed freely and doesn't imply any restrictions on the downstream use cases.\n\u003e However, you should make sure to follow the **Terms of use** of the backend APIs (OpenAI, Cohere, Petals, etc.) in your specific context.\n\n## Installation\n\n```\npip install datasetGPT\n```\n\nMost of the generation features rely on third-party APIs. Install their respective packages:\n\n```\npip install openai cohere petals\n```\n\n## Usage examples\n\n### Inference LLMs at scale\n\n```bash\nexport OPENAI_API_KEY=\"...\"\nexport COHERE_API_KEY=\"...\"\n\ndatasetGPT texts \\\n    --prompt \"If {country} was a planet in the Star Wars universe it would be called\" \\\n    --backend \"openai|text-davinci-003\" \\\n    --backend \"cohere|medium\" \\\n    --temperature 0.9 \\\n    --option country Germany \\\n    --option country France \\\n    --max-length 50 \\\n    --num-samples 1 \\\n    --single-file\n```\n\nThe command above should produce a dataset file with 4 texts. Each possible combination of options is used for each of the backend LLMs. Check out the [CLI reference](#cli-reference) for more details. A dataset file looks like this:\n\n```json\n[\n    {\n        \"sample_id\": 0,\n        \"backend\": \"openai|text-davinci-003\",\n        \"max_length\": 50,\n        \"temperature\": 0.9,\n        \"country\": \"Germany\",\n        \"prompt\": \"If Germany was a planet in the Star Wars universe it would be called\",\n        \"output\": \" Euron. The planet would be home to a powerful and diverse species of aliens, known as the Eurons, that have evolved to a higher level of understanding and technological advancement compared to many of the other planets in the galaxy. The planet would be\"\n    },\n    {\n        \"sample_id\": 0,\n        \"backend\": \"openai|text-davinci-003\",\n        \"max_length\": 50,\n        \"temperature\": 0.9,\n        \"country\": \"France\",\n        \"prompt\": \"If France was a planet in the Star Wars universe it would be called\",\n        \"output\": \" The Empire of Liberty. It would be a peaceful, democratic planet with a strong sense of justice and equality. The planet would be home to many different species of aliens but the majority of its population would be humans. It would have a strong military and\"\n    },\n    {\n        \"sample_id\": 0,\n        \"backend\": \"cohere|medium\",\n        \"max_length\": 50,\n        \"temperature\": 0.9,\n        \"country\": \"Germany\",\n        \"prompt\": \"If Germany was a planet in the Star Wars universe it would be called\",\n        \"output\": \" the Hoth of the universe.\\nAfter the Soviet invasion of Eastern Europe and the subsequent Western anti-Soviet sentiment, Germany's arms manufacturers went into hyperdrive and the country churned out guns at a frightening pace. By the early 1930\"\n    },\n    ...\n]\n```\n\nAlternatively, you can use our [`TextsGenerator`](https://github.com/radi-cho/datasetGPT/blob/main/datasetGPT/texts.py#L27) to produce texts in Python.\n\n### Generate conversations with the ChatGPT API\n\n```bash\nexport OPENAI_API_KEY=\"...\"\ndatasetGPT conversations \\\n    --length 4 \\\n    --length 5 \\\n    --agent1 \"You're a shop assistant in a pet store. Answer to customer questions politely.\" \\\n    --agent2 \"You're a customer in a pet store. You should behave like a human. You want to buy {n} pets. Ask questions about the pets in the store.\" \\\n    --temperature 0.1 \\\n    --temperature 0.2 \\\n    --option n 2 \\\n    --option n 3 \\\n    --path dataset\n```\n\nThe command above should produce a dataset directory with 16 conversations saved as JSON files. You can specify if you want all of them to be saved in a single file. But why 16? Because `num_samples` dialogues are generated for each possible combination of parameters (conversation length, LLM temperature, and custom prompt options). A dataset item looks like this:\n\n```json\n{\n    \"length\": 5,\n    \"temperature\": 0.1,\n    \"n\": \"2\",\n    \"agent1\": \"You're a shop assistant in a pet store. Answer to customer questions politely. When the whole conversation is over end with \\\"Goodbye\\\".\",\n    \"agent2\": \"You're a customer in a pet store. You should behave like a human. You want to buy 2 pets. Ask questions about the pets in the store. When the whole conversation is over end with \\\"Goodbye\\\".\",\n    \"utterances\": [\n        [\n            \"agent1\",\n            \"Hello! How can I assist you today?\"\n        ],\n        [\n            \"agent2\",\n            \"Hi! I'm interested in buying two pets. Can you tell me what kind of pets you have available in the store?\"\n        ],\n        [\n            \"agent1\",\n            \"Certainly! We have a variety of pets available, including dogs, cats, birds, fish, hamsters, guinea pigs, rabbits, and reptiles. Is there a specific type of pet you're interested in?\"\n        ],\n        [\n            \"agent2\",\n            \"I'm not sure yet. Can you tell me more about the dogs and cats you have available? What breeds do you have?\"\n        ],\n        ...\n    ]\n}\n```\n\nYou can also use our [`ConversationsGenerator`](https://github.com/radi-cho/datasetGPT/blob/main/datasetGPT/conversations.py#L43) to produce texts with the `gpt-3.5-turbo` API programatically.\n\n```python\nfrom datasetGPT import ConversationsGenerator, ConversationsGeneratorConfig, DatasetWriter\n\ndataset_writer = DatasetWriter() # single_file=True\n\ngenerator_config = ConversationsGeneratorConfig(agent1=\"You're a shop assistant in a pet store. Answer to customer questions politely.\",\n                                                agent2=\"You're a customer in a pet store. You should behave like a human. You want to buy {n} pets. Ask questions about the pets in the store.\",\n                                                num_samples=2,\n                                                interruption=\"length\",\n                                                lengths=[4, 5],\n                                                temperatures=[0.1, 0.2],\n                                                options=[(\"n\", \"2\"), (\"n\", \"3\")],\n                                                model=\"gpt-3.5-turbo\", # or model=\"gpt-4\"\n                                                # Specific models per agent can be set. If one of the parameters is set, the other must also be provided, otherwise both agents use the \"model\" value.\n                                                model_agent_one=\"gpt-3.5-turbo\",\n                                                model_agent_two=\"gpt-4\"\n\n                                             )\n\nconversations_generator = ConversationsGenerator(generator_config)\n\nfor conversation in conversations_generator:\n    dataset_writer.save_intermediate_result(conversation)\n```\n\n## Contributing\n\n\u003e Still under active development.\n\nContributions will be highly appreciated. Currently these features are under development:\n\n- [x] `datasetGPT conversations` - Make two ChatGPT agents talk with one another and record the conversation history.\n- [x] `datasetGPT texts` - Inference different LLMs with a given input prompt and generate multiple outputs by varying parameters.\n- [ ] `datasetGPT transformations` - Apply a list of transformations to a list of texts. For example, summarizing a list of texts for a {child | university student | PhD candidate} to understand.\n- [ ] Support more backend LLMs.\n\nTo set up a local development environment:\n\n```bash\ngit clone https://github.com/radi-cho/datasetGPT/\ncd datasetGPT\npip install -e .\n```\n\n## CLI Reference\n\n```\ndatasetGPT [OPTIONS] COMMAND [ARGS]...\n\n  Command line interface that generates datasets with LLMs.\n\nOptions:\n  --help  Show this message and exit.\n\nCommands:\n  conversations  Produce conversations between two gpt-3.5-turbo agents...\n  texts          Inference multiple LLMs at scale.\n```\n\n```\ndatasetGPT texts [OPTIONS]\n\n  Inference multiple LLMs at scale.\n\nOptions:\n  -p, --prompt TEXT            Input prompt.  [required]\n  -b, --backend TEXT           LLM APIs to use as backends. Use\n                               \"backend|model_name\" notation. For example:\n                               \"openai|text-davinci-003\".\n  -l, --max-length INTEGER     Maximum number of tokens to generate for each\n                               prompt.\n  -t, --temperature FLOAT      Possible temperature values for the backend\n                               language model.\n  -n, --num-samples INTEGER    Number of conversations for each configuration.\n  -o, --option \u003cTEXT TEXT\u003e...  Values for additional options denoted in your\n                               prompts by {OPTION_NAME}.\n  -f, --path PATH              Where to save the dataset. Either a file or a\n                               directory (folder).\n  -s, --single-file            Either save the whole dataset to a single file\n                               or create multiple files.\n  --help                       Show this message and exit.\n```\n\n- You can specify multiple variants for the following options: `--length`, `--temperature`, `--num-samples`, `--option`. A dataset item will be generated for each possible combination of the supplied values.\n- Each `--option` provided must be formatted as follows: `--option option_name \"Some option value\"`.\n- Currently supported backends: GPT-3 and GPT-4 model variants by [OpenAI](https://openai.com/blog/openai-api), the language models by [Cohere](https://pypi.org/project/cohere/), BLOOM through the [Petals API](https://petals.ml/).\n\n```\ndatasetGPT conversations [OPTIONS]\n\n  Produce conversations between two gpt-3.5-turbo agents with given roles.\n\nOptions:\n  -k, --openai-api-key TEXT       OpenAI API key.\n  -a, --agent1 TEXT               Agent role description.  [required]\n  -b, --agent2 TEXT               Agent role description.  [required]\n  -i, --interruption [length|end_phrase]\n                                  Interruption mode.\n  -e, --end-phrase TEXT           Interrupt after this phrase is outputted by\n                                  one of the agents.\n  -d, --end-agent [agent1|agent2|both]\n                                  In which agent's messages to look for the\n                                  end phrase.\n  -l, --length INTEGER            Maximum number of utterances for each agent.\n                                  A conversation sample will be generated for\n                                  each length.\n  -t, --temperature FLOAT         Possible temperature values for the backend\n                                  language model.\n  -m, --model [gpt-3.5-turbo|gpt-4]\n                                  OpenAI Chat model to use. GPT-4 is only\n                                  supported if provided API key has access to\n                                  GPT4. Defaults to GPT-3.5-Turbo.\n  -m1, --model-agent1 [gpt-3.5-turbo|gpt-4]\n                                  OpenAI Chat model to use for agent1. GPT-4\n                                  is only supported if given API key has\n                                  access to GPT4. Defaults to GPT-3.5-Turbo.\n                                  If set, --model-agent2 must also be\n                                  provided, otherwise --model value will be\n                                  used.\n  -m2, --model-agent2 [gpt-3.5-turbo|gpt-4]\n                                  OpenAI Chat model to use for agent2. GPT-4\n                                  is only supported if given API key has\n                                  access to GPT4. Defaults to GPT-3.5-Turbo If\n                                  set, --model-agent1 must also be provided,\n                                  otherwise --model value will be used.\n  -n, --num-samples INTEGER       Number of conversations for each\n                                  configuration.\n  -o, --option \u003cTEXT TEXT\u003e...     Values for additional options denoted in\n                                  your prompts by {OPTION_NAME}.\n  -f, --path PATH                 Where to save the dataset. Either a file or\n                                  a directory (folder).\n  -s, --single-file               Either save the whole dataset to a single\n                                  file or create multiple files.\n  --help                          Show this message and exit.\n```\n\n- The length parameter specifies how many utterances each agent should make. A length of 4 typically produces 8 utterances in total.\n- You can specify either `length` (default) or `end_phrase` as an interruption strategy. When using `end_phrase` a conversation will be interrupted once the `--end-phrase` has appeared in the messages of the `--end-agent` (could be both). In this case, the lengths provided will be treated as maximum conversation lengths.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fradi-cho%2Fdatasetgpt","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fradi-cho%2Fdatasetgpt","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fradi-cho%2Fdatasetgpt/lists"}