Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/Brandon82/llm-dataset-gen
Using LLMs (OpenAI API) to generate and add data to datasets
https://github.com/Brandon82/llm-dataset-gen
dataset-generation datasets openai openai-api
Last synced: 3 days ago
JSON representation
Using LLMs (OpenAI API) to generate and add data to datasets
- Host: GitHub
- URL: https://github.com/Brandon82/llm-dataset-gen
- Owner: Brandon82
- License: mit
- Created: 2023-12-04T02:17:31.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2024-01-05T08:21:11.000Z (about 1 year ago)
- Last Synced: 2024-01-05T09:30:44.356Z (about 1 year ago)
- Topics: dataset-generation, datasets, openai, openai-api
- Language: Python
- Homepage:
- Size: 198 KB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome_ai_agents - Llm-Dataset-Gen - Using LLMs (OpenAI API) to generate and add data to datasets (Building / Datasets)
- awesome_ai_agents - Llm-Dataset-Gen - Using LLMs (OpenAI API) to generate and add data to datasets (Building / Datasets)
README
# llm-dataset-gen
Provides a `LLMDataset` class for generating and adding data to `.csv` datasets using LLMs (OpenAI API)## Installation
Install the following packages:
`pip install openai==1.3.5 pandas==2.1.3 python-dotenv==1.0.0`## Usage
**1. Create a .env file in the root directory of the project and add your OpenAI API key to it:**
```
OPENAI_API_KEY=
```
**2. Create an empty dataset file using the `create_dataset.py` script**
> You can skip this step if you already have a dataset file**3. Create an instance of the `LLMDataset` class and provide a `dataset_path`:**
```python
from llm_dataset_gen import LLMDataset
data_filepath = "./data/Dataset.csv"
dataset = LLMDataset(dataset_path=data_filepath)
```
**4. Call the `add_data` method by providing the `context` and `num_samples` parameters:**
```python
dataset_context="For Context, this dataset represents requirements engineering excerpts and their corresponding Language Construct (LC) and Language Quality (LQ) codings"
dataset.add_data(context=dataset_context, num_samples=20)
```
- The `add_data` method will automatically overwrite/save the dataset file after appending the new data
- The `context` parameter is the prompt that will be used to generate the data
- The `num_samples` parameter is the number of data samples to generate and add to the dataset### How It Works
The `LLMDataset` class is designed to manage a dataset and interact with the OpenAI API to generate new data entries. By using the JSON Mode of the OpenAI API and the `gpt-4-1106-preview` or `gpt-3.5-turbo-1106` model, it can generate new data entries (as JSON Objects) that match the structure of a given dataset, and easily append them to the dataset.When calling the API, two messages are sent to the model: a `dataset_description`, and a `context`
- The `dataset_description` is automatically generated by the `LLMDataset` class and describes the column names in the dataset, the number of data entries to generate, and how to format the data entries. This ensures that the generated data is consistent with the structure of the dataset.
- The `context` is the prompt that is used to describe the data entries. This is provided by the user as a parameter in the `add_data` method.
- If the dataset contains an `ID` column, the `LLMDataset` will ignore the LLM's generated ID and instead use the next available ID in the dataset.