https://github.com/zeke320/llm-dataset-generator
The LLM Dataset Generator is an open source tool for generating text data compatible with various language models supported by LangChain. You can customize it to meet your specific needs, making it a valuable resource for researchers, developers, and organizations working on NLP applications.
https://github.com/zeke320/llm-dataset-generator
cc0 dataset-generator langchain language-model ollama phi-3
Last synced: about 1 month ago
JSON representation
The LLM Dataset Generator is an open source tool for generating text data compatible with various language models supported by LangChain. You can customize it to meet your specific needs, making it a valuable resource for researchers, developers, and organizations working on NLP applications.
- Host: GitHub
- URL: https://github.com/zeke320/llm-dataset-generator
- Owner: ZEKE320
- License: cc0-1.0
- Created: 2024-06-17T07:57:18.000Z (almost 2 years ago)
- Default Branch: main
- Last Pushed: 2024-11-18T21:24:28.000Z (over 1 year ago)
- Last Synced: 2025-09-15T08:13:19.328Z (9 months ago)
- Topics: cc0, dataset-generator, langchain, language-model, ollama, phi-3
- Language: Jupyter Notebook
- Homepage:
- Size: 650 KB
- Stars: 4
- Watchers: 1
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# LLM Dataset Generator





## Overview
The LLM Dataset Generator is an open-source tool designed to facilitate the creation of text datasets using various language models. This repository provides a framework for generating and collecting text data, supporting research and development in natural language processing (NLP) and related fields.
## License
This project is released under the CC0 1.0 Universal license. It is freely available for use by anyone for research, personal development, or commercial purposes without restriction. For more details, please refer to the [LICENSE](LICENSE) file.
## Examples
Check the following for actual output examples:
- [article_1](phi-3-text-generator/out.example/article_1.txt)
- [article_2](phi-3-text-generator/out.example/article_2.txt)
- [article_3](phi-3-text-generator/out.example/article_3.txt)
For Jupyter Notebook execution results, see:
- [phi-3-text-generator.ipynb](phi-3-text-generator/phi-3-text-generator.ipynb)
# Setup Guide for Ollama and Phi-3 Text Generator
You can check the code from [`phi-3-text-generator/`](phi-3-text-generator).
## 1. Creating and Starting the Ollama Container
To start the Ollama container and install the Phi-3:14B model, follow these steps:
### Prerequisites
- Docker Desktop must be installed.
### Steps
1. Open a terminal and pull the Ollama Docker image from Docker Hub:
```sh
docker pull ollama/ollama:latest
```
2. Run the Ollama container with the following command:
```sh
docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
```
3. Verify that the container is running:
```sh
docker ps
```
Example output:
```sh
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
aa492e7068d7 ollama/ollama:latest "/bin/ollama serve" 9 seconds ago Up 8 seconds 0.0.0.0:11434->11434/tcp ollama
```
4. Check if Ollama is running correctly:
```sh
curl localhost:11434
```
You should see a message indicating that Ollama is running.
5. Pull the Phi-3:14B model:
```sh
docker exec -it ollama ollama pull phi3:medium
```
## 2. How to Use the Project
Before running `phi-3-text-generator.py`, you can modify several constants in the file.
### Constants to Set
- **OLLAMA_BASE_URL**:
- If running inside the Docker container: `"http://host.docker.internal:11434"`
- If running locally: `"http://localhost:11434"`
- **TARGET_SENTENCE_COUNT**: Specify the number of sentences to generate.
- **OUTPUT_FILE_COUNT**: Specify the number of output files to generate.
- **OUTPUT_DIRECTORY**: Specify the directory where the generated files will be saved.
### Execution Steps
1. Open the `phi-3-text-generator.py` file and set the constants as needed.
2. Install poetry
```sh
pip install poetry
```
3. Install the project dependencies. Go to the root directory of the project, and run:
```sh
poetry install
```
4. Activate the virtual environment:
```sh
poetry shell
```
5. Run the script with the following command:
```sh
python phi-3-text-generator.py
```
## 3. How to Check the Generated Files
By default, the generated files will be saved in an `out` directory created in the same directory as the script.
### Steps to Verify
1. Navigate to the directory where the script was executed.
2. Check for the `out` directory:
```sh
ls out
```