https://github.com/tradle/chunkk

recursive dataset gen for finetuning pre-trained GPT models from large text
https://github.com/tradle/chunkk

Last synced: 3 months ago
JSON representation

recursive dataset gen for finetuning pre-trained GPT models from large text

Host: GitHub
URL: https://github.com/tradle/chunkk
Owner: tradle
Created: 2023-04-09T20:13:20.000Z (about 3 years ago)
Default Branch: master
Last Pushed: 2023-04-17T18:04:33.000Z (about 3 years ago)
Last Synced: 2025-04-14T23:52:20.203Z (about 1 year ago)
Language: JavaScript
Homepage:
Size: 48.8 KB
Stars: 3
Watchers: 6
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# chunkk
Recursively generating a dataset for finetuning pre-trained GPT models from a large text file, like a book or a documentation

## Usage
```
node --input [inputFilePath] --output [outputFilePath] --numIterations [number] --numTokens [number] --model [chatGptModel]
```
or
```
node -i [inputFilePath] -o [outputFilePath] -n [number] -t [number] --m [chatGptModel]
```
**input** _(requred)_ - the file path for `txt` (for example a book, or a documentation)
**output** - file path for the generated JSON file // default output.json
**numIterations** - how many times you want to ask for questions for each chunk // default 3
**numTokens** - max number of tokens for ChatGPT model of your choice // default 2000
**model** - ChatGPT model // default **gpt-3.5-turbo**

#### Example
```
node index.js --input '../Downloads/TedChiang-The truth of fact the truth of feeling.txt' --numIterations 5 --output '../Downloads/Ted.json' --numTokens 2500 --model 'gpt-4'
```

### Here is how it works
- Takes a big text file
- Splits it in `numberTokens` chunks
- For each chunk:
- Ask GPT to create a set of questions. The same request repeated in total `numberOfIterations` times. Every request returns about 8-10 question. So the number of questions will be about `numberOfIterations * 10`
- All these questions are then fed as a prompt to ChatGPT for answers.
- The last request is a summary for this chunk of text
- Summaries are concatenated into a new text, and the process repeats recursively until just one chunk is left
- All questions, answers and summaries are recorded in JSON format in file **outputFile** unless you specified the

## TODO
- Add streamining. This is not going to work for huge files for now, since the reading of the file is done with fs.readFileSync
- Add quizzes.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/tradle/chunkk

Awesome Lists containing this project

README