https://github.com/tradle/chunkk
recursive dataset gen for finetuning pre-trained GPT models from large text
https://github.com/tradle/chunkk
Last synced: 3 months ago
JSON representation
recursive dataset gen for finetuning pre-trained GPT models from large text
- Host: GitHub
- URL: https://github.com/tradle/chunkk
- Owner: tradle
- Created: 2023-04-09T20:13:20.000Z (about 3 years ago)
- Default Branch: master
- Last Pushed: 2023-04-17T18:04:33.000Z (about 3 years ago)
- Last Synced: 2025-04-14T23:52:20.203Z (about 1 year ago)
- Language: JavaScript
- Homepage:
- Size: 48.8 KB
- Stars: 3
- Watchers: 6
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# chunkk
Recursively generating a dataset for finetuning pre-trained GPT models from a large text file, like a book or a documentation
## Usage
```
node --input [inputFilePath] --output [outputFilePath] --numIterations [number] --numTokens [number] --model [chatGptModel]
```
or
```
node -i [inputFilePath] -o [outputFilePath] -n [number] -t [number] --m [chatGptModel]
```
**input** _(requred)_ - the file path for `txt` (for example a book, or a documentation)
**output** - file path for the generated JSON file // default output.json
**numIterations** - how many times you want to ask for questions for each chunk // default 3
**numTokens** - max number of tokens for ChatGPT model of your choice // default 2000
**model** - ChatGPT model // default **gpt-3.5-turbo**
#### Example
```
node index.js --input '../Downloads/TedChiang-The truth of fact the truth of feeling.txt' --numIterations 5 --output '../Downloads/Ted.json' --numTokens 2500 --model 'gpt-4'
```
### Here is how it works
- Takes a big text file
- Splits it in `numberTokens` chunks
- For each chunk:
- Ask GPT to create a set of questions. The same request repeated in total `numberOfIterations` times. Every request returns about 8-10 question. So the number of questions will be about `numberOfIterations * 10`
- All these questions are then fed as a prompt to ChatGPT for answers.
- The last request is a summary for this chunk of text
- Summaries are concatenated into a new text, and the process repeats recursively until just one chunk is left
- All questions, answers and summaries are recorded in JSON format in file **outputFile** unless you specified the
## TODO
- Add streamining. This is not going to work for huge files for now, since the reading of the file is done with fs.readFileSync
- Add quizzes.
#