https://github.com/togethercomputer/opendatahub
https://github.com/togethercomputer/opendatahub
Last synced: 5 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/togethercomputer/opendatahub
- Owner: togethercomputer
- Created: 2023-02-13T18:22:17.000Z (over 3 years ago)
- Default Branch: main
- Last Pushed: 2023-04-26T23:49:43.000Z (about 3 years ago)
- Last Synced: 2025-06-05T03:26:30.847Z (about 1 year ago)
- Size: 40 KB
- Stars: 128
- Watchers: 9
- Forks: 29
- Open Issues: 14
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# OpenDataHub
This repository contains the current snapshot of the OpenChatKit bot. You can find all training data in `data`,
the hyperparameters used for training in `training.yaml`, training log in `training_log`,
and the pointer to the model at `model.yaml`.
You can find in different branches different specialized versions of this bot.
You can make it better by contributing data!
## Data Model
How should we think about the training data for OpenChatKit bots? A _training set_ is a _set_ of _slices_,
where each _slice_ contains a set of (input, output) pairs. Each slice corresponds to one file
in the `data` folder.
For example, if the data folder contains
```
data
|- pile.yaml
|- soda.yaml
```
during training, the training set will contain the union of both `pile` and `soda`.
Note that different slices can be weighted differently, which will be specified in
the file `training.yaml` (see "Model Training" for details)
### Data Format
You can provide data in various formats.
1. You can provide a collection of input/output pairs
```
IOPairs:
- input: INPUT TEXT STRING
output: OUTPUT TEXT STRING
- input: INPUT TEXT STRING
output: OUTPUT TEXT STRING
...
```
or pure text
```
Text:
- text: TEXT STRING
- text: TEXT STRING
...
```
2. You can provide us the link to your dataset on HuggingFace
```
HuggingFace:
- link: LINK TO YOUR DATASET
```
3. You can prepare your dataset as in OpenAI jsonl format (https://platform.openai.com/docs/guides/fine-tuning)
and put it in a link that we can `wget` or `curl`
```
OpenAIJsonl:
- link: LINK TO YOUR DATASET
```
## Model Training
Each merged pull request will trigger (currently manually) to the training of a model.
Hyper-parameters, including the specific mixture of data, will be specified in `training.yaml`:
```
Training:
- lr: 0.0001
- momentum: 0.99
Mixture:
- pile: 0.5
- soda: 0.5
```
After training, a file `training_log` will be committed to the repository. And a file
`model.yaml` will be made available in the repository specifying where to find this model
and (optionally) Together API end-point to query such a model.
## How to Contribute?
You can help us to make OpenChatKit better in three ways.
### Finding "Bugs"
If you realize that the bug is not performing well, please open an issue, specifying
your input, the bot's output, and a description of what is wrong with it (potentially with the right answer).
### Fixing "Bugs"
If you have data that you believe could be useful to fix some of the issues, please
add your data into the `data` folder and make a pull request associated with the issue
that you think this will fix.
We will review these pull requests, train a model, and merge them.
### Specialization
You don't have to always merge into the main branch. If you have specific things to
try out (e.g., a `text2sql` bot), feel free to open a new branch work there!
Let's work together to make the best open-source bot!