https://github.com/togethercomputer/opendatahub

Last synced: 5 months ago
JSON representation

Host: GitHub
URL: https://github.com/togethercomputer/opendatahub
Owner: togethercomputer
Created: 2023-02-13T18:22:17.000Z (over 3 years ago)
Default Branch: main
Last Pushed: 2023-04-26T23:49:43.000Z (about 3 years ago)
Last Synced: 2025-06-05T03:26:30.847Z (about 1 year ago)
Size: 40 KB
Stars: 128
Watchers: 9
Forks: 29
Open Issues: 14
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# OpenDataHub

This repository contains the current snapshot of the OpenChatKit bot. You can find all training data in `data`,
the hyperparameters used for training in `training.yaml`, training log in `training_log`,
and the pointer to the model at `model.yaml`.

You can find in different branches different specialized versions of this bot.

You can make it better by contributing data!

## Data Model

How should we think about the training data for OpenChatKit bots? A _training set_ is a _set_ of _slices_,
where each _slice_ contains a set of (input, output) pairs. Each slice corresponds to one file
in the `data` folder.

For example, if the data folder contains
```
data
|- pile.yaml
|- soda.yaml
```
during training, the training set will contain the union of both `pile` and `soda`.
Note that different slices can be weighted differently, which will be specified in
the file `training.yaml` (see "Model Training" for details)

### Data Format

You can provide data in various formats.

1. You can provide a collection of input/output pairs
```
IOPairs:
- input: INPUT TEXT STRING
output: OUTPUT TEXT STRING
- input: INPUT TEXT STRING
output: OUTPUT TEXT STRING
...
```
or pure text
```
Text:
- text: TEXT STRING
- text: TEXT STRING
...
```

2. You can provide us the link to your dataset on HuggingFace
```
HuggingFace:
- link: LINK TO YOUR DATASET
```

3. You can prepare your dataset as in OpenAI jsonl format (https://platform.openai.com/docs/guides/fine-tuning)
and put it in a link that we can `wget` or `curl`
```
OpenAIJsonl:
- link: LINK TO YOUR DATASET
```

## Model Training

Each merged pull request will trigger (currently manually) to the training of a model.
Hyper-parameters, including the specific mixture of data, will be specified in `training.yaml`:
```
Training:
- lr: 0.0001
- momentum: 0.99
Mixture:
- pile: 0.5
- soda: 0.5
```
After training, a file `training_log` will be committed to the repository. And a file
`model.yaml` will be made available in the repository specifying where to find this model
and (optionally) Together API end-point to query such a model.

## How to Contribute?

You can help us to make OpenChatKit better in three ways.

### Finding "Bugs"

If you realize that the bug is not performing well, please open an issue, specifying
your input, the bot's output, and a description of what is wrong with it (potentially with the right answer).

### Fixing "Bugs"

If you have data that you believe could be useful to fix some of the issues, please
add your data into the `data` folder and make a pull request associated with the issue
that you think this will fix.

We will review these pull requests, train a model, and merge them.

### Specialization

You don't have to always merge into the main branch. If you have specific things to
try out (e.g., a `text2sql` bot), feel free to open a new branch work there!

Let's work together to make the best open-source bot!

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/togethercomputer/opendatahub

Awesome Lists containing this project

README