https://github.com/CogStack/OpenGPT

A framework for creating grounded instruction based datasets and training conversational domain expert Large Language Models (LLMs).
https://github.com/CogStack/OpenGPT

chatgpt gpt-4 health healthcare huggingface llm medicine nlp opengpt

Last synced: 2 months ago
JSON representation

A framework for creating grounded instruction based datasets and training conversational domain expert Large Language Models (LLMs).

Host: GitHub
URL: https://github.com/CogStack/OpenGPT
Owner: CogStack
License: apache-2.0
Created: 2023-05-09T21:50:40.000Z (about 2 years ago)
Default Branch: main
Last Pushed: 2023-05-30T18:39:01.000Z (about 2 years ago)
Last Synced: 2025-04-28T07:03:21.701Z (2 months ago)
Topics: chatgpt, gpt-4, health, healthcare, huggingface, llm, medicine, nlp, opengpt
Language: Jupyter Notebook
Homepage:
Size: 5.3 MB
Stars: 351
Watchers: 9
Forks: 45
Open Issues: 5
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

StarryDivineSky - CogStack/OpenGPT

README

# OpenGPT

A framework for creating grounded instruction based datasets and training conversational domain expert Large Language Models (LLMs).

Learn more in our blog: [AI for Healthcare | Introducing OpenGPT](https://aiforhealthcare.substack.com/p/a-large-language-model-for-healthcare).

## NHS-LLM
A conversational model for healthcare trained using OpenGPT. All the medical datasets used to train this model were created using OpenGPT and are available below.

## Available datasets
- NHS UK Q/A, 24,665 question and answer pairs, Prompt used: f53cf99826, Generated via OpenGPT using data available on the [NHS UK Website](https://www.nhs.uk/conditions/). Download [here](./data/nhs_uk_full/prepared_generated_data_for_nhs_uk_qa.csv)
- NHS UK Conversations, 2,354 unique conversations, Prompt used: f4df95ec69, Generated via OpenGPT using data available on the [NHS UK Website](https://www.nhs.uk/conditions/). Download [here](./data/nhs_uk_full/prepared_generated_data_for_nhs_uk_conversations.csv)
- Medical Task/Solution, 4,688 pairs generated via OpenGPT using GPT-4, prompt used: 5755564c19. Download [here](./data/medical_tasks_gpt4/prepared_generated_data_for_medical_tasks.csv)

All datasets are in the `/data` folder.

## Installation
```
pip install opengpt
```
If you are working with LLaMA models, you will also need some extra requirements:
```
pip install -r ./llama_train_requirements.txt
```

## Tutorials

- Making a mini conversational LLM for healthcare, [Google Colab - OpenGPT | The making of Dum-E](https://colab.research.google.com/drive/1GQj9dwBSCmzEh1PmbRlQQYlojCvOG-qG?usp=sharing)

## How to

1. We start by collecting a base dataset in a certain domain. For example, collect definitions of all disases (e.g. from [NHS UK](https://www.nhs.uk/conditions/)). You can find a small sample dataset [here](https://github.com/CogStack/OpenGPT/blob/main/data/nhs_conditions_small_sample/original_data.csv). It is important that the collected dataset has a column named `text` where each row of the CSV has one disease definition.

2. Find a prompt matching your use case in the [prompt database](https://github.com/CogStack/OpenGPT/blob/main/data/prompts.json), or create a new prompt using the [Prompt Creation Notebook](https://github.com/CogStack/OpenGPT/blob/main/experiments/Prompt%20Creation.ipynb). A prompt will be used to generate tasks/solutions based on the `context` (the dataset collected in step 1.)
- Edit the config file for dataset generation and add the appropirate promtps and datasets ([example config file](https://github.com/CogStack/OpenGPT/blob/main/configs/example_config_for_detaset_creation.yaml)).
- Run the Dataset generation notebook ([link](https://github.com/CogStack/OpenGPT/blob/main/experiments/Dataset%20Generation.ipynb))

3. Edit the [train_config](https://github.com/CogStack/OpenGPT/blob/main/configs/example_train_config.yaml) file and add the datasets you want to use for training.
4. Use the [train notebook](https://github.com/CogStack/OpenGPT/blob/main/experiments/Supervised%20Training.ipynb) or run the training scripts to train a model on the new dataset you created.

**If you have any questions please checkout [discourse](https://discourse.cogstack.org/)**

## More Examples

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/CogStack/OpenGPT

Awesome Lists containing this project

README