Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/CogStack/OpenGPT
A framework for creating grounded instruction based datasets and training conversational domain expert Large Language Models (LLMs).
https://github.com/CogStack/OpenGPT
chatgpt gpt-4 health healthcare huggingface llm medicine nlp opengpt
Last synced: about 1 month ago
JSON representation
A framework for creating grounded instruction based datasets and training conversational domain expert Large Language Models (LLMs).
- Host: GitHub
- URL: https://github.com/CogStack/OpenGPT
- Owner: CogStack
- License: apache-2.0
- Created: 2023-05-09T21:50:40.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2023-05-30T18:39:01.000Z (over 1 year ago)
- Last Synced: 2024-11-04T20:06:29.085Z (about 2 months ago)
- Topics: chatgpt, gpt-4, health, healthcare, huggingface, llm, medicine, nlp, opengpt
- Language: Jupyter Notebook
- Homepage:
- Size: 5.3 MB
- Stars: 337
- Watchers: 9
- Forks: 39
- Open Issues: 5
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- StarryDivineSky - CogStack/OpenGPT
README
# OpenGPT
A framework for creating grounded instruction based datasets and training conversational domain expert Large Language Models (LLMs).
Learn more in our blog: [AI for Healthcare | Introducing OpenGPT](https://aiforhealthcare.substack.com/p/a-large-language-model-for-healthcare).
## NHS-LLM
A conversational model for healthcare trained using OpenGPT. All the medical datasets used to train this model were created using OpenGPT and are available below.## Available datasets
- NHS UK Q/A, 24,665 question and answer pairs, Prompt used: f53cf99826, Generated via OpenGPT using data available on the [NHS UK Website](https://www.nhs.uk/conditions/). Download [here](./data/nhs_uk_full/prepared_generated_data_for_nhs_uk_qa.csv)
- NHS UK Conversations, 2,354 unique conversations, Prompt used: f4df95ec69, Generated via OpenGPT using data available on the [NHS UK Website](https://www.nhs.uk/conditions/). Download [here](./data/nhs_uk_full/prepared_generated_data_for_nhs_uk_conversations.csv)
- Medical Task/Solution, 4,688 pairs generated via OpenGPT using GPT-4, prompt used: 5755564c19. Download [here](./data/medical_tasks_gpt4/prepared_generated_data_for_medical_tasks.csv)All datasets are in the `/data` folder.
## Installation
```
pip install opengpt
```
If you are working with LLaMA models, you will also need some extra requirements:
```
pip install -r ./llama_train_requirements.txt
```## Tutorials
- Making a mini conversational LLM for healthcare, [Google Colab - OpenGPT | The making of Dum-E](https://colab.research.google.com/drive/1GQj9dwBSCmzEh1PmbRlQQYlojCvOG-qG?usp=sharing)
## How to
1. We start by collecting a base dataset in a certain domain. For example, collect definitions of all disases (e.g. from [NHS UK](https://www.nhs.uk/conditions/)). You can find a small sample dataset [here](https://github.com/CogStack/OpenGPT/blob/main/data/nhs_conditions_small_sample/original_data.csv). It is important that the collected dataset has a column named `text` where each row of the CSV has one disease definition.
2. Find a prompt matching your use case in the [prompt database](https://github.com/CogStack/OpenGPT/blob/main/data/prompts.json), or create a new prompt using the [Prompt Creation Notebook](https://github.com/CogStack/OpenGPT/blob/main/experiments/Prompt%20Creation.ipynb). A prompt will be used to generate tasks/solutions based on the `context` (the dataset collected in step 1.)
- Edit the config file for dataset generation and add the appropirate promtps and datasets ([example config file](https://github.com/CogStack/OpenGPT/blob/main/configs/example_config_for_detaset_creation.yaml)).
- Run the Dataset generation notebook ([link](https://github.com/CogStack/OpenGPT/blob/main/experiments/Dataset%20Generation.ipynb))3. Edit the [train_config](https://github.com/CogStack/OpenGPT/blob/main/configs/example_train_config.yaml) file and add the datasets you want to use for training.
4. Use the [train notebook](https://github.com/CogStack/OpenGPT/blob/main/experiments/Supervised%20Training.ipynb) or run the training scripts to train a model on the new dataset you created.**If you have any questions please checkout [discourse](https://discourse.cogstack.org/)**
## More Examples