{"id":13754192,"url":"https://github.com/CogStack/OpenGPT","last_synced_at":"2025-05-09T22:31:34.071Z","repository":{"id":163465182,"uuid":"638680528","full_name":"CogStack/OpenGPT","owner":"CogStack","description":"A framework for creating grounded instruction based datasets and training conversational domain expert Large Language Models (LLMs).","archived":false,"fork":false,"pushed_at":"2023-05-30T18:39:01.000Z","size":5561,"stargazers_count":351,"open_issues_count":5,"forks_count":45,"subscribers_count":9,"default_branch":"main","last_synced_at":"2025-04-28T07:03:21.701Z","etag":null,"topics":["chatgpt","gpt-4","health","healthcare","huggingface","llm","medicine","nlp","opengpt"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/CogStack.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2023-05-09T21:50:40.000Z","updated_at":"2025-04-14T09:29:16.000Z","dependencies_parsed_at":null,"dependency_job_id":"d0f63889-eda0-45cf-a6f5-97c8b221ce2b","html_url":"https://github.com/CogStack/OpenGPT","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CogStack%2FOpenGPT","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CogStack%2FOpenGPT/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CogStack%2FOpenGPT/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CogStack%2FOpenGPT/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/CogStack","download_url":"https://codeload.github.com/CogStack/OpenGPT/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253335684,"owners_count":21892713,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["chatgpt","gpt-4","health","healthcare","huggingface","llm","medicine","nlp","opengpt"],"created_at":"2024-08-03T09:01:49.014Z","updated_at":"2025-05-09T22:31:29.275Z","avatar_url":"https://github.com/CogStack.png","language":"Jupyter Notebook","readme":"# OpenGPT\n\nA framework for creating grounded instruction based datasets and training conversational domain expert Large Language Models (LLMs).\n\nLearn more in our blog: [AI for Healthcare | Introducing OpenGPT](https://aiforhealthcare.substack.com/p/a-large-language-model-for-healthcare).\n\n\u003cp align=\"center\"\u003e\n  \u003cimg height='400px' src='https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcbc199b9-3aec-4c80-83c6-9a64886919dc_1318x868.png' /\u003e\n\u003c/p\u003e\n\n## NHS-LLM\nA conversational model for healthcare trained using OpenGPT. All the medical datasets used to train this model were created using OpenGPT and are available below.\n\n## Available datasets\n- NHS UK Q/A, 24,665 question and answer pairs, Prompt used: f53cf99826, Generated via OpenGPT using data available on the [NHS UK Website](https://www.nhs.uk/conditions/). Download [here](./data/nhs_uk_full/prepared_generated_data_for_nhs_uk_qa.csv)\n- NHS UK Conversations, 2,354 unique conversations, Prompt used: f4df95ec69, Generated via OpenGPT using data available on the [NHS UK Website](https://www.nhs.uk/conditions/). Download [here](./data/nhs_uk_full/prepared_generated_data_for_nhs_uk_conversations.csv)\n- Medical Task/Solution, 4,688 pairs generated via OpenGPT using GPT-4, prompt used: 5755564c19. Download [here](./data/medical_tasks_gpt4/prepared_generated_data_for_medical_tasks.csv)\n\nAll datasets are in the `/data` folder.\n\n## Installation\n```\npip install opengpt\n```\nIf you are working with LLaMA models, you will also need some extra requirements:\n```\npip install -r ./llama_train_requirements.txt\n```\n\n## Tutorials\n\n- Making a mini conversational LLM for healthcare, [Google Colab - OpenGPT | The making of Dum-E](https://colab.research.google.com/drive/1GQj9dwBSCmzEh1PmbRlQQYlojCvOG-qG?usp=sharing) \n\n\n## How to\n\n1. We start by collecting a base dataset in a certain domain. For example, collect definitions of all disases (e.g. from [NHS UK](https://www.nhs.uk/conditions/)). You can find a small sample dataset [here](https://github.com/CogStack/OpenGPT/blob/main/data/nhs_conditions_small_sample/original_data.csv). It is important that the collected dataset has a column named `text` where each row of the CSV has one disease definition.\n\n2. Find a prompt matching your use case in the [prompt database](https://github.com/CogStack/OpenGPT/blob/main/data/prompts.json), or create a new prompt using the [Prompt Creation Notebook](https://github.com/CogStack/OpenGPT/blob/main/experiments/Prompt%20Creation.ipynb). A prompt will be used to generate tasks/solutions based on the `context` (the dataset collected in step 1.)\n  - Edit the config file for dataset generation and add the appropirate promtps and datasets ([example config file](https://github.com/CogStack/OpenGPT/blob/main/configs/example_config_for_detaset_creation.yaml)).\n  - Run the Dataset generation notebook ([link](https://github.com/CogStack/OpenGPT/blob/main/experiments/Dataset%20Generation.ipynb))\n\n3. Edit the [train_config](https://github.com/CogStack/OpenGPT/blob/main/configs/example_train_config.yaml) file and add the datasets you want to use for training.\n4. Use the [train notebook](https://github.com/CogStack/OpenGPT/blob/main/experiments/Supervised%20Training.ipynb) or run the training scripts to train a model on the new dataset you created.\n\n**If you have any questions please checkout [discourse](https://discourse.cogstack.org/)**\n\n## More Examples\n\n\u003cp align=\"center\"\u003e\n  \u003cimg width='600px' src='https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3916352d-d1c9-451d-92db-652171f471e0_1318x1842.png' /\u003e\n\u003c/p\u003e\n\n\n\u003cp align=\"center\"\u003e\n  \u003cimg width='600px' src='https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe47dc8e1-d26c-4312-a7a4-8a32bf5375b9_1318x1168.png' /\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003cimg width='600px' src='https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42ab1ebe-2fab-4c94-80e7-69d4b95c8098_1318x854.png' /\u003e\n\u003c/p\u003e\n\n\n","funding_links":[],"categories":["A01_文本生成_文本对话","Jupyter Notebook"],"sub_categories":["大语言对话模型及数据"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FCogStack%2FOpenGPT","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FCogStack%2FOpenGPT","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FCogStack%2FOpenGPT/lists"}