{"id":16148243,"url":"https://github.com/chanwit/openchatkit","last_synced_at":"2025-04-06T21:45:00.560Z","repository":{"id":152641972,"uuid":"613752852","full_name":"chanwit/OpenChatKit","owner":"chanwit","description":null,"archived":false,"fork":false,"pushed_at":"2023-03-14T12:02:51.000Z","size":81,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-02-13T03:48:29.537Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/chanwit.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-03-14T07:48:50.000Z","updated_at":"2023-04-02T03:33:24.000Z","dependencies_parsed_at":"2023-08-09T10:01:16.380Z","dependency_job_id":null,"html_url":"https://github.com/chanwit/OpenChatKit","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chanwit%2FOpenChatKit","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chanwit%2FOpenChatKit/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chanwit%2FOpenChatKit/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chanwit%2FOpenChatKit/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/chanwit","download_url":"https://codeload.github.com/chanwit/OpenChatKit/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247557797,"owners_count":20958047,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-10T00:32:07.577Z","updated_at":"2025-04-06T21:45:00.542Z","avatar_url":"https://github.com/chanwit.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# OpenChatKit\n\nOpenChatKit provides a powerful, open-source base to create both specialized and general purpose chatbots for various applications. The kit includes an instruction-tuned 20 billion parameter language model, a 6 billion parameter moderation model, and an extensible retrieval system for including up-to-date responses from custom repositories. It was trained on the OIG-43M training dataset, which was a collaboration between [Together](https://www.together.xyz/), [LAION](https://laion.ai), and [Ontocord.ai](https://ontocord.ai). Much more than a model release, this is the beginning of an open source project. We are releasing a set of tools and processes for ongoing improvement with community contributions. \n\nIn this repo, you'll find code for:\n- Training an OpenChatKit model\n- Testing inference using the model\n- Augmenting the model with additional context from a retrieval index\n\n# Contents\n\n- [Requirements](#requirements)\n- [Pre-trained Weights](#pre-trained-weights)\n- [Datasets](#datasets)\n  * [Data Contributions](#data-contributions)\n- [Pretrained Base Model](#pretrained-base-model)\n- [Training and Finetuning](#training-and-finetuning)\n  * [(Optional) 8bit Adam](#optional-8bit-adam)\n  * [Train GPT-NeoX-Chat-Base-20B](#train-gpt-neox-chat-base-20b)\n- [Converting Weights to Huggingface Format](#converting-weights-to-huggingface-format)\n- [Inference](#inference)\n- [Monitoring](#monitoring)\n  * [Loguru](#loguru)\n  * [Weights \u0026 Biases](#weights--biases)\n- [Experimental: Retrieval-Augmented Models](#experimental-retrieval-augmented-models)\n- [License](#license)\n- [Citing OpenChatKit](#citing-openchatkit)\n- [Acknowledgements](#acknowledgements)\n\n# Requirements\n\nBefore you begin, you need to install PyTorch and other dependencies.\n\n1. Install [Miniconda](https://docs.conda.io/en/latest/miniconda.html) from their website.\n2. Create an environment called OpenChatKit using the `environment.yml` file at the root of this repo.\n\n```shell\nconda env create -f environment.yml\n```\n\nThis repo also uses [Git LFS](https://git-lfs.com/) to manage some files. Install it using the instructions on their site then run:\n\n```shell\ngit lfs install\n```\n\n# Pre-trained Weights\n\nGPT-NeoXT-Chat-Base-20B is a 20B-parameter variant of GPT-NeoX, fine-tuned on conversational datasets. We are releasing pre-trained weights for this model as [togethercomputer/GPT-NeoXT-Chat-Base-20B](https://huggingface.co/togethercomputer/GPT-NeoXT-Chat-Base-20B) on Huggingface.\n\nMore details can be found on the model card for [GPT-NeoXT-Chat-Base-20B](https://huggingface.co/togethercomputer/GPT-NeoXT-Chat-Base-20B) on Huggingface.\n\n# Datasets\n\nThe chat model was trained on the [OIG](https://huggingface.co/datasets/laion/OIG) dataset built by [LAION](https://laion.ai/), [Together](https://www.together.xyz/), and [Ontocord.ai](https://www.ontocord.ai/). To download the dataset from Huggingface run the command below from the root of the repo.\n\n```shell\npython data/OIG/prepare.py\n```\n\nOnce the command completes, the data will be in the `data/OIG/files` directory.\n\n## Data Contributions\n\nYou can help make this chat model better by contributing data! See the [OpenDataHub](https://github.com/togethercomputer/OpenDataHub) repo for more details.\n\n# Pretrained Base Model\n\nAs mentioned above, the chat model is a fine-tuned variant of GPT-NeoX-20B from Eleuther AI. To download GPT-NeoX-20B and prepare it for fine tuning, run this command from the root of the repo.\n\n```shell\npython pretrained/GPT-NeoX-20B/prepare.py\n```\n\nThe weights for this model will be in the `pretrained/GPT-NeoX-20B/EleutherAI_gpt-neox-20b`.\n\n# Training and Finetuning\n\n## (Optional) 8bit Adam\n\nTo use 8bit-adam during training, install the `bitsandbytes` package.\n\n```shell\npip install bitsandbytes # optional, to use 8bit-adam\n```\n\n## Train GPT-NeoX-Chat-Base-20B\n\nThe `training/finetune_GPT-NeoXT-Chat-Base-20B.sh` script configures and runs the training loop. After downloading the dataset and the base model, run:\n\n```shell\nbash training/finetune_GPT-NeoXT-Chat-Base-20B.sh\n```\n\nThe script launches 8 processes with a pipeline-parallel degree of 8 and a data-parallel degree of 1.\n\nAs the training loop runs, checkpoints are saved to the `model_ckpts` directory at the root of the repo.\n\nPlease see [the training README](training/README.md) for more details about customizing the training run.\n\n# Converting Weights to Huggingface Format\n\nBefore you can use this model to perform inference, it must be converted to the Hugginface format.\n\n```shell\nmkdir huggingface_models \\\n\u0026\u0026 python tools/convert_to_hf_gptneox.py \\\n     --ckpt-path model_ckpts/GPT-Neo-XT-Chat-Base-20B/checkpoint_5 \n     --save-path /huggingface_models/GPT-NeoXT-Chat-Base-20B \n     --n-stages 8 \n     --n-layer-per-stage 6\n```\n\n# Inference\n\nTo help you test the model, we provide a simple test command line test harness to interact with the bot. \n\n```shell\npython inference/bot.py\n```\n\nBy default the script will load the model named GPT-NeoXT-Chat-Base-20B model under the `huggingface_models` directory, but you can override that behavior by specifying `--model`.\n\nFor example, if you want to load the base model from our Huggingface, repo, you can run the following command which downloads the weights from HuggingFace.\n\n```shell\npython inference/bot.py --model togethercomputer/GPT-NeoXT-Chat-Base-20B\n```\n\nOnce the model has loaded, enter text at the prompt and the model will reply.\n\n```shell\n$ python inference/bot.py \nLoading /home/csris/src/github.com/togethercomputer/OpenChatKit/inference/../huggingface_models/GPT-NeoXT-Chat-Base-20B to cuda:1...\nWelcome to OpenChatKit shell.   Type /help or /? to list commands.\n\n\u003e\u003e\u003e Hello.\nSetting `pad_token_id` to `eos_token_id`:0 for open-end generation.\nHello human.\n\n\u003e\u003e\u003e \n```\n\nCommands are prefixed with a `/`, and the `/quit` command exits.\n\n# Monitoring\n\nBy default, the training script simply prints the loss as training proceeds, but it can also output metrics to a file using [loguru](https://github.com/Delgan/loguru) or report them to Weights \u0026 Biases.\n\n## Loguru\n\nAdd the flag `--train-log-backend loguru` to your training script to log to `./logs/file_{time}.log`\n\n## Weights \u0026 Biases\n\nTo use Weights \u0026 Biases, first login with your Weights \u0026 Biases token.\n\n```shell\nwandb login\n```\n\nAnd set `--train-log-backend wandb` in the training script to enable logging to Weights \u0026 Biases.\n\n# Experimental: Retrieval-Augmented Models\n\n*Note: Retrieval is still experimental.*\n\nThe code in `/retrieval` implements a python package for querying a Faiss index of Wikipedia. The following steps explain how to use this index to augment queries in the test harness with context from the retriever.\n\n1. Download the Wikipedia index.\n\n```shell\npython data/wikipedia-3sentence-level-retrieval-index/prepare.py\n```\n\n2. Run the bot with the `--retrieval` flag.\n\n```shell\npython inference/bot.py --retrieval\n```\n\nAfter starting, the bot will load both the chat model and the retrieval index, which takes a long time. Once the model and the index are loaded, all queries will be augmented with extra context.\n\n\n```shell\n$ python inference/bot.py --retrieval\nLoading /OpenChatKit/inference/../huggingface_models/GPT-NeoXT-Chat-Base-20B to cuda:0...\nLoading retrieval index...\nWelcome to OpenChatKit shell.   Type /help or /? to list commands.\n\n\u003e\u003e\u003e Where is Zurich?\nSetting `pad_token_id` to `eos_token_id`:0 for open-end generation.\nWhere is Zurich?\nZurich is located in Switzerland.\n\n\u003e\u003e\u003e\n```\n\n# License\n\nAll code in this repository was developed by Together Computer except where otherwise noted.  Copyright (c) 2023, Together Computer.  All rights reserved. The code is licensed under the Apache 2.0 license.\n\n\n```\nCopyright 2023 Together Computer\n\nLicensed under the Apache License, Version 2.0 (the \"License\");\nyou may not use this file except in compliance with the License.\nYou may obtain a copy of the License at\n\n   http://www.apache.org/licenses/LICENSE-2.0\n\nUnless required by applicable law or agreed to in writing, software\ndistributed under the License is distributed on an \"AS IS\" BASIS,\nWITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\nSee the License for the specific language governing permissions and\nlimitations under the License.\n```\n\nThis repository also contains code written by a number of other authors. Such contributions are marked and the relevant licensing is included where appropriate.\n\nFor full terms, see the LICENSE file. If you have any questions, comments, or concerns about licensing please [contact us](https://www.together.xyz/contact).\n\n# Citing OpenChatKit\n\n```bibtex\n@software{openchatkit,\n  title = {{OpenChatKit: An Open Toolkit and Base Model for Dialogue-style Applications}},\n  author = {Together Computer},\n  url = {https://github.com/togethercomputer/OpenChatKit}\n  month = {3},\n  year = {2023},\n  version = {0.15},\n}\n```\n\n# Acknowledgements\n\nOur model is a fine-tuned version of [gpt-neox-20b](https://huggingface.co/EleutherAI/gpt-neox-20b), a large language model trained by [Eleuther AI](https://www.eleuther.ai). We evaluated our model on [HELM](https://crfm.stanford.edu/helm/latest/) provided by the [Center for Research on Foundation Models](https://crfm.stanford.edu). And we collaborated with both [CRFM](https://crfm.stanford.edu) and [HazyResearch](http://hazyresearch.stanford.edu) at Stanford to build this model.\n\nWe collaborated with [LAION](https://laion.ai/) and [Ontocord.ai](https://www.ontocord.ai/) to build the training data used to fine tune this model.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fchanwit%2Fopenchatkit","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fchanwit%2Fopenchatkit","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fchanwit%2Fopenchatkit/lists"}