{"id":15175670,"url":"https://github.com/jina-ai/rungpt","last_synced_at":"2026-04-02T18:32:14.701Z","repository":{"id":164330765,"uuid":"623356510","full_name":"jina-ai/rungpt","owner":"jina-ai","description":"An open-source cloud-native of large multi-modal models (LMMs) serving framework.","archived":false,"fork":false,"pushed_at":"2023-09-05T03:32:18.000Z","size":5545,"stargazers_count":165,"open_issues_count":5,"forks_count":22,"subscribers_count":20,"default_branch":"main","last_synced_at":"2026-01-30T08:37:43.882Z","etag":null,"topics":["flamingo","gpt-4","large-language-models","large-multimadality-models","llama","llm-hosting","llm-serve","lmm-serve","multi-modality","opengpt","self-hosting","transformers"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jina-ai.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2023-04-04T07:57:35.000Z","updated_at":"2026-01-09T23:53:16.000Z","dependencies_parsed_at":"2023-09-04T05:57:43.268Z","dependency_job_id":"f583b94e-85ad-4112-a8be-ffdb861e7119","html_url":"https://github.com/jina-ai/rungpt","commit_stats":null,"previous_names":["jina-ai/rungpt","jina-ai/opengpt","jina-ai/open_gpt"],"tags_count":25,"template":false,"template_full_name":null,"purl":"pkg:github/jina-ai/rungpt","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jina-ai%2Frungpt","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jina-ai%2Frungpt/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jina-ai%2Frungpt/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jina-ai%2Frungpt/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jina-ai","download_url":"https://codeload.github.com/jina-ai/rungpt/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jina-ai%2Frungpt/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29024557,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-02T23:43:56.073Z","status":"ssl_error","status_checked_at":"2026-02-02T23:43:49.438Z","response_time":58,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["flamingo","gpt-4","large-language-models","large-multimadality-models","llama","llm-hosting","llm-serve","lmm-serve","multi-modality","opengpt","self-hosting","transformers"],"created_at":"2024-09-27T12:39:52.362Z","updated_at":"2026-04-02T18:32:14.669Z","avatar_url":"https://github.com/jina-ai.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# ☄️ RunGPT\n\n\u003cp align=\"center\"\u003e\n\u003ca href=\"https://github.com/jina-ai/rungpt\"\u003e\u003cimg src=\"https://github.com/jina-ai/rungpt/blob/main/.github/images/logo.png\" alt=\"RunGPT: An open-source cloud-native large-scale multimodal model serving framework\" width=\"300px\"\u003e\u003c/a\u003e\n\u003cbr\u003e\n\u003c/p\u003e\n\n\u003e \"A playful and whimsical vector art of a Stochastic Tigger, wearing a t-shirt with a \"GPT\" text printed logo, surrounded by colorful geometric shapes.  –ar 1:1 –upbeta\"\n\u003e\n\u003e — Prompts and logo art was produced with  [PromptPerfect](https://promptperfect.jina.ai/) \u0026 [Stable Diffusion X](https://clipdrop.co/stable-diffusion)\n\n\n![](https://img.shields.io/badge/Made%20with-JinaAI-blueviolet?style=flat)\n[![PyPI](https://img.shields.io/pypi/v/run_gpt_torch)](https://pypi.org/project/run_gpt_torch/)\n[![PyPI - License](https://img.shields.io/pypi/l/run_gpt_torch)](https://pypi.org/project/run_gpt_torch/)\n\n**RunGPT** is an open-source _cloud-native_ **_large-scale language models_** (LLMs) serving framework. \nIt is designed to simplify the deployment and management of large language models, on a distributed cluster of GPUs.\nWe aim to make it a one-stop solution for a centralized and accessible place to gather techniques for optimizing LLM and make them easy to use for everyone.\n\n\n## Table of contents\n\n- [Features](#features)\n- [Get started](#get-started)\n- [Build a model serving in one line](#build-a-model-serving-in-one-line)\n- [Cloud-native deployment](#cloud-native-deployment)\n- [Roadmap](#roadmap)\n\n## Features\n\nRunGPT provides the following features to make it easy to deploy and serve **large language models** (LLMs) at scale:\n\n- Scalable architecture for handling high traffic loads\n- Optimized for low-latency inference\n- Automatic model partitioning and distribution across multiple GPUs\n- Centralized model management and monitoring\n- REST API for easy integration with existing applications\n\n## Updates\n\n- **2023-08-22**: The OpenGPT is now renamed to RunGPT. We have also released the first version `v0.1.0` of RunGPT. You can install it with `pip install rungpt`.\n- **2023-05-12**: 🎉We have released the first version `v0.0.1` of OpenGPT. You can install it with `pip install open_gpt_torch`.\n\n\n## Get Started\n\n### Installation\n\nInstall the package with `pip`:\n\n```bash\npip install rungpt\n```\n\n### Quickstart\n\n```python\nimport run_gpt\n\nmodel = run_gpt.create_model(\n    'stabilityai/stablelm-tuned-alpha-3b', device='cuda', precision='fp16'\n)\n\nprompt = \"The quick brown fox jumps over the lazy dog.\"\n\noutput = model.generate(\n    prompt,\n    max_length=100,\n    temperature=0.9,\n    top_k=50,\n    top_p=0.95,\n    repetition_penalty=1.2,\n    do_sample=True,\n    num_return_sequences=1,\n)\n```\n\nWe use the [stabilityai/stablelm-tuned-alpha-3b](https://huggingface.co/stabilityai/stablelm-tuned-alpha-3b) as the open example model as it is relatively small and fast to download.\n\n\u003e **Warning**\n\u003e In the above example, we use `precision='fp16'` to reduce the memory usage and speed up the inference with some loss in accuracy on text generation tasks. \n\u003e You can also use `precision='fp32'` instead as you like for better performance. \n\n\u003e **Note**\n\u003e It usually takes a while (several minutes) for the first time to download and load the model into the memory.\n\n\nIn most cases of large model serving, the model cannot fit into a single GPU. To solve this problem, we also provide a `device_map` option (supported by `accecleate` package) to automatically partition the model and distribute it across multiple GPUs:\n\n```python\nmodel = run_gpt.create_model(\n    'stabilityai/stablelm-tuned-alpha-3b', precision='fp16', device_map='balanced'\n)\n```\n\nIn the above example, `device_map=\"balanced\"` evenly split the model on all available GPUs, making it possible for you to serve large models.\n\n\u003e **Note**\n\u003e The `device_map` option is supported by the [accelerate](https://github.com/huggingface/accelerate) package. \n\n\nSee [examples on how to use rungpt with different models.](./examples) 🔥\n\n\n## Build a model serving in one line\n\nTo do so, you can use the `serve` command:\n\n```bash\nrungpt serve stabilityai/stablelm-tuned-alpha-3b --precision fp16 --device_map balanced\n```\n\n💡 **Tip**: you can inspect the available options with `rungpt serve --help`.\n\nThis will start a gRPC and HTTP server listening on port `51000` and `52000` respectively. \nOnce the server is ready, as shown below:\n\u003cdetails\u003e\n\u003csummary\u003eClick to expand\u003c/summary\u003e\n\u003cimg src=\"https://github.com/jina-ai/rungpt/blob/main/.github/images/serve_ready.png\" width=\"600px\"\u003e\n\u003c/details\u003e\n\nYou can then send requests to the server:\n\n```python\nimport requests\n\nprompt = \"Once upon a time,\"\n\nresponse = requests.post(\n    \"http://localhost:51000/generate\",\n    json={\n        \"prompt\": prompt,\n        \"max_length\": 100,\n        \"temperature\": 0.9,\n        \"top_k\": 50,\n        \"top_p\": 0.95,\n        \"repetition_penalty\": 1.2,\n        \"do_sample\": True,\n        \"num_return_sequences\": 1,\n    },\n)\n```\n\nWhat's more, we also provide a [Python client](https://github.com/jina-ai/inference-client/) (`inference-client`) for you to easily interact with the server:\n\n```python\nfrom run_gpt import Client\n\nclient = Client()\n\n# connect to the model server\nmodel = client.get_model(endpoint='grpc://0.0.0.0:51000')\n\nprompt = \"Once upon a time,\"\n\noutput = model.generate(\n    prompt,\n    max_length=100,\n    temperature=0.9,\n    top_k=50,\n    top_p=0.95,\n    repetition_penalty=1.2,\n    do_sample=True,\n    num_return_sequences=1,\n)\n```\n\nThe output has the same format as the one from the OpenAI's Python API:\n\n```\n{ \"id\": \"18d92585-7b66-4b7c-b818-71287c122c50\", \n  \"object\": \"text_completion\", \n  \"created\": 1692610173, \n  \"choices\": [{\"text\": \"Once upon a time, there was an old man who lived in the forest. He had no children\", \n              \"finish_reason\": \"length\", \n              \"index\": 0.0}], \n  \"prompt\": \"Once upon a time,\", \n  \"usage\": {\"completion_tokens\": 21, \"total_tokens\": 27, \"prompt_tokens\": 6}}\n```\n\nFor the streaming output, you can install `sseclient-py` first:\n```bash\npip install sseclient-py\n```\n\nAnd send the request to `http://localhost:51000/generate_stream` with the same payload.\n\n```python\nimport sseclient\nimport requests\n\nprompt = \"Once upon a time,\"\n\nresponse = requests.post(\n    \"http://localhost:51000/generate_stream\",\n    json={\n        \"prompt\": prompt,\n        \"max_length\": 100,\n        \"temperature\": 0.9,\n        \"top_k\": 50,\n        \"top_p\": 0.95,\n        \"repetition_penalty\": 1.2,\n        \"do_sample\": True,\n        \"num_return_sequences\": 1,\n    },\n    stream=True,\n)\nclient = sseclient.SSEClient(response)\nfor event in client.events():\n    print(event.data)\n```\n\nAnd the output will be streamed back to you (only show 3 iterations here):\n\n```\n{ \"id\": \"18d92585-7b66-4b7c-b818-71287c122c51\", \n  \"object\": \"text_completion\", \n  \"created\": 1692610173, \n  \"choices\": [{\"text\": \" there\", \"finish_reason\": None, \"index\": 0.0}], \n  \"prompt\": \"Once upon a time,\", \n  \"usage\": {\"completion_tokens\": 1, \"total_tokens\": 7, \"prompt_tokens\": 6}},\n{ \"id\": \"18d92585-7b66-4b7c-b818-71287c122c52\", \n  \"object\": \"text_completion\", \n  \"created\": 1692610173, \n  \"choices\": [{\"text\": \"was\", \"finish_reason\": None, \"index\": 0.0}], \n  \"prompt\": None, \n  \"usage\": {\"completion_tokens\": 2, \"total_tokens\": 9, \"prompt_tokens\": 7}},\n{ \"id\": \"18d92585-7b66-4b7c-b818-71287c122c53\", \n  \"object\": \"text_completion\", \n  \"created\": 1692610173, \n  \"choices\": [{\"text\": \"an\", \"finish_reason\": None, \"index\": 0.0}], \n  \"prompt\": None, \n  \"usage\": {\"completion_tokens\": 3, \"total_tokens\": 11, \"prompt_tokens\": 8}}\n```\n\nWe also support chat mode, which is useful for interactive applications. The inputs for `chat` should be a list of \ndictionaries which contain role and content. For example:\n\n```python\nimport requests\n\nmessages = [\n    {\"role\": \"user\", \"content\": \"Hello!\"},\n]\n\nresponse = requests.post(\n    \"http://localhost:51000/chat\",\n    json={\n        \"messages\": messages,\n        \"max_length\": 100,\n        \"temperature\": 0.9,\n        \"top_k\": 50,\n        \"top_p\": 0.95,\n        \"repetition_penalty\": 1.2,\n        \"do_sample\": True,\n        \"num_return_sequences\": 1,\n    },\n)\n```\n\nThe response will be:\n\n```\n{\"id\": \"18d92585-7b66-4b7c-b818-71287c122c57\", \n  \"object\": \"chat.completion\", \n  \"created\": 1692610173, \n  \"choices\": [{\"message\": {\n                            \"role\": \"assistant\",\n                            \"content\": \"\\n\\nHello there, how may I assist you today?\",\n                        }, \n              \"finish_reason\": \"stop\", \"index\": 0.0}], \n  \"prompt\": \"Hello there!\", \n  \"usage\": {\"completion_tokens\": 12, \"total_tokens\": 15, \"prompt_tokens\": 3}}\n```\n\nYou can also replace the `chat` with `chat_stream` to get the streaming output.\n\n\n\n## Cloud-native deployment\n\nYou can also deploy the server to a cloud provider like Jina Cloud or AWS.\nTo do so, you can use `deploy` command:\n\n### Jina Cloud\n\nusing predefined executor\n\n```bash\nrungpt deploy stabilityai/stablelm-tuned-alpha-3b --precision fp16 --device_map balanced --cloud jina --replicas 1\n```\n\nIt will give you a HTTP url and a gRPC url by default:\n```bash\nhttps://{random-host-name}-http.wolf.jina.ai\ngrpcs://{random-host-name}-grpc.wolf.jina.ai\n```\n\n\n### AWS\n\nTBD\n\n\n## Benchmark\nWe have done some benchmarking on different model architectures and different configurations (whether to use \nquantization, torch.compile and page attention ...), regards to the latency, throughput (prefill stage \u0026\u0026 the whole \ndecoding process) and perplexity. \n\nThe script for benchmarking locates at `scripts/benchmark.py`. You can run the scripts to get the benchmarking results.\n\n### Environment Setting\nWe use a single RTX3090 (cuda version is 11.8) for all benchmarking except for Llama-2-13b (2*RTX3090). We use:\n```\ntorch==2.0.1 (without torch.compile) / torch==2.1.0.dev20230803 (with torch.compile)\nbitsandbytes==0.41.0\ntransformers==4.31.0\ntriton==2.0.0\n```\n\n### Model Candidates\n|             Model_Name             |\n|:----------------------------------:|\n|      meta-llama/Llama-2-7b-hf      |\n|           mosaicml/mpt-7b          |\n| stabilityai/stablelm-base-alpha-7b |\n|         EleutherAI/gpt-j-6B        |\n\n\n### Benchmarking Results\n\n- **Latency/throughput for different models** (precision: fp16)\n\n|             Model_Name             | average_prefill_latency(ms/token) | average_prefill_throughput(token/s) | average_decode_latency(ms/token) | average_decode_throughput(token/s) |\n|:----------------------------------:|:---------------------------------:|:-----------------------------------:|:--------------------------------:|:----------------------------------:|\n|      meta-llama/Llama-2-7b-hf      |                 49                |                20.619               |               49.4               |               20.054               |\n|      meta-llama/Llama-2-13b-hf     |                175                |                5.727                |              188.27              |                4.836               |\n|           mosaicml/mpt-7b          |                 27                |                37.527               |               28.04              |               35.312               |\n| stabilityai/stablelm-base-alpha-7b |                 50                |                20.09                |               45.73              |               21.878               |\n|         EleutherAI/gpt-j-6B        |                 75                |                13.301               |               76.15              |               11.181               |\n\n\n- **Latency/throughput for different models using torch.compile** (precision: fp16)\n\n\u003e **Warning**\n\u003e torch.compile doesn't support Flash-Attention based model like MPT. Also, it cannot be used in multi-GPUs environment.\n\n|             Model_Name             | average_prefill_latency(ms/token) | average_prefill_throughput(token/s) | average_decode_latency(ms/token) | average_decode_throughput(token/s) |\n|:----------------------------------:|:---------------------------------:|:-----------------------------------:|:--------------------------------:|:----------------------------------:|\n|      meta-llama/Llama-2-7b-hf      |                 25                |                40.644               |               26.54              |                37.75               |\n|      meta-llama/Llama-2-13b-hf     |                 -                 |                  -                  |                 -                |                  -                 |\n|           mosaicml/mpt-7b          |                 -                 |                  -                  |                 -                |                  -                 |\n| stabilityai/stablelm-base-alpha-7b |                 44                |                22.522               |               42.97              |               21.413               |\n|         EleutherAI/gpt-j-6B        |                 32                |                31.488               |               33.89              |               25.105               |\n\n- **Latency/throughput for different models using quantization** (precision: fp16 / bit8 / bit4)\n\n\u003ctable\u003e\n\u003cthead\u003e\n  \u003ctr\u003e\n    \u003cth\u003e\u003c/th\u003e\n    \u003cth colspan=\"3\"\u003eprefill latency (ms/token)\u003c/th\u003e\n    \u003cth colspan=\"3\"\u003eprefill throughput (tokens/s)\u003c/th\u003e\n    \u003cth colspan=\"3\"\u003edecode latency (ms/token)\u003c/th\u003e\n    \u003cth colspan=\"3\"\u003edecode throughput (tokens/s)\u003c/th\u003e\n  \u003c/tr\u003e\n\u003c/thead\u003e\n\u003ctbody\u003e\n  \u003ctr\u003e\n    \u003ctd\u003e\u003c/td\u003e\n    \u003ctd\u003efp16\u003c/td\u003e\n    \u003ctd\u003ebit8\u003c/td\u003e\n    \u003ctd\u003ebit4\u003c/td\u003e\n    \u003ctd\u003efp16\u003c/td\u003e\n    \u003ctd\u003ebit8\u003c/td\u003e\n    \u003ctd\u003ebit4\u003c/td\u003e\n    \u003ctd\u003efp16\u003c/td\u003e\n    \u003ctd\u003ebit8\u003c/td\u003e\n    \u003ctd\u003ebit4\u003c/td\u003e\n    \u003ctd\u003efp16\u003c/td\u003e\n    \u003ctd\u003ebit8\u003c/td\u003e\n    \u003ctd\u003ebit4\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003emeta-llama/Llama-2-7b-hf\u003c/td\u003e\n    \u003ctd\u003e49\u003c/td\u003e\n    \u003ctd\u003e301\u003c/td\u003e\n    \u003ctd\u003e125\u003c/td\u003e\n    \u003ctd\u003e20.619\u003c/td\u003e\n    \u003ctd\u003e3.325\u003c/td\u003e\n    \u003ctd\u003e8.015\u003c/td\u003e\n    \u003ctd\u003e49.4\u003c/td\u003e\n    \u003ctd\u003e256.44\u003c/td\u003e\n    \u003ctd\u003e112.22\u003c/td\u003e\n    \u003ctd\u003e20.054\u003c/td\u003e\n    \u003ctd\u003e3.9\u003c/td\u003e\n    \u003ctd\u003e8.918\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003emeta-llama/Llama-2-13b-hf\u003c/td\u003e\n    \u003ctd\u003e175\u003c/td\u003e\n    \u003ctd\u003e974\u003c/td\u003e\n    \u003ctd\u003e376\u003c/td\u003e\n    \u003ctd\u003e5.727\u003c/td\u003e\n    \u003ctd\u003e1.027\u003c/td\u003e\n    \u003ctd\u003e2.662\u003c/td\u003e\n    \u003ctd\u003e182.27\u003c/td\u003e\n    \u003ctd\u003e796.32\u003c/td\u003e\n    \u003ctd\u003e349.93\u003c/td\u003e\n    \u003ctd\u003e4.836\u003c/td\u003e\n    \u003ctd\u003e1.144\u003c/td\u003e\n    \u003ctd\u003e2.662\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003emosaicml/mpt-7b\u003c/td\u003e\n    \u003ctd\u003e27\u003c/td\u003e\n    \u003ctd\u003e139\u003c/td\u003e\n    \u003ctd\u003e86\u003c/td\u003e\n    \u003ctd\u003e37.527\u003c/td\u003e\n    \u003ctd\u003e7.222\u003c/td\u003e\n    \u003ctd\u003e11.6\u003c/td\u003e\n    \u003ctd\u003e28.04\u003c/td\u003e\n    \u003ctd\u003e141.04\u003c/td\u003e\n    \u003ctd\u003e94.22\u003c/td\u003e\n    \u003ctd\u003e35.312\u003c/td\u003e\n    \u003ctd\u003e7.021\u003c/td\u003e\n    \u003ctd\u003e10.507\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003estabilityai/stablelm-base-alpha-7b\u003c/td\u003e\n    \u003ctd\u003e50\u003c/td\u003e\n    \u003ctd\u003e164\u003c/td\u003e\n    \u003ctd\u003e156\u003c/td\u003e\n    \u003ctd\u003e20.09\u003c/td\u003e\n    \u003ctd\u003e6.134\u003c/td\u003e\n    \u003ctd\u003e6.408\u003c/td\u003e\n    \u003ctd\u003e45.73\u003c/td\u003e\n    \u003ctd\u003e148.53\u003c/td\u003e\n    \u003ctd\u003e147.56\u003c/td\u003e\n    \u003ctd\u003e21.878\u003c/td\u003e\n    \u003ctd\u003e6.947\u003c/td\u003e\n    \u003ctd\u003e6.994\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003eEleutherAI/gpt-j-6B\u003c/td\u003e\n    \u003ctd\u003e75\u003c/td\u003e\n    \u003ctd\u003e368\u003c/td\u003e\n    \u003ctd\u003e162\u003c/td\u003e\n    \u003ctd\u003e13.301\u003c/td\u003e\n    \u003ctd\u003e2.724\u003c/td\u003e\n    \u003ctd\u003e6.195\u003c/td\u003e\n    \u003ctd\u003e76.15\u003c/td\u003e\n    \u003ctd\u003e365.51\u003c/td\u003e\n    \u003ctd\u003e138.44\u003c/td\u003e\n    \u003ctd\u003e11.181\u003c/td\u003e\n    \u003ctd\u003e2.327\u003c/td\u003e\n    \u003ctd\u003e5.642\u003c/td\u003e\n  \u003c/tr\u003e\n\u003c/tbody\u003e\n\u003c/table\u003e\n\n\n- **Perplexity for different models using quantization** (precision: fp16 / bit8 / bit4)\n\n\u003e **Notice**\n\u003e From this benchmark we see that quantization doesn't affect the perplexity of the model too much.\n\n\u003ctable\u003e\n\u003cthead\u003e\n  \u003ctr\u003e\n    \u003cth\u003e\u003c/th\u003e\n    \u003cth colspan=\"3\"\u003ewikitext2\u003c/th\u003e\n    \u003cth colspan=\"3\"\u003eptb \u003c/th\u003e\n    \u003cth colspan=\"3\"\u003ec4\u003c/th\u003e\n  \u003c/tr\u003e\n\u003c/thead\u003e\n\u003ctbody\u003e\n  \u003ctr\u003e\n    \u003ctd\u003e\u003c/td\u003e\n    \u003ctd\u003efp16\u003c/td\u003e\n    \u003ctd\u003ebit8\u003c/td\u003e\n    \u003ctd\u003ebit4\u003c/td\u003e\n    \u003ctd\u003efp16\u003c/td\u003e\n    \u003ctd\u003ebit8\u003c/td\u003e\n    \u003ctd\u003ebit4\u003c/td\u003e\n    \u003ctd\u003efp16\u003c/td\u003e\n    \u003ctd\u003ebit8\u003c/td\u003e\n    \u003ctd\u003ebit4\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003emeta-llama/Llama-2-7b-hf\u003c/td\u003e\n    \u003ctd\u003e5.4721\u003c/td\u003e\n    \u003ctd\u003e5.506\u003c/td\u003e\n    \u003ctd\u003e5.6437\u003c/td\u003e\n    \u003ctd\u003e22.9483\u003c/td\u003e\n    \u003ctd\u003e23.8797\u003c/td\u003e\n    \u003ctd\u003e25.0556\u003c/td\u003e\n    \u003ctd\u003e6.9727\u003c/td\u003e\n    \u003ctd\u003e7.0098\u003c/td\u003e\n    \u003ctd\u003e7.1623\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003emeta-llama/Llama-2-13b-hf\u003c/td\u003e\n    \u003ctd\u003e4.8837\u003c/td\u003e\n    \u003ctd\u003e4.9229\u003c/td\u003e\n    \u003ctd\u003e4.9811\u003c/td\u003e\n    \u003ctd\u003e27.6802\u003c/td\u003e\n    \u003ctd\u003e27.9665\u003c/td\u003e\n    \u003ctd\u003e28.8417\u003c/td\u003e\n    \u003ctd\u003e6.4677\u003c/td\u003e\n    \u003ctd\u003e6.4884\u003c/td\u003e\n    \u003ctd\u003e6.566\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003emosaicml/mpt-7b\u003c/td\u003e\n    \u003ctd\u003e7.6829\u003c/td\u003e\n    \u003ctd\u003e7.7256\u003c/td\u003e\n    \u003ctd\u003e7.9869\u003c/td\u003e\n    \u003ctd\u003e10.6002\u003c/td\u003e\n    \u003ctd\u003e10.6743\u003c/td\u003e\n    \u003ctd\u003e10.9486\u003c/td\u003e\n    \u003ctd\u003e9.6001\u003c/td\u003e\n    \u003ctd\u003e9.6457\u003c/td\u003e\n    \u003ctd\u003e9.879\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003estabilityai/stablelm-base-alpha-7b\u003c/td\u003e\n    \u003ctd\u003e14.1886\u003c/td\u003e\n    \u003ctd\u003e14.268\u003c/td\u003e\n    \u003ctd\u003e15.9817\u003c/td\u003e\n    \u003ctd\u003e19.2968\u003c/td\u003e\n    \u003ctd\u003e19.4904\u003c/td\u003e\n    \u003ctd\u003e21.3513\u003c/td\u003e\n    \u003ctd\u003e48.222\u003c/td\u003e\n    \u003ctd\u003e48.3384\u003c/td\u003e\n    \u003ctd\u003e57.022\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003eEleutherAI/gpt-j-6B\u003c/td\u003e\n    \u003ctd\u003e8.8563\u003c/td\u003e\n    \u003ctd\u003e8.8786\u003c/td\u003e\n    \u003ctd\u003e9.0301\u003c/td\u003e\n    \u003ctd\u003e13.5946\u003c/td\u003e\n    \u003ctd\u003e13.6137\u003c/td\u003e\n    \u003ctd\u003e13.784\u003c/td\u003e\n    \u003ctd\u003e11.7114\u003c/td\u003e\n    \u003ctd\u003e11.7293\u003c/td\u003e\n    \u003ctd\u003e11.8929\u003c/td\u003e\n  \u003c/tr\u003e\n\u003c/tbody\u003e\n\u003c/table\u003e\n\n- **Latency/throughput for different models using vllm** (precision: fp16)\n\n\u003e **Warning**\n\u003e vllm brings a significant improvement in latency and throughput, but it is not compatible with streaming output, so we don't release it yet.\n\n\u003ctable\u003e\n\u003cthead\u003e\n  \u003ctr\u003e\n    \u003cth\u003e\u003c/th\u003e\n    \u003cth colspan=\"2\"\u003eprefill latency (ms/token)\u003c/th\u003e\n    \u003cth colspan=\"2\"\u003eprefill throughput (tokens/s)\u003c/th\u003e\n    \u003cth colspan=\"2\"\u003edecode latency (ms/token)\u003c/th\u003e\n    \u003cth colspan=\"2\"\u003edecode throughput (tokens/s)\u003c/th\u003e\n  \u003c/tr\u003e\n\u003c/thead\u003e\n\u003ctbody\u003e\n  \u003ctr\u003e\n    \u003ctd\u003e\u003c/td\u003e\n    \u003ctd\u003eusing vllm\u003c/td\u003e\n    \u003ctd\u003ebaseline\u003c/td\u003e\n    \u003ctd\u003eusing vllm\u003c/td\u003e\n    \u003ctd\u003ebaseline\u003c/td\u003e\n    \u003ctd\u003eusing vllm\u003c/td\u003e\n    \u003ctd\u003ebaseline\u003c/td\u003e\n    \u003ctd\u003eusing vllm\u003c/td\u003e\n    \u003ctd\u003ebaseline\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003emeta-llama/Llama-2-7b-hf\u003c/td\u003e\n    \u003ctd\u003e29\u003c/td\u003e\n    \u003ctd\u003e49\u003c/td\u003e\n    \u003ctd\u003e34.939\u003c/td\u003e\n    \u003ctd\u003e20.619\u003c/td\u003e\n    \u003ctd\u003e20.34\u003c/td\u003e\n    \u003ctd\u003e49.40\u003c/td\u003e\n    \u003ctd\u003e48.67\u003c/td\u003e\n    \u003ctd\u003e20.054\u003c/td\u003e\n  \u003c/tr\u003e\n\u003c/tbody\u003e\n\u003c/table\u003e\n\n## Contributing\n\nWe welcome contributions from the community! To contribute, please submit a pull request following our contributing guidelines.\n\n## License\n\nRunGPT is licensed under the Apache License, Version 2.0. See LICENSE for the full license text.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjina-ai%2Frungpt","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjina-ai%2Frungpt","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjina-ai%2Frungpt/lists"}