{"id":15012787,"url":"https://github.com/microsoft/batch-inference","last_synced_at":"2025-04-30T11:29:26.686Z","repository":{"id":154984601,"uuid":"628882200","full_name":"microsoft/batch-inference","owner":"microsoft","description":"Dynamic batching library for Deep Learning inference. Tutorials for LLM, GPT scenarios.","archived":false,"fork":false,"pushed_at":"2024-08-14T08:35:20.000Z","size":277,"stargazers_count":90,"open_issues_count":2,"forks_count":3,"subscribers_count":8,"default_branch":"main","last_synced_at":"2025-02-02T01:06:22.410Z","etag":null,"topics":["deep-learning","dynamic-batching","gpt","inference","llm","performance-optimization","python"],"latest_commit_sha":null,"homepage":"https://microsoft.github.io/batch-inference/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/microsoft.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-04-17T07:15:38.000Z","updated_at":"2025-01-25T07:33:50.000Z","dependencies_parsed_at":null,"dependency_job_id":"4d1f5cae-6657-43a0-b5d4-1974a0814c60","html_url":"https://github.com/microsoft/batch-inference","commit_stats":{"total_commits":32,"total_committers":7,"mean_commits":4.571428571428571,"dds":0.5625,"last_synced_commit":"a8f8ce447899b2b5fdc40b60035e37bc9c6f4dbe"},"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2Fbatch-inference","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2Fbatch-inference/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2Fbatch-inference/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2Fbatch-inference/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/microsoft","download_url":"https://codeload.github.com/microsoft/batch-inference/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":237894349,"owners_count":19383170,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["deep-learning","dynamic-batching","gpt","inference","llm","performance-optimization","python"],"created_at":"2024-09-24T19:43:13.363Z","updated_at":"2025-02-09T02:07:57.382Z","avatar_url":"https://github.com/microsoft.png","language":"Python","readme":"# Batch Inference Toolkit\n\nBatch Inference Toolkit(batch-inference) is a Python package that batches model input tensors coming from multiple requests dynamically, executes the model, un-batches output tensors and then returns them back to each request respectively. This will improve system throughput because of better compute parallelism and better cache locality. The entire process is transparent to developers. \n\n## When to use\n\nWhen you want to host Deep Learning model inference on Cloud servers, especially on GPU\n\n## Why to use\n\nIt can improve your server throughput up to multiple times\n\n## Advantage of batch-inference\n\n* Platform independent lightweight python library\n* Only few lines code change is needed to onboard using built-in [batching algorithms](https://microsoft.github.io/batch-inference/batcher/what_is_batcher.html)\n* Flexible APIs to support customized batching algorithms and input types\n* Support [multi-process remote mode](https://microsoft.github.io/batch-inference/remote_model_host.html) to avoid python GIL bottleneck\n* Tutorials and benchmarks on popular models:\n\n| Model | Throughput Comparing to Baseline | Links |\n| :-----| :---- | :---- |\n| Bert Embedding | 4.7x | [Tutorial](https://microsoft.github.io/batch-inference/examples/bert_embedding.html)  |\n| GPT Completion | 16x | [Tutorial](https://microsoft.github.io/batch-inference/examples/gpt_completion.html) |\n\n## Installation\n\nInstall from Pip\n\n```bash\npython -m pip install batch-inference --upgrade\n```\n\nBuild and Install from Source _(for developers)_\n\n```bash\ngit clone https://github.com/microsoft/batch-inference.git\npython -m pip install -e .[docs,testing]\n\n# if you want to format the code before commit\npip install pre-commit\npre-commit install\n\n# run unittests\npython -m unittest discover tests\n```\n\n## Example\n\nLet's start with a toy model to learn the APIs. Firstly, you need to define a **predict_batch** method in your model class, and then add the **batching** decorator to your model class.\n\nThe **batching** decorator adds host() method to create **ModelHost** object. The **predict** method of ModelHost takes a single query as input, and it will merge multiple queries into a batch before calling **predict_batch** method. The predict method also splits outputs from predict_batch method before it returns result.\n\n```python\nimport numpy as np\nfrom batch_inference import batching\nfrom batch_inference.batcher.concat_batcher import ConcatBatcher\n\n@batching(batcher=ConcatBatcher(), max_batch_size=32)\nclass MyModel:\n    def __init__(self, k, n):\n        self.weights = np.random.randn(k, n).astype(\"f\")\n\n    # shape of x: [batch_size, m, k]\n    def predict_batch(self, x):\n        y = np.matmul(x, self.weights)\n        return y\n\n# initialize MyModel with k=3 and n=3\nhost = MyModel.host(3, 3)\nhost.start()\n\n# shape of x: [1, 3, 3]\ndef process_request(x):\n    y = host.predict(x)\n    return y\n\nhost.stop()\n```\n\n**Batcher** is responsible to merge queries and split outputs. In this case ConcatBatcher will concat input tensors into a batched tensors at first dimension. We provide a set of built-in Batchers for common scenarios, and you can also implement your own Batcher. See [What is Batcher](https://microsoft.github.io/batch-inference/batcher/what_is_batcher.html) for more information.\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmicrosoft%2Fbatch-inference","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmicrosoft%2Fbatch-inference","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmicrosoft%2Fbatch-inference/lists"}