{"id":14958377,"url":"https://github.com/rodrigobaron/quick-deploy","last_synced_at":"2025-03-08T10:32:54.560Z","repository":{"id":43830522,"uuid":"424075001","full_name":"rodrigobaron/quick-deploy","owner":"rodrigobaron","description":"Optimize, convert and deploy machine learning models as fast inference API using Triton and ORT. Currently support Hugging Face transformers, PyToch, Tensorflow, SKLearn and XGBoost models.","archived":false,"fork":false,"pushed_at":"2022-02-16T23:28:54.000Z","size":20454,"stargazers_count":6,"open_issues_count":4,"forks_count":1,"subscribers_count":2,"default_branch":"main","last_synced_at":"2024-10-11T08:21:09.538Z","etag":null,"topics":["deep-learning","huggingface-transformers","inference","machine-learning","mlops","onnx","pytorch","sklearn","tensorflow","triton","xgboost"],"latest_commit_sha":null,"homepage":"https://pypi.org/project/quick-deploy/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/rodrigobaron.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-11-03T03:05:12.000Z","updated_at":"2022-11-11T12:28:06.000Z","dependencies_parsed_at":"2022-08-24T07:50:45.477Z","dependency_job_id":null,"html_url":"https://github.com/rodrigobaron/quick-deploy","commit_stats":null,"previous_names":[],"tags_count":4,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rodrigobaron%2Fquick-deploy","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rodrigobaron%2Fquick-deploy/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rodrigobaron%2Fquick-deploy/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rodrigobaron%2Fquick-deploy/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/rodrigobaron","download_url":"https://codeload.github.com/rodrigobaron/quick-deploy/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":221055851,"owners_count":16747982,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["deep-learning","huggingface-transformers","inference","machine-learning","mlops","onnx","pytorch","sklearn","tensorflow","triton","xgboost"],"created_at":"2024-09-24T13:16:54.002Z","updated_at":"2024-10-22T03:26:16.033Z","avatar_url":"https://github.com/rodrigobaron.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Quick-Deploy\n\n\u003cp align=\"center\"\u003e\n    \u003ca href=\"https://github.com/rodrigobaron/quick-deploy/actions/workflows/build.yaml\"\u003e\n        \u003cimg alt=\"Build\" src=\"https://github.com/rodrigobaron/quick-deploy/actions/workflows/build.yaml/badge.svg\"\u003e\n    \u003c/a\u003e\n    \u003ca href=\"https://github.com/rodrigobaron/quick-deploy/blob/main/LICENSE\"\u003e\n        \u003cimg alt=\"GitHub\" src=\"https://img.shields.io/github/license/rodrigobaron/quick-deploy.svg?color=blue\"\u003e\n    \u003c/a\u003e\n    \u003ca href=\"https://github.com/rodrigobaron/quick-deploy/releases\"\u003e\n        \u003cimg alt=\"GitHub release\" src=\"https://img.shields.io/github/release/rodrigobaron/quick-deploy.svg\"\u003e\n    \u003c/a\u003e\n\u003c/p\u003e\n\n\u003ch3 align=\"center\"\u003e\n    Optimize and deploy machine learning models fast and easy as possible.\n\u003c/h3\u003e\n\nquick-deploy provide tools to optimize, convert and deploy machine learning models as fast inference API (low latency and high throughput) by [Triton Inference Server](https://github.com/triton-inference-server/server) using [Onnx Runtime](https://github.com/microsoft/onnxruntime) backend. It support 🤗 transformers, PyToch, Tensorflow, SKLearn and XGBoost models.\n\n\n## Get Started\n\nLet's see an quick example by deploying bert transformers for GPU inference. quick-deploy already have support 🤗 transformers so we can specify the path of pretrained model or just the name from the Hub:\n\n```bash\n$ quick-deploy transformers \\\n    -n my-bert-base \\\n    -p text-classification \\\n    -m bert-base-uncased \\\n    -o ./models \\\n    --model-type bert \\\n    --seq-len 128 \\\n    --cuda\n```\n\nThe command above created the deployment artifacts by optimizing and converting the model to Onxx. Next just run the inference server:\n```bash\n$ docker run -it --rm \\\n    --gpus all \\\n    --shm-size 256m \\\n    -p 8000:8000 \\\n    -p 8001:8001 \\\n    -p 8002:8002 \\\n    -v $(pwd)/models:/models nvcr.io/nvidia/tritonserver:21.11-py3 \\\n    tritonserver --model-repository=/models\n\n```\n\nNow we can use tritonclient which uses gRPC calls to consume our model:\n```python\nimport numpy as np\nimport tritonclient.http\nfrom scipy.special import softmax\nfrom transformers import BertTokenizer, TensorType\n\n\ntokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n\nmodel_name = \"my_bert_base\"\nurl = \"127.0.0.1:8000\"\nmodel_version = \"1\"\nbatch_size = 1\n\ntext = \"The goal of life is [MASK].\"\ntokens = tokenizer(text=text, return_tensors=TensorType.NUMPY)\n\ntriton_client = tritonclient.http.InferenceServerClient(url=url, verbose=False)\nassert triton_client.is_model_ready(\n    model_name=model_name, model_version=model_version\n), f\"model {model_name} not yet ready\"\n\ninput_ids = tritonclient.http.InferInput(name=\"input_ids\", shape=(batch_size, 9), datatype=\"INT64\")\ntoken_type_ids = tritonclient.http.InferInput(name=\"token_type_ids\", shape=(batch_size, 9), datatype=\"INT64\")\nattention = tritonclient.http.InferInput(name=\"attention_mask\", shape=(batch_size, 9), datatype=\"INT64\")\nmodel_output = tritonclient.http.InferRequestedOutput(name=\"output\", binary_data=False)\n\ninput_ids.set_data_from_numpy(tokens['input_ids'] * batch_size)\ntoken_type_ids.set_data_from_numpy(tokens['token_type_ids'] * batch_size)\nattention.set_data_from_numpy(tokens['attention_mask'] * batch_size)\n\nresponse = triton_client.infer(\n    model_name=model_name,\n    model_version=model_version,\n    inputs=[input_ids, token_type_ids, attention],\n    outputs=[model_output],\n)\n\ntoken_logits = response.as_numpy(\"output\")\nprint(token_logits)\n```\n\n**Note:** This does only model deployment the tokenizer and post-processing should be done in the client side. The full tansformers deployment is comming soon.\n\nFor more use cases please check the [examples](examples) page.\n\n## Install\n\nBefore install make sure to install just the target model eg.: \"torch\", \"sklearn\" or \"all\". There two options to use quick-deploy, by docker container:\n```bash\n$ docker run --rm -it rodrigobaron/quick-deploy:0.1.1-all --help\n```\n\nor install the python library `quick-deploy`:\n\n```bash\n$ pip install quick-deploy[all]\n```\n\n**Note:** This will install the full vesion `all`.\n\n## Contributing\n\nPlease follow the [Contributing](CONTRIBUTING.md) guide.\n\n## License\n\n[Apache License 2.0](LICENSE)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frodrigobaron%2Fquick-deploy","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frodrigobaron%2Fquick-deploy","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frodrigobaron%2Fquick-deploy/lists"}