{"id":15130016,"url":"https://github.com/els-rd/transformer-deploy","last_synced_at":"2025-05-14T12:10:02.797Z","repository":{"id":37097913,"uuid":"423126761","full_name":"ELS-RD/transformer-deploy","owner":"ELS-RD","description":"Efficient, scalable and enterprise-grade CPU/GPU inference server for 🤗 Hugging Face transformer models 🚀","archived":false,"fork":false,"pushed_at":"2024-10-23T17:41:45.000Z","size":34756,"stargazers_count":1683,"open_issues_count":56,"forks_count":151,"subscribers_count":26,"default_branch":"main","last_synced_at":"2025-05-08T10:15:40.974Z","etag":null,"topics":["deep-learning","deployment","inference","machine-learning","natural-language-processing","server"],"latest_commit_sha":null,"homepage":"https://els-rd.github.io/transformer-deploy/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ELS-RD.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-10-31T11:20:14.000Z","updated_at":"2025-04-30T10:06:05.000Z","dependencies_parsed_at":"2023-01-30T02:15:52.823Z","dependency_job_id":"5e8eecab-c4ce-4641-9535-735ae1a379bd","html_url":"https://github.com/ELS-RD/transformer-deploy","commit_stats":null,"previous_names":[],"tags_count":6,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ELS-RD%2Ftransformer-deploy","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ELS-RD%2Ftransformer-deploy/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ELS-RD%2Ftransformer-deploy/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ELS-RD%2Ftransformer-deploy/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ELS-RD","download_url":"https://codeload.github.com/ELS-RD/transformer-deploy/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254140760,"owners_count":22021219,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["deep-learning","deployment","inference","machine-learning","natural-language-processing","server"],"created_at":"2024-09-26T02:26:57.970Z","updated_at":"2025-05-14T12:09:57.779Z","avatar_url":"https://github.com/ELS-RD.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Hugging Face Transformer submillisecond inference️ and deployment to production: 🤗 → 🤯\n\n[![Documentation](https://img.shields.io/website?label=documentation\u0026style=for-the-badge\u0026up_message=online\u0026url=https%3A%2F%2Fels-rd.github.io%2Ftransformer-deploy%2F)](https://els-rd.github.io/transformer-deploy/) [![tests](https://img.shields.io/github/workflow/status/ELS-RD/transformer-deploy/tests/main?label=tests\u0026style=for-the-badge)](https://github.com/ELS-RD/transformer-deploy/actions/workflows/python-app.yml) [![Python 3.6](https://img.shields.io/badge/python-3.8-blue.svg?style=for-the-badge)](https://www.python.org/downloads/release/python-380/) [![Twitter Follow](https://img.shields.io/twitter/follow/pommedeterre33?color=orange\u0026style=for-the-badge)](https://twitter.com/pommedeterre33)\n\n### Optimize and deploy in **production** 🤗 Hugging Face Transformer models in a single command line.  \n\n=\u003e Up to 10X faster inference! \u003c=\n\n#### Why this tool?\n\n\u003c!--why-start--\u003e\n\nAt [Lefebvre Dalloz](https://www.lefebvre-dalloz.fr/) we run in production *semantic search engines* in the legal domain, \nin non-marketing language it's a re-ranker, and we based ours on `Transformer`.  \nIn those setup, latency is key to provide good user experience, and relevancy inference is done online for hundreds of snippets per user query.  \nWe have tested many solutions, and below is what we found:\n\n[`Pytorch`](https://pytorch.org/) + [`FastAPI`](https://fastapi.tiangolo.com/) = 🐢  \nMost tutorials on `Transformer` deployment in production are built over Pytorch and FastAPI.\nBoth are great tools but not very performant in inference (actual measures below).  \n\n[`Microsoft ONNX Runtime`](https://github.com/microsoft/onnxruntime/) + [`Nvidia Triton inference server`](https://github.com/triton-inference-server/server) = ️🏃💨  \nThen, if you spend some time, you can build something over ONNX Runtime and Triton inference server.\nYou will usually get from 2X to 4X faster inference compared to vanilla Pytorch. It's cool!  \n\n[`Nvidia TensorRT`](https://github.com/NVIDIA/TensorRT/)  + [`Nvidia Triton inference server`](https://github.com/triton-inference-server/server) = ⚡️🏃💨💨  \nHowever, if you want the best in class performances on GPU, there is only a single possible combination: Nvidia TensorRT and Triton.\nYou will usually get 5X faster inference compared to vanilla Pytorch.  \nSometimes it can rise up to **10X faster inference**.  \nBuuuuttt... TensorRT can ask some efforts to master, it requires tricks not easy to come up with, we implemented them for you!  \n\n[Detailed tool comparison table](https://els-rd.github.io/transformer-deploy/compare/)\n\n## Features\n\n* Heavily optimize transformer models for inference (CPU and GPU) -\u003e between 5X and 10X speedup\n* deploy models on `Nvidia Triton` inference servers (enterprise grade), 6X faster than `FastAPI`\n* add quantization support for both CPU and GPU\n* simple to use: optimization done in a single command line!\n* supported model: any model that can be exported to ONNX (-\u003e most of them)\n* supported tasks: document classification, token classification (NER), feature extraction (aka sentence-transformers dense embeddings), text generation\n\n\u003e Want to understand how it works under the hood?  \n\u003e read [🤗 Hugging Face Transformer inference UNDER 1 millisecond latency 📖](https://towardsdatascience.com/hugging-face-transformer-inference-under-1-millisecond-latency-e1be0057a51c?source=friends_link\u0026sk=cd880e05c501c7880f2b9454830b8915)  \n\u003e \u003cimg src=\"resources/rabbit.jpg\" width=\"120\"\u003e\n\n## Want to check by yourself in 3 minutes?\n\nTo have a raw idea of what kind of acceleration you will get on your own model, you can try the `docker` only run below.\nFor GPU run, you need to have installed on your machine Nvidia drivers and [NVIDIA Container Toolkit](https://github.com/NVIDIA/nvidia-docker).\n\n**3 tasks are covered** below: \n\n* Classification, \n* feature extraction (text to dense embeddings) \n* text generation (GPT-2 style).  \n\nMoreover, we have added a GPU `quantization` notebook to open directly on `Docker` to play with.\n\nFirst, clone the repo as some commands below expect to find the `demo` folder:\n\n```shell\ngit clone git@github.com:ELS-RD/transformer-deploy.git\ncd transformer-deploy\n# docker image may take a few minutes\ndocker pull ghcr.io/els-rd/transformer-deploy:0.6.0 \n\n\n### Classification/reranking (encoder model)\n\nClassification is a common task in NLP, and large language models have shown great results.  \nThis task is also used for search engines to provide Google like relevancy (cf. [arxiv](https://arxiv.org/abs/1901.04085))\n\n#### Optimize existing model\n\nThis will optimize models, generate Triton configuration and Triton folder layout in a single command:\n\n```shell\ndocker run -it --rm --gpus all \\\n  -v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.6.0 \\\n  bash -c \"cd /project \u0026\u0026 \\\n    convert_model -m \\\"philschmid/MiniLM-L6-H384-uncased-sst2\\\" \\\n    --backend tensorrt onnx \\\n    --seq-len 16 128 128\"\n\n# output:  \n# ...\n# Inference done on NVIDIA GeForce RTX 3090\n# latencies:\n# [Pytorch (FP32)] mean=5.43ms, sd=0.70ms, min=4.88ms, max=7.81ms, median=5.09ms, 95p=7.01ms, 99p=7.53ms\n# [Pytorch (FP16)] mean=6.55ms, sd=1.00ms, min=5.75ms, max=10.38ms, median=6.01ms, 95p=8.57ms, 99p=9.21ms\n# [TensorRT (FP16)] mean=0.53ms, sd=0.03ms, min=0.49ms, max=0.61ms, median=0.52ms, 95p=0.57ms, 99p=0.58ms\n# [ONNX Runtime (FP32)] mean=1.57ms, sd=0.05ms, min=1.49ms, max=1.90ms, median=1.57ms, 95p=1.63ms, 99p=1.76ms\n# [ONNX Runtime (optimized)] mean=0.90ms, sd=0.03ms, min=0.88ms, max=1.23ms, median=0.89ms, 95p=0.95ms, 99p=0.97ms\n# Each infence engine output is within 0.3 tolerance compared to Pytorch output\n```\n\nIt will output mean latency and other statistics.  \nUsually `Nvidia TensorRT` is the fastest option and `ONNX Runtime` is usually a strong second option.  \nOn ONNX Runtime, `optimized` means that kernel fusion and mixed precision are enabled.  \n`Pytorch` is never competitive on transformer inference, including mixed precision, whatever the model size.  \n\n#### Run Nvidia Triton inference server\n\nNote that we install `transformers` at run time.  \nFor production, it's advised to build your own 3-line Docker image with `transformers` pre-installed.\n\n```shell\ndocker run -it --rm --gpus all -p8000:8000 -p8001:8001 -p8002:8002 --shm-size 256m \\\n  -v $PWD/triton_models:/models nvcr.io/nvidia/tritonserver:22.07-py3 \\\n  bash -c \"pip install transformers \u0026\u0026 tritonserver --model-repository=/models\"\n\n# output:\n# ...\n# I0207 09:58:32.738831 1 grpc_server.cc:4195] Started GRPCInferenceService at 0.0.0.0:8001\n# I0207 09:58:32.739875 1 http_server.cc:2857] Started HTTPService at 0.0.0.0:8000\n# I0207 09:58:32.782066 1 http_server.cc:167] Started Metrics Service at 0.0.0.0:8002\n```\n\n#### Query inference\n\nQuery ONNX models (replace `transformer_onnx_inference` by `transformer_tensorrt_inference` to query TensorRT engine):\n\n```shell\ncurl -X POST  http://localhost:8000/v2/models/transformer_onnx_inference/versions/1/infer \\\n  --data-binary \"@demo/infinity/query_body.bin\" \\\n  --header \"Inference-Header-Content-Length: 161\"\n\n# output:\n# {\"model_name\":\"transformer_onnx_inference\",\"model_version\":\"1\",\"parameters\":{\"sequence_id\":0,\"sequence_start\":false,\"sequence_end\":false},\"outputs\":[{\"name\":\"output\",\"datatype\":\"FP32\",\"shape\":[1,2],\"data\":[-3.431640625,3.271484375]}]}\n```\n\nModel output is at the end of the Json (`data` field).\n[More information about how to query the server from `Python`, and other languages](https://els-rd.github.io/transformer-deploy/run/).\n\nTo get very low latency inference in your Python code (no inference server): [click here](https://els-rd.github.io/transformer-deploy/python/)\n\n### Token-classification (NER) (encoder model)\n\nToken classification assigns a label to individual tokens in a sentence.\nOne of the most common token classification tasks is Named Entity Recognition (NER). \nNER attempts to find a label for each entity in a sentence, such as a person, location, or organization.\n\n#### Optimize existing model\n\nThis will optimize models, generate Triton configuration and Triton folder layout in a single command:\n\n```shell\ndocker run -it --rm --gpus all \\\n  -v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.6.0 \\\n  bash -c \"cd /project \u0026\u0026 \\\n    convert_model -m \\\"kamalkraj/bert-base-cased-ner-conll2003\\\" \\\n    --backend tensorrt onnx \\\n    --seq-len 16 128 128 \\\n    --task token-classification\"\n\n# output:  \n# ...\n# Inference done on Tesla T4\n# latencies:\n# [Pytorch (FP32)] mean=8.24ms, sd=0.46ms, min=7.66ms, max=13.91ms, median=8.20ms, 95p=8.38ms, 99p=10.01ms\n# [Pytorch (FP16)] mean=6.87ms, sd=0.44ms, min=6.69ms, max=13.05ms, median=6.78ms, 95p=7.33ms, 99p=8.86ms\n# [TensorRT (FP16)] mean=2.33ms, sd=0.32ms, min=2.19ms, max=4.18ms, median=2.24ms, 95p=3.00ms, 99p=4.04ms\n# [ONNX Runtime (FP32)] mean=8.08ms, sd=0.33ms, min=7.78ms, max=10.61ms, median=8.06ms, 95p=8.18ms, 99p=10.55ms\n# [ONNX Runtime (optimized)] mean=2.57ms, sd=0.04ms, min=2.38ms, max=2.83ms, median=2.56ms, 95p=2.68ms, 99p=2.73ms\n# Each infence engine output is within 0.3 tolerance compared to Pytorch output\n```\n\nIt will output mean latency and other statistics.  \nUsually `Nvidia TensorRT` is the fastest option and `ONNX Runtime` is usually a strong second option.  \nOn ONNX Runtime, `optimized` means that kernel fusion and mixed precision are enabled.  \n`Pytorch` is never competitive on transformer inference, including mixed precision, whatever the model size.  \n\n#### Run Nvidia Triton inference server\n\nNote that we install `transformers` at run time.  \nFor production, it's advised to build your own 3-line Docker image with `transformers` pre-installed.\n\n```shell\ndocker run -it --rm --gpus all -p8000:8000 -p8001:8001 -p8002:8002 --shm-size 256m \\\n  -v $PWD/triton_models:/models nvcr.io/nvidia/tritonserver:22.07-py3 \\\n  bash -c \"pip install transformers torch==1.12.0 -f https://download.pytorch.org/whl/cu116/torch_stable.html \u0026\u0026 \\\n  tritonserver --model-repository=/models\"\n\n# output:\n# ...\n# I0207 09:58:32.738831 1 grpc_server.cc:4195] Started GRPCInferenceService at 0.0.0.0:8001\n# I0207 09:58:32.739875 1 http_server.cc:2857] Started HTTPService at 0.0.0.0:8000\n# I0207 09:58:32.782066 1 http_server.cc:167] Started Metrics Service at 0.0.0.0:8002\n```\n\n#### Query inference \n\nQuery ONNX models (replace `transformer_onnx_inference` by `transformer_tensorrt_inference` to query TensorRT engine):\n\n```shell\ncurl -X POST  http://localhost:8000/v2/models/transformer_onnx_inference/versions/1/infer \\\n  --data-binary \"@demo/infinity/query_body.bin\" \\\n  --header \"Inference-Header-Content-Length: 161\"\n\n# output:\n# {\"model_name\":\"transformer_onnx_inference\",\"model_version\":\"1\",\"outputs\":[{\"name\":\"output\",\"datatype\":\"BYTES\",\"shape\":[],\"data\":[\"[{\\\"entity_group\\\": \\\"ORG\\\", \\\"score\\\": 0.9848777055740356, \\\"word\\\": \\\"Infinity\\\", \\\"start\\\": 45, \\\"end\\\": 53}]\"]}]}\n```\n\n### Question Answering (encoder model)\n\nQuestion Answering models can retrieve the answer to a question from a given text, which is useful for searching for an answer in a document.\n\n#### Optimize existing model\n\nThis will optimize models, generate Triton configuration and Triton folder layout in a single command:\n\n```shell\ndocker run -it --rm --gpus all \\\n  -v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.6.0 \\\n  bash -c \"cd /project \u0026\u0026 \\\n    convert_model -m \\\"distilbert-base-cased-distilled-squad\\\" \\\n    --backend tensorrt onnx \\\n    --seq-len 16 128 384 \\\n    --task question-answering\"\n\n# output:  \n# ...\n# Inference done on Tesla T4\n# latencies:\n# [Pytorch (FP32)] mean=8.24ms, sd=0.46ms, min=7.66ms, max=13.91ms, median=8.20ms, 95p=8.38ms, 99p=10.01ms\n# [Pytorch (FP16)] mean=6.87ms, sd=0.44ms, min=6.69ms, max=13.05ms, median=6.78ms, 95p=7.33ms, 99p=8.86ms\n# [TensorRT (FP16)] mean=2.33ms, sd=0.32ms, min=2.19ms, max=4.18ms, median=2.24ms, 95p=3.00ms, 99p=4.04ms\n# [ONNX Runtime (FP32)] mean=8.08ms, sd=0.33ms, min=7.78ms, max=10.61ms, median=8.06ms, 95p=8.18ms, 99p=10.55ms\n# [ONNX Runtime (optimized)] mean=2.57ms, sd=0.04ms, min=2.38ms, max=2.83ms, median=2.56ms, 95p=2.68ms, 99p=2.73ms\n# Each infence engine output is within 0.3 tolerance compared to Pytorch output\n```\n\nIt will output mean latency and other statistics.  \nUsually `Nvidia TensorRT` is the fastest option and `ONNX Runtime` is usually a strong second option.  \nOn ONNX Runtime, `optimized` means that kernel fusion and mixed precision are enabled.  \n`Pytorch` is never competitive on transformer inference, including mixed precision, whatever the model size.  \n\n#### Run Nvidia Triton inference server\n\nNote that we install `transformers` at run time.  \nFor production, it's advised to build your own 3-line Docker image with `transformers` pre-installed.\n\n```shell\ndocker run -it --rm --gpus all -p8000:8000 -p8001:8001 -p8002:8002 --shm-size 1024m \\\n  -v $PWD/triton_models:/models nvcr.io/nvidia/tritonserver:22.07-py3 \\\n  bash -c \"pip install transformers torch==1.12.0 -f https://download.pytorch.org/whl/cu116/torch_stable.html \u0026\u0026 \\\n  tritonserver --model-repository=/models\"\n\n# output:\n# ...\n# I0207 09:58:32.738831 1 grpc_server.cc:4195] Started GRPCInferenceService at 0.0.0.0:8001\n# I0207 09:58:32.739875 1 http_server.cc:2857] Started HTTPService at 0.0.0.0:8000\n# I0207 09:58:32.782066 1 http_server.cc:167] Started Metrics Service at 0.0.0.0:8002\n```\n\n#### Query inference \n\nQuery ONNX models (replace `transformer_onnx_inference` by `transformer_tensorrt_inference` to query TensorRT engine):\n\n```shell\ncurl -X POST  http://localhost:8000/v2/models/transformer_onnx_inference/versions/1/infer \\\n  --data-binary \"@demo/question-answering/query_body.bin\" \\\n  --header \"Inference-Header-Content-Length: 276\"\n\n# output:\n# {\"model_name\":\"transformer_onnx_inference\",\"model_version\":\"1\",\"outputs\":[{\"name\":\"output\",\"datatype\":\"BYTES\",\"shape\":[],\"data\":[\"{\\\"score\\\": 0.9925152659416199, \\\"start\\\": 34, \\\"end\\\": 40, \\\"answer\\\": \\\"Berlin\\\"}\"]}]}\n```\nCheckout demo/question-answering/query_bin_gen.ipynb for how to generate the query_body.bin file.\nMore examples of inference can be found in demo/question-answering/\n\n\n### Feature extraction / dense embeddings\n\nFeature extraction in NLP is the task to convert text to dense embeddings.  \nIt has gained some traction as a robust way to improve search engine relevancy (increase recall).  \nThis project supports models from [sentence-transformers](https://github.com/UKPLab/sentence-transformers) and it requires \na version \u003e= V2.2.0 of sentence-transformers library.\n#### Optimize existing model\n\n```shell\ndocker run -it --rm --gpus all \\\n  -v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.6.0 \\\n  bash -c \"cd /project \u0026\u0026 \\\n    convert_model -m \\\"sentence-transformers/msmarco-distilbert-cos-v5\\\" \\\n    --backend tensorrt onnx \\\n    --task embedding \\\n    --seq-len 16 128 128\"\n\n# output:\n# ...\n# Inference done on NVIDIA GeForce RTX 3090\n# latencies:\n# [Pytorch (FP32)] mean=5.19ms, sd=0.45ms, min=4.74ms, max=6.64ms, median=5.03ms, 95p=6.14ms, 99p=6.26ms\n# [Pytorch (FP16)] mean=5.41ms, sd=0.18ms, min=5.26ms, max=8.15ms, median=5.36ms, 95p=5.62ms, 99p=5.72ms\n# [TensorRT (FP16)] mean=0.72ms, sd=0.04ms, min=0.69ms, max=1.33ms, median=0.70ms, 95p=0.78ms, 99p=0.81ms\n# [ONNX Runtime (FP32)] mean=1.69ms, sd=0.18ms, min=1.62ms, max=4.07ms, median=1.64ms, 95p=1.86ms, 99p=2.44ms\n# [ONNX Runtime (optimized)] mean=1.03ms, sd=0.09ms, min=0.98ms, max=2.30ms, median=1.00ms, 95p=1.15ms, 99p=1.41ms\n# Each infence engine output is within 0.3 tolerance compared to Pytorch output\n```\n\n#### Run Nvidia Triton inference server\n\n```shell\ndocker run -it --rm --gpus all -p8000:8000 -p8001:8001 -p8002:8002 --shm-size 256m \\\n  -v $PWD/triton_models:/models nvcr.io/nvidia/tritonserver:22.07-py3 \\\n  bash -c \"pip install transformers \u0026\u0026 tritonserver --model-repository=/models\"\n\n# output:\n# ...\n# I0207 11:04:33.761517 1 grpc_server.cc:4195] Started GRPCInferenceService at 0.0.0.0:8001\n# I0207 11:04:33.761844 1 http_server.cc:2857] Started HTTPService at 0.0.0.0:8000\n# I0207 11:04:33.803373 1 http_server.cc:167] Started Metrics Service at 0.0.0.0:8002\n\n```\n\n#### Query inference \n\n```shell\ncurl -X POST  http://localhost:8000/v2/models/transformer_onnx_inference/versions/1/infer \\\n  --data-binary \"@demo/infinity/query_body.bin\" \\\n  --header \"Inference-Header-Content-Length: 161\"\n\n# output:\n# {\"model_name\":\"transformer_onnx_inference\",\"model_version\":\"1\",\"parameters\":{\"sequence_id\":0,\"sequence_start\":false,\"sequence_end\":false},\"outputs\":[{\"name\":\"output\",\"datatype\":\"FP32\",\"shape\":[1,768],\"data\":[0.06549072265625,-0.04327392578125,0.1103515625,-0.007320404052734375,...\n```\n\n### Generate text (decoder model)\n\nText generation seems to be the way to go for NLP.  \nUnfortunately, they are slow to run, below we will accelerate the most famous of them: GPT-2.\n\n#### GPT example\nWe will start with GPT-2 model example, then in the next section we will use T5-model.\n\n#### Optimize existing model\n\nLike before, command below will prepare Triton inference server stuff.  \nOne point to have in mind is that Triton run:\n- inference engines (`ONNX Runtime` and `TensorRT`)\n- `Python` code in charge of the `decoding` part. `Python` code delegate to Triton server the model management.\n\n`Python` code is in `./triton_models/transformer_tensorrt_generate/1/model.py`\n\n```shell\ndocker run -it --rm --gpus all \\\n  -v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.6.0 \\\n  bash -c \"cd /project \u0026\u0026 \\\n    convert_model -m gpt2 \\\n    --backend tensorrt onnx \\\n    --seq-len 6 256 256 \\\n    --task text-generation\"\n\n# output:\n# ...\n# Inference done on NVIDIA GeForce RTX 3090\n# latencies:\n# [Pytorch (FP32)] mean=9.43ms, sd=0.59ms, min=8.95ms, max=15.02ms, median=9.33ms, 95p=10.38ms, 99p=12.46ms\n# [Pytorch (FP16)] mean=9.92ms, sd=0.55ms, min=9.50ms, max=15.06ms, median=9.74ms, 95p=10.96ms, 99p=12.26ms\n# [TensorRT (FP16)] mean=2.19ms, sd=0.18ms, min=2.06ms, max=3.04ms, median=2.10ms, 95p=2.64ms, 99p=2.79ms\n# [ONNX Runtime (FP32)] mean=4.99ms, sd=0.38ms, min=4.68ms, max=9.09ms, median=4.78ms, 95p=5.72ms, 99p=5.95ms\n# [ONNX Runtime (optimized)] mean=3.93ms, sd=0.40ms, min=3.62ms, max=6.53ms, median=3.81ms, 95p=4.49ms, 99p=5.79ms\n# Each infence engine output is within 0.3 tolerance compared to Pytorch output\n```\n\nTwo detailed notebooks are available:\n\n* GPT-2: \u003chttps://github.com/ELS-RD/transformer-deploy/blob/main/demo/generative-model/gpt2.ipynb\u003e\n* T5: \u003chttps://github.com/ELS-RD/transformer-deploy/blob/main/demo/generative-model/t5.ipynb\u003e\n\n#### Optimize existing large model\n\nTo optimize models which typically don't fit twice onto a single GPU, run the script as follows:\n\n```shell\ndocker run -it --rm --shm-size=24g --ulimit memlock=-1 --ulimit stack=67108864 --gpus all \\\n  -v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.6.0 \\\n  bash -c \"cd /project \u0026\u0026 \\\n    convert_model -m gpt2-medium \\\n    --backend tensorrt onnx \\\n    --seq-len 6 256 256 \\\n    --fast \\\n    --atol 3 \\\n    --task text-generation\"\n```\n\nThe larger the model gets, the more likely it is that you need to also increase the absolute tolerance of the script.\nAdditionally, some models may return a message similar to: `Converted FP32 value in weights (either FP32 infinity or FP32 value outside FP16 range) to corresponding FP16 infinity`. It is best to test and evaluate the model afterwards to understand the implications of this conversion.\n\nDepending on model size this may take really long. GPT Neo 2.7B can easily take 1 hour of conversion or more.\n\n#### Run Nvidia Triton inference server\n\nTo run decoding algorithm server side, we need to install `Pytorch` on `Triton` docker image.\n\n```shell\ndocker run -it --rm --gpus all -p8000:8000 -p8001:8001 -p8002:8002 --shm-size 8g \\\n  -v $PWD/triton_models:/models nvcr.io/nvidia/tritonserver:22.07-py3 \\\n  bash -c \"pip install transformers torch==1.12.0 -f https://download.pytorch.org/whl/cu116/torch_stable.html \u0026\u0026 \\\n  tritonserver --model-repository=/models\"\n\n# output:\n# ...\n# I0207 10:29:19.091191 1 grpc_server.cc:4195] Started GRPCInferenceService at 0.0.0.0:8001\n# I0207 10:29:19.091417 1 http_server.cc:2857] Started HTTPService at 0.0.0.0:8000\n# I0207 10:29:19.132902 1 http_server.cc:167] Started Metrics Service at 0.0.0.0:8002\n```\n\n#### Query inference\n\nReplace `transformer_onnx_generate` by `transformer_tensorrt_generate` to query `TensorRT` engine.\n\n```shell\ncurl -X POST  http://localhost:8000/v2/models/transformer_onnx_generate/versions/1/infer \\\n  --data-binary \"@demo/infinity/query_body.bin\" \\\n  --header \"Inference-Header-Content-Length: 161\"\n\n# output:\n# {\"model_name\":\"transformer_onnx_generate\",\"model_version\":\"1\",\"outputs\":[{\"name\":\"output\",\"datatype\":\"BYTES\",\"shape\":[],\"data\":[\"This live event is great. I will sign-up for Infinity.\\n\\nI'm going to be doing a live stream of the event.\\n\\nI\"]}]}\n```\n\nOk, the output is not very interesting (💩 in -\u003e 💩 out) but you get the idea.  \nSource code of the generative model is in `./triton_models/transformer_tensorrt_generate/1/model.py`.  \nYou may want to tweak it regarding your needs (default is set for greedy search and output 64 tokens).\n\n#### Python code\n\nYou may be interested in running optimized text generation on Python directly, without using any inference server:  \n\n```shell\ndocker run -p 8888:8888 -v $PWD/demo/generative-model:/project ghcr.io/els-rd/transformer-deploy:0.6.0 \\\n  bash -c \"cd /project \u0026\u0026 jupyter notebook --ip 0.0.0.0 --port 8888 --no-browser --allow-root\"\n```\n\n#### T5-small example\nIn this section we will present the t5-small model conversion.\n\n#### Optimize existing large model\n\nTo optimize model run the script as follows:\n\n```shell\ndocker run -it --rm --shm-size=24g --ulimit memlock=-1 --ulimit stack=67108864 --gpus all \\\n  -v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.6.0 \\\n  bash -c \"cd /project \u0026\u0026 \\\n    convert_model -m t5-small \\\n    --backend onnx \\\n    --seq-len 16 256 256 \\\n    --task text-generation \\\n    --nb-measures 100 \\\n    --generative-model t5 \\\n    --output triton_models\"\n```\n#### Run Nvidia Triton inference server\n\nTo run decoding algorithm server side, we need to install `Pytorch` on `Triton` docker image.\n\n```shell\ndocker run -it --rm --gpus all -p8000:8000 -p8001:8001 -p8002:8002 --shm-size 8g \\\n  -v $PWD/triton_models/:/models nvcr.io/nvidia/tritonserver:22.07-py3 \\\n  bash -c \"pip install onnx onnxruntime-gpu transformers==4.21.3 git+https://github.com/ELS-RD/transformer-deploy torch==1.12.0 -f https://download.pytorch.org/whl/cu116/torch_stable.html onnx onnxruntime-gpu \u0026\u0026 \\\n  tritonserver --model-repository=/models\"\n```\nTo test text generation, you can try this request:\n```shell\ncurl -X POST http://localhost:8000/v2/models/t5_model_generate/versions/1/infer --data-binary \"@demo/generative-model/t5_query_body.bin\" --header \"Inference-Header-Content-Length: 181\"\n\n# output:\n# {\"model_name\":\"t5_model_generate\",\"model_version\":\"1\",\"outputs\":[{\"name\":\"OUTPUT_TEXT\",\"datatype\":\"BYTES\",\"shape\":[],\"data\":[\"Mein Name mein Wolfgang Wolfgang und ich wohne in Berlin.\"]}]}\n```\n#### Query inference\n\nReplace `transformer_onnx_generate` by `transformer_tensorrt_generate` to query `TensorRT` engine.\n\n```shell\ncurl -X POST  http://localhost:8000/v2/models/transformer_onnx_inference/versions/1/infer \\\n  --data-binary \"@demo/infinity/seq2seq_query_body.bin\" \\\n  --header \"Inference-Header-Content-Length: 176\"\n```\n\n### Model quantization on GPU\n\nQuantization is a generic method to get X2 speedup on top of other inference optimization.  \nGPU quantization on transformers is almost never used because it requires to modify model source code.  \n\nWe have implemented in this library a mechanism which updates Hugging Face transformers library to support quantization.  \nIt makes it easy to use.\n\nTo play with it, open this notebook:\n\n```shell\ndocker run -p 8888:8888 -v $PWD/demo/quantization:/project ghcr.io/els-rd/transformer-deploy:0.6.0 \\\n  bash -c \"cd /project \u0026\u0026 jupyter notebook --ip 0.0.0.0 --port 8888 --no-browser --allow-root\"\n```\n\n\u003c!--why-end--\u003e\n\n## See our [documentation](https://els-rd.github.io/transformer-deploy/) for detailed instructions on how to use the package, including setup, GPU quantization support and Nvidia Triton inference server deployment.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fels-rd%2Ftransformer-deploy","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fels-rd%2Ftransformer-deploy","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fels-rd%2Ftransformer-deploy/lists"}