https://github.com/rodrigobaron/quick-deploy

Optimize, convert and deploy machine learning models as fast inference API using Triton and ORT. Currently support Hugging Face transformers, PyToch, Tensorflow, SKLearn and XGBoost models.
https://github.com/rodrigobaron/quick-deploy

deep-learning huggingface-transformers inference machine-learning mlops onnx pytorch sklearn tensorflow triton xgboost

Last synced: 4 months ago
JSON representation

Optimize, convert and deploy machine learning models as fast inference API using Triton and ORT. Currently support Hugging Face transformers, PyToch, Tensorflow, SKLearn and XGBoost models.

Host: GitHub
URL: https://github.com/rodrigobaron/quick-deploy
Owner: rodrigobaron
License: apache-2.0
Created: 2021-11-03T03:05:12.000Z (over 3 years ago)
Default Branch: main
Last Pushed: 2022-02-16T23:28:54.000Z (over 3 years ago)
Last Synced: 2024-10-11T08:21:09.538Z (9 months ago)
Topics: deep-learning, huggingface-transformers, inference, machine-learning, mlops, onnx, pytorch, sklearn, tensorflow, triton, xgboost
Language: Python
Homepage: https://pypi.org/project/quick-deploy/
Size: 19.5 MB
Stars: 6
Watchers: 2
Forks: 1
Open Issues: 4
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE

Awesome Lists containing this project

README

        # Quick-Deploy



    

        

    

    

        

    

    

        

    





    Optimize and deploy machine learning models fast and easy as possible.



quick-deploy provide tools to optimize, convert and deploy machine learning models as fast inference API (low latency and high throughput) by [Triton Inference Server](https://github.com/triton-inference-server/server) using [Onnx Runtime](https://github.com/microsoft/onnxruntime) backend. It support 🤗 transformers, PyToch, Tensorflow, SKLearn and XGBoost models.

## Get Started

Let's see an quick example by deploying bert transformers for GPU inference. quick-deploy already have support 🤗 transformers so we can specify the path of pretrained model or just the name from the Hub:

```bash

$ quick-deploy transformers \

    -n my-bert-base \

    -p text-classification \

    -m bert-base-uncased \

    -o ./models \

    --model-type bert \

    --seq-len 128 \

    --cuda

```

The command above created the deployment artifacts by optimizing and converting the model to Onxx. Next just run the inference server:

```bash

$ docker run -it --rm \

    --gpus all \

    --shm-size 256m \

    -p 8000:8000 \

    -p 8001:8001 \

    -p 8002:8002 \

    -v $(pwd)/models:/models nvcr.io/nvidia/tritonserver:21.11-py3 \

    tritonserver --model-repository=/models

```

Now we can use tritonclient which uses gRPC calls to consume our model:

```python

import numpy as np

import tritonclient.http

from scipy.special import softmax

from transformers import BertTokenizer, TensorType

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

model_name = "my_bert_base"

url = "127.0.0.1:8000"

model_version = "1"

batch_size = 1

text = "The goal of life is [MASK]."

tokens = tokenizer(text=text, return_tensors=TensorType.NUMPY)

triton_client = tritonclient.http.InferenceServerClient(url=url, verbose=False)

assert triton_client.is_model_ready(

    model_name=model_name, model_version=model_version

), f"model {model_name} not yet ready"

input_ids = tritonclient.http.InferInput(name="input_ids", shape=(batch_size, 9), datatype="INT64")

token_type_ids = tritonclient.http.InferInput(name="token_type_ids", shape=(batch_size, 9), datatype="INT64")

attention = tritonclient.http.InferInput(name="attention_mask", shape=(batch_size, 9), datatype="INT64")

model_output = tritonclient.http.InferRequestedOutput(name="output", binary_data=False)

input_ids.set_data_from_numpy(tokens['input_ids'] * batch_size)

token_type_ids.set_data_from_numpy(tokens['token_type_ids'] * batch_size)

attention.set_data_from_numpy(tokens['attention_mask'] * batch_size)

response = triton_client.infer(

    model_name=model_name,

    model_version=model_version,

    inputs=[input_ids, token_type_ids, attention],

    outputs=[model_output],

)

token_logits = response.as_numpy("output")

print(token_logits)

```

**Note:** This does only model deployment the tokenizer and post-processing should be done in the client side. The full tansformers deployment is comming soon.

For more use cases please check the [examples](examples) page.

## Install

Before install make sure to install just the target model eg.: "torch", "sklearn" or "all". There two options to use quick-deploy, by docker container:

```bash

$ docker run --rm -it rodrigobaron/quick-deploy:0.1.1-all --help

```

or install the python library `quick-deploy`:

```bash

$ pip install quick-deploy[all]

```

**Note:** This will install the full vesion `all`.

## Contributing

Please follow the [Contributing](CONTRIBUTING.md) guide.

## License

[Apache License 2.0](LICENSE)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/rodrigobaron/quick-deploy

Awesome Lists containing this project

README

Optimize and deploy machine learning models fast and easy as possible.