{"id":18903991,"url":"https://github.com/dudeperf3ct/15-mlserver-deploy","last_synced_at":"2025-08-03T22:37:30.435Z","repository":{"id":191864844,"uuid":"464073410","full_name":"dudeperf3ct/15-mlserver-deploy","owner":"dudeperf3ct","description":null,"archived":false,"fork":false,"pushed_at":"2022-02-27T08:01:34.000Z","size":7,"stargazers_count":2,"open_issues_count":0,"forks_count":1,"subscribers_count":2,"default_branch":"main","last_synced_at":"2024-12-31T10:17:45.955Z","etag":null,"topics":["deployment","huggingface-transformers","mlserver","seldon-core"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/dudeperf3ct.png","metadata":{"files":{"readme":"Readme.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2022-02-27T07:55:41.000Z","updated_at":"2024-08-14T17:58:58.000Z","dependencies_parsed_at":"2023-09-01T07:12:33.266Z","dependency_job_id":"90865c37-1b31-421f-a4f5-1c88613179f0","html_url":"https://github.com/dudeperf3ct/15-mlserver-deploy","commit_stats":null,"previous_names":["dudeperf3ct/15-mlserver-deploy"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dudeperf3ct%2F15-mlserver-deploy","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dudeperf3ct%2F15-mlserver-deploy/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dudeperf3ct%2F15-mlserver-deploy/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dudeperf3ct%2F15-mlserver-deploy/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/dudeperf3ct","download_url":"https://codeload.github.com/dudeperf3ct/15-mlserver-deploy/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":239888555,"owners_count":19713692,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["deployment","huggingface-transformers","mlserver","seldon-core"],"created_at":"2024-11-08T09:07:03.047Z","updated_at":"2025-02-20T17:46:36.415Z","avatar_url":"https://github.com/dudeperf3ct.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# MLServer\n\n[MLServer](https://mlserver.readthedocs.io/en/latest/) aims to provide an easy way to start serving your machine learning models through a REST and gRPC interface, fully compliant with [KFServing's V2 Dataplane spec](https://kserve.github.io/website/modelserving/inference_api/). The list of cool features include\n\n* [Adaptive batching](https://mlserver.readthedocs.io/en/latest/user-guide/adaptive-batching.html), to group inference requests together on the fly.\n* [Parallel Inference Serving](https://mlserver.readthedocs.io/en/latest/user-guide/parallel-inference.html), for vertical scaling across multiple models through a pool of inference workers.\n* Multi-model serving to run multiple models within the same process\n* Support for the standard [V2 Inference Protocol](https://kserve.github.io/website/modelserving/inference_api/) on both the gRPC and REST flavours\n* Scalability with deployment in Kubernetes native frameworks, including [Seldon Core](https://docs.seldon.io/projects/seldon-core/en/latest/graph/protocols.html#v2-kfserving-protocol) and [KServe](https://kserve.github.io/website/modelserving/v1beta1/sklearn/v2/), where MLServer is the core Python inference server used to serve machine learning models.\n\n[Inference runtimes](https://github.com/SeldonIO/MLServer/blob/master/docs/runtimes/index.md) allow you to define how your model should be used within MLServer. You can think of them as the backend glue between MLServer and your machine learning framework of choice. It also provides supports inference runtimes for many frameworks such as:\n\n1. [Scikit-Learn](https://github.com/SeldonIO/MLServer/blob/master/runtimes/sklearn)\n2. [XGBoost](https://github.com/SeldonIO/MLServer/blob/master/runtimes/xgboost)\n3. [Spark MLib](https://github.com/SeldonIO/MLServer/blob/master/runtimes/mllib)\n4. [LightGBM](https://github.com/SeldonIO/MLServer/blob/master/runtimes/lightgbm)\n5. [Tempo](https://github.com/SeldonIO/tempo)\n6. [MLflow](https://github.com/SeldonIO/MLServer/blob/master/runtimes/mlflow)\n7. [Writing custom runtimes](https://github.com/SeldonIO/MLServer/blob/master/docs/runtimes/custom.md)\n\nIn this exercise, we will deploy the sentiment analysis huggingface transformer model. Since MLServer does not provide out-of-the-box support for PyTorch or Transformer models, we will write a custom inference runtime to deploy this model.\n\n```bash\npip install mlserver\n# to install out-of-box frameworks\npip install mlserver-sklearn # or any of the frameworks supported above\n```\n\n## Custom Inference Runtime\n\nIt's very easy to extend MLServer for any framework other than the supported ones by writing a custom inference runtime. To add support for our framework, we extend `mlserver.MLModel` abstract class and overload two main methods:\n\n* `load(self) -\u003e bool`: Responsible for loading any artifacts related to a model (e.g. model weights, pickle files, etc.).\n* `predict(self, payload: InferenceRequest) -\u003e InferenceResponse`: Responsible for using a model to perform inference on an incoming data point.\n\n```python\nclass SentimentModel(MLModel):\n    \"\"\"\n    Implementationof the MLModel interface to load and serve custom hugging face transformer models.\n    \"\"\"\n\n    # load the model\n    async def load(self) -\u003e bool:\n\n        self.device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n\n        model_uri = await get_model_uri(self._settings)\n\n        self.model_name = model_uri\n        self.model = DistilBertForSequenceClassification.from_pretrained(\n            self.model_name\n        )\n        self.model.eval()\n        self.model.to(self.device)\n        self.tokenizer = DistilBertTokenizer.from_pretrained(self.model_name)\n\n        self.ready = True\n        return self.ready\n\n    # output predictions\n    async def predict(self, payload: types.InferenceRequest) -\u003e types.InferenceResponse:\n        input_id, attention_mask = self._preprocess_inputs(payload)\n        prediction = self._model_predict(input_id, attention_mask)\n\n        return types.InferenceResponse(\n            model_name=self.name,\n            model_version=self.version,\n            outputs=[\n                types.ResponseOutput(\n                    name=\"predictions\",\n                    shape=prediction.shape,\n                    datatype=\"FP32\",\n                    data=np.asarray(prediction).tolist(),\n                )\n            ],\n        )\n\n    # preprocess input payload\n    def _preprocess_inputs(self, payload: types.InferenceRequest):\n        inp_text = defaultdict()\n        for inp in payload.inputs:\n            inp_text[inp.name] = json.loads(\n                \"\".join(self.decode(inp, default_codec=StringCodec))\n            )\n        inputs = self.tokenizer(inp_text['text'], return_tensors=\"pt\")\n        input_id = inputs[\"input_ids\"]\n        attention_mask = inputs[\"attention_mask\"]\n        return input_id, attention_mask\n\n    # run inference\n    def _model_predict(self, input_id, attention_mask):\n        with torch.no_grad():\n            outputs = self.model(input_id, attention_mask)\n            probs = F.softmax(outputs.logits, dim=1).numpy()[0]\n        return probs\n```\n\n### Settings files\n\nThe next step will be to create 2 configuration files:\n\n* `settings.json`: holds the configuration of our server (e.g. ports, log level, etc.).\n* `model-settings.json`: holds the configuration of our model (e.g. input type, runtime to use, etc.).\n\n## Run\n\n### Locally\n\nTest the sentiment classifier model\n\n```bash\ndocker build -t sentiment -f sentiment/Dockerfile.sentiment sentiment/\ndocker run --rm -it sentiment\n```\n\nTest MLServer locally\n\n```bash\n# download trained models\nbash get_models.sh\n# create a docker image\nmlserver build . -t 'sentiment-app:1.0.0'\ndocker run -it --rm -p 8080:8080 -p 8081:8081 sentiment-app:1.0.0\n```\n\nIn a separate terminal,\n\n```bash\n# test inference request (REST)\npython3  test_local_http_endpoint.py\n# test inference request (gRPC)\npython3  test_local_http_endpoint.py\n```\n\n### Additional Exercise\n\n* Deploy the MLServer application on [SeldonCore](https://docs.seldon.io/projects/seldon-core/en/latest/) or [KServe](https://kserve.github.io/website/modelserving/v1beta1/sklearn/v2/).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdudeperf3ct%2F15-mlserver-deploy","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdudeperf3ct%2F15-mlserver-deploy","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdudeperf3ct%2F15-mlserver-deploy/lists"}