{"id":20503182,"url":"https://github.com/curt-park/serving-codegen-gptj-triton","last_synced_at":"2025-08-22T07:41:59.546Z","repository":{"id":170797071,"uuid":"647033356","full_name":"Curt-Park/serving-codegen-gptj-triton","owner":"Curt-Park","description":"Serving Example of CodeGen-350M-Mono-GPTJ on Triton Inference Server with Docker and Kubernetes","archived":false,"fork":false,"pushed_at":"2023-05-30T16:07:17.000Z","size":5734,"stargazers_count":20,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-07-25T04:51:41.979Z","etag":null,"topics":["codegen","docker","fastertransformer","huggingface-transformers","kubernetes","pytorch","triton-inference-server"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Curt-Park.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-05-29T23:21:45.000Z","updated_at":"2024-03-23T02:39:51.000Z","dependencies_parsed_at":null,"dependency_job_id":"9a200cfd-c3c2-46f6-8a97-89e66b21ccd2","html_url":"https://github.com/Curt-Park/serving-codegen-gptj-triton","commit_stats":null,"previous_names":["curt-park/serving-codegen-gptj-triton"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/Curt-Park/serving-codegen-gptj-triton","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Curt-Park%2Fserving-codegen-gptj-triton","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Curt-Park%2Fserving-codegen-gptj-triton/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Curt-Park%2Fserving-codegen-gptj-triton/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Curt-Park%2Fserving-codegen-gptj-triton/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Curt-Park","download_url":"https://codeload.github.com/Curt-Park/serving-codegen-gptj-triton/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Curt-Park%2Fserving-codegen-gptj-triton/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":271606068,"owners_count":24788969,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-22T02:00:08.480Z","response_time":65,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["codegen","docker","fastertransformer","huggingface-transformers","kubernetes","pytorch","triton-inference-server"],"created_at":"2024-11-15T19:29:37.175Z","updated_at":"2025-08-22T07:41:59.502Z","avatar_url":"https://github.com/Curt-Park.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Serving codegen-350M-mono-gptj on Triton Inference Server\n\n![](assets/codegen.png)\n\n## Contents\n- PyTorch model conversion to [FasterTransformer](https://github.com/NVIDIA/FasterTransformer) (See [Artifacts](https://huggingface.co/curt-park/codegen-350M-mono-gptj)).\n- [Triton](https://github.com/triton-inference-server/server) serving with [FasterTransformer Backend](https://github.com/triton-inference-server/fastertransformer_backend).\n- Load test on Triton server (Locust)\n- A simple chatbot with [Gradio](https://github.com/gradio-app/gradio).\n- Docker compose for the server and client.\n- Kubernetes helm charts for the server and client.\n- Monitoring on K8s (Promtail + Loki \u0026 Prometheus \u0026 Grafana).\n- Autoscaling Triton (gRPC) on K8s (Triton Metrics \u0026 Traefik)\n\n## How to Run\n\n### Option1. Docker\n```bash\ndocker compose up   # Run the server \u0026 client.\n```\n\nOpen http://localhost:7860\n\n### Option2. Kubernetes (K3S)\n\nBefore you start,\n- Install [Helm](https://helm.sh/docs/intro/install/)\n\n#### Create a Service Cluster\n```bash\nmake cluster\nmake charts\n```\n\nAfter a while, `kubectl get pods` will show:\n```bash\nNAME                                                     READY   STATUS    RESTARTS   AGE\ndcgm-exporter-ltftk                                      1/1     Running   0          2m26s\nprometheus-kube-prometheus-operator-7958587c67-wxh8c     1/1     Running   0          96s\nprometheus-prometheus-node-exporter-vgx65                1/1     Running   0          96s\ntraefik-677c7d64f8-8zlh9                                 1/1     Running   0          115s\nprometheus-grafana-694f868865-58c2k                      3/3     Running   0          96s\nalertmanager-prometheus-kube-prometheus-alertmanager-0   2/2     Running   0          94s\nprometheus-kube-state-metrics-85c858f4b-8rkzv            1/1     Running   0          96s\nclient-codegen-client-5d6df644f5-slcm8                   1/1     Running   0          87s\nprometheus-prometheus-kube-prometheus-prometheus-0       2/2     Running   0          94s\nclient-codegen-client-5d6df644f5-tms9j                   1/1     Running   0          72s\ntriton-57d47d448c-hkf57                                  1/1     Running   0          88s\ntriton-prometheus-adapter-674d9855f-g9d6j                1/1     Running   0          88s\nloki-0                                                   1/1     Running   0          113s\npromtail-qzvrz                                           1/1     Running   0          112s\n```\n\n#### Access to Client\nOpen http://localhost:7860\n\n#### Access to Grafana\n```bash\nkubectl port-forward svc/prometheus-grafana 3000:80\n```\nOpen http://localhost:3000\n- id: admin\n- pw: prom-operator\n\nIf you want to configure loki as data sources to monitor the service logs:\n1. Configuration -\u003e Data sources -\u003e Add data sources\n2. Select Loki\n3. Add URL: http://loki.default.svc.cluster.local:3100\n4. Click Save \u0026 test on the bottom.\n5. Explore -\u003e Select Loki\n6. job -\u003e default/client-codegen-client -\u003e Show logs\n\n#### Triton Auto-Scaling\nTo enable auto-scaling, you need to increase `maxReplicas` in `charts/triton/values.yaml`.\n```bash\n# For example,\nautoscaling:\n  minReplicas: 1\n  maxReplicas: 2\n```\nBy default, the autoscaling metric is average queuing time 50 ms for 30 seconds.\nYou can set the target value as you need.\n```bash\nautoscaling:\n  ...\n  metrics:\n    - type: Pods\n      pods:\n        metric:\n          name: avg_time_queue_us\n        target:\n          type: AverageValue\n          averageValue: 50000  # 1,000 us == 1 ms\n```\n\n#### Finalization\n```bash\nmake remove-charts\nmake finalize\n```\n\n## Artifacts\n- CodeGen-350M-mono-gptj (for Triton): https://huggingface.co/curt-park/codegen-350M-mono-gptj\n\n## For Developer\n```bash\nmake setup      # Install packages for execution.\nmake setup-dev  # Install packages for development.\nmake format     # Format the code.\nmake lint       # Lint the code.\nmake load-test  # Load test (`make setup-dev` is required).\n```\n\n## Experiments: Load Test\n\nDevice Info:\n- CPU: AMD EPYC Processor (with IBPB)\n- GPU: A100-SXM4-80GB x 1\n- RAM: 1.857TB\n\nExperimental Setups:\n- Single Triton instance.\n- Dynamic batching.\n- Triton docker server.\n- Output Length: 8 vs 32 vs 128 vs 512\n\n### Output Length: 8\n![](assets/loadtest_output_len_08_00.png)\n![](assets/loadtest_output_len_08_01.png)\n```bash\n# metrics\nnv_inference_count{model=\"ensemble\",version=\"1\"} 391768\nnv_inference_count{model=\"postprocessing\",version=\"1\"} 391768\nnv_inference_count{model=\"codegen-350M-mono-gptj\",version=\"1\"} 391768\nnv_inference_count{model=\"preprocessing\",version=\"1\"} 391768\n\nnv_inference_exec_count{model=\"ensemble\",version=\"1\"} 391768\nnv_inference_exec_count{model=\"postprocessing\",version=\"1\"} 391768\nnv_inference_exec_count{model=\"codegen-350M-mono-gptj\",version=\"1\"} 20439\nnv_inference_exec_count{model=\"preprocessing\",version=\"1\"} 391768\n\nnv_inference_compute_infer_duration_us{model=\"ensemble\",version=\"1\"} 6368616649\nnv_inference_compute_infer_duration_us{model=\"postprocessing\",version=\"1\"} 51508744\nnv_inference_compute_infer_duration_us{model=\"codegen-350M-mono-gptj\",version=\"1\"} 6148437063\nnv_inference_compute_infer_duration_us{model=\"preprocessing\",version=\"1\"} 168281250\n```\n\n- RPS (Response per Second) reaches around 1,715.\n- The average response time is 38 ms.\n- The metric shows dynamic batching works (`nv_inference_count` vs `nv_inference_exec_count`)\n- Preprocessing spends 2.73% of the model inference time.\n- Postprocessing spends 0.83% of the model inference time.\n\n### Output Length: 32\n![](assets/loadtest_output_len_32_00.png)\n![](assets/loadtest_output_len_32_01.png)\n```bash\n# metrics\nnv_inference_count{model=\"ensemble\",version=\"1\"} 118812\nnv_inference_count{model=\"codegen-350M-mono-gptj\",version=\"1\"} 118812\nnv_inference_count{model=\"postprocessing\",version=\"1\"} 118812\nnv_inference_count{model=\"preprocessing\",version=\"1\"} 118812\n\nnv_inference_exec_count{model=\"ensemble\",version=\"1\"} 118812\nnv_inference_exec_count{model=\"codegen-350M-mono-gptj\",version=\"1\"} 6022\nnv_inference_exec_count{model=\"postprocessing\",version=\"1\"} 118812\nnv_inference_exec_count{model=\"preprocessing\",version=\"1\"} 118812\n\nnv_inference_compute_infer_duration_us{model=\"ensemble\",version=\"1\"} 7163210716\nnv_inference_compute_infer_duration_us{model=\"codegen-350M-mono-gptj\",version=\"1\"} 7090601211\nnv_inference_compute_infer_duration_us{model=\"postprocessing\",version=\"1\"} 18416946\nnv_inference_compute_infer_duration_us{model=\"preprocessing\",version=\"1\"} 54073590\n```\n- RPS (Response per Second) reaches around 500.\n- The average response time is 122 ms.\n- The metric shows dynamic batching works (`nv_inference_count` vs `nv_inference_exec_count`)\n- Preprocessing spends 0.76% of the model inference time.\n- Postprocessing spends 0.26% of the model inference time.\n\n### Output Length: 128\n![](assets/loadtest_output_len_128_00.png)\n![](assets/loadtest_output_len_128_01.png)\n```bash\nnv_inference_count{model=\"ensemble\",version=\"1\"} 14286\nnv_inference_count{model=\"codegen-350M-mono-gptj\",version=\"1\"} 14286\nnv_inference_count{model=\"preprocessing\",version=\"1\"} 14286\nnv_inference_count{model=\"postprocessing\",version=\"1\"} 14286\n\nnv_inference_exec_count{model=\"ensemble\",version=\"1\"} 14286\nnv_inference_exec_count{model=\"codegen-350M-mono-gptj\",version=\"1\"} 1121\nnv_inference_exec_count{model=\"preprocessing\",version=\"1\"} 14286\nnv_inference_exec_count{model=\"postprocessing\",version=\"1\"} 14286\n\nnference_compute_infer_duration_us{model=\"ensemble\",version=\"1\"} 4509635072\nnv_inference_compute_infer_duration_us{model=\"codegen-350M-mono-gptj\",version=\"1\"} 4498667310\nnv_inference_compute_infer_duration_us{model=\"preprocessing\",version=\"1\"} 7348176\nnv_inference_compute_infer_duration_us{model=\"postprocessing\",version=\"1\"} 3605100\n```\n- RPS (Response per Second) reaches around 65.\n- The average response time is 620 ms.\n- The metric shows dynamic batching works (`nv_inference_count` vs `nv_inference_exec_count`)\n- Preprocessing spends 0.16% of the model inference time.\n- Postprocessing spends 0.08% of the model inference time.\n\n### Output Length: 512\n![](assets/loadtest_output_len_512_00.png)\n![](assets/loadtest_output_len_512_01.png)\n```bash\nnv_inference_count{model=\"ensemble\",version=\"1\"} 7183\nnv_inference_count{model=\"codegen-350M-mono-gptj\",version=\"1\"} 7183\nnv_inference_count{model=\"preprocessing\",version=\"1\"} 7183\nnv_inference_count{model=\"postprocessing\",version=\"1\"} 7183\n\nnv_inference_exec_count{model=\"ensemble\",version=\"1\"} 7183\nnv_inference_exec_count{model=\"codegen-350M-mono-gptj\",version=\"1\"} 465\nnv_inference_exec_count{model=\"preprocessing\",version=\"1\"} 7183\nnv_inference_exec_count{model=\"postprocessing\",version=\"1\"} 7183\n\nnv_inference_compute_infer_duration_us{model=\"ensemble\",version=\"1\"} 5764391176\nnv_inference_compute_infer_duration_us{model=\"codegen-350M-mono-gptj\",version=\"1\"} 5757320649\nnv_inference_compute_infer_duration_us{model=\"preprocessing\",version=\"1\"} 3678517\nnv_inference_compute_infer_duration_us{model=\"postprocessing\",version=\"1\"} 3384699\n```\n- RPS (Response per Second) reaches around 40.\n- The average response time is 1,600 ms.\n- The metric shows dynamic batching works (`nv_inference_count` vs `nv_inference_exec_count`)\n- Preprocessing spends 0.06% of the model inference time.\n- Postprocessing spends 0.06% of the model inference time.\n\n## NOTE\n#### NVIDIA-Docker Configurations for K8s\nSet `default-runtime` in `/etc/docker/daemon.json`.\n```bash\n{\n\t\"default-runtime\": \"nvidia\",\n\t\"runtimes\": {\n\t  \"nvidia\": {\n\t      \"path\": \"/usr/bin/nvidia-container-runtime\",\n\t      \"runtimeArgs\": []\n\t  }\n\t}\n}\n```\nAfter configuring, restart docker: `sudo systemctl restart docker`\n\n## References\n- https://github.com/NVIDIA/FasterTransformer\n- https://github.com/triton-inference-server/fastertransformer_backend\n- https://github.com/triton-inference-server/python_backend\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcurt-park%2Fserving-codegen-gptj-triton","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcurt-park%2Fserving-codegen-gptj-triton","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcurt-park%2Fserving-codegen-gptj-triton/lists"}