{"id":20435726,"url":"https://github.com/clam004/triton-ft-api","last_synced_at":"2025-08-27T15:05:08.466Z","repository":{"id":64134612,"uuid":"572712447","full_name":"clam004/triton-ft-api","owner":"clam004","description":"tutorial on how to deploy a scalable autoregressive causal language model transformer using nvidia triton server ","archived":false,"fork":false,"pushed_at":"2022-12-03T21:26:39.000Z","size":54,"stargazers_count":5,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-04-06T15:06:51.637Z","etag":null,"topics":["fastapi","fastertransformer","gpt","huggingface","nvidia","nvidia-docker","nvidia-gpu"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/clam004.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2022-11-30T21:44:23.000Z","updated_at":"2023-06-29T20:54:08.000Z","dependencies_parsed_at":"2023-01-14T23:45:45.680Z","dependency_job_id":null,"html_url":"https://github.com/clam004/triton-ft-api","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/clam004/triton-ft-api","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/clam004%2Ftriton-ft-api","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/clam004%2Ftriton-ft-api/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/clam004%2Ftriton-ft-api/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/clam004%2Ftriton-ft-api/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/clam004","download_url":"https://codeload.github.com/clam004/triton-ft-api/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/clam004%2Ftriton-ft-api/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":272342402,"owners_count":24917620,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-27T02:00:09.397Z","response_time":76,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["fastapi","fastertransformer","gpt","huggingface","nvidia","nvidia-docker","nvidia-gpu"],"created_at":"2024-11-15T08:37:09.117Z","updated_at":"2025-08-27T15:05:08.431Z","avatar_url":"https://github.com/clam004.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# triton-fastertransformer-api\n\ntutorial on how to deploy a scalable autoregressive causal language model transformer as a fastAPI endpoint using nvidia triton server and fastertransformer backend\n\nthe primary value added is that in addition to simplifying and explaining for the beginner machine learning engineer what is happening in the [NVIDIA blog on triton inference server with faster transformer backend](https://developer.nvidia.com/blog/deploying-gpt-j-and-t5-with-fastertransformer-and-triton-inference-server/) we also do a controlled before and after comparison on a realistic RESTful API that you can put into production\n\n## Step By Step Instructions\n\nenter into your terminal within your desired directory:\n\n```\n1. git clone https://github.com/triton-inference-server/fastertransformer_backend.git\n2. cd fastertransformer_backend\n```\n\nyou may not need to use sudo\n\nin docker build you can choose your triton version by for example doing `--build-arg TRITON_VERSION=22.05` instead, you can also change `ft_triton_2207:v1` to whatever name you want for the image\n\nin docker build to build the image from scratch when you already have nother similar image, use `--no-cache` \n\nif you have multiple GPUs, in step 4 and _ you can do `--gpus device=1` instead of `--gpus=all` if you want to place this triton server only on the 2nd GPU instead of distributed across all GPUs\n\nport 8000 is used to send http requests to triton. 8001 is used for GRPC requests, which are apparently faster than http requests. 8002 is used for monitering. I use 8001. to route gRPC to port 2001 on your VM do `-p 2001:8001` you may need to do this just to reroute to an available port.\n\n```\n3. sudo docker build --rm --no-cache --build-arg TRITON_VERSION=22.07 -t triton_with_ft:22.07 -f docker/Dockerfile .\n4. sudo docker run -it --rm --gpus=all --shm-size=4G  -v /path/to/fastertransformer_backend:/ft_workspace -p 8001:8001 -p 8002:8002 triton_ft:v1 bash\n```\n\nexample\n```\nsudo docker run -it --rm --gpus device=1 --shm-size=4G  -v /home/carson/fastertransformer_backend:/ft_workspace -p 2001:8001 -p 2002:8002 triton_ft:v1 bash\n```\n\nthe next steps are from within the bash session started in step 4\n\n```\n6. cd /ft_workspace\n7. git clone https://github.com/NVIDIA/FasterTransformer.git\n8. cd FasterTransformer\n```\n\nvim CMakeLists.txt and change `set(PYTHON_PATH \"python\" CACHE STRING \"Python path\")` to `set(PYTHON_PATH \"python3\")`\n\n```\n9. mkdir -p build\n10. cd build\n11. cmake -DSM=xx -DCMAKE_BUILD_TYPE=Release -DBUILD_PYT=ON -DBUILD_MULTI_GPU=ON ..\n12. make -j32\n13. cd /ft_workspace\n14. mkdir models\n15. cd models\n16. python3\n17. from transformers import GPT2LMHeadModel\n18. model = GPT2LMHeadModel.from_pretrained('gpt2')\n19. model.save_pretrained('./gpt2')\n20. exit()\n```\n\nyou may have to run the next step again if the first attempt fails, in the example below we create a folder of binaries for each layer in `/ft_workspace/all_models/gpt/fastertransformer/1/1-gpu`\n\n```\n21. cd /ft_workspace\n22. python3 ./FasterTransformer/examples/pytorch/gpt/utils/huggingface_gpt_convert.py -o ./all_models/gpt/fastertransformer/1/ -i ./models/gpt2 -i_g 1\n23. cd /ft_workspace/all_models/gpt/fastertransformer\n24. vim config.pbtxt\n```\n\nusing the information in `/ft_workspace/all_models/gpt/fastertransformer/1/1-gpu/config.ini`, update config.pbtxt, for example\n\n```\nname: \"fastertransformer\"\nbackend: \"fastertransformer\"\ndefault_model_filename: \"gpt2-med\"\nmax_batch_size: 1024\n\nmodel_transaction_policy {\n  decoupled: False\n}\n\ninput [\n  {\n    name: \"input_ids\"\n    data_type: TYPE_UINT32\n    dims: [ -1 ]\n  },\n  {\n    name: \"input_lengths\"\n    data_type: TYPE_UINT32\n    dims: [ 1 ]\n    reshape: { shape: [ ] }\n  },\n  {\n    name: \"request_output_len\"\n    data_type: TYPE_UINT32\n    dims: [ -1 ]\n  },\n  {\n    name: \"runtime_top_k\"\n    data_type: TYPE_UINT32\n    dims: [ 1 ]\n    reshape: { shape: [ ] }\n    optional: true\n  },\n  {\n    name: \"runtime_top_p\"\n    data_type: TYPE_FP32\n    dims: [ 1 ]\n    reshape: { shape: [ ] }\n    optional: true\n  },\n  {\n    name: \"beam_search_diversity_rate\"\n    data_type: TYPE_FP32\n    dims: [ 1 ]\n    reshape: { shape: [ ] }\n    optional: true\n  },\n  {\n    name: \"temperature\"\n    data_type: TYPE_FP32\n    dims: [ 1 ]\n    reshape: { shape: [ ] }\n    optional: true\n  },\n  {\n    name: \"len_penalty\"\n    data_type: TYPE_FP32\n    dims: [ 1 ]\n    reshape: { shape: [ ] }\n    optional: true\n  },\n  {\n    name: \"repetition_penalty\"\n    data_type: TYPE_FP32\n    dims: [ 1 ]\n    reshape: { shape: [ ] }\n    optional: true\n  },\n  {\n    name: \"random_seed\"\n    data_type: TYPE_UINT64\n    dims: [ 1 ]\n    reshape: { shape: [ ] }\n    optional: true\n  },\n  {\n    name: \"is_return_log_probs\"\n    data_type: TYPE_BOOL\n    dims: [ 1 ]\n    reshape: { shape: [ ] }\n    optional: true\n  },\n  {\n    name: \"beam_width\"\n    data_type: TYPE_UINT32\n    dims: [ 1 ]\n    reshape: { shape: [ ] }\n    optional: true\n  },\n  {\n    name: \"start_id\"\n    data_type: TYPE_UINT32\n    dims: [ 1 ]\n    reshape: { shape: [ ] }\n    optional: true\n  },\n  {\n    name: \"end_id\"\n    data_type: TYPE_UINT32\n    dims: [ 1 ]\n    reshape: { shape: [ ] }\n    optional: true\n  },\n  {\n    name: \"stop_words_list\"\n    data_type: TYPE_INT32\n    dims: [ 2, -1 ]\n    optional: true\n  },\n  {\n    name: \"bad_words_list\"\n    data_type: TYPE_INT32\n    dims: [ 2, -1 ]\n    optional: true\n  },\n  {\n    name: \"prompt_learning_task_name_ids\"\n    data_type: TYPE_UINT32\n    dims: [ 1 ]\n    reshape: { shape: [ ] }\n    optional: true\n  },\n  {\n    name: \"request_prompt_embedding\"\n    data_type: TYPE_FP16\n    dims: [ -1, -1 ]\n    optional: true\n  },\n  {\n    name: \"request_prompt_lengths\"\n    data_type: TYPE_UINT32\n    dims: [ 1 ]\n    reshape: { shape: [ ] }\n    optional: true\n  },\n  {\n    name: \"request_prompt_type\"\n    data_type: TYPE_UINT32\n    dims: [ 1 ]\n    reshape: { shape: [ ] }\n    optional: true\n  }\n]\noutput [\n  {\n    name: \"output_ids\"\n    data_type: TYPE_UINT32\n    dims: [ -1, -1 ]\n  },\n  {\n    name: \"sequence_length\"\n    data_type: TYPE_UINT32\n    dims: [ -1 ]\n  },\n  {\n    name: \"cum_log_probs\"\n    data_type: TYPE_FP32\n    dims: [ -1 ]\n  },\n  {\n    name: \"output_log_probs\"\n    data_type: TYPE_FP32\n    dims: [ -1, -1 ]\n  }\n]\ninstance_group [\n  {\n    count: 2\n    kind : KIND_CPU\n  }\n]\nparameters {\n  key: \"tensor_para_size\"\n  value: {\n    string_value: \"1\"\n  }\n}\nparameters {\n  key: \"pipeline_para_size\"\n  value: {\n    string_value: \"1\"\n  }\n}\nparameters {\n  key: \"data_type\"\n  value: {\n    string_value: \"fp16\"\n  }\n}\nparameters {\n  key: \"model_type\"\n  value: {\n    string_value: \"GPT\"\n  }\n}\nparameters {\n  key: \"model_checkpoint_path\"\n  value: {\n    string_value: \"/ft_workspace/all_models/gpt/fastertransformer/1/1-gpu/\"\n  }\n}\nparameters {\n  key: \"int8_mode\"\n  value: {\n    string_value: \"0\"\n  }\n}\nparameters {\n  key: \"enable_custom_all_reduce\"\n  value: {\n    string_value: \"0\"\n  }\n}\n```\n\nRun the triton server\n\n```\n24. CUDA_VISIBLE_DEVICES=0 /opt/tritonserver/bin/tritonserver --log-warning false --model-repository=./all_models/gpt/\n```\n\nyou can do the same as above outside the environment by exiting the bash session\n\n```\nexit\n```\n\nyou are no longer in the bash session. you could replace `$(pwd)` with the full path `/path/to/fastertransformer_backend/`, or within `/path/to/fastertransformer_backend/` run:\n\n```\n24. sudo docker run -it --rm --gpus-all --shm-size=4G  -v $(pwd):/ft_workspace -p 2001:8001 -p 2002:8002 triton_ft:v1 /opt/tritonserver/bin/tritonserver --log-warning false --model-repository=/ft_workspace/all_models/gpt/\n```\n\nfor example\n\n```\n24. sudo docker run -it --rm --gpus device=1 --shm-size=4G  -v /home/carson/fastertransformer_backend:/ft_workspace -p 2001:8001 -p 2002:8002 triton_ft:v1 /opt/tritonserver/bin/tritonserver --log-warning false --model-repository=/ft_workspace/all_models/gpt/\n```\n\nkeep this terminal open, do not exit this terminal window, a successful deployment would result in output: \n\n```\nI1203 05:18:03.283727 1 grpc_server.cc:4195] Started GRPCInferenceService at 0.0.0.0:8001\nI1203 05:18:03.283981 1 http_server.cc:2857] Started HTTPService at 0.0.0.0:8000\nI1203 05:18:03.326952 1 http_server.cc:167] Started Metrics Service at 0.0.0.0:8002\n```\n\nin another terminal, if you check your `nvidia-smi` you will see the model has been loaded to your GPU or GPUs. if you exit from the above, the models will be off loaded. This step is meant to remain as so while the trion server is running. using docker compose you can keep this running as long as the VM is on using `restart: \"unless-stopped\"`\n\nthe docker compose equivalent of step 24 is:\n\n```\nversion: \"2.3\"\nservices:\n\n  fastertransformer:\n    restart: \"unless-stopped\"\n    image: triton_with_ft:22.07\n    runtime: nvidia\n    ports:\n      - 2001:8001\n    shm_size: 4gb\n    volumes:\n      - ${HOST_FT_MODEL_REPO}:/ft_workspace\n    command: /opt/tritonserver/bin/tritonserver --log-warning false --model-repository=/ft_workspace/all_models/gpt/\n    deploy:\n      resources:\n        reservations:\n          devices:\n          - driver: nvidia\n            device_ids: ['0']\n            capabilities: [gpu]\n```\n\n```\npython3 -m venv .venv\nsource .venv/bin/activate\npip install --upgrade pip\npip install -r requirements.txt\n```\n\n```\nTraceback (most recent call last):\n  File \"utils.py\", line 90, in \u003cmodule\u003e\n    print(generate_text('the first rule of robotics is','20.112.126.140:2001'))\n  File \"utils.py\", line 84, in generate_text\n    result = client.infer(MODEl_GPT, inputs)\n  File \"/home/carson/triton-ft-api/.venv/lib/python3.8/site-packages/tritonclient/grpc/__init__.py\", line 1431, in infer\n    raise_error_grpc(rpc_error)\n  File \"/home/carson/triton-ft-api/.venv/lib/python3.8/site-packages/tritonclient/grpc/__init__.py\", line 62, in raise_error_grpc\n    raise get_error_grpc(rpc_error) from None\ntritonclient.utils.InferenceServerException: [StatusCode.INVALID_ARGUMENT] inference input data-type is 'UINT64', model expects 'INT32' for 'ensemble'\n```\n\n20.112.126.140:1337 in this example is the gRPC port \n\n```\nprint(generate_text('the first rule of robotics is','20.112.126.140:1337'))\n```\n\n```\ntritonclient.utils.InferenceServerException: [StatusCode.INVALID_ARGUMENT] [request id: \u003cid_unknown\u003e] inference input data-type is 'INT32', model expects 'UINT64' for 'ensemble'\n```\n\nthe above error refers to the type of these lines\n\n```\nrandom_seed = (100 * np.random.rand(input0_data.shape[0], 1)).astype(np.uint64)\n#random_seed = (100 * np.random.rand(input0_data.shape[0], 1)).astype(np.int32) \n```","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fclam004%2Ftriton-ft-api","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fclam004%2Ftriton-ft-api","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fclam004%2Ftriton-ft-api/lists"}