{"id":28032133,"url":"https://github.com/AI-Hypercomputer/inference-benchmark","last_synced_at":"2025-05-11T09:03:33.064Z","repository":{"id":273751190,"uuid":"891287655","full_name":"AI-Hypercomputer/inference-benchmark","owner":"AI-Hypercomputer","description":null,"archived":false,"fork":false,"pushed_at":"2025-04-10T16:37:09.000Z","size":82,"stargazers_count":7,"open_issues_count":10,"forks_count":8,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-04-10T18:05:18.871Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/AI-Hypercomputer.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-11-20T04:05:56.000Z","updated_at":"2025-04-09T21:49:12.000Z","dependencies_parsed_at":null,"dependency_job_id":"ab0bd56f-2439-4bdc-a365-5d7856b0ce1f","html_url":"https://github.com/AI-Hypercomputer/inference-benchmark","commit_stats":null,"previous_names":["ai-hypercomputer/inference-benchmark"],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AI-Hypercomputer%2Finference-benchmark","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AI-Hypercomputer%2Finference-benchmark/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AI-Hypercomputer%2Finference-benchmark/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AI-Hypercomputer%2Finference-benchmark/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/AI-Hypercomputer","download_url":"https://codeload.github.com/AI-Hypercomputer/inference-benchmark/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253540762,"owners_count":21924535,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-05-11T09:01:43.350Z","updated_at":"2025-05-11T09:03:33.047Z","avatar_url":"https://github.com/AI-Hypercomputer.png","language":"Python","readme":"# Inference Benchmark\n\nA model server agnostic inference benchmarking tool that can be used to\nbenchmark LLMs running on differet infrastructure like GPU and TPU. It can also\nbe run on a GKE cluster as a container.\n\n## Run the benchmark\n\n1. Create a python virtualenv.\n\n2. Install all the prerequisite packages.\n\n```\npip install -r requirements.txt\n```\n\n3. Set your huggingface token as an enviornment variable\n\n```\nexport HF_TOKEN=\u003cyour-huggingface-token\u003e\n```\n\n4. Download the ShareGPT dataset.\n\n```\nwget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json\n```\n\n5. Run the benchmarking script directly with a specific request rate.\n\n```\npython3 benchmark_serving.py --save-json-results --host=$IP  --port=$PORT --dataset=$PROMPT_DATASET_FILE --tokenizer=$TOKENIZER --request-rate=$REQUEST_RATE --backend=$BACKEND --num-prompts=$NUM_PROMPTS --max-input-length=$INPUT_LENGTH --max-output-length=$OUTPUT_LENGTH --file-prefix=$FILE_PREFIX\n```\n\n6. Generate a full latency profile which generates latency and throughput data\n   at different request rates.\n\n```\n./latency_throughput_curve.sh\n```\n\n## Run on a Kubernetes cluster\n1. You can build a container to run the benchmark directly on a Kubernetes cluster\nusing the specified Dockerfile.\n\n```\ndocker build -t inference-benchmark .\n```\n\n2. Create a repository in artifact registry to push the image there and use it on your cluster.\n\n```\ngcloud artifacts repositories create ai-benchmark --location=us-central1 --repository-format=docker\n```\n\n3. Push the image to that repository.\n\n```\ndocker tag inference-benchmark us-central1-docker.pkg.dev/{project-name}/ai-benchmark/inference-benchmark\ndocker push us-central1-docker.pkg.dev/{project-name}/ai-benchmark/inference-benchmark\n```\n\n4. Update the image name in deploy/deployment.yaml to `us-central1-docker.pkg.dev/{project-name}/ai-benchmark/inference-benchmark`.\n\n5. Deploy and run the benchmark.\n\n```\nkubectl apply -f deploy/deployment.yaml\n```\n\n6. Get the benchmarking data by looking at the logs of the deployment.\n\n```\nkubectl logs deployment/latency-profile-generator\n```\n\n7. To download the full report, get it from the container by listing the files and copying it. \nIf you specify a GCS bucket, the report will be automatically uploaded there.\n\n```\nkubectl exec \u003clatency-profile-generator-pod-name\u003e -- ls\nkubectl cp \u003clatency-profile-generator-pod-name\u003e:benchmark-\u003ctimestamp\u003e.json report.json\n```\n\n8. Delete the benchmarking deployment.\n\n```\nkubectl delete -f deploy/deployment.yaml\n```\n\n## Configuring the Benchmark\n\nThe following are the set of flags the benchmarking script takes in. These are all exposed as environment variables in the `deploy/deployment.yaml` file that you can configure.\n\n* `--backend`:\n    * Type: `str`\n    * Default: `\"vllm\"`\n    * Choices: `[\"vllm\", \"tgi\", \"naive_transformers\", \"tensorrt_llm_triton\", \"sax\", \"jetstream\"]`\n    * Description: Specifies the backend model server to benchmark.\n* `--file-prefix`:\n    * Type: `str`\n    * Default: `\"benchmark\"`\n    * Description: Prefix for output files.\n* `--endpoint`:\n    * Type: `str`\n    * Default: `\"generate\"`\n    * Description: The endpoint to send requests to.\n* `--host`:\n    * Type: `str`\n    * Default: `\"localhost\"`\n    * Description: The host address of the server.\n* `--port`:\n    * Type: `int`\n    * Default: `7080`\n    * Description: The port number of the server.\n* `--dataset`:\n    * Type: `str`\n    * Description: Path to the dataset. The default dataset used is ShareGPT from HuggingFace.\n* `--models`:\n    * Type: `str`\n    * Description: Comma separated list of models to benchmark.\n* `--traffic-split`:\n    * Type: parsed traffic split (comma separated list of floats that sum to 1.0)\n    * Default: None\n    * Description: Comma-separated list of traffic split proportions for the models, e.g. '0.9,0.1'. Sum must equal 1.0.\n* `--stream-request`:\n    * Action: `store_true`\n    * Description: Whether to stream the request. Needed for TTFT metric.\n* `--request-timeout`:\n    * Type: `float`\n    * Default: `3.0 * 60.0 * 60.0` (3 hours)\n    * Description: Individual request timeout.\n* `--tokenizer`:\n    * Type: `str`\n    * Required: `True`\n    * Description: Name or path of the tokenizer. You can specify the model ID in HuggingFace for the tokenizer of a model.\n* `--num-prompts`:\n    * Type: `int`\n    * Default: `1000`\n    * Description: Number of prompts to process.\n* `--max-input-length`:\n    * Type: `int`\n    * Default: `1024`\n    * Description: Maximum number of input tokens for filtering the benchmark dataset.\n* `--max-output-length`:\n    * Type: `int`\n    * Default: `1024`\n    * Description: Maximum number of output tokens.\n* `--request-rate`:\n    * Type: `float`\n    * Default: `float(\"inf\")`\n    * Description: Number of requests per second. If this is inf, then all the requests are sent at time 0. Otherwise, we use Poisson process to synthesize the request arrival times.\n* `--save-json-results`:\n    * Action: `store_true`\n    * Description: Whether to save benchmark results to a json file.\n* `--output-bucket`:\n    * Type: `str`\n    * Default: `None`\n    * Description: Specifies the Google Cloud Storage bucket to which JSON-format results will be uploaded. If not provided, no upload will occur.\n* `--output-bucket-filepath`:\n    * Type: `str`\n    * Default: `None`\n    * Description: Specifies the destination path within the bucket provided by --output-bucket for uploading the JSON results. This argument requires --output-bucket to be set. If not specified, results will be uploaded to the root of the bucket. If the filepath doesnt exist, it will be created for you.\n* `--additional-metadata-metrics-to-save`:\n    * Type: `str`\n    * Description: Additional metadata about the workload. Should be a dictionary in the form of a string.\n* `--scrape-server-metrics`:\n    * Action: `store_true`\n    * Description: Whether to scrape server metrics.\n* `--pm-namespace`:\n    * Type: `str`\n    * Default: `default`\n    * Description: namespace of the pod monitoring object, ignored if scrape-server-metrics is false\n* `--pm-job`:\n    * Type: `str`\n    * Default: `vllm-podmonitoring`\n    * Description: name of the pod monitoring object, ignored if scrape-server-metrics is false.","funding_links":[],"categories":["Inference"],"sub_categories":["Benchmark"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FAI-Hypercomputer%2Finference-benchmark","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FAI-Hypercomputer%2Finference-benchmark","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FAI-Hypercomputer%2Finference-benchmark/lists"}