{"id":13653165,"url":"https://github.com/ray-project/llmperf","last_synced_at":"2025-05-15T00:12:10.586Z","repository":{"id":199536208,"uuid":"703131946","full_name":"ray-project/llmperf","owner":"ray-project","description":"LLMPerf is a library for validating and benchmarking LLMs","archived":false,"fork":false,"pushed_at":"2024-12-09T01:52:28.000Z","size":224,"stargazers_count":864,"open_issues_count":49,"forks_count":152,"subscribers_count":10,"default_branch":"main","last_synced_at":"2025-04-14T01:50:00.946Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ray-project.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-10-10T16:49:58.000Z","updated_at":"2025-04-13T11:23:57.000Z","dependencies_parsed_at":"2023-10-10T23:17:25.071Z","dependency_job_id":"e4089b30-530f-44d2-a782-40f24b4d3dd1","html_url":"https://github.com/ray-project/llmperf","commit_stats":null,"previous_names":["ray-project/llmval","ray-project/llmperf"],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ray-project%2Fllmperf","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ray-project%2Fllmperf/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ray-project%2Fllmperf/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ray-project%2Fllmperf/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ray-project","download_url":"https://codeload.github.com/ray-project/llmperf/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248809024,"owners_count":21164895,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-02T02:01:06.716Z","updated_at":"2025-04-14T01:50:09.946Z","avatar_url":"https://github.com/ray-project.png","language":"Python","readme":"# LLMPerf\n\nA Tool for evaulation the performance of LLM APIs.\n\n# Installation\n```bash\ngit clone https://github.com/ray-project/llmperf.git\ncd llmperf\npip install -e .\n```\n\n# Basic Usage\n\nWe implement 2 tests for evaluating LLMs: a load test to check for performance and a correctness test to check for correctness.\n\n## Load test\n\nThe load test spawns a number of concurrent requests to the LLM API and measures the inter-token latency and generation throughput per request and across concurrent requests. The prompt that is sent with each request is of the format:\n\n```\nRandomly stream lines from the following text. Don't generate eos tokens:\nLINE 1,\nLINE 2,\nLINE 3,\n...\n```\n\nWhere the lines are randomly sampled from a collection of lines from Shakespeare sonnets. Tokens are counted using the `LlamaTokenizer` regardless of which LLM API is being tested. This is to ensure that the prompts are consistent across different LLM APIs.\n\nTo run the most basic load test you can the token_benchmark_ray script.\n\n\n### Caveats and Disclaimers\n\n- The endpoints provider backend might vary widely, so this is not a reflection on how the software runs on a particular hardware.\n- The results may vary with time of day.\n- The results may vary with the load.\n- The results may not correlate with users’ workloads.\n\n### OpenAI Compatible APIs\n```bash\nexport OPENAI_API_KEY=secret_abcdefg\nexport OPENAI_API_BASE=\"https://api.endpoints.anyscale.com/v1\"\n\npython token_benchmark_ray.py \\\n--model \"meta-llama/Llama-2-7b-chat-hf\" \\\n--mean-input-tokens 550 \\\n--stddev-input-tokens 150 \\\n--mean-output-tokens 150 \\\n--stddev-output-tokens 10 \\\n--max-num-completed-requests 2 \\\n--timeout 600 \\\n--num-concurrent-requests 1 \\\n--results-dir \"result_outputs\" \\\n--llm-api openai \\\n--additional-sampling-params '{}'\n\n```\n\n### Anthropic\n```bash\nexport ANTHROPIC_API_KEY=secret_abcdefg\n\npython token_benchmark_ray.py \\\n--model \"claude-2\" \\\n--mean-input-tokens 550 \\\n--stddev-input-tokens 150 \\\n--mean-output-tokens 150 \\\n--stddev-output-tokens 10 \\\n--max-num-completed-requests 2 \\\n--timeout 600 \\\n--num-concurrent-requests 1 \\\n--results-dir \"result_outputs\" \\\n--llm-api anthropic \\\n--additional-sampling-params '{}'\n\n```\n\n### TogetherAI\n\n```bash\nexport TOGETHERAI_API_KEY=\"YOUR_TOGETHER_KEY\"\n\npython token_benchmark_ray.py \\\n--model \"together_ai/togethercomputer/CodeLlama-7b-Instruct\" \\\n--mean-input-tokens 550 \\\n--stddev-input-tokens 150 \\\n--mean-output-tokens 150 \\\n--stddev-output-tokens 10 \\\n--max-num-completed-requests 2 \\\n--timeout 600 \\\n--num-concurrent-requests 1 \\\n--results-dir \"result_outputs\" \\\n--llm-api \"litellm\" \\\n--additional-sampling-params '{}'\n\n```\n\n### Hugging Face\n\n```bash\nexport HUGGINGFACE_API_KEY=\"YOUR_HUGGINGFACE_API_KEY\"\nexport HUGGINGFACE_API_BASE=\"YOUR_HUGGINGFACE_API_ENDPOINT\"\n\npython token_benchmark_ray.py \\\n--model \"huggingface/meta-llama/Llama-2-7b-chat-hf\" \\\n--mean-input-tokens 550 \\\n--stddev-input-tokens 150 \\\n--mean-output-tokens 150 \\\n--stddev-output-tokens 10 \\\n--max-num-completed-requests 2 \\\n--timeout 600 \\\n--num-concurrent-requests 1 \\\n--results-dir \"result_outputs\" \\\n--llm-api \"litellm\" \\\n--additional-sampling-params '{}'\n\n```\n\n### LiteLLM\n\nLLMPerf can use LiteLLM to send prompts to LLM APIs. To see the environment variables to set for the provider and arguments that one should set for model and additional-sampling-params.\n\nsee the [LiteLLM Provider Documentation](https://docs.litellm.ai/docs/providers).\n\n```bash\npython token_benchmark_ray.py \\\n--model \"meta-llama/Llama-2-7b-chat-hf\" \\\n--mean-input-tokens 550 \\\n--stddev-input-tokens 150 \\\n--mean-output-tokens 150 \\\n--stddev-output-tokens 10 \\\n--max-num-completed-requests 2 \\\n--timeout 600 \\\n--num-concurrent-requests 1 \\\n--results-dir \"result_outputs\" \\\n--llm-api \"litellm\" \\\n--additional-sampling-params '{}'\n\n```\n\n### Vertex AI\n\nHere, --model is used for logging, not for selecting the model. The model is specified in the Vertex AI Endpoint ID.\n\nThe GCLOUD_ACCESS_TOKEN needs to be somewhat regularly set, as the token generated by `gcloud auth print-access-token` expires after 15 minutes or so.\n\nVertex AI doesn't return the total number of tokens that are generated by their endpoint, so tokens are counted using the LLama tokenizer.\n\n```bash\n\ngcloud auth application-default login\ngcloud config set project YOUR_PROJECT_ID\n\nexport GCLOUD_ACCESS_TOKEN=$(gcloud auth print-access-token)\nexport GCLOUD_PROJECT_ID=YOUR_PROJECT_ID\nexport GCLOUD_REGION=YOUR_REGION\nexport VERTEXAI_ENDPOINT_ID=YOUR_ENDPOINT_ID\n\npython token_benchmark_ray.py \\\n--model \"meta-llama/Llama-2-7b-chat-hf\" \\\n--mean-input-tokens 550 \\\n--stddev-input-tokens 150 \\\n--mean-output-tokens 150 \\\n--stddev-output-tokens 10 \\\n--max-num-completed-requests 2 \\\n--timeout 600 \\\n--num-concurrent-requests 1 \\\n--results-dir \"result_outputs\" \\\n--llm-api \"vertexai\" \\\n--additional-sampling-params '{}'\n\n```\n\n### SageMaker\n\nSageMaker doesn't return the total number of tokens that are generated by their endpoint, so tokens are counted using the LLama tokenizer.\n\n```bash\n\nexport AWS_ACCESS_KEY_ID=\"YOUR_ACCESS_KEY_ID\"\nexport AWS_SECRET_ACCESS_KEY=\"YOUR_SECRET_ACCESS_KEY\"s\nexport AWS_SESSION_TOKEN=\"YOUR_SESSION_TOKEN\"\nexport AWS_REGION_NAME=\"YOUR_ENDPOINTS_REGION_NAME\"\n\npython llm_correctness.py \\\n--model \"llama-2-7b\" \\\n--llm-api \"sagemaker\" \\\n--max-num-completed-requests 2 \\\n--timeout 600 \\\n--num-concurrent-requests 1 \\\n--results-dir \"result_outputs\" \\\n\n```\n\nsee `python token_benchmark_ray.py --help` for more details on the arguments.\n\n## Correctness Test\n\nThe correctness test spawns a number of concurrent requests to the LLM API with the following format:\n\n```\nConvert the following sequence of words into a number: {random_number_in_word_format}. Output just your final answer.\n```\n\nwhere random_number_in_word_format could be for example \"one hundred and twenty three\". The test then checks that the response contains that number in digit format which in this case would be 123.\n\nThe test does this for a number of randomly generated numbers and reports the number of responses that contain a mismatch.\n\nTo run the most basic correctness test you can run the the llm_correctness.py script.\n\n### OpenAI Compatible APIs\n\n```bash\nexport OPENAI_API_KEY=secret_abcdefg\nexport OPENAI_API_BASE=https://console.endpoints.anyscale.com/m/v1\n\npython llm_correctness.py \\\n--model \"meta-llama/Llama-2-7b-chat-hf\" \\\n--max-num-completed-requests 150 \\\n--timeout 600 \\\n--num-concurrent-requests 10 \\\n--results-dir \"result_outputs\"\n```\n\n### Anthropic\n\n```bash\nexport ANTHROPIC_API_KEY=secret_abcdefg\n\npython llm_correctness.py \\\n--model \"claude-2\" \\\n--llm-api \"anthropic\"  \\\n--max-num-completed-requests 5 \\\n--timeout 600 \\\n--num-concurrent-requests 1 \\\n--results-dir \"result_outputs\"\n```\n\n### TogetherAI\n\n```bash\nexport TOGETHERAI_API_KEY=\"YOUR_TOGETHER_KEY\"\n\npython llm_correctness.py \\\n--model \"together_ai/togethercomputer/CodeLlama-7b-Instruct\" \\\n--llm-api \"litellm\" \\\n--max-num-completed-requests 2 \\\n--timeout 600 \\\n--num-concurrent-requests 1 \\\n--results-dir \"result_outputs\" \\\n\n```\n\n### Hugging Face\n\n```bash\nexport HUGGINGFACE_API_KEY=\"YOUR_HUGGINGFACE_API_KEY\"\nexport HUGGINGFACE_API_BASE=\"YOUR_HUGGINGFACE_API_ENDPOINT\"\n\npython llm_correctness.py \\\n--model \"huggingface/meta-llama/Llama-2-7b-chat-hf\" \\\n--llm-api \"litellm\" \\\n--max-num-completed-requests 2 \\\n--timeout 600 \\\n--num-concurrent-requests 1 \\\n--results-dir \"result_outputs\" \\\n\n```\n\n### LiteLLM\n\nLLMPerf can use LiteLLM to send prompts to LLM APIs. To see the environment variables to set for the provider and arguments that one should set for model and additional-sampling-params.\n\nsee the [LiteLLM Provider Documentation](https://docs.litellm.ai/docs/providers).\n\n```bash\npython llm_correctness.py \\\n--model \"meta-llama/Llama-2-7b-chat-hf\" \\\n--llm-api \"litellm\" \\\n--max-num-completed-requests 2 \\\n--timeout 600 \\\n--num-concurrent-requests 1 \\\n--results-dir \"result_outputs\" \\\n\n```\n\nsee `python llm_correctness.py --help` for more details on the arguments.\n\n\n### Vertex AI\n\nHere, --model is used for logging, not for selecting the model. The model is specified in the Vertex AI Endpoint ID.\n\nThe GCLOUD_ACCESS_TOKEN needs to be somewhat regularly set, as the token generated by `gcloud auth print-access-token` expires after 15 minutes or so.\n\nVertex AI doesn't return the total number of tokens that are generated by their endpoint, so tokens are counted using the LLama tokenizer.\n\n\n```bash\n\ngcloud auth application-default login\ngcloud config set project YOUR_PROJECT_ID\n\nexport GCLOUD_ACCESS_TOKEN=$(gcloud auth print-access-token)\nexport GCLOUD_PROJECT_ID=YOUR_PROJECT_ID\nexport GCLOUD_REGION=YOUR_REGION\nexport VERTEXAI_ENDPOINT_ID=YOUR_ENDPOINT_ID\n\npython llm_correctness.py \\\n--model \"meta-llama/Llama-2-7b-chat-hf\" \\\n--llm-api \"vertexai\" \\\n--max-num-completed-requests 2 \\\n--timeout 600 \\\n--num-concurrent-requests 1 \\\n--results-dir \"result_outputs\" \\\n\n```\n\n### SageMaker\n\nSageMaker doesn't return the total number of tokens that are generated by their endpoint, so tokens are counted using the LLama tokenizer.\n\n```bash\n\nexport AWS_ACCESS_KEY_ID=\"YOUR_ACCESS_KEY_ID\"\nexport AWS_SECRET_ACCESS_KEY=\"YOUR_SECRET_ACCESS_KEY\"s\nexport AWS_SESSION_TOKEN=\"YOUR_SESSION_TOKEN\"\nexport AWS_REGION_NAME=\"YOUR_ENDPOINTS_REGION_NAME\"\n\npython llm_correctness.py \\\n--model \"llama-2-7b\" \\\n--llm-api \"sagemaker\" \\\n--max-num-completed-requests 2 \\\n--timeout 600 \\\n--num-concurrent-requests 1 \\\n--results-dir \"result_outputs\" \\\n\n```\n\n## Saving Results\n\nThe results of the load test and correctness test are saved in the results directory specified by the `--results-dir` argument. The results are saved in 2 files, one with the summary metrics of the test, and one with metrics from each individual request that is returned.\n\n# Advanced Usage\n\nThe correctness tests were implemented with the following workflow in mind:\n\n```python\nimport ray\nfrom transformers import LlamaTokenizerFast\n\nfrom llmperf.ray_clients.openai_chat_completions_client import (\n    OpenAIChatCompletionsClient,\n)\nfrom llmperf.models import RequestConfig\nfrom llmperf.requests_launcher import RequestsLauncher\n\n\n# Copying the environment variables and passing them to ray.init() is necessary\n# For making any clients work.\nray.init(runtime_env={\"env_vars\": {\"OPENAI_API_BASE\" : \"https://api.endpoints.anyscale.com/v1\",\n                                   \"OPENAI_API_KEY\" : \"YOUR_API_KEY\"}})\n\nbase_prompt = \"hello_world\"\ntokenizer = LlamaTokenizerFast.from_pretrained(\n    \"hf-internal-testing/llama-tokenizer\"\n)\nbase_prompt_len = len(tokenizer.encode(base_prompt))\nprompt = (base_prompt, base_prompt_len)\n\n# Create a client for spawning requests\nclients = [OpenAIChatCompletionsClient.remote()]\n\nreq_launcher = RequestsLauncher(clients)\n\nreq_config = RequestConfig(\n    model=\"meta-llama/Llama-2-7b-chat-hf\",\n    prompt=prompt\n    )\n\nreq_launcher.launch_requests(req_config)\nresult = req_launcher.get_next_ready(block=True)\nprint(result)\n\n```\n\n# Implementing New LLM Clients\n\nTo implement a new LLM client, you need to implement the base class `llmperf.ray_llm_client.LLMClient` and decorate it as a ray actor.\n\n```python\n\nfrom llmperf.ray_llm_client import LLMClient\nimport ray\n\n\n@ray.remote\nclass CustomLLMClient(LLMClient):\n\n    def llm_request(self, request_config: RequestConfig) -\u003e Tuple[Metrics, str, RequestConfig]:\n        \"\"\"Make a single completion request to a LLM API\n\n        Returns:\n            Metrics about the performance charateristics of the request.\n            The text generated by the request to the LLM API.\n            The request_config used to make the request. This is mainly for logging purposes.\n\n        \"\"\"\n        ...\n\n```\n\n# Legacy Codebase\nThe old LLMPerf code base can be found in the [llmperf-legacy](https://github.com/ray-project/llmval-legacy) repo.\n","funding_links":[],"categories":["Datasets-or-Benchmark","Projects","Evaluation and Monitoring","Python","A01_文本生成_文本对话","Models and Projects"],"sub_categories":["推理速度","📋 Others","大语言对话模型及数据","Ray + LLM"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fray-project%2Fllmperf","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fray-project%2Fllmperf","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fray-project%2Fllmperf/lists"}