{"id":14701446,"url":"https://github.com/jdf-prog/LLM-Engines","last_synced_at":"2025-09-10T09:31:06.145Z","repository":{"id":247537115,"uuid":"826132891","full_name":"jdf-prog/LLM-Engines","owner":"jdf-prog","description":null,"archived":false,"fork":false,"pushed_at":"2025-06-07T03:11:52.000Z","size":124,"stargazers_count":50,"open_issues_count":0,"forks_count":5,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-08-02T21:06:53.172Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jdf-prog.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-07-09T06:36:20.000Z","updated_at":"2025-06-25T16:51:49.000Z","dependencies_parsed_at":"2024-08-27T01:33:53.582Z","dependency_job_id":"c54e6a81-c87e-4262-ba55-086b45dbd342","html_url":"https://github.com/jdf-prog/LLM-Engines","commit_stats":null,"previous_names":["jdf-prog/llm-engines"],"tags_count":5,"template":false,"template_full_name":null,"purl":"pkg:github/jdf-prog/LLM-Engines","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jdf-prog%2FLLM-Engines","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jdf-prog%2FLLM-Engines/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jdf-prog%2FLLM-Engines/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jdf-prog%2FLLM-Engines/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jdf-prog","download_url":"https://codeload.github.com/jdf-prog/LLM-Engines/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jdf-prog%2FLLM-Engines/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":274440640,"owners_count":25285735,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-09-10T02:00:12.551Z","response_time":83,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-09-13T12:00:48.395Z","updated_at":"2025-09-10T09:31:06.132Z","avatar_url":"https://github.com/jdf-prog.png","language":"Python","funding_links":[],"categories":["A01_文本生成_文本对话","推理 Inference"],"sub_categories":["大语言对话模型及数据"],"readme":"# LLM-Engines\n\n[Author: Dongfu Jiang](https://jdf-prog.github.io/), [Twitter](https://x.com/DongfuJiang/status/1833730295696334925), [PyPI Package](https://pypi.org/project/llm-engines/)\n\nA unified inference engine for large language models (LLMs) including open-source models (VLLM, SGLang, Together) and commercial models (OpenAI, Mistral, Claude).\n\nThe correctness of the inference has been verified by comparing the outputs of the models with different engines when `temperature=0.0` and `max_tokens=None`.\nFor example, the outputs of a single model using 3 enginer (VLLM, SGLang, Together) will be the same when `temperature=0.0` and `max_tokens=None`.\nTry examples below to see the outputs of different engines.\n\n## News\n- 2025-03-03: support `sleep` for vllm models, see [Sleep Mode](#sleep-mode) for more details.\n- 2025-02-23: Support for vision input for all engines. See [Vision Input](#vision-input) for more details.\n- 2025-02-19: Add support for `fireworks` api services, which provide calling for deepseek-r1 models with high speed.\n- 2025-02-18: Add support for `grok` models.\n\n## Installation\nWe recommend to use `uv` to manage the environment due to its fast installation speed.  \n```bash\npip install llm-engines # or \n# pip install git+https://github.com/jdf-prog/LLM-Engines.git\npip install flash-attn --no-build-isolation\n```\nIf you want to use SGLang, you need to install it separately:\n```bash\npip install \"sglang[all]\u003e=0.4.6.post5\"\n```\n\nFor development:\n```bash\npip install -e . # for development\n```\n\n## Usage\n\n### Engines\n- use vllm or sglang \n```python\nfrom llm_engines import LLMEngine\nmodel_name=\"Qwen/Qwen2.5-0.5B-Instruct\"\nllm = LLMEngine()\nllm.load_model(\n    model_name=model_name,\n    num_workers=1, # number of workers\n    num_gpu_per_worker=1, # tensor parallelism size for each worker\n    engine=\"vllm\", # or \"sglang\"\n    use_cache=False\n)\nresponse = llm.call_model(model_name, \"What is the capital of France?\", temperature=0.0, max_tokens=None)\nprint(response)\n```\n\n- use together\n```python\n# export TOGETHER_API_KEY=\"your_together_api_key\"\nfrom llm_engines import LLMEngine\nmodel_name=\"meta-llama/Llama-3-8b-chat-hf\"\nllm = LLMEngine()\nllm.load_model(\n    model_name=model_name, \n    engine=\"together\", # or \"openai\", \"mistral\", \"claude\"\n    use_cache=False\n)\nresponse = llm.call_model(model_name, \"What is the capital of France?\", temperature=0.0, max_tokens=None)\nprint(response)\n```\n\n- openai models\n```python\n# export OPENAI_API_KEY=\"your_openai_api_key\"\nfrom llm_engines import LLMEngine\nmodel_name=\"gpt-3.5-turbo\"\nllm = LLMEngine()\nllm.load_model(\n    model_name=model_name, \n    engine=\"openai\", # or \"vllm\", \"together\", \"mistral\", \"claude\"\n    use_cache=False\n)\nresponse = llm.call_model(model_name, \"What is the capital of France?\", temperature=0.0, max_tokens=None)\nprint(response)\n```\n\n- grok models\n```python\n# export XAI_API_KEY=\"your_xai_api_key\"\nfrom llm_engines import LLMEngine\nmodel_name=\"grok-2-latest\"\nllm = LLMEngine()\nllm.load_model(\n    model_name=model_name,\n    engine=\"grok\", # or \"vllm\", \"together\", \"mistral\", \"claude\"\n    use_cache=False\n)\nresponse = llm.call_model(model_name, \"What is the capital of France?\", temperature=0.0, max_tokens=None)\nprint(response)\n```\n\n- mistral models\n```python\n# export MISTRAL_API_KEY=\"your_mistral_api_key\"\nfrom llm_engines import LLMEngine\nmodel_name=\"mistral-large-latest\"\nllm = LLMEngine()\nllm.load_model(\n    model_name=model_name,\n    engine=\"mistral\", # or \"vllm\", \"together\", \"openai\", \"claude\"\n    use_cache=False\n)\nresponse = llm.call_model(model_name, \"What is the capital of France?\", temperature=0.0, max_tokens=None)\nprint(response)\n```\n\n- claude models\n```python\n# export ANTHROPIC_API_KEY=\"your_claude_api_key\"\nfrom llm_engines import LLMEngine\nmodel_name=\"claude-3-opus-20240229\"\nllm = LLMEngine()\nllm.load_model(\n    model_name=model_name,\n    engine=\"claude\", # or \"vllm\", \"together\", \"openai\", \"mistral\"\n    use_cache=False\n)\nresponse = llm.call_model(model_name, \"What is the capital of France?\", temperature=0.0, max_tokens=None)\nprint(response)\n```\n\n- gemini models\n```python\n# export GEMINI_API_KEY=\"your_gemini_api_key\"\nfrom llm_engines import LLMEngine\nmodel_name=\"gemini-1.5-flash\"\nllm = LLMEngine()\nllm.load_model(\n    model_name=model_name,\n    engine=\"gemini\", # or \"vllm\", \"together\", \"openai\", \"mistral\", \"claude\"\n    use_cache=False\n)\nresponse = llm.call_model(model_name, \"What is the capital of France?\", temperature=0.0, max_tokens=None)\nprint(response)\n```\n\n- fireworks api\n```python\n```python\n# export FIREWORKS_API_KEY=\"your_fireworks_api_key\"\nfrom llm_engines import LLMEngine\nmodel_name=\"accounts/fireworks/models/deepseek-r1\"\nllm = LLMEngine()\nllm.load_model(\n    model_name=model_name,\n    engine=\"fireworks\", # or \"vllm\", \"together\", \"openai\", \"mistral\", \"claude\"\n    use_cache=False\n)\nresponse = llm.call_model(model_name, \"What is the capital of France?\", temperature=0.0, max_tokens=None)\nprint(response)\n```\n\n### unload model\nRemember to unload the model after using it to free up the resources. By default, all the workers will be unloaded after the program exits. If you want to use different models in the same program, you can unload the model before loading a new model, if that model needs gpu resources.\n```python\nllm.unload_model(model_name) # unload all the workers named model_name\nllm.unload_model() # unload all the workers\n```\n\n### Multi-turn conversation\n```python\nfrom llm_engines import LLMEngine\nmodel_name=\"Qwen/Qwen2.5-0.5B-Instruct\"\nllm = LLMEngine()\nllm.load_model(\n    model_name=\"Qwen/Qwen2.5-0.5B-Instruct\", \n    num_workers=1, # number of workers\n    num_gpu_per_worker=1, # tensor parallelism size for each worker\n    engine=\"vllm\", # or \"sglang\"\n    use_cache=False\n)\nmessages = [\n    \"Hello\", # user message \n    \"Hello! It's nice to meet you. Is there something I can help you with, or would you like to chat?\", # previous model response\n    \"What is the capital of France?\" # user message\n]\n# or you can use opneai's multi-turn conversation format. \nmessages = [\n    {\"role\": \"user\", \"content\": \"Hello\"}, # user message \n    {\"role\": \"assistant\", \"content\": \"Hello! It's nice to meet you. Is there something I can help you with, or would you like to chat?\"}, # previous model response\n    {\"role\": \"user\", \"content\": \"What is the capital of France?\"} # user message\n]\nresponse = llm.call_model(model_name, messages, temperature=0.0, max_tokens=None)\nprint(response)\n```\nthe messages should be in the format of \n- `[user_message, model_response, user_message, model_response, ...]`\n- or in the format of openai's multi-turn conversation format.\n\n### Vision Input\n```python\nfrom llm_engines import LLMEngine\nfrom PIL import Image\nimport requests\nfrom io import BytesIO\nresponse = requests.get(\"https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg\")\nimage = Image.open(BytesIO(response.content)).resize((256, 256))\nimage.save(\"./test.jpg\")\nmessages_with_image = [\n    {\n        \"role\": \"user\",\n        \"content\": [\n            {\n                \"type\": \"text\",\n                \"text\": \"What's in the image?\"\n            },\n            {\n                \"type\": \"image\",\n                \"image\": image\n            }\n        ]\n    }\n] # the 'image' type is not offical format of openai API, LLM-Engines will convert it into image_url type internally\nmessages_with_image_url = [\n    {\n        \"role\": \"user\",\n        \"content\": [\n            {\n                \"type\": \"text\",\n                \"text\": \"What's in the image?\"\n            },\n            {\n                \"type\": \"image_url\",\n                \"image_url\": {\"url\": \"https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg\"}\n            }\n        ]\n    }\n] # the 'image_url' type is the offical format of openai API\nadditional_args=[]\n# engine=\"openai\"; model_name=\"gpt-4o-mini\"\n# engine=\"claude\"; model_name=\"claude-3-5-sonnet-20241022\"\n# engine=\"gemini\"; model_name=\"gemini-2.0-flash\"\n# engine=\"grok\"; model_name=\"grok-2-vision-latest\"\n# engine=\"sglang\"; model_name=\"meta-llama/Llama-3.2-11B-Vision-Instruct\"; additional_args=[\"--chat-template=llama_3_vision\"] # refer to \nengine=\"vllm\"; model_name=\"microsoft/Phi-3.5-vision-instruct\"; additional_args=[\"--limit-mm-per-prompt\", \"image=2\", \"--max-model-len\", \"4096\"] # refer to vllm serve api\nllm = LLMEngine()\nllm.load_model(\n    model_name=model_name, \n    engine=engine, # or \"vllm\", \"together\", \"mist\n    use_cache=False,\n    additional_args=additional_args,\n)\nresponse = llm.call_model(model_name, messages_with_image, temperature=0.0, max_tokens=None)\nprint(response)\nresponse = llm.call_model(model_name, messages_with_image_url, temperature=0.0, max_tokens=None)\nprint(response)\n```\n\n### Sleep Mode\nWe support vllm's sleep mode if you want to save the GPU resources when the model is not used. (should have `vllm\u003e=0.7.3`)\n```python\nimport time\nfrom llm_engines import LLMEngine\nmodel_name=\"Qwen/Qwen2.5-0.5B-Instruct\"\nllm = LLMEngine()\nllm.load_model(\n    model_name=model_name,\n    num_workers=1, # number of workers\n    num_gpu_per_worker=1, # tensor parallelism size for each worker\n    engine=\"vllm\", # or \"sglang\"\n    use_cache=False,\n    additional_args=[\"--enable-sleep-mode\"] # enable sleep mode\n)\nresponse = llm.call_model(model_name, \"What is the capital of France?\", temperature=0.0, max_tokens=None)\nprint(response)\nllm.sleep_model(model_name) # sleep all the instances that named model_name\ntime.sleep(20) # check your GPU usage, it should be almost 0\nllm.wake_up_model(model_name) # wake up all the instances that named model_name\nresponse = llm.call_model(model_name, \"What is the capital of France?\", temperature=0.0, max_tokens=None)\n```\n\n### Batch inference\n```python\nfrom llm_engines import LLMEngine\nmodel_name=\"Qwen/Qwen2.5-0.5B-Instruct\"\nllm = LLMEngine()\nllm.load_model(\n    model_name=\"Qwen/Qwen2.5-0.5B-Instruct\", \n    num_workers=1, # number of workers\n    num_gpu_per_worker=1, # tensor parallelism size for each worker\n    engine=\"vllm\", # or \"sglang\"\n    use_cache=False\n)\nbatch_messages = [\n    \"Hello\", # user message \n    \"Hello! It's nice to meet you. Is there something I can help you with, or would you like to chat?\", # previous model response\n    \"What is the capital of France?\" # user message\n] * 100\nresponse = llm.batch_call_model(model_name, batch_messages, num_proc=32, temperature=0.0, max_tokens=None)\nprint(response)\n# List of responses [response1, response2, ...]\n```\nExample inference file: [`./examples/batch_inference_wildchat.py`](./examples/batch_inference_wildchat.py)\n```bash\npython examples/batch_inference_wildchat.py\n```\n\n**OpenAI Batch API**\nby using the above code, it will automatically use the batch API for openai models. if you don't want to use the batch API and still want to use the normal API, set `disable_batch_api=True` when loading the model. `num_proc` will be ignored when using the batch API.\n\nBy using openai's batch API, you can get half the price of the normal API. The batch API is only available for the models with `max_batch_size \u003e 1`.\n\nLLM-Engines will calculates the hash of the inputs and generation parameters, and will only send new batch requests if the inputs and generation parameters are different from the previous requests. You can check a list of requested batch information in the [`~/llm_engines/generation_cache/openai_batch_cache/batch_submission_status.json`](~/llm_engines/generation_cache/openai_batch_cache/batch_submission_status.json) file.\n\n### Parallel infernece throught huggingface dataset map\nCheck out [`./examples/mp_inference_wildchat.py`](./examples/mp_inference_wildchat.py) for parallel inference with multiple models.\n```bash\npython examples/mp_inference_wildchat.py\n```\n\n### Cache\n\nif `use_cache=True`, all the queries and responses are cached in the `generation_cache` folder, no duplicate queries will be sent to the model.\nThe cache of each model is saved to `generation_cache/{model_name}.jsonl`\n\nExample items in the cache:\n```json\n{\"cb0b4aaf80c43c9973aefeda1bd72890\": {\"input\": [\"What is the capital of France?\"], \"output\": \"The capital of France is Paris.\"}}\n```\nThe hash key here is the hash of the concatenated inputs.\n\n### Chat template\nFor each open-source models, we use the default chat template as follows:\n```python\nprompt = self.tokenizer.apply_chat_template(\n    messages, \n    add_generation_prompt=add_generation_prompt,\n    tokenize=False,\n    chat_template=chat_template,\n)\n```\nThere will be errors if the model does not support the chat template. \n\n### Worker initialization parameters (`load_model`)\n- `model_name`: the model name, e.g., \"Qwen/Qwen2.5-0.5B-Instruct\" (required)\n- `worker_addrs`: the list of worker addresses to use, if not provided, a new worker will be launched. If provided, it will use the existing workers (default: None)\n- `num_workers`: the number of workers to use for the model (default: 1)\n- `num_gpu_per_worker`: the number of GPUs to use for each worker (default: None)\n- `engine`: the engine to use, one of {vllm, sglang, together, openai, mistral, claude, gemini} (default: \"vllm\")\n- `additional_args`: list of str, additional arguments for launching the (vllm, sglang) worker, e.g., `[\"--max-model-len\", \"65536\"]` (default: [])\n- `use_cache`: whether to use the cache for the queries and responses (default: True)\n- `cache_dir`: the cache directory, env variable `LLM_ENGINES_CACHE_DIR` (default: `~/llm-engines/generation_cache`)\n- `overwrite_cache`: whether to overwrite the cache (default: False)\n- `dtype`: the data type to use (default: \"auto\"; {auto,half,float16,bfloat16,float,float32})\n- `quantization`: specify the quantization type, one of {aqlm,awq,deepspeedfp,tpu_int8,fp8,fbgemm_fp8,marlin,gguf,gptq_marlin_24,gptq_marlin,awq_marlin,gptq,squeezellm,compressed-tensors,bitsandbytes,qqq,experts_int8} (default: None)\n- `max_retry`: the maximum number of retries for the request (default: None)\n- `completion`: whether to use the completion API; If you use completion, (default: False)\n\n\n### Generation parameters (`call_model`, `batch_call_model`)\n- `inputs`: the list of inputs for the model; Either a list of strings or a list of dictionaries for multi-turn conversation in openai conversation format; If `completion` is True, it should be a single string (required)\n- `top_p`: the nucleus sampling parameter, 0.0 means no sampling (default: 1.0)\n- `temperature`: the randomness of the generation, 0.0 means deterministic generation (default: 0.0)\n- `max_tokens`: the maximum number of tokens to generate, `None` means no limit (default: None)\n- `timeout`: the maximum time to wait for the response, `None` means no limit (default: 300)\n- `frequency_penalty`: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim. (default: 0.0)\n- `presence_penalty`: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics. (default: 0.0)\n- `n`: Number of completions to generate for each prompt. (**only vllm, sglang, openai have this feature**) (default: 1)\n- `stream`: Whether to stream the response or not. If True, `n` will be ignored. (default: False)\n- `conv_system_msg`: The system message for multi-turn conversation; If the meessage contains a system message, this parameter will be overwritten (default: None)\n- `logprobs`: Whether to return the log probabilities of the generated tokens, True/False/None (default: None)\n- all the other parameters that are supported by different engines.\n    - for openai and sglang, check out [openai](https://platform.openai.com/docs/api-reference/chat)\n    - for extra paramters of vllm, check out [vllm](https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html#extra-parameters)\n\n### Launch a separate vllm/sglang model worker\n\n- launch a separate vllm worker\n\n```bash\nCUDA_VISIBLE_DEVICES=0 python -m vllm.entrypoints.openai.api_server --model meta-llama/Meta-Llama-3-8B-Instruct --dtype auto --host \"127.0.0.1\" --port 34200 --tensor-parallel-size 1 --disable-log-requests \u0026\n# address: http://127.0.0.1:34200\n```\n\n- launch a separate sglang worker\n```bash\nCUDA_VISIBLE_DEVICES=1 python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --dtype auto --host \"127.0.0.1\" --port 34201 --tp-size 1 \u0026\nCUDA_VISIBLE_DEVICES=1 python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --dtype auto --host \"127.0.0.1\" --port 34201 --tp-size 1 --disable-flashinfer \u0026 # disable flashinfer if it's not installed\n# address: http://127.0.0.1:34201\n```\n\n- query multiple existing workers\n```python\nfrom llm_engines import ModelWorker\ncall_worker_func = ModelWorker(\n    model_name=\"Qwen/Qwen2.5-0.5B-Instruct\", \n    worker_addrs=[\"http://127.0.0.1:34200\", \"http://127.0.0.1:34201\"], # many workers can be used, will be load balanced\n    engine=\"sglang\", \n    use_cache=False\n)\nresponse = ModelWorker([\"What is the capital of France?\"], temperature=0.0, max_tokens=None)\nprint(response)\n# The capital of France is Paris.\n```\n\n### Test notes\n\nWhen setting `temperature=0.0` and `max_tokens=None`, testing long generations:\n- VLLM (fp16) can generate same outputs with hugging face transformers (fp16) generations, but not for bf16.\n- Together AI can generate almost the same outputs with vllm (fp16, bf16) generations\n- SGLang's outputs outputs are sometimes not consistent with others.\n- note that some weird inputs will cause the models to inference forever, it's better to set `timeout` (default: 300) to drop the request after certain seconds.\n- Bug: [issue](https://github.com/vllm-project/vllm/issues/7196) of `vllm==0.5.4` when num_workers \u003e 1, use `vllm==0.5.5` instead.\n- Try not load the same openai models with different cache directories, the current code only loads the cache from the first provided cache directory. But when writing the cache, it will write to different cache directories correspondingly. This might cause some confusion when using.\n\n## Star History\n\n[![Star History Chart](https://api.star-history.com/svg?repos=jdf-prog/LLM-Engines\u0026type=Date)](https://star-history.com/#jdf-prog/LLM-Engines\u0026Date)\n\n## Citation\n```bibtex\n@misc{jiang2024llmengines,\n  title = {LLM-Engines: A unified and parallel inference engine for large language models},\n  author = {Dongfu Jiang},\n  year = {2024},\n  publisher = {GitHub},\n  journal = {GitHub repository},\n  howpublished = {\\url{https://github.com/jdf-progLLM-Engines}},\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjdf-prog%2FLLM-Engines","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjdf-prog%2FLLM-Engines","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjdf-prog%2FLLM-Engines/lists"}