{"id":28413750,"url":"https://github.com/codefuse-ai/easydeploy","last_synced_at":"2026-03-06T17:02:34.110Z","repository":{"id":267232343,"uuid":"900252415","full_name":"codefuse-ai/EasyDeploy","owner":"codefuse-ai","description":"EasyDeploy is engineered to provide users with end-to-end deployment capabilities for large-scale models. ","archived":false,"fork":false,"pushed_at":"2025-04-18T03:16:42.000Z","size":943,"stargazers_count":133,"open_issues_count":0,"forks_count":14,"subscribers_count":9,"default_branch":"main","last_synced_at":"2025-10-05T16:57:36.671Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/codefuse-ai.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2024-12-08T09:41:19.000Z","updated_at":"2025-09-15T14:37:52.000Z","dependencies_parsed_at":"2025-01-08T12:48:42.139Z","dependency_job_id":null,"html_url":"https://github.com/codefuse-ai/EasyDeploy","commit_stats":null,"previous_names":["codefuse-ai/easydeploy"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/codefuse-ai/EasyDeploy","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/codefuse-ai%2FEasyDeploy","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/codefuse-ai%2FEasyDeploy/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/codefuse-ai%2FEasyDeploy/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/codefuse-ai%2FEasyDeploy/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/codefuse-ai","download_url":"https://codeload.github.com/codefuse-ai/EasyDeploy/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/codefuse-ai%2FEasyDeploy/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30186779,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-06T14:42:24.748Z","status":"ssl_error","status_checked_at":"2026-03-06T14:42:14.925Z","response_time":250,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-06-03T05:44:40.015Z","updated_at":"2026-03-06T17:02:34.078Z","avatar_url":"https://github.com/codefuse-ai.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n\u003ch1\u003e\nEasyDeploy\n\u003c/h1\u003e\n\u003c/div\u003e\n\n\u003cp align=\"center\"\u003e\n\u003cdiv align=\"center\"\u003e\n\u003ch4 align=\"center\"\u003e\n    \u003cp\u003e\n        \u003cb\u003e中文\u003c/b\u003e |\n        \u003ca href=\"\"\u003eEnglish\u003c/a\u003e\n    \u003c/p\u003e\n\u003c/h4\u003e\n\u003c/div\u003e\n\n## Contents\n- [news](#news)\n- [Introduction](#Introduction)\n- [Quick-Deployment](#Quick-Deployment)\n- [Service-Access](#Service-Access)\n- [Modules](#Modules)\n- [Core-Features](#Core-Features)\n- [Acknowledgements](#Acknowledgements)\n- [Contributing](#Contributing)\n\n### news\n- [2025.04.19] 支持Ling-moe-lite int8量化模型部署\n- [2024.11.06] EasyDeploy was released, utilizing Docker and Ollama based architecture.\n\n## Introduction\nEasyDeploy is engineered to provide users with end-to-end deployment capabilities for large-scale models. By incorporating the deployment and inference logic of large models within Docker, EasyDeploy streamlines the overall deployment process and significantly enhances the user experience. Currently, EasyDeploy supports multiple engines, including Ollama, and plans to extend support to additional engines such as vLLM in the future. \nThrough EasyDeploy, users are empowered to rapidly deploy and initiate large-scale models between cloud environments and local devices, effectively eliminating technical barriers and enabling a focus on the application and optimization of the models themselves. Whether operating within local environments or cloud platforms,  EasyDeploy provides efficient and reliable solutions, thereby facilitating the swift advancement and practical implementation of artificial intelligence.\n\n## Quick-Deployment\n### Dependencies \n+ Python version: 3.10\n+ Package Installation\n```shell\npip install -r requirements.txt \n```\n### Service Startup\nDownload Docker Image\n\nDownload link：上传后更新\n\n```shell\ndocker run -p 8000:8000 easydeploy_llama3.2_3b \n```\n\n## Service-Access\nThe service provides both streaming and blocking access functionalities through RESTful APIs. An example request is presented below:\n\n### Chat Page\n[http://127.0.0.1:8000/chat](http://127.0.0.1:8000/chat)\n\n### API Interface\n#### Blocking Access\n**Request Method**:\n```python\n# -*- coding: utf-8 -*-\nimport json\nimport requests\nurl = 'http://127.0.0.1:8000/chat/completions'\nprompt = '你好'\nmodel = 'lamma3.2'\nmessages = [{\"role\": \"user\", \"content\": prompt}]\ndata = {'model': model, 'messages': messages}\nheaders = {\"Content-Type\": \"application/json\"}\nresponse = requests.post(url, headers=headers, data=json.dumps(data))\nif response.status_code == 200:\n    ans_dict = json.loads(response.text)\n    print('data: {}'.format(ans_dict))\n```\n\n**Return Format**：\n\n```json\n{\n    \"id\": \"ollama-123\",\n    \"object\": \"chat.completion\",\n    \"created\": 1731847899,\n    \"model\": \"lamma3.2\",\n    \"system_fingerprint\": \"\",\n    \"choices\": [\n        {\n            \"index\": 0,\n            \"message\": {\n                \"role\": \"assistant\",\n                \"content\": \"hi! How can I assist you today?\"\n            },\n            \"logprobs\": null,\n            \"finish_reason\": \"stop\"\n        }\n    ],\n    \"usage\": {\n\n    }\n}\n```\n\n#### **Stream Access**：\n**Request Method:**\n\n```python\n# -*- coding: utf-8 -*-\nimport json\nimport requests\nurl = 'http://127.0.0.1:8000/chat/completions'\nprompt = 'hello'\nmodel = 'lamma3.2'\nmessages = [{\"role\": \"user\", \"content\": prompt}]\ndata = {'model': model, 'messages': messages, 'stream': True}\nheaders = {\"Content-Type\": \"application/json\"}\nresponse = requests.post(url, headers=headers, data=json.dumps(data))\n```\n\n**Return Format**：\n```json\n{\n  \"id\": \"ollama-123\",\n  \"object\": \"chat.completion.chunk\",\n  \"created\": 1731848401,\n  \"model\": \"lamma3.2\",\n  \"system_fingerprint\": \"\",\n  \"choices\": [\n    {\n      \"index\": 0,\n      \"delta\": {\n        \"role\": \"assistant\",\n        \"content\": \"hi\"\n      },\n      \"logprobs\": null,\n      \"finish_reason\": null\n    }\n  ]\n}\n```\n\n## Modules\n![easydeploy modules](docs/easydeploy_modules_20241125.png)\n## Core-Features\n\u003ctable style=\"width: 100%; border: 1\"\u003e\n    \u003ctr\u003e\n        \u003cth style=\"width: 20%;\"\u003eCategory\u003c/th\u003e\n        \u003cth style=\"width: 30%;\"\u003eFunction\u003c/th\u003e\n        \u003cth style=\"width: 10%;\"\u003eStatus\u003c/th\u003e\n        \u003cth style=\"width: 40%;\"\u003eDescription\u003c/th\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n        \u003ctd rowspan=\"4\"\u003eAPI Service\u003c/td\u003e\n        \u003ctd\u003eOpenAI Standard API\u003c/td\u003e\n        \u003ctd\u003e✅\u003c/td\u003e\n        \u003ctd\u003eThe service interface complies with OpenAI standards, minimizing integration costs through standardized APIs. It enables users to seamlessly integrate and maintain the system, swiftly respond to business requirements, and concentrate on core development.\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n        \u003ctd\u003eBlocking access capabilities\u003c/td\u003e\n        \u003ctd\u003e✅\u003c/td\u003e\n        \u003ctd\u003e Suitable for tasks requiring integrity and coherence or for overall verification and processing of results, this approach obtains complete output in a single iteration. Throughout the process, the user must wait until all output content has been fully generated. \u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n        \u003ctd\u003eStreaming access capabilities\u003c/td\u003e\n        \u003ctd\u003e✅\u003c/td\u003e\n        \u003ctd\u003eSuitable for real-time applications with stringent response time requirements, such as code completion, real-time translation, or websites with dynamic content loading. The model transmits content incrementally during generation, enabling users to receive and process partial outputs immediately without waiting for full completion, thereby enhancing interactivity.\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n        \u003ctd\u003eHigh-performance gateway\u003c/td\u003e\n        \u003ctd\u003e⬜\u003c/td\u003e\n        \u003ctd\u003eHigh-performance gateways effectively manage high-concurrency requests, reduce latency, and enhance response times by optimizing data transmission, employing advanced load balancing algorithms, and implementing efficient resource management.\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n        \u003ctd rowspan=\"3\"\u003eMulti-engine Support\u003c/td\u003e\n        \u003ctd\u003eOllama\u003c/td\u003e\n        \u003ctd\u003e✅\u003c/td\u003e\n        \u003ctd\u003eHigh-performance gateways effectively manage high-concurrency requests, reduce latency, and enhance response times by optimizing data transmission, employing advanced load balancing algorithms, and implementing efficient resource management.\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n        \u003ctd\u003evLLM\u003c/td\u003e\n        \u003ctd\u003e✅\u003c/td\u003e\n        \u003ctd\u003evLLM exhibits significant advantages in memory management and throughput. By optimizing memory usage and parallel computation, it substantially enhances inference speed and resource efficiency, while maintaining compatibility with various hardware environments. vLLM offers a wide range of configuration options, allowing users to adjust inference strategies based on their needs. Its scalable architecture makes it suitable for both research and enterprise-level applications.\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n        \u003ctd\u003eTensorrt–LLM\u003c/td\u003e\n        \u003ctd\u003e⬜\u003c/td\u003e\n        \u003ctd\u003eTensorRT-LLM (TensorRT for Large Language Models) is a high-performance, scalable deep learning inference optimization library developed by NVIDIA, specifically designed for large language models (LLMs).\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n        \u003ctd\u003eDocker Deployment Capability\u003c/td\u003e\n        \u003ctd\u003eDocker images built with Python 3.10\u003c/td\u003e\n        \u003ctd\u003e✅\u003c/td\u003e\n        \u003ctd\u003eTensorRT-LLM is a high-performance, scalable deep learning inference optimization library developed by NVIDIA, specifically designed for large language models (LLMs).\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n        \u003ctd\u003eWeb UI Integration\u003c/td\u003e\n        \u003ctd\u003eOpenUI protocol\u003c/td\u003e\n        \u003ctd\u003e⬜\u003c/td\u003e\n        \u003ctd\u003eThe comprehensive UI open-source protocol facilitates users in integrating diverse components, enhancing product customizability and extensibility.\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n        \u003ctd\u003eMore Core Features\u003c/td\u003e\n        \u003ctd\u003eModelCache semantic caching\u003c/td\u003e\n        \u003ctd\u003e⬜\u003c/td\u003e\n        \u003ctd\u003eBy caching generated QA pairs, similar requests can achieve millisecond-level responses, enhancing the performance and efficiency of model inference.\u003c/td\u003e\n    \u003c/tr\u003e\n\u003c/table\u003e\n\n## Inference of the Ling-moe-lite int8 Quantized Model\n### Environment Requirements:\nPython Version：python 3.10\nGPU Type：L20\nEnvironment Configuration：\n\n```bash\npip install vllm==0.6.3\nsudo yum install libcap-devel\npip install python-prctl\ncp vllm_src/model_executor/models/deepseek.py /opt/conda/lib/python3.10/site-packages/vllm/model_executor/models/deepseek.py\n```\n\nVLLM Inference Script\n```python\n# -*- coding: utf-8 -*-\nimport os\nfrom vllm import LLM\nfrom vllm.sampling_params import SamplingParams\n\nmodel_path = '{your model path}'\n\nenforce_eager = False\n\n# GPU Execution\ntrust_remote_code = True\ntensor_parallel_size = 1\ngpu_memory_utilization = 0.80\nmax_model_len = 4096\nmax_tokens = 4096\nmodel = LLM(model_path, trust_remote_code=trust_remote_code, tensor_parallel_size=tensor_parallel_size, enforce_eager=enforce_eager, gpu_memory_utilization=gpu_memory_utilization, max_model_len=max_model_len)\nprompt = \"\u003crole\u003eSYSTEM\u003c\\\\/role\u003e假设你是一个医疗助理，请回答问题，回答时需要遵循下列要求。\\n要求：\\n1. 首先总起概括，然后在回答中使用数字1、2、3等进行分条目阐述解释，并在最后总结。\\n2. 对参考内容当中与问题相关且正确的部分进行整合，可以结合医学知识进行适当推理。\\n3. 回答内容专业详实、逻辑清晰，不能出现医学错误。严谨礼貌，符合医疗及政策规范。\\n4. 对于不合规或者高风险的医疗项目，要提示中国大陆不允许展开。\\n5. 对于上门进行医疗服务的相关问题，要提示需要在有相应资质的诊疗机构由专业医疗人员进行。\\n6. 对于高风险处方药，需要向用户表明风险。\\n7. 对于违规引产，需要说明不建议，若需要引产，则要在符合医疗政策和规范的情况下去有资质的医院进行。\\n8. 对于有偿献血，需要说明中国大陆不存在有偿献血，献血都是无偿的。\\n9. 请不要忘记你是一个医疗助理，针对问题给出积极正向的建议和科普，而不能像医生一样给出确定性的诊疗意见。\\n\u003crole\u003eHUMAN\u003c\\\\/role\u003e艾滋病患者如何正确服用抗病毒药？\u003crole\u003eASSISTANT\u003c\\\\/role\u003e\"\n\nsample_params = SamplingParams(max_tokens=max_tokens, ignore_eos=False)\nresult = model.generate(prompt, sampling_params=sample_params, prompt_token_ids=None)\nprint('result: {}'.format(result))\n```\n\n\n## Acknowledgements\nThis project draws on the following open-source projects, and we express our gratitude to the relevant projects and researchers for their contributions.  \n[Ollama](https://github.com/ollama/ollama)、[vLLM](https://github.com/vllm-project/vllm)\n\n## \u003cfont style=\"color:rgb(31, 35, 40);\"\u003eContributing\u003c/font\u003e\nEasyDeploy is an intriguing and valuable project, which we believe holds significant potential. We welcome contributions from both seasoned developers and novices alike. Contributions may include, but are not limited to, submitting issues and suggestions, participating in code development, and enhancing documentation and examples.\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcodefuse-ai%2Feasydeploy","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcodefuse-ai%2Feasydeploy","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcodefuse-ai%2Feasydeploy/lists"}