{"id":44457969,"url":"https://github.com/laincloud/deployd","last_synced_at":"2026-02-12T18:08:40.871Z","repository":{"id":57522726,"uuid":"57937848","full_name":"laincloud/deployd","owner":"laincloud","description":"Container orchestration for LAIN","archived":false,"fork":false,"pushed_at":"2018-09-26T01:47:43.000Z","size":650,"stargazers_count":12,"open_issues_count":5,"forks_count":11,"subscribers_count":8,"default_branch":"master","last_synced_at":"2024-06-19T11:34:14.513Z","etag":null,"topics":["layer0","orchestration","swarm"],"latest_commit_sha":null,"homepage":"","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/laincloud.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2016-05-03T03:11:47.000Z","updated_at":"2024-06-19T11:34:14.514Z","dependencies_parsed_at":"2022-09-26T18:00:56.671Z","dependency_job_id":null,"html_url":"https://github.com/laincloud/deployd","commit_stats":null,"previous_names":[],"tags_count":3,"template":false,"template_full_name":null,"purl":"pkg:github/laincloud/deployd","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/laincloud%2Fdeployd","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/laincloud%2Fdeployd/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/laincloud%2Fdeployd/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/laincloud%2Fdeployd/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/laincloud","download_url":"https://codeload.github.com/laincloud/deployd/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/laincloud%2Fdeployd/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29375740,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-12T08:51:36.827Z","status":"ssl_error","status_checked_at":"2026-02-12T08:51:26.849Z","response_time":55,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["layer0","orchestration","swarm"],"created_at":"2026-02-12T18:08:40.175Z","updated_at":"2026-02-12T18:08:40.865Z","avatar_url":"https://github.com/laincloud.png","language":"Go","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Lain Deloyd\n\n[![Build Status](https://travis-ci.org/laincloud/deployd.svg?branch=master)](https://travis-ci.org/laincloud/deployd) [![codecov](https://codecov.io/gh/laincloud/deployd/branch/master/graph/badge.svg)](https://codecov.io/gh/laincloud/deployd) [![MIT license](https://img.shields.io/github/license/mashape/apistatus.svg)](https://opensource.org/licenses/MIT)\n\n\n## 基本介绍\n\nDeployd是负责Lain中底层容器编排的组件。将用户层对proc和portal等概念操作映射为实际的container操作，包括对container进行启停，升级和迁移等。Deployd会对所有的container定时巡检，自动修复异常的container，以确保服务的正常。\n\n## 整体设计\n\n1. 主要包括apiserver、engine、cluster、storage四个组成部分\n1. 使用Pod做为调度单元，Pod中包含需要运行的Container描述，即目标Container需要达到的状态，还包括对Pod进行调度的预设条件Filter和依赖关系Dependency\n1. 使用PodGroup来提供对Pod的Replication控制，以及重启设定\n1. 提供Pod的依赖关系，一个Pod可以指定他需要的依赖Pod同时运行，还包括依赖Scope的设定，例如Namespace级别的Scope，或者Node级别的Scope\n1. 总结来说，Deployd使用Pod来描述Container的运行参数、资源预约（包括内存和CPU）、依赖关系、预设调度条件；PodGroup来描述一个Pod需要多少实例，发生不同状况之后重启策略是什么，例如不重启、失败后重启、永远重启等；Dependency用来描述被依赖的Pod基本运行参数、资源等\n\n![deployd design](deployd-design.png)\n\n### OrcEngine\nOrcEngine采用单独的Worker Queue来对来自Api Server的各种操作请求进行排队派发，分发给对应的dependsController和podGroupController，并且统一安排定时的刷新操作，进行自检工作。默认会从Etcd读取所有PodGroup和DependPod的相关Spec和已知的运行时信息。\n\n还包括RuntimeEagleView利用Swarm Api提供的筛选API进行实时信息校正，因为在Etcd中存储的相关运行时信息可能是不准确的，这时候会通过RuntimeEagleView来获取Swarm实际运行时数据进行校准。\n\n所有的操作请求进入OrcEngine的Worker Queue，Queue出队的Operation会将相关请求派发到对应的Controller进入该Controller的Worker Queue，进行调度安排。（具体操作实现可以参考engine/engine_ops.go代码）\n\n### podGroupController\n\npodGroupController提供对于PodGroup的控制和自检工作，负责所有相关PodGroup调度工作，并定时自检，根据当前集群内PodGroup工作状态和配置进行相关调整，每个podGroupController都使用单独的Goroutine来进行所有调度工作的安排，所以，OrcEngine本身提供的异步操作接口。podGroupController会调用对应的podController进行底层的实际Container控制操作（具体操作实现可以参考engine/podgroup_ops.go），所有的API都会被拆分成若干底层Operation的Functor推送到Worker Queue中排队，从而重用大部分代码：\n\n1. Deploy操作：每个Instance的Deploy首先会从RuntimeEagleView中尝试获取当前是否有相关Container被部署，如果发现已经被部署的Pod，Deploy操作不会重新调度Container，只是重新获取Container状态，恢复PodGroup的运行时数据。在Deploy时，会尽量带上Affinity的调度标记，例如`affinity:cc.bdp.lain.deployd.pg_name!=~hello.web.web`，可以使Instance在集群中部署时能被分散开。\n1. 实例数量调度（RescheduleInstance）：会根据Instance数量变化的Delta来选择是Deploy新的Instance还是Remove Instance，如果是Deploy的话，相关执行同Deploy操作；如果是删除Instance，是从InstanceNo大的一端开始删除\n1. Spec更新调度（RescheduleSpec）：每个Instance串行的进行更新，更新过程中先会删除该Instance，并且等待`10s`，然后调用上面的Deploy Instance操作，同样会使用RuntimeEagleView来进行校准\n1. Drift漂移操作：每个Instance来判断自己是否需要漂移，如果漂移的话，也是先Remove Instance，然后再Deploy Instance到指定节点或者由Swarm来选择被调度的节点\n1. Remove操作：每个Instance会通过podController来进行Remove操作，然后再次调用RuntimeEagleView刷新相关Container运行列表，如果发现有残留的Container，会直接Remove Container，避免podController操作失败造成数据和运行时污染\n1. Refresh操作：先是通过RuntimeEagleView更新运行时Contrainer相关列表，每个Instance自己刷新，如果和RuntimePod匹配，那么就没有问题，此外，有一种情况目前是考虑的：\n\t* 发现Container Missing：会重新调用上面的Deploy Instance操作，从新部署新的实例\n\t* （Deployd数据格式升级）：发现老版本Container还在运行，会使用UpgradeInstance操作对应Instance，将Container本身升级到新版本，例如添加或者更新Container的配置Labels等\n\t* 如果发现RuntimePod对应版本和当前Spec版本不一致，会调用UpgradeInstance来更新Instance\n\t* 如果发现Container没有正常运行，会根据PodGroupSpec中的重启策略来选择是否重新启动Container\n\n### dependsController\n\ndependsController提供对于Dependency Pod的控制和自检工作，OrcEngine新建Dependency Pod的时候只是记录相关Spec信息，并生成对应dependsController，但是并不会实际部署任何Pod，dependency需要跟相关PodGroup的Instance运行在同一个集群Node上，所以会在有实际PodGroup Instance运行之后才会启动部署相关的Dependency Pod。而具体部署的细节是通过DependencyPolicy来控制的，目前有两种策略，一种是Node级别的，一种是Namespace级别的，例如具有相同Namespace的PodGroup Instance他们在同一台主机节点上会使用同一个Namespace级别的Dependency Pod，如果是Node级别的，那该Node主机上只会有一个Dependency Pod被部署然后被大家共享。\n\ndependsController也会调用相应的podController进行底册的实际Container控制操作，所有的API也被拆分成若干底层Operation Functor推送到单独的Goroutine Worker Queue中，不同的时，dependsController里面使用了带有引用计数和VerifyTime的podController，从而实现不同DependencyPolicy级别的共享功能。\n\ndependsController是通过Event的方式接收系统中正在发生的Dependency变化的，具体包括添加、删除和Verify，对应的会AddPod、RemovePod和VerifyPod，具体操作实现可以参考`engine/depends_ops.go`:\n\n1. AddSpec: 添加Spec配置，只是简单的存储到Storage中\n1. UpdateSpec：更新Spec配置，并对当前所有正在运行的Pod进行Upgrade，具体操作跟上面podGroupController的更新操作比较类似\n1. RemoveSpec：删除配置，默认如果当前对应的所有级别podController的引用计数大于0的话，是不允许执行的，需要先移除依赖他的PodGroup之后才可以进行，如果指定force的话，会强制停掉相关Pod\n1. AddPod：接收到DependencyEvent添加Pod事件，会根据事件中的Namespace和Node标记找到podController，如果没有相关部署，会进行部署，否则仅仅是增加引用计数，并且修改VerifyTime\n1. RemovePod：接收到DependencyEvent删除Pod事件，会根据事件中的Namespace和Node标记找到podController，不会立即就移除Pod，修改引用计数，修改VerifyTime，具体的移除操作实际上是在自检过程中如果发现很长时间没有PodGroup来Verify还在使用的话，该Pod就会被删除了，目前设定的垃圾回收时间为`5m`\n1. VerifyPod: 接收到DependencyEvent的Verify事件，会根据事件中的Namespace和Node标记找到podController，更新他的VerifyTime\n1. Refresh自检：会先刷新RuntimeEagleView中Dependency Pods的运行时列表，然后对于每个Node上每个Namespace对应的podController进行自检，如果发现距离VerifyTime已经超过Deploy启动时间并且超过垃圾回收时间，就会对该Pod进行回收，如果确定还不是垃圾之后，会有几种情况：\n   * 运行时正常，并且跟RuntimeEagleView中的列表匹配成功，说明一切正常\n   * 有Container Missing，会尝试从RuntimeEagleView中发现丢失的Container，如果找到，只需要重新登记，如果没有找到，会进行重新部署\n   * 如果发现是老版本的Container，会调用相应的UpgradePod操作对该Pod进行升级操作，从而满足Deployd自身数据和Container更新的要求，例如升级了Container配置Labels等\n   * 发现实际运行版本不同于Spec中定义版本，也会调用相应的UpgradePod动作进行升级\n   * 发现Container挂了，会尝试重新启动\n\n### constraintController\n\nconstraintController用于在部署pod时添加相应限制规则。目前主要用途是在进行集群维护时将某些节点设置为不可部署状态，这样deployd在部署时则不会允许pod部署到相应限制节点。\nconstraint机制主要来自于swarm，属于node filter中的一种，具体可参见swarm filter相关文档。\n\n### notifyController\n\nnotifyController用于管理deployd的callback列表及给相应callback列表发送通知。当deployd发现容器状态出现问题时，会给已注册的callback url发送通知。\n目前当出现如下情况时notifyController会发送通知：\n    * 某个pod处于exit状态\n    * 找不到某个pod\n    * 某个pod启动后不包含IP\n    * 某个pod在一定时间内被重启了多次\n\n## 编译和安装\n\n### 编译\n**依赖环境:go1.5+**,\n\n```sh\ngo build -o deployd\n```\n\n### 运行\n**依赖环境: swarm, etcd**\n\n```sh\n./deployd -h # 查看启动参数\n\n# 例子\n./deployd -web :9000 -swarm http://127.0.0.1:2376 -etcd http://127.0.0.1:2379 # 监听9000端口\n```\n\n## API Reference\n\nDeployd的内部编排引擎OrcEngine为异步执行模型，所以，基本上调度API返回的结果只是预约结果，而非真实操作的最后结果，可以继续通过相关GET Api来获取实际的运行信息，任务接受后，会进入OrcEngine的异步执行队列中。\n\n### Engine Api\n\n```\nGET /api/engine/config\n# 获取engine 配置信息\n# 返回：\n#     OK: EngineConfig JSON 数据\n\nPATCH /api/engine/config\n# 修改engine配置信息\n# 参数：\n#     Body: EngineConfig的JSON数据\n# 返回：\n#     OK: EngineConfig JSON 数据\n# 错误信息：\n#     BadRequest: PodGroupSpec JSON格式错误，或者缺少必需的参数\n\nPATCH /api/engine/maintenance\u0026on=false\n# 维护模式设置\n# 参数：\n#     on: 是否打开维护模式\n# 返回：\n#     OK: EngineConfig JSON 数据\n```\n\n### PodGroup Api\n\n```\nGET /api/podgroups?name={string}\u0026force_update={true|false}\n# 获取PodGroup运行Spec和Runtime数据\n# 参数：\n#     name: PodGroup名称\n#     force_update: 是否强制更新，使用true或者false\n# 返回：\n#     OK: PodGroupWithSpec JSON 数据\n# 错误信息：\n#     BadRequest: 缺少name参数\n#     NotFound: 没有找到对应名称的PodGroup\n\nPOST /api/podgroups\n# 新建要被调度的PodGroup，并且马上部署\n# 参数：\n#     Body: PodGroupSpec的JSON数据\n# 返回：\n#     Accepted: 任务被接受\n# 错误信息：\n#     BadRequest: PodGroupSpec JSON格式错误，或者缺少必需的参数\n#     NotAllowed: 集群缺少相关资源可被调度、PodGroup已经存在（请使用Patch相关接口）\n\nDELETE /api/podgroups?name={string}\n# 删除PodGroup部署\n# 参数：\n#     name: PodGroup名称\n# 返回：\n#     Accepted: 任务被接受\n# 错误信息：\n#     BadRequest: 缺少name参数\n#     NotFound: 没有找到对应名称的PodGroup\n\nPATCH /api/podgroups?name={string}\u0026cmd=replica\u0026num_instances={int}\u0026restart_policy={string}\n# 更改PodGroup运行时的Instance数量以及重启策略\n# 参数：\n#     name: PodGroup名称\n#     num_instances: 需要的实例数量\n#     restart_policy(optional): 重启策略，值包括：never, always, onfail\n# 返回：\n#     Accepted: 任务被接受\n# 错误信息：\n#     BadRequest: 缺少必需的参数\n#     NotAllowed: 集群缺少相关资源可被调度\n#     NotFound: 没有找到对应名称的PodGroup\n\nPATCH /api/podgroups?name={string}\u0026cmd=spec\n# 更改PodGroup运行时的具体Spec配置信息\n# 参数：\n#     name: PodGroup名称\n#     Body: 新的PodSpec\n# 返回：\n#     Accepted: 任务被接受\n# 错误信息：\n#     BadRequest: 缺少必需的参数\n#     NotAllowed: 集群缺少相关资源可被调度\n#     NotFound: 没有找到对应名称的PodGroup\n\nPATCH /api/podgroups?name={string}\u0026cmd=operation\u0026optype={start/stop/restart}[\u0026instance={int}]\n# 更改PodGroup运行时的具体Spec配置信息\n# 参数：\n#     name: PodGroup名称\n#     optype: 操作类型 停止或重启\n#     instance: 操作的pg instance，不传时为整个pod group\n# 返回：\n#     Accepted: 任务被接受\n# 错误信息：\n#     BadRequest: 缺少必需的参数\n#     NotAllowed: 集群缺少相关资源可被调度\n#     NotFound: 没有找到对应名称的PodGroup\n```\n\n### Dependency Api\n\n```\nGET /api/depends?name={string}\n# 获取Dependency Pod的Spec和Runtime数据\n# 参数：\n#     name: Dependency Pod名称\n# 返回：\n#     OK: PodSpec以及Runtime JSON 数据\n# 错误信息：\n#     BadRequest: 缺少name参数\n#     NotFound: 没有找到对应依赖Pod定义\n\nPOST /api/depends\n# 新建依赖Dependency Pod，但是并不会马上部署，按需部署的\n# 参数：\n#     Body: PodSpec的JSON数据\n# 返回：\n#     Accepted: 任务被接受\n# 错误信息：\n#     BadRequest: PodSpec JSON格式错误，或者缺少必需的参数\n#     NotAllowed: 集群缺少相关资源可被调度、Dependency已经存在（请使用PUT相关接口）\n\nDELETE /api/depends?name={string}\u0026force={true|false}\n# 删除Dependency部署\n# 参数：\n#     name: Dependency Pod名称\n#     force(optional): 是否强制删除，如果force＝false，当前Dependency Pod被其他PodGroup依赖的话，是不会被删除的\n# 返回：\n#     Accepted: 任务被接受\n# 错误信息：\n#     BadRequest: 缺少name参数\n#     NotFound: 没有找到对应名称的Dependency\n\nPUT /api/depends\n# 更新依赖Dependency Pod，会逐步更新所有目前运行的实例\n# 参数：\n#     Body: PodSpec的JSON数据\n# 返回：\n#     Accepted: 任务被接受\n# 错误信息：\n#     BadRequest: PodSpec JSON格式错误，或者缺少必需的参数\n#     NotFound: 没有找到对应的Dependency\n```\n\n### Node Api\n\n```\nGET /api/nodes\n# 获取集群当前节点数据\n\nPATCH /api/nodes?cmd=drift\u0026from={string}\u0026to={string}\u0026pg={string}\u0026pg_instance={int}\u0026force={true|false}\n# 漂移相关的Pod\n# 参数：\n#     from: 漂移出去的节点名称\n#     to(optional): 漂移的目标节点名称，如果等于from的话，会报BadRequest\n#     pg(optional): 特定漂移的PodGroup名称\n#     pg_instance(optional): 特定漂移的PodGroup InstanceNo，需要同时指定pg参数\n#     force(optional): 是否忽略PodGroup Stateful的标记，如果为false，具有Stateful标记的PodGroup不会被飘走\n# 返回：\n#     Accepted: 任务被接受\n# 错误信息：\n#     BadRequest: 缺少必需的参数\n```\n\n### Constraint Api\n\n```\nGET /api/contraints\n# 获取集群当前constraints数据\n\nPATCH /api/constraints?type={string}\u0026value={string}\u0026equal={true|false}\u0026soft={true|false}\n# 漂移相关的contraint\n# 参数：\n#     type: 需要修改的constraint类型，比如node\n#     value: constraint类型对应的值\n#     equal(optional): 在应用constraint的值时，是使用==还是!=，如果为true，则使用==\n#     soft(optional): 是否强制实施此constraint，如果为true，如果不能满足条件则不能部署相应容器\n# 返回：\n#     Accepted: constraint被添加\n# 错误信息：\n#     BadRequest: 缺少必需的参数\n\nDELETE /api/constraints?type={string}\n# 删除某种类型的constraint\n# 参数：\n#     type: Constraint 名称\n# 返回：\n#     Accepted: constraint被删除\n# 错误信息：\n#     BadRequest: 缺少必需的参数\n#     NotFound: 没有找到对应类型的constraint\n```\n\n### Notify Api\n\n```\nGET /api/notifies\n# 获取集群当前notify列表\n\nPOST /api/notifies?callback={string}\n# 添加一个callback url\n# 参数：\n#     callback: 需要添加的callback url\n# 返回：\n#     Accepted: callback url被添加\n# 错误信息：\n#     BadRequest: 缺少必需的参数或url格式存在问题\n\nDELETE /api/notifies?callback={string}\n# 删除某个callback url\n# 参数：\n#     callback: callback url\n# 返回：\n#     Accepted: callback url被删除\n# 错误信息：\n#     BadRequest: 缺少相关参数\n#     NotFound: 没有找到对应的callback url\n```\n\n### Status API\n\n```\nGET /api/status\n# 获取deployd engine的启停状态\n\nPATCH -XPATCH /api/status -H \"Content-Type: application/json\" -d '{\"status\": \"start\"}'\n# start 或 stop deployd engine\n```\n\n## Cluster 管理接口\n目前Cluster部分使用Docker Swarm来提供集群管理功能，并且设计了NetworkManager接口（还不成熟）接入Calico（已废弃删除）或者Noop的网络管理器，基本接口包括：\n\n```\ntype NetworkManager interface {\n\tGetContainerNetInfo(nodeName string, containerId string) (ContainerNetInfo, error)\n\tPatchEnv(envlist []string, key string, value string)\n}\n\ntype Cluster interface {\n\tNetworkManager\n\tGetResources() ([]Node, error)\n\tListContainers(showAll bool, showSize bool, filters ...string) ([]adoc.Container, error)\n\tCreateContainer(cc adoc.ContainerConfig, hc adoc.HostConfig, name ...string) (string, error)\n\tStartContainer(id string) error\n\tStopContainer(id string, timeout ...int) error\n\tInspectContainer(id string) (adoc.ContainerDetail, error)\n\tRemoveContainer(id string, force bool, volumes bool) error\n\tRenameContainer(id string, name string) error\n\tMonitorEvents(filter string, callback adoc.EventCallback) int64\n\tStopMonitor(monitorId int64)\n}\n```\n\n## Storage 接口\n目前存储部分使用Etcd集群来提供KV存储功能，主要接口包括：\n\n```\ntype Store interface {\n\tGet(key string, v interface{}) error\n\tSet(key string, v interface{}, force ...bool) error\n\tKeysByPrefix(prefix string) ([]string, error)\n\tRemove(key string) error\n\tTryRemoveDir(key string)\n\tRemoveDir(key string) error\n}\n```\n\n## 已知问题\n1. Swarm本身对于写一类的操作是要进行加锁的，例如pull image、create container、start container，即便操作对象不在同一个node上，也会有这个全局锁问题，所以算是个瓶颈吧，如果有大规模的重新部署或者更新之类的话，整个编排系统的吞吐量和并发程度受限于Swarm\n\n## License\n\nDeployd is released under the [MIT license](LICENSE).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flaincloud%2Fdeployd","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flaincloud%2Fdeployd","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flaincloud%2Fdeployd/lists"}