{"id":15178791,"url":"https://github.com/ning1875/k8ssolutions","last_synced_at":"2026-03-05T21:33:00.413Z","repository":{"id":255012245,"uuid":"848253785","full_name":"ning1875/k8sSolutions","owner":"ning1875","description":"面向企业的TOB k8s/cicd/监控/微服务 -基础架构方向解决方案","archived":false,"fork":false,"pushed_at":"2024-08-27T12:51:22.000Z","size":2555,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-05T07:04:35.645Z","etag":null,"topics":["apisix","cicd","consul","crd","go","grafana","ingress","istio","k8s","kubernetes","nginx","operator","prometheus"],"latest_commit_sha":null,"homepage":"https://haohuo.jinritemai.com/ecommerce/trade/detail/index.html?id=3696623728575250607\u0026origin_type=604","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ning1875.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-08-27T12:28:04.000Z","updated_at":"2024-08-27T12:55:30.000Z","dependencies_parsed_at":null,"dependency_job_id":"c04a3446-77c7-4071-8bb0-358538d45e81","html_url":"https://github.com/ning1875/k8sSolutions","commit_stats":{"total_commits":2,"total_committers":2,"mean_commits":1.0,"dds":0.5,"last_synced_commit":"b1f5da4fb38db664dad7a767418793bee1d12f68"},"previous_names":["ning1875/k8ssolutions"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/ning1875/k8sSolutions","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ning1875%2Fk8sSolutions","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ning1875%2Fk8sSolutions/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ning1875%2Fk8sSolutions/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ning1875%2Fk8sSolutions/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ning1875","download_url":"https://codeload.github.com/ning1875/k8sSolutions/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ning1875%2Fk8sSolutions/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":281140986,"owners_count":26450555,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-26T02:00:06.575Z","response_time":61,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apisix","cicd","consul","crd","go","grafana","ingress","istio","k8s","kubernetes","nginx","operator","prometheus"],"created_at":"2024-09-27T15:23:14.980Z","updated_at":"2025-10-26T17:31:12.810Z","avatar_url":"https://github.com/ning1875.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"# AI初创公司面临什么运维困境 有什么运维需求\n## 按照完整链路和具体模块分\n| 问题大类                | 问题举例                  | 小乙的解决方案                                    |\n|---------------------|-----------------------|--------------------------------------------|\n| 如何搭建训练环境之算力平台       | 用公有云mlp平台的还是自己开发      | 需要看是否有多云便宜算力的需求                            |\n| 如何搭建训练环境之算力选择       | 用什么类型的卡，多大的数据量        | 消费卡3090/4090 ，大模型大显存rdma：A800              |\n| 如何搭建训练环境之原始数据处理     | 对象存储，工具链和工作流任务        | 对象存储有成本刺客，cpu/gpu工作流数据处理任务(标注等等)           |\n| 如何搭建训练环境之训练任务如何读取数据 | 需要用到高速缓存存储组件          | 云pfs/goosefs/alluxio/rapidfs等等使用和mlp平台开发打通 |\n| 如何搭建训练环境之训练框架       | 选哪个框架，单机多机任务如何发起      | pytorch/tensorflow等等                       |\n| 如何搭建训练环境之训练存储       | 共享存储和对象存储             | cfs/nas/pfs  obs/cos/bos的sdk               |\n| 训练过程中的问题            | 训练报错：nccl、cuda、nvml等等 | 具体看                                        |\n| 训练过程中可观测性           | metrics监控、训练日志、可视化工具  | prometheus监控套件、 云日志和k8s日志、Tensorboard大盘    |\n| 算法开发环境              | gpu开发机                | 单卡/多卡/单机/多机 vs gpu虚拟化 share等等              |\n| 算法开发环境              | 镜像、本地私有pip源、公网/外网加速   | 依赖库cuda/torch/ps等等 最好都提供                   |\n| 算法开发环境              | ide                   | vscode+ssh环境                               |\n| 算法开发环境              | 研发数据存储                | nas/cfs/pfs等等                              |\n| 基础运维组件              | 账号认证体系                | 飞书扫码、ldap账号密码                              |\n| 基础运维组件              | 办公网和云环境打通             | 专线和vpn                                     |\n| 基础运维组件              | 代码托管、cicd             | gitlab、镜像构建、runner、argocd                  |\n| 内部的平台服务             | 网关服务和在线服务             | 在线k8s集群Deployment +apisix                  |\n| 模型部署产物交付            | 推理服务的部署               | 管理在线的k8s集群、artifactory制品库                  |\n\n\n## 按照宏观大方向：如果没有一个懂得运维开发会有不好的事情\n- 成本：你的it(云和idc等)成本会高出40%-50%\n- 单云：你会被单一公有云所绑架：外面有便宜的算力不会迁移，设计多云架构等\n- 安全：傻乎乎的公网暴露，被挖矿等攻击，挤占你的算力，偷走你的核心数据等等\n- 可用：自己搭建的由于架构不合理，经常宕机，影响模型开发进度，效率低\n\n\n## 找到小乙我能给你们提供什么\n- 让你的训练跑起来：设计并实施完整的AI训练推理基础环境\n- 管理你的数据：合适的存储方案\n- 压缩你的成本：把你云账单每月降低30%-40%\n- 基础环境：提供笔记本、网络相关方案\n\n# 关于合作的发票问题\n- 面向的企业的客户\n- 企业客户打款都需要对公\n- 我这边有个体户：最近搞定了发票的问题 ![im](./发票.png)\n\u003e 那么合作的流程：\n- 讨论方案和交付物\n- 确定价格\n- 根据要求开具专票\n- 第一批打款\n- 帮你来解决这些问题\n- 尾款发票和打款\n\n# k8s运维解决方案\n# k8s多集群管理\n# 高可用Prometheus集群架构\n# 灰度发布\n\n\n# \u003cfont color=red\u003e专注 k8s/监控/cicd/微服务整理/golang 基础架构解决方案\u003c/font\u003e\n- 这几个方向的集群运维，组件选型，开发都可以\n- 有需求+v `mxy1875` 沟通\n\n\n# 个人简介\n- [抖音：小乙运维杂货铺](https://v.douyin.com/ihNX2nKx/)\n- [github:ning1875](https://github.com/ning1875)\n- [b站](https://space.bilibili.com/278569661)\n- [知乎](https://www.zhihu.com/people/lang-zi-yan-qing-yan-xiao-yi-62/posts)\n\n\n\n## 核心优势 ：k8s/prometheus/cicd/golang运维开发专家\n- 精通k8s源码：解决各种k8s底层疑难杂症、k8s大集群调优、containerd底层问题等\n- 从2个角度系统的研究过k8s源码，总结[2个k8s源码课程](https://haohuo.jinritemai.com/ecommerce/trade/detail/index.html?id=3669946874917421381\u0026origin_type=604) (按组件、从一个具体问题入手)\n- 独立开发30+k8s周边项目：多集群自动守卫、operator、定制调度器、webhook、各种ds等\n- k8s在线离线没有短板，离线训练：aiOnK8s、aiInfra、volcano等\n- 在线集群：稳定性保障、流量控制、集群网关apisix、多泳道等\n- k8s资源利用率：超卖、应用资源画像、在离混部-潮汐调度等\n- 监控架构：我可以将监控系统带到一个新的高度，熟读prometheus和周边项目源码，已多次给人培训并贡献多个开源项目\n- 监控底层：开发20+的exporter，维护各种tsdb、thanos、重查询提速、动态分片的HA等\n- cicd：独立设计完整的多环境多泳道发布流程、熟悉各种pipeLine工具源码tekton、argocd、kruise-rollout等\n- golang: 丰富的运维平台和工具开发经验，[独立设计并开发8模块前后端大运维平台](https://www.bilibili.com/video/BV1j2421c7ac) (工单、cmdb和服务树、grpc-cs任务执行、监控、k8s、cicd、巡检、日志监控、分布式网络探测)\n\n\n\n# 成功案例介绍 8模块大运维平台\n- 介绍\n```shell\n\n课程介绍\n【课程形式】2000集录播教程视频(持续更新)+直播答疑\n【自己一人用golang+vue3实现8模块golang大运维平台前后端全部代码】\n【后端golang代码4万行】【60+张mysql表】\n【8模块详情如下】\n模块01-前后端底座\n模块02-服务树和CMDB\n模块03-自助工单\n模块04-任务执行中心-grpc-server/agent\n模块05-prometheus监控平台\n模块06-k8s多集群和APP管理\n模块07-cicd平台和灰度发布\n模块08-数据库和SQL管理平台\n----\n学习前的门槛：golang基础+前端0基础即可\n```\n- ![image](./pic/飞书卡片.png)\n- ![img.png](pic/8模块菜单01.png)\n- ![img.png](pic/cicd-灰度发布-多分支泳道.png)\n\n# 成功案例介绍 gpuOnk8s volcano ai训练推理\n```shell\n【让运维搭上AI大模型的风口】，gpuOnk8s 实战搭建和go开发，源码解读，原理讲解\n【感兴趣的私聊】\ngpuOnk8s，gpu虚拟化、gpu监控、gpu多k8s集群管理，gpu坏卡自动守卫\n高性能-roce组网rdma网卡\n内核ebpf和cilium\nvolcano调度和 gpu虚拟化、dp和调度器开发\n```\n- ![img.png](pic/卖点-aiInfra-大模型-aiOnK8s-gpu-离线训练-volcano调度.png)\n\n\n\n# 成功案例介绍  k8s集群 在离混部 潮汐调度\n- 简介\n```shell\n【是k8s专家就自己用golang实现 k8s在离混部 潮汐调度组件】，感兴趣的私聊\n- 整机分时复用，潮汐调度\n- 动态资源分配和隔离：根据在线业务的负载，动态调整分配给离线业务的资源量，动态执行资源隔离策略\n- 动态资源感知，轻松提升集群cpu利用率20个基点\n- 混部cgroupv2资源大框\n- 资源波动驱逐管理器\n```\n- ![img.png](pic/卖点-k8s在离混部-潮汐调度golang开发实战.png)\n- ![img.png](pic/在离混部卖点.png)\n\n\n# 成功案例介绍   apisix网关改造下线istio\n- ![img.png](pic/apisix网关.png)\n```shell\n【是k8s专家就自己用golang实现一套ingress控制器和集群网关】\n\n- 目的不是完成一个非常厉害的控制器：(降级、多分支泳道 对接灰度/蓝绿发布)难度比较高\n- 更多的是在于在go源码级别让你更好的理解 ingress控制器的工作流程\n- 有助于排查底层istio/apisix的问题\n\n\n【掌握ingress/istio/apisix等网关源码还不能】\n# 一般的公司k8s集群流量网关重要性\n- 业务模式是在线服务 流量网关的建设是重点\n# 网关的核心点\n- 网站流量入口、http/grpc流量\n- 控制面配置，如何对接服务发现\n- 转发规则、降级、多分支泳道\n- 对接灰度/蓝绿发布\n- ingress-nginx/istio源码解析\n\n```\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fning1875%2Fk8ssolutions","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fning1875%2Fk8ssolutions","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fning1875%2Fk8ssolutions/lists"}