{"id":26896495,"url":"https://github.com/withlin/veps","last_synced_at":"2026-01-11T22:03:25.669Z","repository":{"id":282806411,"uuid":"949713253","full_name":"withlin/veps","owner":"withlin","description":"VictoriaMetrics Enhancement Proposals","archived":false,"fork":false,"pushed_at":"2025-12-28T06:24:55.000Z","size":90,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-12-30T13:45:11.244Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/withlin.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-03-17T02:38:31.000Z","updated_at":"2025-12-28T06:24:58.000Z","dependencies_parsed_at":"2025-03-17T05:15:30.549Z","dependency_job_id":null,"html_url":"https://github.com/withlin/veps","commit_stats":null,"previous_names":["withlin/veps"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/withlin/veps","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/withlin%2Fveps","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/withlin%2Fveps/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/withlin%2Fveps/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/withlin%2Fveps/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/withlin","download_url":"https://codeload.github.com/withlin/veps/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/withlin%2Fveps/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28324848,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-11T18:42:50.174Z","status":"ssl_error","status_checked_at":"2026-01-11T18:39:13.842Z","response_time":60,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-04-01T03:57:30.631Z","updated_at":"2026-01-11T22:03:25.653Z","avatar_url":"https://github.com/withlin.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"# VictoriaMetrics Enhancement Proposals(VEPs): Automation Kubernetes Monitoring for vmagent\n\n**Status**: Proposed  \n**Version**: v1  \n**Last Updated**: 2025-03-17\n\n## Table of Contents\n\n- [VictoriaMetrics Enhancement Proposals(VEPs): Automation Kubernetes Monitoring for vmagent](#victoriametrics-enhancement-proposalsveps-automation-kubernetes-monitoring-for-vmagent)\n  - [Table of Contents](#table-of-contents)\n  - [Overview](#overview)\n  - [Motivation](#motivation)\n  - [Goals](#goals)\n  - [Non-Goals](#non-goals)\n  - [Proposal](#proposal)\n    - [Architecture Design](#architecture-design)\n    - [Command Line Parameters Design](#command-line-parameters-design)\n    - [Built-in Collectors Design](#built-in-collectors-design)\n      - [1. Node Metrics Collector](#1-node-metrics-collector)\n      - [2. Container Metrics Collector](#2-container-metrics-collector)\n      - [3. Kubernetes State Metrics Collector](#3-kubernetes-state-metrics-collector)\n      - [4. Application Auto-Discovery](#4-application-auto-discovery)\n    - [Implementation](#implementation)\n      - [Collector Interface](#collector-interface)\n      - [vmagent Main Configuration](#vmagent-main-configuration)\n      - [Node Collector Implementation](#node-collector-implementation)\n      - [Container Collector Implementation](#container-collector-implementation)\n      - [Kubernetes State Collector Implementation](#kubernetes-state-collector-implementation)\n      - [Auto-Discovery Implementation](#auto-discovery-implementation)\n  - [Risks and Mitigations](#risks-and-mitigations)\n  - [Implementation Progress](#implementation-progress)\n  - [Test Plan](#test-plan)\n    - [Prerequisite testing updates](#prerequisite-testing-updates)\n    - [Unit tests](#unit-tests)\n    - [Integration tests](#integration-tests)\n    - [e2e tests](#e2e-tests)\n  - [Graduation Criteria](#graduation-criteria)\n    - [Alpha](#alpha)\n    - [Beta](#beta)\n    - [Stable](#stable)\n  - [Deprecated](#deprecated)\n  - [Disabled](#disabled)\n  - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)\n    - [Upgrade Strategy](#upgrade-strategy)\n    - [Downgrade Strategy](#downgrade-strategy)\n    - [Version Skew Strategy](#version-skew-strategy)\n  - [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)\n    - [Feature Enablement and Rollback](#feature-enablement-and-rollback)\n    - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning)\n    - [Monitoring Requirements](#monitoring-requirements)\n    - [Dependencies](#dependencies)\n    - [Scalability](#scalability)\n    - [Troubleshooting](#troubleshooting)\n  - [Implementation History](#implementation-history)\n  - [Drawbacks](#drawbacks)\n  - [Alternatives](#alternatives)\n  - [References](#references)\n\n## Overview\n\nThis proposal aims to simplify Kubernetes monitoring configuration by integrating lightweight monitoring components into vmagent, replacing complex YAML configurations with a single command-line flag. This approach enables vmagent to automatically discover and collect key metrics from Kubernetes clusters without deploying multiple standalone components, significantly reducing configuration complexity and resource consumption.\n\n## Motivation\n\nAccording to the issues described in [GitHub issue #1393](https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1393), current Kubernetes monitoring faces several key challenges:\n\n1. **Complex Configuration**:\n   - Kubernetes service discovery (kubernetes_sd_config) configuration is extremely complex\n   - Typical configurations include hundreds of lines of difficult-to-understand YAML\n   - Different configurations produce different metric names and labels, making it difficult to create unified dashboards and alerting rules\n\n2. **Multiple Component Dependencies**:\n   - Requires separate deployment of kube-state-metrics\n   - Requires cadvisor\n   - Requires node-exporter\n   - Requires configuration for application metrics scraping\n\n3. **Resource Waste**: Many generated metrics are never used in dashboards or alerting rules\n\n4. **Operational Complexity**: DevOps professionals often can only copy configuration snippets from the internet without understanding how the entire system works\n\nBy providing simple command-line parameters to enable standardized Kubernetes monitoring, we can greatly simplify this process while ensuring consistent metric naming and labeling conventions.\n\n## Goals\n\n1. Provide a simple command-line flag `-promscrape.kubernetes=true` to enable Kubernetes monitoring\n2. Support flexible selection of monitoring components through the `-promscrape.kubernetes.collectors` parameter\n3. Integrate lightweight alternative components to eliminate dependencies on node-exporter, cadvisor, and kube-state-metrics\n4. Establish unified metric naming and labeling conventions\n5. Simplify automatic discovery and collection of application metrics\n6. Reduce collection of unnecessary metrics to lower resource consumption\n7. Provide official Kubernetes monitoring dashboards and alerting rules\n\n## Non-Goals\n\n1. Completely replicate all metrics from traditional components\n2. Support all possible Kubernetes monitoring scenarios and configuration options\n3. Replace advanced custom monitoring requirements\n\n## Proposal\n\n### Architecture Design\n\nvmagent will be deployed as a DaemonSet on each node in the Kubernetes cluster, integrating the following functionalities:\n\n1. **Node Metrics Collector**: Replaces node-exporter, collecting node-level metrics\n2. **Container Metrics Collector**: Replaces cadvisor, collecting container-level metrics\n3. **Kubernetes State Metrics Collector**: Replaces kube-state-metrics, collecting cluster object states\n4. **Application Auto-Discovery**: Automatically discovers and scrapes application metrics based on annotations\n5. **Unified Metrics Handling**: Standardizes metric names and labels\n\n### Command Line Parameters Design\n\n```bash\n# Core switch\n-promscrape.kubernetes=true                          # Main switch: Enable Kubernetes monitoring functionality\n\n# Collector control\n-promscrape.kubernetes.collectors=node,container,kube-state,app  # Specify enabled collectors\n\n# Node collector parameters\n-promscrape.kubernetes.node.enabled=true               # Enable node collector\n-promscrape.kubernetes.node.interval=\"15s\"             # Collection interval\n-promscrape.kubernetes.node.collectors=\"all\"           # Collect all metrics (default)\n# Or select specific collectors\n-promscrape.kubernetes.node.collectors=\"cpu,meminfo,filesystem,netdev,loadavg,diskstats\"\n\n# Container collector parameters\n-promscrape.kubernetes.container.enabled=true          # Enable container collector\n-promscrape.kubernetes.container.interval=\"15s\"        # Collection interval\n-promscrape.kubernetes.container.collectors=\"all\"      # Collect all metrics (default)\n-promscrape.kubernetes.container.useCache=true         # Use cache to improve performance\n-promscrape.kubernetes.container.workers=5             # Number of parallel worker threads\n\n# Kubernetes state collector parameters\n-promscrape.kubernetes.kube-state.enabled=true         # Enable kube-state collector\n-promscrape.kubernetes.kube-state.interval=\"30s\"       # Collection interval\n-promscrape.kubernetes.kube-state.resources=\"all\"      # Collect all resources (default)\n-promscrape.kubernetes.kube-state.namespaces=\"\"        # Limit to namespaces, empty means all\n-promscrape.kubernetes.kube-state.excludeNamespaces=\"\" # Excluded namespaces\n\n# Application auto-discovery parameters\n-promscrape.kubernetes.autoDiscover.enabled=true       # Enable auto-discovery functionality\n-promscrape.kubernetes.autoDiscover.interval=\"30s\"     # Discovery refresh interval\n-promscrape.kubernetes.autoDiscover.roles=\"pod,service,node,endpoints,ingress\"  # Enabled service discovery roles\n-promscrape.kubernetes.autoDiscover.pod.pathAnnotation=\"victoriametrics.com/path\"  # Metrics path annotation\n-promscrape.kubernetes.autoDiscover.pod.portAnnotation=\"victoriametrics.com/port\"  # Metrics port annotation\n-promscrape.kubernetes.autoDiscover.pod.schemeAnnotation=\"victoriametrics.com/scheme\"  # Protocol annotation\n```\n\n### Built-in Collectors Design\n\n#### 1. Node Metrics Collector\n\nA lightweight node-exporter alternative that collects key node metrics:\n\n- CPU usage\n- Memory usage\n- Filesystem space\n- Network throughput\n- System load\n\nKey metrics examples:\n```\nvm_node_cpu_usage_percent{cpu=\"0\", mode=\"user\"} 24.5\nvm_node_memory_bytes{type=\"used\"} 8053063680\nvm_node_memory_bytes{type=\"total\"} 16106127360\nvm_node_filesystem_bytes{device=\"/dev/sda1\", mountpoint=\"/\", fstype=\"ext4\", type=\"used\"} 52034642240\nvm_node_network_bytes_total{device=\"eth0\", direction=\"receive\"} 1234567890\nvm_node_load1 0.42\n```\n\n#### 2. Container Metrics Collector\n\nAn efficient cadvisor alternative that collects container resource usage metrics:\n\n- Container CPU usage and limits\n- Container memory usage and limits\n- Container network usage\n- Container disk I/O\n\nKey metrics examples:\n```\nvm_container_cpu_usage_seconds_total{container_id=\"8af9f2\", container_name=\"nginx\", pod_name=\"web-1\", namespace=\"default\"} 15.7\nvm_container_memory_usage_bytes{container_id=\"8af9f2\", container_name=\"nginx\", pod_name=\"web-1\", namespace=\"default\", type=\"used\"} 67108864\nvm_container_memory_usage_bytes{container_id=\"8af9f2\", container_name=\"nginx\", pod_name=\"web-1\", namespace=\"default\", type=\"limit\"} 134217728\nvm_container_cpu_throttling_seconds_total{container_id=\"8af9f2\", container_name=\"nginx\", pod_name=\"web-1\", namespace=\"default\"} 0.45\n```\n\n#### 3. Kubernetes State Metrics Collector\n\nAn efficient kube-state-metrics alternative that collects Kubernetes object states:\n\n- Pod status and counts\n- Deployment status and replica counts\n- Node status and conditions\n- Service and endpoint information\n\nKey metrics examples:\n```\nvm_kube_pod_status{namespace=\"default\", pod=\"web-1\", phase=\"Running\", host_ip=\"10.0.0.5\", pod_ip=\"10.244.0.17\"} 1\nvm_kube_deployment_status{namespace=\"default\", deployment=\"web\", status_type=\"replicas\"} 3\nvm_kube_deployment_status{namespace=\"default\", deployment=\"web\", status_type=\"available\"} 3\nvm_kube_deployment_status{namespace=\"default\", deployment=\"web\", status_type=\"ready\"} 3\nvm_kube_node_status{node=\"worker-1\", condition=\"Ready\", status=\"true\"} 1\n```\n\n#### 4. Application Auto-Discovery\n\nAuto-discovers and scrapes application metrics based on annotations:\n\n- Uses `victoriametrics.com/scrape: \"true\"` annotation to mark scrapable Pods\n- Configures scrape details through `victoriametrics.com/path`, `victoriametrics.com/port`, `victoriametrics.com/scheme` annotations\n- Supports role-based service discovery: pod, service, node, endpoints, ingress\n\n### Implementation\n\n#### Collector Interface\n\n```go\n// pkg/kubernetes/collector/collector.go\n\npackage collector\n\nimport (\n\t\"context\"\n\t\"time\"\n\n\t\"github.com/prometheus/client_golang/prometheus\"\n)\n\n// Collector defines the interface that all Kubernetes metric collectors must implement\ntype Collector interface {\n\t// Name returns the unique name of the collector\n\tName() string\n\n\t// Description returns the description of the collector\n\tDescription() string\n\n\t// Initialize initializes the collector and registers all metrics\n\tInitialize(registry prometheus.Registerer) error\n\n\t// Start starts the collector's run loop, returning a Stop function\n\tStart(ctx context.Context) (StopFunc, error)\n\n\t// CollectorType returns the collector type (node, container, kube-state, app)\n\tCollectorType() string\n\n\t// SetInterval sets the collection interval\n\tSetInterval(interval time.Duration)\n}\n\n// StopFunc is used to stop the collector\ntype StopFunc func()\n\n// Config represents the configuration options for a collector\ntype Config struct {\n\t// Enabled indicates whether the collector is enabled\n\tEnabled bool\n\n\t// Interval represents the collection interval\n\tInterval time.Duration\n\n\t// CollectorsEnabled represents the specific collectors enabled (e.g., cpu, memory, etc.)\n\tCollectorsEnabled []string\n\n\t// UseCache indicates whether to use caching\n\tUseCache bool\n\n\t// Workers represents the number of worker threads\n\tWorkers int\n\n\t// Resources represents the resource types to monitor\n\tResources []string\n\n\t// Namespaces represents the namespaces to monitor\n\tNamespaces []string\n\n\t// ExcludeNamespaces represents the namespaces to exclude\n\tExcludeNamespaces []string\n}\n```\n\n#### vmagent Main Configuration\n\n```go\n// cmd/vmagent/main.go\n\npackage main\n\nimport (\n\t\"flag\"\n\t\"log\"\n\n\t\"github.com/VictoriaMetrics/VictoriaMetrics/pkg/kubernetes\"\n)\n\nvar (\n\t// Kubernetes monitoring related flags\n\tkubernetesEnabled = flag.Bool(\"promscrape.kubernetes\", false, \"Whether to enable Kubernetes monitoring\")\n\t\n\tkubernetesCollectors = flag.String(\"promscrape.kubernetes.collectors\", \"node,container,kube-state,app\", \n\t\t\"List of collectors to enable, comma-separated. Available values: node, container, kube-state, app, all\")\n)\n\nfunc main() {\n\t// Parse command line arguments\n\tflag.Parse()\n\n\t// Other vmagent initialization code...\n\n\t// If Kubernetes monitoring is enabled, initialize the Kubernetes monitoring module\n\tif *kubernetesEnabled {\n\t\tlog.Printf(\"Enabling Kubernetes monitoring with collectors: %s\", *kubernetesCollectors)\n\t\t\n\t\t// Create Kubernetes monitoring manager\n\t\tk8sMgr, err := kubernetes.NewManager(*kubernetesCollectors)\n\t\tif err != nil {\n\t\t\tlog.Fatalf(\"Unable to initialize Kubernetes monitoring: %s\", err)\n\t\t}\n\t\t\n\t\t// Start Kubernetes monitoring\n\t\tif err := k8sMgr.Start(); err != nil {\n\t\t\tlog.Fatalf(\"Unable to start Kubernetes monitoring: %s\", err)\n\t\t}\n\t\t\n\t\t// Stop Kubernetes monitoring when the program exits\n\t\tdefer k8sMgr.Stop()\n\t}\n\n\t// Other vmagent code...\n}\n```\n\n#### Node Collector Implementation\n\n```go\n// pkg/kubernetes/node/collector.go\n\npackage node\n\nimport (\n\t\"context\"\n\t\"fmt\"\n\t\"sync\"\n\t\"time\"\n\n\t\"github.com/prometheus/client_golang/prometheus\"\n\t\"github.com/shirou/gopsutil/v3/cpu\"\n\t\"github.com/shirou/gopsutil/v3/disk\"\n\t\"github.com/shirou/gopsutil/v3/mem\"\n\t\"github.com/shirou/gopsutil/v3/load\"\n\t\"github.com/shirou/gopsutil/v3/net\"\n\t\n\t\"github.com/VictoriaMetrics/VictoriaMetrics/pkg/kubernetes/collector\"\n)\n\n// Collector implements node metrics collection\ntype Collector struct {\n\t// Context and lifecycle management\n\tctx        context.Context\n\tcancel     context.CancelFunc\n\twg         sync.WaitGroup\n\tinterval   time.Duration\n\t\n\t// Performance optimization\n\tmetricCache  map[string]float64\n\tmutex        sync.RWMutex\n\t\n\t// Enabled collectors\n\tenabledCollectors map[string]bool\n\t\n\t// Metric definitions\n\tcpuUsage      *prometheus.GaugeVec\n\tmemoryStats   *prometheus.GaugeVec\n\tdiskSpace     *prometheus.GaugeVec\n\tdiskIO        *prometheus.GaugeVec\n\tnetworkIO     *prometheus.GaugeVec\n\tloadAvg       *prometheus.GaugeVec\n\t\n\t// Performance metrics\n\tscrapeDuration  prometheus.Histogram\n\tscrapeErrors    prometheus.Counter\n}\n\n// NewCollector creates a new node collector\nfunc NewCollector(config collector.Config) (*Collector, error) {\n\t// Set default collectors\n\tenabledCollectors := make(map[string]bool)\n\tif len(config.CollectorsEnabled) == 0 || containsString(config.CollectorsEnabled, \"all\") {\n\t\t// Enable all collectors by default\n\t\tenabledCollectors = map[string]bool{\n\t\t\t\"cpu\": true, \"meminfo\": true, \"filesystem\": true,\n\t\t\t\"netdev\": true, \"loadavg\": true, \"diskstats\": true,\n\t\t}\n\t} else {\n\t\t// Only enable specified collectors\n\t\tfor _, c := range config.CollectorsEnabled {\n\t\t\tenabledCollectors[c] = true\n\t\t}\n\t}\n\t\n\tctx, cancel := context.WithCancel(context.Background())\n\t\n\treturn \u0026Collector{\n\t\tctx:              ctx,\n\t\tcancel:           cancel,\n\t\tinterval:         config.Interval,\n\t\tmetricCache:      make(map[string]float64),\n\t\tenabledCollectors: enabledCollectors,\n\t}, nil\n}\n\n// Name returns the collector name\nfunc (c *Collector) Name() string {\n\treturn \"node-collector\"\n}\n\n// Description returns the collector description\nfunc (c *Collector) Description() string {\n\treturn \"Collects node-level system metrics, replacing node-exporter\"\n}\n\n// CollectorType returns the collector type\nfunc (c *Collector) CollectorType() string {\n\treturn \"node\"\n}\n\n// SetInterval sets the collection interval\nfunc (c *Collector) SetInterval(interval time.Duration) {\n\tc.interval = interval\n}\n\n// Initialize initializes the collector and registers metrics\nfunc (c *Collector) Initialize(registry prometheus.Registerer) error {\n\t// Create CPU metrics\n\tc.cpuUsage = prometheus.NewGaugeVec(\n\t\tprometheus.GaugeOpts{\n\t\t\tName: \"vm_node_cpu_usage_percent\",\n\t\t\tHelp: \"CPU usage percentage by mode\",\n\t\t},\n\t\t[]string{\"cpu\", \"mode\"},\n\t)\n\t\n\t// Create memory metrics\n\tc.memoryStats = prometheus.NewGaugeVec(\n\t\tprometheus.GaugeOpts{\n\t\t\tName: \"vm_node_memory_bytes\",\n\t\t\tHelp: \"Memory statistics in bytes\",\n\t\t},\n\t\t[]string{\"type\"},  // types: total, used, free, cached, buffers, etc.\n\t)\n\t\n\t// Create disk space metrics\n\tc.diskSpace = prometheus.NewGaugeVec(\n\t\tprometheus.GaugeOpts{\n\t\t\tName: \"vm_node_filesystem_bytes\",\n\t\t\tHelp: \"Filesystem statistics in bytes\",\n\t\t},\n\t\t[]string{\"device\", \"mountpoint\", \"fstype\", \"type\"},  // types: size, used, free, etc.\n\t)\n\t\n\t// Create disk IO metrics\n\tc.diskIO = prometheus.NewGaugeVec(\n\t\tprometheus.GaugeOpts{\n\t\t\tName: \"vm_node_disk_io_bytes_total\",\n\t\t\tHelp: \"Total disk IO bytes\",\n\t\t},\n\t\t[]string{\"device\", \"direction\"},  // direction: read, write\n\t)\n\t\n\t// Create network IO metrics\n\tc.networkIO = prometheus.NewGaugeVec(\n\t\tprometheus.GaugeOpts{\n\t\t\tName: \"vm_node_network_bytes_total\",\n\t\t\tHelp: \"Network traffic statistics in bytes\",\n\t\t},\n\t\t[]string{\"device\", \"direction\"},  // direction: receive, transmit\n\t)\n\t\n\t// Create load metrics\n\tc.loadAvg = prometheus.NewGaugeVec(\n\t\tprometheus.GaugeOpts{\n\t\t\tName: \"vm_node_load\",\n\t\t\tHelp: \"System load average\",\n\t\t},\n\t\t[]string{\"period\"},  // period: 1m, 5m, 15m\n\t)\n\t\n\t// Create performance metrics\n\tc.scrapeDuration = prometheus.NewHistogram(\n\t\tprometheus.HistogramOpts{\n\t\t\tName:    \"vm_node_collector_scrape_duration_seconds\",\n\t\t\tHelp:    \"Node metrics collection duration\",\n\t\t\tBuckets: prometheus.DefBuckets,\n\t\t},\n\t)\n\t\n\tc.scrapeErrors = prometheus.NewCounter(\n\t\tprometheus.CounterOpts{\n\t\t\tName: \"vm_node_collector_scrape_errors_total\",\n\t\t\tHelp: \"Total number of node collector errors\",\n\t\t},\n\t)\n\t\n\t// Only register metrics for enabled collectors\n\tif c.enabledCollectors[\"cpu\"] {\n\t\tregistry.MustRegister(c.cpuUsage)\n\t}\n\tif c.enabledCollectors[\"meminfo\"] {\n\t\tregistry.MustRegister(c.memoryStats)\n\t}\n\tif c.enabledCollectors[\"filesystem\"] {\n\t\tregistry.MustRegister(c.diskSpace)\n\t}\n\tif c.enabledCollectors[\"diskstats\"] {\n\t\tregistry.MustRegister(c.diskIO)\n\t}\n\tif c.enabledCollectors[\"netdev\"] {\n\t\tregistry.MustRegister(c.networkIO)\n\t}\n\tif c.enabledCollectors[\"loadavg\"] {\n\t\tregistry.MustRegister(c.loadAvg)\n\t}\n\t\n\t// Always register performance metrics\n\tregistry.MustRegister(c.scrapeDuration, c.scrapeErrors)\n\t\n\treturn nil\n}\n\n// Start begins the metrics collection loop\nfunc (c *Collector) Start(ctx context.Context) (collector.StopFunc, error) {\n\tc.ctx = ctx\n\t\n\tc.wg.Add(1)\n\tgo func() {\n\t\tdefer c.wg.Done()\n\t\tticker := time.NewTicker(c.interval)\n\t\tdefer ticker.Stop()\n\t\t\n\t\tfor {\n\t\t\tselect {\n\t\t\tcase \u003c-ticker.C:\n\t\t\t\tif err := c.collect(); err != nil {\n\t\t\t\t\tc.scrapeErrors.Inc()\n\t\t\t\t}\n\t\t\tcase \u003c-c.ctx.Done():\n\t\t\t\treturn\n\t\t\t}\n\t\t}\n\t}()\n\t\n\treturn func() {\n\t\tc.cancel()\n\t\tc.wg.Wait()\n\t}, nil\n}\n\n// collect performs a complete metrics collection\nfunc (c *Collector) collect() error {\n\tstart := time.Now()\n\tdefer func() {\n\t\tc.scrapeDuration.Observe(time.Since(start).Seconds())\n\t}()\n\t\n\t// Use multiple goroutines to collect different types of metrics in parallel\n\tvar wg sync.WaitGroup\n\terrCh := make(chan error, 5)  // Buffer for potential errors\n\t\n\t// Collect CPU metrics\n\tif c.enabledCollectors[\"cpu\"] {\n\t\twg.Add(1)\n\t\tgo func() {\n\t\t\tdefer wg.Done()\n\t\t\tif err := c.collectCPUMetrics(); err != nil {\n\t\t\t\terrCh \u003c- err\n\t\t\t}\n\t\t}()\n\t}\n\t\n\t// Collect memory metrics\n\tif c.enabledCollectors[\"meminfo\"] {\n\t\twg.Add(1)\n\t\tgo func() {\n\t\t\tdefer wg.Done()\n\t\t\tif err := c.collectMemoryMetrics(); err != nil {\n\t\t\t\terrCh \u003c- err\n\t\t\t}\n\t\t}()\n\t}\n\t\n\t// Collect filesystem metrics\n\tif c.enabledCollectors[\"filesystem\"] {\n\t\twg.Add(1)\n\t\tgo func() {\n\t\t\tdefer wg.Done()\n\t\t\tif err := c.collectFilesystemMetrics(); err != nil {\n\t\t\t\terrCh \u003c- err\n\t\t\t}\n\t\t}()\n\t}\n\t\n\t// Collect disk IO metrics\n\tif c.enabledCollectors[\"diskstats\"] {\n\t\twg.Add(1)\n\t\tgo func() {\n\t\t\tdefer wg.Done()\n\t\t\tif err := c.collectDiskIOMetrics(); err != nil {\n\t\t\t\terrCh \u003c- err\n\t\t\t}\n\t\t}()\n\t}\n\t\n\t// Collect network metrics\n\tif c.enabledCollectors[\"netdev\"] {\n\t\twg.Add(1)\n\t\tgo func() {\n\t\t\tdefer wg.Done()\n\t\t\tif err := c.collectNetworkMetrics(); err != nil {\n\t\t\t\terrCh \u003c- err\n\t\t\t}\n\t\t}()\n\t}\n\t\n\t// Collect load metrics\n\tif c.enabledCollectors[\"loadavg\"] {\n\t\twg.Add(1)\n\t\tgo func() {\n\t\t\tdefer wg.Done()\n\t\t\tif err := c.collectLoadMetrics(); err != nil {\n\t\t\t\terrCh \u003c- err\n\t\t\t}\n\t\t}()\n\t}\n\t\n\t// Wait for all collection tasks to complete\n\twg.Wait()\n\tclose(errCh)\n\t\n\t// Check for errors\n\tvar lastErr error\n\tfor err := range errCh {\n\t\tlastErr = err\n\t}\n\t\n\treturn lastErr\n}\n\n// collectCPUMetrics collects CPU-related metrics\nfunc (c *Collector) collectCPUMetrics() error {\n\t// Get CPU usage with minimal overhead\n\tcpuPercents, err := cpu.PercentWithContext(c.ctx, 0, true)\n\tif err != nil {\n\t\treturn err\n\t}\n\t\n\t// Update metrics in batch\n\tfor i, percent := range cpuPercents {\n\t\tcpuID := fmt.Sprintf(\"cpu%d\", i)\n\t\tc.cpuUsage.WithLabelValues(cpuID, \"user\").Set(percent)\n\t}\n\t\n\treturn nil\n}\n\n// collectMemoryMetrics collects memory-related metrics\nfunc (c *Collector) collectMemoryMetrics() error {\n\tmemStats, err := mem.VirtualMemoryWithContext(c.ctx)\n\tif err != nil {\n\t\treturn err\n\t}\n\t\n\tc.memoryStats.WithLabelValues(\"total\").Set(float64(memStats.Total))\n\tc.memoryStats.WithLabelValues(\"used\").Set(float64(memStats.Used))\n\tc.memoryStats.WithLabelValues(\"free\").Set(float64(memStats.Free))\n\tc.memoryStats.WithLabelValues(\"cached\").Set(float64(memStats.Cached))\n\tc.memoryStats.WithLabelValues(\"buffers\").Set(float64(memStats.Buffers))\n\t\n\treturn nil\n}\n\n// Other collector methods omitted for brevity...\n\n// containsString checks if a string slice contains a specific string\nfunc containsString(slice []string, s string) bool {\n\tfor _, item := range slice {\n\t\tif item == s {\n\t\t\treturn true\n\t\t}\n\t}\n\treturn false\n}\n```\n\n#### Container Collector Implementation\n\n```go\n// pkg/kubernetes/container/collector.go\n\npackage container\n\nimport (\n\t\"context\"\n\t\"sync\"\n\t\"time\"\n\t\n\t\"github.com/prometheus/client_golang/prometheus\"\n\t\"github.com/docker/docker/client\"\n\t\"github.com/docker/docker/api/types\"\n\t\n\t\"github.com/VictoriaMetrics/VictoriaMetrics/pkg/kubernetes/collector\"\n)\n\n// Collector implements container metrics collection\ntype Collector struct {\n\t// Context and lifecycle management\n\tctx        context.Context\n\tcancel     context.CancelFunc\n\twg         sync.WaitGroup\n\tinterval   time.Duration\n\t\n\t// Client connections\n\tdockerClient *client.Client\n\t\n\t// Cache settings\n\tuseCache      bool\n\tcontainerCache map[string]*containerInfo\n\tmutex          sync.RWMutex\n\t\n\t// Worker count\n\tworkers int\n\t\n\t// Enabled collectors\n\tenabledCollectors map[string]bool\n\t\n\t// Metric definitions\n\tcpuUsage        *prometheus.GaugeVec\n\tcpuThrottling   *prometheus.GaugeVec\n\tmemoryUsage     *prometheus.GaugeVec\n\tmemoryFailures  *prometheus.GaugeVec\n\tnetworkUsage    *prometheus.GaugeVec\n\tdiskIO          *prometheus.GaugeVec\n\tcontainerState  *prometheus.GaugeVec\n\t\n\t// Performance metrics\n\tscrapeDuration    prometheus.Histogram\n\tscrapeErrors      prometheus.Counter\n\tcontainersScraped prometheus.Gauge\n}\n\n// containerInfo holds cached container metadata\ntype containerInfo struct {\n\tID          string\n\tName        string\n\tPodName     string\n\tNamespace   string\n\tLastSeen    time.Time\n\tLabels      map[string]string\n}\n\n// Implementation details omitted for brevity...\n```\n\n#### Kubernetes State Collector Implementation\n\n```go\n// pkg/kubernetes/kube-state/collector.go\n\npackage kubestate\n\nimport (\n\t\"context\"\n\t\"sync\"\n\t\"time\"\n\t\n\t\"github.com/prometheus/client_golang/prometheus\"\n\tmetav1 \"k8s.io/apimachinery/pkg/apis/meta/v1\"\n\t\"k8s.io/client-go/kubernetes\"\n\t\"k8s.io/client-go/rest\"\n\t\"k8s.io/client-go/tools/cache\"\n\t\"k8s.io/client-go/util/workqueue\"\n\t\"k8s.io/client-go/informers\"\n\tcorev1 \"k8s.io/api/core/v1\"\n\tappsv1 \"k8s.io/api/apps/v1\"\n\t\n\t\"github.com/VictoriaMetrics/VictoriaMetrics/pkg/kubernetes/collector\"\n)\n\n// Collector implements Kubernetes state metrics collection\ntype Collector struct {\n\t// Context and lifecycle management\n\tctx        context.Context\n\tcancel     context.CancelFunc\n\twg         sync.WaitGroup\n\tinterval   time.Duration\n\t\n\t// Kubernetes client and caches\n\tclient          kubernetes.Interface\n\tinformerFactory informers.SharedInformerFactory\n\t\n\t// Work queue\n\tworkqueue  workqueue.RateLimitingInterface\n\tworkers    int\n\t\n\t// Monitored resource types\n\tresources        map[string]bool\n\tnamespaces       []string\n\texcludeNamespaces []string\n\t\n\t// Metric definitions\n\tpodMetrics       *prometheus.GaugeVec\n\tdeploymentMetrics *prometheus.GaugeVec\n\tnodeMetrics      *prometheus.GaugeVec\n\tserviceMetrics   *prometheus.GaugeVec\n\tpvcMetrics       *prometheus.GaugeVec\n\t\n\t// Performance metrics\n\tscrapeLatency    prometheus.Histogram\n\tapiErrors        prometheus.Counter\n\tresourcesScraped prometheus.CounterVec\n}\n\n// Implementation details omitted for brevity...\n```\n\n#### Auto-Discovery Implementation\n\n```go\n// pkg/kubernetes/autodiscover/collector.go\n\npackage autodiscover\n\nimport (\n\t\"context\"\n\t\"sync\"\n\t\"time\"\n\t\n\t\"github.com/prometheus/client_golang/prometheus\"\n\tmetav1 \"k8s.io/apimachinery/pkg/apis/meta/v1\"\n\t\"k8s.io/client-go/kubernetes\"\n\t\"k8s.io/client-go/rest\"\n\t\"k8s.io/apimachinery/pkg/util/yaml\"\n\t\n\t\"github.com/VictoriaMetrics/VictoriaMetrics/pkg/kubernetes/collector\"\n\t\"github.com/VictoriaMetrics/VictoriaMetrics/pkg/promscrape\"\n)\n\n// Collector implements Kubernetes application auto-discovery\ntype Collector struct {\n\t// Context and lifecycle management\n\tctx        context.Context\n\tcancel     context.CancelFunc\n\twg         sync.WaitGroup\n\tinterval   time.Duration\n\t\n\t// Kubernetes client\n\tclient     kubernetes.Interface\n\t\n\t// Configuration\n\troles             []string\n\tpodConfig         RoleConfig\n\tserviceConfig     RoleConfig\n\tnodeConfig        RoleConfig\n\tendpointsConfig   RoleConfig\n\tingressConfig     RoleConfig\n\t\n\t// Performance metrics\n\tdiscoveredTargets prometheus.GaugeVec\n\tscrapeErrors      prometheus.Counter\n\tscrapeLatency     prometheus.Histogram\n}\n\n// Implementation details omitted for brevity...\n```\n\n## Risks and Mitigations\n\n| Risk | Severity | Mitigation |\n|------|----------|------------|\n| Incomplete Metrics Support | Medium | Prioritize implementing the most commonly used metrics and establish a clear path for feature extension |\n| Performance Overhead | Medium | Optimize through caching and parallel processing, provide parameters to adjust thread counts and intervals |\n| Compatibility with Existing Configurations | High | Maintain support for traditional configuration, allow using both approaches simultaneously |\n| API Permission Requirements | Medium | Clearly document RBAC permission requirements, provide minimal permission templates |\n| Cluster Version Compatibility | Low | Test against Kubernetes 1.16+, document version compatibility |\n\n## Implementation Progress\n\n**Phase 1 - Basic Framework** (1-2 weeks)\n- [x] Design collector interfaces\n- [ ] Implement main command-line parameters\n- [ ] Create collector manager framework\n\n**Phase 2 - Core Collectors** (2-3 weeks)\n- [ ] Implement node metrics collector\n- [ ] Implement container metrics collector\n- [ ] Implement Kubernetes state collector\n- [ ] Implement application auto-discovery\n\n**Phase 3 - Integration and Testing** (1-2 weeks)\n- [ ] Integrate all collectors into vmagent\n- [ ] Create tests for various cluster types\n- [ ] Benchmark and performance optimization\n\n**Phase 4 - Documentation and Release** (1 week)\n- [ ] Create detailed documentation\n- [ ] Create example dashboards\n- [ ] Prepare release plan\n\n## Test Plan\n\n### Prerequisite testing updates\n\nBefore implementing this feature, we need to establish baseline benchmarks for performance comparison:\n\n- Resource usage (CPU, memory) of current vmagent without Kubernetes monitoring\n- Collection latency with traditional configurations using node-exporter, cadvisor, and kube-state-metrics\n- Coverage of metrics in traditional setup\n- Reliability metrics (error rates, collection failures)\n\n### Unit tests\n\nUnit tests will cover:\n\n- Configuration parsing for various collector parameters\n- Metric registration and deregistration\n- Collector lifecycle management (initialization, start, stop)\n- Validity of generated metric names and labels\n- Error handling in collection code paths\n- Concurrency safety of collectors\n\nEach collector will have dedicated test suites:\n\n- Node collector: Tests for CPU, memory, filesystem, network metrics\n- Container collector: Tests for container metadata, resource usage metrics\n- Kubernetes state collector: Tests for object state tracking, API interactions\n- Auto-discovery: Tests for annotations parsing, role configuration\n\n### Integration tests\n\nIntegration tests will verify:\n\n- End-to-end metric collection in containerized environments\n- Interaction with Kubernetes API server using mock clients\n- Performance under various load conditions\n- Configuration reload and dynamic reconfiguration\n- Compatibility with various Kubernetes versions (1.16+)\n- Multiple collectors working together\n\n### e2e tests\n\nEnd-to-end tests will be run on real Kubernetes clusters to validate:\n\n- Collection accuracy compared to traditional tools\n- Resource consumption (should be lower than combined tools)\n- Scalability with large cluster sizes (100+ nodes)\n- Fault tolerance (node failures, API server unavailability)\n- Upgrade and downgrade scenarios\n- Completeness of collected metrics for dashboard rendering\n\n## Graduation Criteria\n\n### Alpha\n\nAlpha release requirements:\n\n- Complete implementation of all core collectors\n- Basic documentation and usage examples\n- Functioning end-to-end on standard Kubernetes environments\n- Unit test coverage \u003e70%\n- Performance at least equal to traditional setup\n- Support for the most common metrics (80/20 rule)\n- Clearly documented limitations and known issues\n\n### Beta\n\nBeta release requirements:\n\n- Successfully running in at least 3 production environments\n- Comprehensive documentation, including troubleshooting guides\n- Performance optimizations complete\n- Unit test coverage \u003e85%\n- Integration and e2e test coverage \u003e60%\n- Dashboard templates available for major observability platforms\n- Alert rule templates available\n- No known critical bugs\n- Graceful degradation under error conditions\n\n### Stable\n\nStable release requirements:\n\n- Production usage for 3+ months without major issues\n- Complete documentation, including reference dashboards and best practices\n- Performance benchmarks showing improvement over traditional setup\n- Unit test coverage \u003e90%, integration and e2e test coverage \u003e75%\n- Telemetry for usage and error reporting\n- Compatibility with all supported Kubernetes versions\n- Verified upgrade path from beta\n- Feature complete for targeted use cases\n\n## Deprecated\n\nFeatures that will be deprecated as part of this implementation:\n\n- Complex manual kubernetes_sd_config configurations (replaced by simplified parameters)\n- Manual service discovery configuration for common Kubernetes components\n- Direct dependencies on external monitoring components (node-exporter, cadvisor, kube-state-metrics)\n\nThe traditional configuration approach will still be available but marked as legacy in documentation.\n\n## Disabled\n\nThe feature will be disabled by default and requires explicit opt-in via the `-promscrape.kubernetes=true` flag. \n\nIndividual collectors can be enabled or disabled through the `-promscrape.kubernetes.collectors` parameter.\n\n## Upgrade / Downgrade Strategy\n\n### Upgrade Strategy\n\nFor upgrades to versions with this feature:\n\n1. Deploy new vmagent version with Kubernetes monitoring disabled\n2. Verify normal operation\n3. Enable Kubernetes monitoring with only non-critical collectors (node, container)\n4. Validate collected metrics and dashboard rendering\n5. Enable remaining collectors\n6. Once validated, remove redundant components (node-exporter, cadvisor, kube-state-metrics)\n\n### Downgrade Strategy\n\nFor downgrades from versions with this feature:\n\n1. Deploy traditional monitoring components alongside vmagent\n2. Verify they're working correctly\n3. Disable Kubernetes monitoring in vmagent\n4. Downgrade vmagent version\n\n### Version Skew Strategy\n\nIn environments with multiple vmagent versions:\n\n- Ensure metrics naming consistency with configuration\n- Use metric relabeling to harmonize differences if needed\n- Maintain backward compatibility in metric naming where possible\n- Document potential conflicts and mitigation\n\n## Production Readiness Review Questionnaire\n\n### Feature Enablement and Rollback\n\n1. **How can this feature be enabled / disabled in a live cluster?**\n   - Feature gate: `-promscrape.kubernetes=true|false`\n   - Other flags: Individual collectors can be enabled/disabled separately\n   - Can be changed at runtime? No, requires restart of vmagent\n\n2. **Does enabling the feature change any default behavior?**\n   - Yes, it adds automatic discovery and collection of Kubernetes metrics\n   - No changes when disabled\n\n3. **Can the feature be disabled once it has been enabled?**\n   - Yes, by setting `-promscrape.kubernetes=false`\n   - Requires restart of vmagent\n\n4. **What happens if we disable the feature while it's in use?**\n   - Kubernetes metrics will no longer be collected\n   - Existing metrics in storage will remain until retention period\n\n5. **Are there any prerequisites for enabling this feature?**\n   - RBAC permissions for vmagent to access Kubernetes API\n   - Access to container runtime statistics\n   - Access to node filesystem for node metrics\n\n### Rollout, Upgrade and Rollback Planning\n\n1. **How can an operator determine if the feature is in use?**\n   - Check for presence of vm_* metrics from the collectors\n   - Look for log messages indicating Kubernetes monitoring is enabled\n   - Monitor resource usage patterns typical of active collectors\n\n2. **How can an operator determine if the feature is enabled but not in use?**\n   - Monitor for log messages about failed initialization\n   - Check for absence of expected metrics despite enabling the feature\n   - Verify error counters in vmagent's own metrics\n\n3. **What are the SLIs for this feature?**\n   - Latency: Metric collection duration\n   - Availability: Percentage of successful scrapes\n   - Errors: Rate of collection errors\n   - Resource usage: CPU and memory consumption\n\n4. **What are reasonable SLOs for the above SLIs?**\n   - Latency: 99% of scrapes complete within 5s\n   - Availability: 99.9% successful scrapes\n   - Errors: \u003c0.1% error rate\n   - Resource usage: \u003c200MB base + 2MB per node\n\n5. **Are there any known limitations?**\n   - Not all metrics from traditional exporters will be available\n   - Performance may degrade in very large clusters (1000+ nodes)\n   - Some specialized metrics require custom configuration\n\n### Monitoring Requirements\n\n1. **How can an operator monitor the feature?**\n   - Monitor vmagent's own metrics for collector performance\n   - Watch for collection errors in logs\n   - Monitor resource usage of vmagent pods\n   - Check for expected metric presence and freshness\n\n2. **What are the reasonable alerting thresholds?**\n   - Alert on \u003e5% error rate in collection\n   - Alert on persistent absence of critical metrics\n   - Alert on collection latency exceeding 10s\n   - Alert on vmagent resource saturation\n\n3. **Are there any missing metrics that would be useful?**\n   - Per-collector success/failure rates\n   - API request counts and latencies\n   - Metric cardinality statistics\n   - Cache hit/miss rates\n\n### Dependencies\n\n1. **Does this feature depend on any specific services running in the cluster?**\n   - Kubernetes API server\n   - Kubelet on each node\n   - Container runtime with stats API\n\n2. **Does this feature depend on any other features?**\n   - General Prometheus scraping functionality in vmagent\n   - Service account and RBAC support in Kubernetes\n\n3. **Does this feature make use of any API Extensions?**\n   - No new API extensions are required\n\n### Scalability\n\n1. **Will enabling this feature result in any new API calls?**\n   - Yes, calls to Kubernetes API for listing and watching resources\n   - Kubelet API calls for container stats\n   - Filesystem access for node stats\n\n2. **Will enabling this feature result in introducing new API types?**\n   - No new API types are introduced\n\n3. **Will enabling this feature result in any new calls to cloud provider?**\n   - No direct cloud provider API calls\n\n4. **Will enabling this feature result in increasing size or count of the existing API objects?**\n   - No change to API object size\n   - No creation of additional API objects\n\n5. **Will enabling this feature result in increasing time taken by any operations?**\n   - Startup time of vmagent will increase slightly\n   - No impact on Kubernetes control plane operations\n\n6. **Will enabling this feature result in any new cardinality of metrics?**\n   - Yes, new metrics with label dimensions for Kubernetes objects\n   - Controlled through resource selection and namespaces filtering\n\n### Troubleshooting\n\n1. **How does this feature react if the API server and/or etcd is unavailable?**\n   - Falls back to cached data for previously discovered resources\n   - Continues collecting metrics that don't require API server\n   - Logs errors and increments error counters\n   - Retries with backoff for API server operations\n\n2. **What are other known failure modes?**\n   - Insufficient permissions: Logs authorization errors\n   - Resource constraints: Collection slows or fails under resource pressure\n   - Configuration errors: Logs parsing errors and uses defaults\n   - Container runtime API changes: May fail to collect container metrics\n\n3. **What steps should be taken if SLOs are not being met?**\n   - Check vmagent logs for specific error messages\n   - Verify RBAC permissions are correct\n   - Consider reducing enabled collectors or increasing resources\n   - Check for Kubernetes API server performance issues\n   - Reduce collection frequency if necessary\n\n## Implementation History\n\n- **2025-03-17**: Initial VEP draft created\n\n## Drawbacks\n\nPotential drawbacks of this approach include:\n\n1. **Increased complexity in vmagent**: Adding builtin collectors increases code complexity and maintenance burden\n2. **Potential resource usage**: While more efficient than multiple components, still requires resources on each node\n3. **Less flexibility**: Simplified approach may not cover all custom monitoring scenarios\n4. **More permissions required**: vmagent needs broader permissions to collect all metrics\n5. **Consistency challenges**: Maintaining consistent metrics during version transitions\n\n## Alternatives\n\nAlternative approaches that were considered:\n\n1. **Operator-based approach**: Create a Kubernetes operator to manage monitoring components\n   - Pros: Declarative configuration, managed lifecycle\n   - Cons: Another component to maintain, doesn't simplify collection\n\n2. **Push-based approach**: Have Kubernetes components push metrics to VictoriaMetrics\n   - Pros: Reduced scrape complexity, potentially lower latency\n   - Cons: Counter to Prometheus model, requires changes to components\n\n  \n## References\n\n1. [GitHub issue #1393: Automatically discover and scrape Prometheus targets in Kubernetes](https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1393)\n2. [Prometheus Documentation: kubernetes_sd_config](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#kubernetes_sd_config)\n3. [Kubernetes Documentation: Monitoring Architecture](https://kubernetes.io/docs/concepts/cluster-administration/monitoring/)\n4. [Node Exporter GitHub Repository](https://github.com/prometheus/node_exporter)\n5. [cAdvisor GitHub Repository](https://github.com/google/cadvisor)\n6. [kube-state-metrics GitHub Repository](https://github.com/kubernetes/kube-state-metrics) ","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwithlin%2Fveps","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fwithlin%2Fveps","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwithlin%2Fveps/lists"}