{"id":50578902,"url":"https://github.com/pramodksahoo/centralized-logging-eks","last_synced_at":"2026-06-05T00:30:23.583Z","repository":{"id":357641840,"uuid":"1237899969","full_name":"pramodksahoo/centralized-logging-eks","owner":"pramodksahoo","description":"Centralized Logging Platform for Amazon EKS","archived":false,"fork":false,"pushed_at":"2026-05-13T16:22:57.000Z","size":72,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"master","last_synced_at":"2026-05-13T18:08:23.952Z","etag":null,"topics":["eks","helm","logging"],"latest_commit_sha":null,"homepage":"","language":"Go Template","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/pramodksahoo.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-05-13T16:04:10.000Z","updated_at":"2026-05-13T16:23:00.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/pramodksahoo/centralized-logging-eks","commit_stats":null,"previous_names":["pramodksahoo/centralized-logging-eks"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/pramodksahoo/centralized-logging-eks","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pramodksahoo%2Fcentralized-logging-eks","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pramodksahoo%2Fcentralized-logging-eks/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pramodksahoo%2Fcentralized-logging-eks/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pramodksahoo%2Fcentralized-logging-eks/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/pramodksahoo","download_url":"https://codeload.github.com/pramodksahoo/centralized-logging-eks/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pramodksahoo%2Fcentralized-logging-eks/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33926275,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-04T02:00:06.755Z","response_time":64,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["eks","helm","logging"],"created_at":"2026-06-05T00:30:22.829Z","updated_at":"2026-06-05T00:30:23.573Z","avatar_url":"https://github.com/pramodksahoo.png","language":"Go Template","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Centralized Logging Platform for Amazon EKS\n\nA production-ready, Helm-managed centralized logging stack for Amazon EKS.\n\nThis chart deploys the complete logging pipeline:\n\n```text\nKubernetes Pods\n  -\u003e Fluent Bit DaemonSet\n  -\u003e Apache Kafka / Strimzi\n  -\u003e Vector Aggregator StatefulSet\n  -\u003e Elasticsearch hot + warm data tiers\n  -\u003e Amazon S3 archive\n  -\u003e Kibana dashboards\n  -\u003e ElastAlert alerts\n```\n\nRecommended Helm release name:\n\n```bash\nplatform-logging\n```\n\n---\n\n## Table of contents\n\n- [Why this setup was chosen](#why-this-setup-was-chosen)\n- [Problems with the previous approach](#problems-with-the-previous-approach)\n- [Why the Helm-based solution is better](#why-the-helm-based-solution-is-better)\n- [Architecture diagram](#architecture-diagram)\n- [Component responsibilities](#component-responsibilities)\n- [Repository and Helm chart structure](#repository-and-helm-chart-structure)\n- [Prerequisites](#prerequisites)\n- [Step-by-step Helm deployment guide](#step-by-step-helm-deployment-guide)\n- [Recommended values to review before production](#recommended-values-to-review-before-production)\n- [Verification](#verification)\n- [Kibana access](#kibana-access)\n- [Upgrade guide](#upgrade-guide)\n- [Rollback guide](#rollback-guide)\n- [Uninstall guide](#uninstall-guide)\n- [Team onboarding guide](#team-onboarding-guide)\n  - [How logs flow from your cluster to the centralized platform](#how-logs-flow-from-your-cluster-to-the-centralized-platform)\n  - [Step 1: Register your cluster with the platform team](#step-1-register-your-cluster-with-the-platform-team)\n  - [Step 2: Set up cross-cluster network connectivity](#step-2-set-up-cross-cluster-network-connectivity)\n  - [Step 3: Obtain Kafka connection credentials](#step-3-obtain-kafka-connection-credentials)\n  - [Step 4: Deploy Fluent Bit on your cluster](#step-4-deploy-fluent-bit-on-your-cluster)\n  - [Step 5: Apply required pod labels to your workloads](#step-5-apply-required-pod-labels-to-your-workloads)\n  - [Step 6: Format your application logs as structured JSON](#step-6-format-your-application-logs-as-structured-json)\n  - [Step 7: Verify your logs are flowing](#step-7-verify-your-logs-are-flowing)\n- [Application team logging standard](#application-team-logging-standard)\n- [Production sizing notes](#production-sizing-notes)\n- [Operational best practices](#operational-best-practices)\n- [Troubleshooting](#troubleshooting)\n\n---\n\n## Why this setup was chosen\n\nModern Kubernetes platforms with hundreds of microservices need more than simple pod log viewing. For **400+ microservices** and **10+ application teams**, the logging platform must handle high volume, bursts, search, alerting, team separation, and long-term retention.\n\nThis setup was chosen because each component has a focused responsibility:\n\n| Layer | Tool | Why it was selected |\n|---|---|---|\n| Node-level collection | Fluent Bit | Lightweight log collector that runs as a DaemonSet on every node. It tails Kubernetes container logs with low CPU and memory overhead. |\n| Buffer and decoupling | Kafka | Absorbs spikes and protects downstream systems when Elasticsearch or S3 slows down. |\n| Processing and routing | Vector | Lightweight, fast, flexible log processing layer with disk buffering and routing to multiple destinations. |\n| Hot searchable storage | Elasticsearch | Fast full-text search and filtering for recent logs. |\n| Archive storage | Amazon S3 | Low-cost long-term storage for compliance, replay, audits, and historical investigation. |\n| Visualization | Kibana | Search UI, dashboards, filters, and troubleshooting experience for engineering teams. |\n| Alerting | ElastAlert | Rule-based alerting for error spikes, anomalies, and service-level incidents. |\n| Deployment management | Helm | One release, one values file, repeatable installation, versioned upgrades, and easy rollback. |\n\nThe result is a platform that is:\n\n- **Lightweight** at the node level\n- **Reliable** during traffic spikes\n- **Flexible** for multiple application teams\n- **Robust** because Kafka and Vector buffers reduce log loss risk\n- **Searchable** through Elasticsearch and Kibana\n- **Cost-aware** because S3 handles long-term retention\n- **Easy to operate** because Helm tracks all resources as one release\n\n---\n\n## Problems with the previous approach\n\nThe previous approach used separate Kubernetes manifest files and shell scripts.\n\nThat works for a first implementation, but it becomes difficult to manage in a real EKS production environment.\n\n### 1. Too many manual steps\n\nRaw manifests and scripts require engineers to remember the correct order:\n\n```text\nCreate namespace\nInstall operators\nCreate secrets\nApply Kafka\nApply Elasticsearch\nApply Fluent Bit\nApply Vector\nApply Kibana\nApply ElastAlert\nRun setup jobs\nVerify everything manually\n```\n\nThis increases the chance of mistakes during production deployment.\n\n### 2. No single release tracking\n\nWhen resources are applied with `kubectl apply` and shell scripts, Kubernetes resources exist independently.\n\nIt becomes harder to answer:\n\n- Which version is currently deployed?\n- Which values were used?\n- What changed between releases?\n- Who upgraded the stack?\n- How do we safely roll back?\n\n### 3. Harder upgrades\n\nWith raw YAML, every future version upgrade requires manual edits across many files.\n\nFor example:\n\n- Kafka version upgrade\n- Strimzi API version change\n- Elasticsearch version upgrade\n- Vector image upgrade\n- Fluent Bit configuration update\n- Storage size changes\n- Retention policy changes\n\nWithout Helm, these changes are harder to review and repeat safely.\n\n### 4. Environment drift\n\nDevelopment, staging, and production environments can easily drift apart when each one is deployed manually.\n\nFor example:\n\n```text\ndev has 3 Kafka brokers\nstaging has 5 Kafka brokers\nprod has 5 Kafka brokers but different storage\n```\n\nA values-based Helm chart keeps one reusable template and separates environment-specific configuration into values files.\n\n### 5. Secrets and cloud settings were harder to standardize\n\nFor EKS, production S3 access should normally use IAM Roles for Service Accounts, also known as IRSA.\n\nWith raw manifests, AWS region, bucket name, role annotation, and secret behavior are spread across files. In this Helm chart, these are configured centrally through `values.yaml`.\n\n### 6. Reusability was limited\n\nThe previous bundle was useful, but not as reusable across clusters.\n\nA Helm chart can be installed repeatedly into different clusters using different values files:\n\n```text\nvalues-dev.yaml\nvalues-staging.yaml\nvalues-eks-production.yaml\n```\n\n---\n\n## Why the Helm-based solution is better\n\nHelm provides a release-based deployment model.\n\nThat means the entire centralized logging platform can be installed, upgraded, rolled back, and uninstalled as one managed release.\n\n### Benefits of this Helm chart\n\n| Benefit | Explanation |\n|---|---|\n| One command deployment | Deploy the complete logging stack using `helm upgrade --install`. |\n| One values file | All required settings are controlled from `values-eks-production.yaml`. |\n| Repeatable environments | Use the same chart for dev, staging, and prod with different values files. |\n| Easier upgrades | Change image tags, replicas, retention, storage, and versions through values. |\n| Rollback support | Use `helm rollback` if a release has issues. |\n| Release history | Use `helm history` to see past deployments. |\n| Safer reviews | Review changes using `helm diff` or `helm template` before applying. |\n| Less duplication | Templates avoid copy-paste across YAML files. |\n| Better GitOps fit | Works well with Argo CD, Flux, Jenkins, GitHub Actions, or GitLab CI. |\n| Cleaner ownership | The platform team can own the chart while app teams only follow logging standards. |\n\n---\n\n## Architecture diagram\n\nThe following Mermaid diagram shows both the **runtime log flow** and the **Helm deployment ownership model**.\n\n```mermaid\nflowchart TB\n  %% =========================\n  %% Deployment control plane\n  %% =========================\n  subgraph GitHub[\"GitHub Repository\"]\n    Chart[\"Helm Chart\u003cbr/\u003ecentralized-logging-eks\"]\n    Values[\"values-eks-production.yaml\u003cbr/\u003ecluster, storage, S3, replicas, retention\"]\n  end\n\n  Engineer[\"Platform Engineer\"] --\u003e Chart\n  Engineer --\u003e Values\n\n  Chart --\u003e Helm[\"Helm Release\u003cbr/\u003eplatform-logging\"]\n  Values --\u003e Helm\n\n  %% =========================\n  %% EKS cluster\n  %% =========================\n  subgraph EKS[\"Amazon EKS Cluster\"]\n    \n    subgraph LoggingNS[\"Namespace: logging\"]\n      Strimzi[\"Strimzi Operator\u003cbr/\u003eKafka CRDs\"]\n      ECK[\"ECK Operator\u003cbr/\u003eElasticsearch and Kibana CRDs\"]\n\n      Kafka[\"Kafka Cluster\u003cbr/\u003econtrollers + brokers\"]\n      TopicRaw[\"Kafka Topic\u003cbr/\u003elogs.raw\"]\n      TopicDLQ[\"Kafka Topic\u003cbr/\u003elogs.dlq\"]\n\n      FluentBit[\"Fluent Bit DaemonSet\u003cbr/\u003eone pod per node\"]\n      Vector[\"Vector Aggregator StatefulSet\u003cbr/\u003edisk buffer + routing\"]\n\n      ES[\"Elasticsearch\u003cbr/\u003e3 master + hot data + warm data\"]\n      Kibana[\"Kibana\u003cbr/\u003edashboards and log search\"]\n\n      ElastAlert[\"ElastAlert\u003cbr/\u003eerror spike rules\"]\n      IndexJob[\"Elasticsearch Index Setup Job\u003cbr/\u003eILM + template\"]\n    end\n\n    subgraph AppNamespaces[\"Application Namespaces\"]\n      App1[\"Application Team A\u003cbr/\u003emicroservices\"]\n      App2[\"Application Team B\u003cbr/\u003emicroservices\"]\n      AppN[\"Application Team N\u003cbr/\u003e400+ services\"]\n    end\n\n    NodeLogs[\"/var/log/containers/*.log\u003cbr/\u003estdout and stderr\"]\n  end\n\n  %% =========================\n  %% AWS Services\n  %% =========================\n  subgraph AWS[\"AWS Services\"]\n    S3[\"Amazon S3\u003cbr/\u003elong-term archive\"]\n    IAM[\"IAM Role for Service Account\u003cbr/\u003eIRSA\"]\n  end\n\n  %% =========================\n  %% Connections\n  %% =========================\n  Helm --\u003e Strimzi\n  Helm --\u003e ECK\n  Helm --\u003e FluentBit\n  Helm --\u003e Vector\n  Helm --\u003e Kibana\n  Helm --\u003e ElastAlert\n  Helm --\u003e IndexJob\n\n  Strimzi --\u003e Kafka\n  Kafka --\u003e TopicRaw\n  Kafka --\u003e TopicDLQ\n  ECK --\u003e ES\n  ECK --\u003e Kibana\n\n  App1 --\u003e NodeLogs\n  App2 --\u003e NodeLogs\n  AppN --\u003e NodeLogs\n\n  NodeLogs --\u003e FluentBit\n  FluentBit --\u003e TopicRaw\n  TopicRaw --\u003e Vector\n\n  Vector --\u003e ES\n  Vector --\u003e S3\n  Vector -. failed or special routes .-\u003e TopicDLQ\n\n  ES --\u003e Kibana\n  ES --\u003e ElastAlert\n\n  IAM --\u003e Vector\n\n  %% =========================\n  %% Styling\n  %% =========================\n\n  %% Main sections\n  style GitHub fill:#1e293b,stroke:#94a3b8,stroke-width:2px,color:#ffffff\n  style EKS fill:#0f172a,stroke:#38bdf8,stroke-width:2px,color:#ffffff\n  style AWS fill:#1e1b4b,stroke:#a78bfa,stroke-width:2px,color:#ffffff\n\n  %% Logging namespace\n  style LoggingNS fill:#111827,stroke:#06b6d4,stroke-width:2px,color:#ffffff\n  style AppNamespaces fill:#1f2937,stroke:#10b981,stroke-width:2px,color:#ffffff\n\n  %% GitOps + deployment\n  style Engineer fill:#2563eb,stroke:#bfdbfe,stroke-width:2px,color:#ffffff\n  style Chart fill:#334155,stroke:#cbd5e1,stroke-width:2px,color:#ffffff\n  style Values fill:#334155,stroke:#cbd5e1,stroke-width:2px,color:#ffffff\n  style Helm fill:#7c3aed,stroke:#ddd6fe,stroke-width:2px,color:#ffffff\n\n  %% Operators\n  style Strimzi fill:#9333ea,stroke:#f3e8ff,stroke-width:2px,color:#ffffff\n  style ECK fill:#9333ea,stroke:#f3e8ff,stroke-width:2px,color:#ffffff\n\n  %% Kafka pipeline\n  style Kafka fill:#ea580c,stroke:#ffedd5,stroke-width:2px,color:#ffffff\n  style TopicRaw fill:#f97316,stroke:#ffedd5,stroke-width:2px,color:#ffffff\n  style TopicDLQ fill:#dc2626,stroke:#fecaca,stroke-width:2px,color:#ffffff\n\n  %% Log collection\n  style FluentBit fill:#0891b2,stroke:#cffafe,stroke-width:2px,color:#ffffff\n  style Vector fill:#0f766e,stroke:#ccfbf1,stroke-width:2px,color:#ffffff\n  style NodeLogs fill:#475569,stroke:#cbd5e1,stroke-width:2px,color:#ffffff\n\n  %% Elasticsearch stack\n  style ES fill:#16a34a,stroke:#dcfce7,stroke-width:2px,color:#ffffff\n  style Kibana fill:#ca8a04,stroke:#fef9c3,stroke-width:2px,color:#ffffff\n  style ElastAlert fill:#b91c1c,stroke:#fee2e2,stroke-width:2px,color:#ffffff\n  style IndexJob fill:#4f46e5,stroke:#c7d2fe,stroke-width:2px,color:#ffffff\n\n  %% Applications\n  style App1 fill:#059669,stroke:#d1fae5,stroke-width:2px,color:#ffffff\n  style App2 fill:#059669,stroke:#d1fae5,stroke-width:2px,color:#ffffff\n  style AppN fill:#059669,stroke:#d1fae5,stroke-width:2px,color:#ffffff\n\n  %% AWS\n  style S3 fill:#7c2d12,stroke:#fed7aa,stroke-width:2px,color:#ffffff\n  style IAM fill:#4338ca,stroke:#c7d2fe,stroke-width:2px,color:#ffffff\n\n  %% Link styling\n  linkStyle default stroke:#94a3b8,stroke-width:2px\n```\n\n---\n\n## Component responsibilities\n\n### Fluent Bit\n\nFluent Bit runs as a Kubernetes DaemonSet.\n\nPurpose:\n\n- Runs one pod on each node\n- Reads container logs from `/var/log/containers/*.log`\n- Enriches logs with Kubernetes metadata\n- Sends logs to Kafka\n- Keeps node-level overhead low\n\nIn this platform, Fluent Bit should stay simple. It should collect and forward logs, not perform heavy transformations.\n\n### Kafka\n\nKafka acts as the durable buffer between log collection and processing.\n\nPurpose:\n\n- Decouples producers from consumers\n- Absorbs sudden log spikes\n- Protects against downstream slowness\n- Enables replay from the raw log topic\n- Prevents Elasticsearch from being directly overloaded by all nodes\n\nMain topics:\n\n| Topic | Purpose |\n|---|---|\n| `logs.raw` | Main raw log stream from Fluent Bit |\n| `logs.dlq` | Dead-letter or failed-processing topic |\n\n### Vector Aggregator\n\nVector is the main processing and routing layer.\n\nPurpose:\n\n- Consumes logs from Kafka\n- Normalizes fields\n- Adds standard fields such as `cluster`, `namespace`, `service`, `team`, and `environment`\n- Routes logs to Elasticsearch and S3\n- Uses disk buffers to reduce log loss during downstream issues\n- Provides a lightweight alternative to Logstash\n\nVector is deployed as a StatefulSet because persistent buffers are important for reliability.\n\n### Elasticsearch\n\nElasticsearch stores recent searchable logs.\n\nPurpose:\n\n- Fast search and filtering\n- Support for Kibana dashboards\n- Support for ElastAlert rules\n- Hot and warm node separation for better cost and performance control\n\nThis chart creates separate node sets:\n\n| Node set | Purpose |\n|---|---|\n| Masters | Dedicated cluster management nodes |\n| Hot data | High-speed indexing and recent log search |\n| Warm data | Older searchable logs before deletion |\n\n### Amazon S3\n\nS3 stores long-term archived logs.\n\nPurpose:\n\n- Low-cost retention\n- Audit and compliance storage\n- Replay source for future analysis\n- Separation between searchable retention and archive retention\n\nRecommended production authentication method on EKS:\n\n```text\nIRSA: IAM Role for Service Account\n```\n\n### Kibana\n\nKibana provides the UI for engineering teams.\n\nPurpose:\n\n- Search logs\n- Create dashboards\n- Filter by namespace, service, team, environment, and severity\n- Debug incidents across microservices\n\nRecommended Kibana data view:\n\n```text\nkubernetes-logs-*\n```\n\n### ElastAlert\n\nElastAlert provides rule-based alerting.\n\nPurpose:\n\n- Detect high error rates\n- Detect fatal or critical logs\n- Notify engineering teams\n- Trigger Slack, PagerDuty, email, or other alert targets when configured\n\nThe default chart includes a starter high-error-rate rule.\n\n---\n\n## Repository and Helm chart structure\n\nExpected repository layout:\n\n```text\ncentralized-logging-eks-helm/\n├── FILE_LIST.txt\n├── centralized-logging-eks-0.1.0.tgz\n└── centralized-logging-eks/\n    ├── Chart.yaml\n    ├── README.md\n    ├── values.yaml\n    ├── values-eks-production.yaml\n    └── templates/\n        ├── _helpers.tpl\n        ├── namespace.yaml\n        ├── priorityclasses.yaml\n        ├── s3-secret.yaml\n        ├── kafka.yaml\n        ├── elasticsearch.yaml\n        ├── kibana.yaml\n        ├── elasticsearch-index-job.yaml\n        ├── fluent-bit.yaml\n        ├── vector.yaml\n        ├── elastalert.yaml\n        ├── pdbs.yaml\n        ├── networkpolicies.yaml\n        ├── validate-values.yaml\n        └── NOTES.txt\n```\n\n### Root files\n\n| File or directory | Purpose |\n|---|---|\n| `FILE_LIST.txt` | Simple generated inventory of files included in the chart package. |\n| `centralized-logging-eks-0.1.0.tgz` | Packaged Helm chart artifact. Useful for pushing to a Helm repository or OCI registry. |\n| `centralized-logging-eks/` | Main editable Helm chart source directory. |\n\n### Chart files\n\n| File | Purpose |\n|---|---|\n| `Chart.yaml` | Helm chart metadata: chart name, version, app version, dependencies, keywords, and maintainers. |\n| `values.yaml` | Default values for all components. This is the main configuration reference. |\n| `values-eks-production.yaml` | Example production override file for Amazon EKS. Edit this before installing into your cluster. |\n| `README.md` | GitHub documentation for architecture, deployment, upgrades, and operations. |\n\n### Template files\n\n| Template | Purpose |\n|---|---|\n| `templates/_helpers.tpl` | Shared Helm helper functions for names, labels, and reusable template logic. |\n| `templates/namespace.yaml` | Optional Namespace object when `namespace.create=true`. Most teams still use `--create-namespace`. |\n| `templates/priorityclasses.yaml` | Creates priority classes for logging-critical workloads and logging agents. |\n| `templates/s3-secret.yaml` | Optionally creates the AWS credentials secret when using secret-based S3 authentication. |\n| `templates/kafka.yaml` | Creates Strimzi Kafka NodePools, Kafka cluster, and Kafka topics. |\n| `templates/elasticsearch.yaml` | Creates the ECK Elasticsearch cluster with master, hot data, and warm data node sets. |\n| `templates/kibana.yaml` | Creates the ECK Kibana resource and optional ingress. |\n| `templates/elasticsearch-index-job.yaml` | Creates ILM policy and index template for `kubernetes-logs-*`. |\n| `templates/fluent-bit.yaml` | Creates Fluent Bit ServiceAccount, RBAC, ConfigMap, and DaemonSet. |\n| `templates/vector.yaml` | Creates Vector ServiceAccount, ConfigMap, headless Service, StatefulSet, PVC, and optional HPA. |\n| `templates/elastalert.yaml` | Creates ElastAlert ServiceAccount, ConfigMaps, rules, and Deployment. |\n| `templates/pdbs.yaml` | Creates PodDisruptionBudgets for higher availability during node drains and upgrades. |\n| `templates/networkpolicies.yaml` | Optional NetworkPolicies when `networkPolicies.enabled=true`. |\n| `templates/validate-values.yaml` | Helm validation template to fail early when required values are missing. |\n| `templates/NOTES.txt` | Post-install Helm notes shown after installation or upgrade. |\n\n---\n\n## Prerequisites\n\n### Required local tools\n\n```bash\nhelm version\nkubectl version --client\naws --version\n```\n\n### Required EKS setup\n\nYour EKS cluster should have:\n\n- Kubernetes cluster already created\n- Worker nodes with enough CPU, memory, and disk capacity\n- EBS CSI driver installed\n- A `gp3` StorageClass, or update the chart values to match your StorageClass\n- IAM permissions to create or reference an IRSA role for S3 access\n- Network access from the EKS cluster to Amazon S3\n- Permission to install CRDs and operators, unless Strimzi and ECK are already installed separately\n\n### Recommended minimum production shape\n\nFor the default production values, use a node pool large enough for:\n\n- Kafka brokers\n- Elasticsearch hot data nodes\n- Elasticsearch warm data nodes\n- Vector aggregators\n- Kibana\n- ElastAlert\n- Fluent Bit on every node\n\nA small development cluster should reduce replicas and storage sizes before installing.\n\n---\n\n## Step-by-step Helm deployment guide\n\nThis deployment guide uses Helm for installation and release management. No shell scripts are required.\n\n### Step 1: Clone the repository\n\n```bash\ngit clone git@github.com:pramodksahoo/centralized-logging-eks.git\ncd centralized-logging-eks-helm\n```\n\nThe chart directory should be:\n\n```bash\ncentralized-logging-eks/\n```\n\n### Step 2: Choose the release name\n\nRecommended release name:\n\n```bash\nplatform-logging\n```\n\nRecommended namespace:\n\n```bash\nlogging\n```\n\nWhy this release name is good:\n\n- Short and clear\n- Easy to identify in `helm list`\n- Produces readable resource names\n- Works well for platform-owned infrastructure\n\nExample generated names:\n\n```text\nplatform-logging-kafka\nplatform-logging-es\nplatform-logging-vector\nplatform-logging-kibana\n```\n\n### Step 3: Review and edit the EKS values file\n\nOpen:\n\n```bash\ncentralized-logging-eks/values-eks-production.yaml\n```\n\nAt minimum, update these values:\n\n```yaml\nglobal:\n  clusterName: prod-eks\n  environment: prod\n\naws:\n  region: ap-south-1\n\ns3:\n  enabled: true\n  bucket: your-centralized-logging-bucket\n  prefix: centralized-logging\n  authMode: irsa\n  serviceAccountAnnotations:\n    eks.amazonaws.com/role-arn: arn:aws:iam::\u003caccount-id\u003e:role/platform-logging-vector-s3\n```\n\nAlso review storage settings:\n\n```yaml\nkafka:\n  controllers:\n    storageClassName: gp3\n    storageSize: 100Gi\n  brokers:\n    storageClassName: gp3\n    storageSize: 500Gi\n\nelasticsearch:\n  masters:\n    storageClassName: gp3\n    storageSize: 50Gi\n  hotData:\n    storageClassName: gp3\n    storageSize: 1Ti\n  warmData:\n    storageClassName: gp3\n    storageSize: 2Ti\n\nvector:\n  pvc:\n    storageClassName: gp3\n    storageSize: 100Gi\n```\n\n### Step 4: Choose operator installation mode\n\nThis chart can install Strimzi and ECK as Helm dependencies.\n\n#### Option A: Install operators through this chart\n\nUse this for a self-contained deployment:\n\n```yaml\noperators:\n  strimzi:\n    enabled: true\n  eck:\n    enabled: true\n```\n\n#### Option B: Operators already installed by platform team\n\nUse this if your organization manages operators separately:\n\n```yaml\noperators:\n  strimzi:\n    enabled: false\n  eck:\n    enabled: false\n```\n\nThe Kafka, Elasticsearch, and Kibana custom resources are still managed by this Helm release.\n\n### Step 5: Update Helm dependencies\n\nFrom the repository root:\n\n```bash\nhelm dependency update ./centralized-logging-eks\n```\n\nThis downloads chart dependencies into:\n\n```text\ncentralized-logging-eks/charts/\n```\n\n### Step 6: Render templates before installing\n\nThis is strongly recommended before the first production deployment.\n\n```bash\nhelm template platform-logging ./centralized-logging-eks \\\n  --namespace logging \\\n  -f ./centralized-logging-eks/values-eks-production.yaml \\\n  \u003e rendered-platform-logging.yaml\n```\n\nReview the rendered file:\n\n```bash\nless rendered-platform-logging.yaml\n```\n\n### Step 7: Run Helm lint\n\n```bash\nhelm lint ./centralized-logging-eks \\\n  -f ./centralized-logging-eks/values-eks-production.yaml\n```\n\n### Step 8: Install the release\n\n```bash\nhelm upgrade --install platform-logging ./centralized-logging-eks \\\n  --namespace logging \\\n  --create-namespace \\\n  -f ./centralized-logging-eks/values-eks-production.yaml \\\n  --wait \\\n  --timeout 30m\n```\n\n### Step 9: Check Helm release status\n\n```bash\nhelm status platform-logging -n logging\n```\n\nList releases:\n\n```bash\nhelm list -n logging\n```\n\nView values used by the release:\n\n```bash\nhelm get values platform-logging -n logging\n```\n\nView all computed values:\n\n```bash\nhelm get values platform-logging -n logging --all\n```\n\nView rendered manifests from the release:\n\n```bash\nhelm get manifest platform-logging -n logging\n```\n\n---\n\n## Recommended values to review before production\n\n### Global settings\n\n```yaml\nglobal:\n  clusterName: prod-eks\n  environment: prod\n```\n\nThese values are added to log events and help distinguish clusters and environments.\n\n### S3 archive\n\n```yaml\ns3:\n  enabled: true\n  bucket: your-centralized-logging-bucket\n  prefix: centralized-logging\n  authMode: irsa\n```\n\nRecommended for EKS:\n\n```yaml\ns3:\n  authMode: irsa\n```\n\nUse secret-based authentication only when IRSA is not available.\n\n### Kafka sizing\n\n```yaml\nkafka:\n  controllers:\n    replicas: 3\n  brokers:\n    replicas: 5\n  topics:\n    raw:\n      partitions: 96\n      replicas: 3\n```\n\nFor high-volume logging, Kafka partition count is important because it controls parallelism for Vector consumers.\n\n### Vector sizing\n\n```yaml\nvector:\n  replicas: 6\n  hpa:\n    enabled: true\n    minReplicas: 6\n    maxReplicas: 20\n```\n\nIncrease Vector replicas if Kafka consumer lag grows.\n\n### Elasticsearch sizing\n\n```yaml\nelasticsearch:\n  masters:\n    count: 3\n  hotData:\n    count: 6\n  warmData:\n    count: 3\n```\n\nIncrease hot data nodes if indexing latency or query latency increases.\n\n### Retention\n\n```yaml\nelasticsearch:\n  ilm:\n    hotRolloverMaxAge: 1d\n    warmMinAge: 7d\n    deleteMinAge: 30d\n```\n\nThis keeps recent logs searchable and older logs in S3 archive.\n\n---\n\n## Verification\n\nAfter installation, check the Helm release:\n\n```bash\nhelm status platform-logging -n logging\n```\n\nOptional Kubernetes checks:\n\n```bash\nkubectl get pods -n logging -o wide\nkubectl get kafka -n logging\nkubectl get kafkatopic -n logging\nkubectl get elasticsearch -n logging\nkubectl get kibana -n logging\n```\n\nExpected main workloads:\n\n```text\nFluent Bit DaemonSet\nKafka controllers and brokers\nVector Aggregator StatefulSet\nElasticsearch master, hot data, and warm data nodes\nKibana\nElastAlert\nElasticsearch index setup Job\n```\n\n---\n\n## Kibana access\n\nPort-forward Kibana:\n\n```bash\nkubectl port-forward -n logging svc/platform-logging-kibana-kb-http 5601:5601\n```\n\nOpen:\n\n```text\nhttps://localhost:5601\n```\n\nGet the Elastic user password:\n\n```bash\nkubectl get secret platform-logging-es-es-elastic-user \\\n  -n logging \\\n  -o go-template='{{.data.elastic | base64decode}}{{\"\\n\"}}'\n```\n\nUsername:\n\n```text\nelastic\n```\n\nCreate a Kibana data view:\n\n```text\nkubernetes-logs-*\n```\n\nRecommended fields for filtering:\n\n```text\ncluster\nenvironment\nnamespace\nservice\nteam\nseverity\npod\ncontainer\ntrace_id\nspan_id\n```\n\n---\n\n## Upgrade guide\n\n### Step 1: Change values\n\nEdit:\n\n```bash\ncentralized-logging-eks/values-eks-production.yaml\n```\n\nExample changes:\n\n```yaml\nvector:\n  replicas: 8\n\nelasticsearch:\n  ilm:\n    deleteMinAge: 45d\n```\n\n### Step 2: Preview changes\n\n```bash\nhelm template platform-logging ./centralized-logging-eks \\\n  --namespace logging \\\n  -f ./centralized-logging-eks/values-eks-production.yaml \\\n  \u003e rendered-upgrade.yaml\n```\n\nIf your team uses the Helm diff plugin:\n\n```bash\nhelm diff upgrade platform-logging ./centralized-logging-eks \\\n  --namespace logging \\\n  -f ./centralized-logging-eks/values-eks-production.yaml\n```\n\n### Step 3: Upgrade\n\n```bash\nhelm upgrade platform-logging ./centralized-logging-eks \\\n  --namespace logging \\\n  -f ./centralized-logging-eks/values-eks-production.yaml \\\n  --wait \\\n  --timeout 30m\n```\n\n### Step 4: Review release history\n\n```bash\nhelm history platform-logging -n logging\n```\n\n---\n\n## Rollback guide\n\nIf an upgrade causes issues, check the release history:\n\n```bash\nhelm history platform-logging -n logging\n```\n\nRollback to a previous revision:\n\n```bash\nhelm rollback platform-logging \u003cREVISION\u003e -n logging --wait --timeout 30m\n```\n\nExample:\n\n```bash\nhelm rollback platform-logging 2 -n logging --wait --timeout 30m\n```\n\n---\n\n## Uninstall guide\n\nUninstall the Helm release:\n\n```bash\nhelm uninstall platform-logging -n logging\n```\n\nImportant:\n\n- PersistentVolumeClaims may remain depending on the chart values and StorageClass reclaim policy.\n- S3 archived logs are not deleted by Helm.\n- CRDs may remain if installed by operator charts.\n- Elasticsearch data should be backed up before destructive changes.\n\nCheck remaining PVCs:\n\n```bash\nkubectl get pvc -n logging\n```\n\n---\n\n## Team onboarding guide\n\nThis guide is for application teams whose workloads are deployed in a **different EKS cluster** from the centralized logging cluster. It explains exactly what you need to do to get your pod logs flowing into the shared Elasticsearch and S3 platform.\n\nYou do not need to run your own logging stack. You only need to deploy a lightweight Fluent Bit agent on your cluster, configure it to forward logs to the central Kafka endpoint, and label your pods correctly.\n\n---\n\n### How logs flow from your cluster to the centralized platform\n\n```text\nYour EKS Cluster (spoke)\n  -\u003e Pods write logs to stdout / stderr\n  -\u003e Kubernetes writes logs to /var/log/containers/*.log on each node\n  -\u003e Fluent Bit DaemonSet (running on your cluster) tails those files\n  -\u003e Fluent Bit forwards logs over TLS to the central Kafka bootstrap endpoint\n  -\u003e Central Kafka topic: logs.raw\n  -\u003e Vector Aggregator reads from Kafka, normalises and routes logs\n  -\u003e Elasticsearch (hot/warm) for live search\n  -\u003e Amazon S3 for long-term archive\n  -\u003e Kibana for dashboards and incident investigation\n  -\u003e ElastAlert for error-rate alerts\n```\n\nFluent Bit is the only component you deploy. Everything downstream is managed by the platform team.\n\n```mermaid\nflowchart LR\n  subgraph SpokeA[\"Your EKS Cluster (spoke)\"]\n    PodA[\"Application Pods\\nstdout / stderr\"]\n    NodeLogsA[\"/var/log/containers/*.log\"]\n    FBA[\"Fluent Bit DaemonSet\\n(spoke agent)\"]\n  end\n\n  subgraph SpokeB[\"Another Team's EKS Cluster\"]\n    PodB[\"Application Pods\\nstdout / stderr\"]\n    FBB[\"Fluent Bit DaemonSet\\n(spoke agent)\"]\n  end\n\n  subgraph Central[\"Centralized Logging Cluster\"]\n    Kafka[\"Kafka\\nlogs.raw topic\"]\n    Vector[\"Vector Aggregator\"]\n    ES[\"Elasticsearch\\nhot + warm\"]\n    S3[\"Amazon S3\\narchive\"]\n    Kibana[\"Kibana\"]\n  end\n\n  PodA --\u003e NodeLogsA --\u003e FBA\n  PodB --\u003e FBB\n  FBA -- \"TLS / mTLS\\nKafka producer\" --\u003e Kafka\n  FBB -- \"TLS / mTLS\\nKafka producer\" --\u003e Kafka\n  Kafka --\u003e Vector\n  Vector --\u003e ES\n  Vector --\u003e S3\n  ES --\u003e Kibana\n```\n\n---\n\n### Step 1: Register your cluster with the platform team\n\nBefore deploying anything, open a request with the platform team. Include the following information:\n\n| Field | Example |\n|---|---|\n| Your cluster name | `payments-eks-prod` |\n| AWS region | `ap-south-1` |\n| Team name | `payments` |\n| Namespaces your workloads use | `payments`, `payments-infra` |\n| Expected daily log volume estimate | `~5 GB/day` |\n| Environment | `prod`, `staging`, or `dev` |\n| Slack channel or contact for alerts | `#team-payments-oncall` |\n\nThe platform team will:\n\n- Allocate a Kafka SASL user or issue a client TLS certificate for your cluster.\n- Confirm network path is open between your cluster and the central Kafka bootstrap endpoint.\n- Create or confirm your Kibana index filter and any team-specific dashboards.\n- Share the Kafka bootstrap endpoint and credentials with you.\n\n---\n\n### Step 2: Set up cross-cluster network connectivity\n\nFluent Bit on your cluster must be able to reach the central Kafka brokers over TCP port `9093` (TLS). The platform team controls the central cluster. Work with your network or cloud infrastructure team to establish one of the following paths:\n\n#### Option A: VPC Peering\n\nIf both clusters are in the same AWS account or across accounts with peering:\n\n```text\n1. Request VPC peering from your cluster VPC to the logging cluster VPC.\n2. Update route tables in both VPCs to add the peer CIDR.\n3. Update security groups on the central Kafka nodes to allow inbound TCP 9093\n   from your cluster's node or pod CIDR.\n4. Confirm DNS resolution for the Kafka bootstrap hostname from a pod in your cluster.\n```\n\n#### Option B: AWS Transit Gateway\n\nIf your organization uses Transit Gateway for multi-cluster connectivity:\n\n```text\n1. Attach your VPC to the Transit Gateway.\n2. Add a route in your VPC route table pointing the logging cluster CIDR via the Transit Gateway.\n3. Confirm security group rules allow TCP 9093 from your cluster to the Kafka brokers.\n```\n\n#### Option C: AWS PrivateLink (recommended for cross-account production)\n\n```text\n1. The platform team creates a Network Load Balancer in front of Kafka and an\n   endpoint service.\n2. Your AWS account accepts the endpoint service.\n3. Create an Interface VPC Endpoint in your VPC.\n4. Fluent Bit uses the endpoint DNS name as the Kafka bootstrap address.\n5. No VPC peering or Transit Gateway needed.\n```\n\n#### Confirming connectivity\n\nFrom any pod or node in your cluster:\n\n```bash\n# Replace with the actual Kafka bootstrap hostname provided by the platform team\nnc -zv kafka-bootstrap.logging.internal 9093\n```\n\nExpected output:\n\n```text\nConnection to kafka-bootstrap.logging.internal 9093 port [tcp/*] succeeded!\n```\n\n---\n\n### Step 3: Obtain Kafka connection credentials\n\nThe platform team supports two authentication modes. They will tell you which one applies to your cluster.\n\n#### Mode A: SASL/SCRAM-SHA-512\n\nThe platform team will share:\n\n- Kafka bootstrap endpoint\n- SASL username\n- SASL password\n\nCreate a Kubernetes Secret in the namespace where Fluent Bit will run on your cluster:\n\n```bash\nkubectl create namespace logging-agent\n\nkubectl create secret generic kafka-sasl-credentials \\\n  --namespace logging-agent \\\n  --from-literal=username=\u003cyour-team-kafka-user\u003e \\\n  --from-literal=password=\u003cyour-team-kafka-password\u003e\n```\n\n#### Mode B: Mutual TLS (mTLS)\n\nThe platform team will share:\n\n- Kafka bootstrap endpoint\n- Client certificate (`client.crt`)\n- Client private key (`client.key`)\n- CA certificate (`ca.crt`)\n\nCreate a Kubernetes Secret:\n\n```bash\nkubectl create namespace logging-agent\n\nkubectl create secret generic kafka-mtls-credentials \\\n  --namespace logging-agent \\\n  --from-file=client.crt=./client.crt \\\n  --from-file=client.key=./client.key \\\n  --from-file=ca.crt=./ca.crt\n```\n\n---\n\n### Step 4: Deploy Fluent Bit on your cluster\n\nDeploy Fluent Bit as a DaemonSet on your cluster using the configuration below. This is a standalone deployment — it is separate from any Fluent Bit that may already run in the central logging cluster.\n\n#### Create the Fluent Bit values file\n\nCreate a file named `fluent-bit-spoke-values.yaml`:\n\n```yaml\n# fluent-bit-spoke-values.yaml\n# Spoke cluster Fluent Bit — forwards all pod logs to the central Kafka cluster\n\nimage:\n  repository: cr.fluentbit.io/fluent/fluent-bit\n  tag: \"3.3.2\"\n\nserviceAccount:\n  create: true\n  name: fluent-bit\n\nrbac:\n  create: true\n  nodeAccess: true\n\ntolerations:\n  - key: node-role.kubernetes.io/master\n    operator: Exists\n    effect: NoSchedule\n  - key: node-role.kubernetes.io/control-plane\n    operator: Exists\n    effect: NoSchedule\n\nresources:\n  requests:\n    cpu: 50m\n    memory: 64Mi\n  limits:\n    cpu: 200m\n    memory: 256Mi\n\nconfig:\n  service: |\n    [SERVICE]\n        Flush         5\n        Daemon        Off\n        Log_Level     info\n        Parsers_File  parsers.conf\n        HTTP_Server   On\n        HTTP_Listen   0.0.0.0\n        HTTP_Port     2020\n\n  inputs: |\n    [INPUT]\n        Name              tail\n        Tag               kube.*\n        Path              /var/log/containers/*.log\n        multiline.parser  docker, cri\n        DB                /var/log/flb_kube.db\n        Mem_Buf_Limit     50MB\n        Skip_Long_Lines   On\n        Refresh_Interval  10\n\n  filters: |\n    [FILTER]\n        Name                kubernetes\n        Match               kube.*\n        Kube_URL            https://kubernetes.default.svc:443\n        Kube_CA_File        /var/run/secrets/kubernetes.io/serviceaccount/ca.crt\n        Kube_Token_File     /var/run/secrets/kubernetes.io/serviceaccount/token\n        Kube_Tag_Prefix     kube.var.log.containers.\n        Merge_Log           On\n        Merge_Log_Key       log_processed\n        Keep_Log            Off\n        Annotations         Off\n        Labels              On\n\n    [FILTER]\n        Name   record_modifier\n        Match  kube.*\n        Record cluster       \u003cYOUR_CLUSTER_NAME\u003e\n        Record environment   \u003cYOUR_ENVIRONMENT\u003e\n\n  outputs: |\n    # SASL/SCRAM output — use this block for Mode A authentication\n    [OUTPUT]\n        Name                    kafka\n        Match                   *\n        Brokers                 \u003cKAFKA_BOOTSTRAP_ENDPOINT\u003e:9093\n        Topics                  logs.raw\n        rdkafka.security.protocol SASL_SSL\n        rdkafka.sasl.mechanism  SCRAM-SHA-512\n        rdkafka.sasl.username   ${KAFKA_USERNAME}\n        rdkafka.sasl.password   ${KAFKA_PASSWORD}\n        rdkafka.ssl.ca.location /etc/ssl/certs/ca-certificates.crt\n        Retry_Limit             False\n\n    # mTLS output — replace the block above with this for Mode B authentication\n    # [OUTPUT]\n    #     Name                    kafka\n    #     Match                   *\n    #     Brokers                 \u003cKAFKA_BOOTSTRAP_ENDPOINT\u003e:9093\n    #     Topics                  logs.raw\n    #     rdkafka.security.protocol SSL\n    #     rdkafka.ssl.certificate.location /fluent-bit/secrets/client.crt\n    #     rdkafka.ssl.key.location         /fluent-bit/secrets/client.key\n    #     rdkafka.ssl.ca.location          /fluent-bit/secrets/ca.crt\n    #     Retry_Limit             False\n\nextraVolumes:\n  - name: kafka-credentials\n    secret:\n      secretName: kafka-sasl-credentials   # change to kafka-mtls-credentials for mTLS\n\nextraVolumeMounts:\n  - name: kafka-credentials\n    mountPath: /fluent-bit/secrets\n    readOnly: true\n\nextraEnvVars:\n  - name: KAFKA_USERNAME\n    valueFrom:\n      secretKeyRef:\n        name: kafka-sasl-credentials\n        key: username\n  - name: KAFKA_PASSWORD\n    valueFrom:\n      secretKeyRef:\n        name: kafka-sasl-credentials\n        key: password\n```\n\nReplace these placeholders before deploying:\n\n| Placeholder | What to put |\n|---|---|\n| `\u003cYOUR_CLUSTER_NAME\u003e` | Your cluster name as registered with the platform team, e.g. `payments-eks-prod` |\n| `\u003cYOUR_ENVIRONMENT\u003e` | `prod`, `staging`, or `dev` |\n| `\u003cKAFKA_BOOTSTRAP_ENDPOINT\u003e` | The hostname provided by the platform team |\n\n#### Install Fluent Bit using Helm\n\n```bash\nhelm repo add fluent https://fluent.github.io/helm-charts\nhelm repo update\n\nhelm upgrade --install fluent-bit fluent/fluent-bit \\\n  --namespace logging-agent \\\n  --create-namespace \\\n  -f fluent-bit-spoke-values.yaml \\\n  --wait \\\n  --timeout 10m\n```\n\n#### Verify Fluent Bit is running\n\n```bash\nkubectl get pods -n logging-agent -l app.kubernetes.io/name=fluent-bit\nkubectl logs -n logging-agent daemonset/fluent-bit --tail=50\n```\n\nLook for lines like:\n\n```text\n[2026/05/14 10:00:01] [ info] [output:kafka:kafka.0] ...\n```\n\nNo `[error]` lines from the Kafka output plugin means the connection is healthy.\n\n---\n\n### Step 5: Apply required pod labels to your workloads\n\nThe centralized platform uses Kubernetes pod labels to route logs, build Kibana dashboards per team, and drive ElastAlert rules. Without the correct labels, logs land in Elasticsearch but cannot be filtered by team, service, or platform.\n\n#### Required labels\n\nEvery pod in your cluster **must** have these labels:\n\n```yaml\nmetadata:\n  labels:\n    app.kubernetes.io/name: \u003cservice-name\u003e        # e.g. order-service\n    app.kubernetes.io/team: \u003cteam-name\u003e            # e.g. payments\n    app.kubernetes.io/part-of: \u003cplatform-name\u003e     # e.g. payments-platform\n    environment: \u003cprod|staging|dev\u003e                # matches your cluster env\n```\n\n#### Recommended labels\n\nThese labels are not strictly required but unlock additional Kibana filters and routing:\n\n```yaml\nmetadata:\n  labels:\n    app.kubernetes.io/version: \"1.4.2\"            # semver of the deployed image\n    app.kubernetes.io/component: \u003capi|worker|job\u003e  # logical role of the pod\n    platform: \u003cbackend|data|frontend|infra\u003e        # platform layer\n    namespace-owner: \u003cteam-name\u003e                   # team that owns this namespace\n    cost-center: \u003ccc-12345\u003e                        # for chargeback reporting\n```\n\n#### Label reference by team and namespace type\n\nUse this table as a reference for common team and namespace patterns:\n\n| Team | Namespace | `app.kubernetes.io/team` | `app.kubernetes.io/part-of` | `platform` |\n|---|---|---|---|---|\n| Payments | `payments` | `payments` | `payments-platform` | `backend` |\n| Payments | `payments-infra` | `payments` | `payments-platform` | `infra` |\n| Orders | `orders` | `orders` | `orders-platform` | `backend` |\n| Orders | `orders-workers` | `orders` | `orders-platform` | `backend` |\n| Data Engineering | `data-pipeline` | `data` | `data-platform` | `data` |\n| Data Engineering | `spark-jobs` | `data` | `data-platform` | `data` |\n| Frontend | `web` | `frontend` | `web-platform` | `frontend` |\n| Shared Infra | `infra` | `platform` | `infra-platform` | `infra` |\n| ML Platform | `ml-serving` | `ml` | `ml-platform` | `data` |\n\n#### Example Deployment manifest with full labels\n\n```yaml\napiVersion: apps/v1\nkind: Deployment\nmetadata:\n  name: order-service\n  namespace: orders\n  labels:\n    app.kubernetes.io/name: order-service\n    app.kubernetes.io/team: orders\n    app.kubernetes.io/part-of: orders-platform\n    app.kubernetes.io/version: \"2.1.0\"\n    app.kubernetes.io/component: api\n    environment: prod\n    platform: backend\n    namespace-owner: orders\nspec:\n  selector:\n    matchLabels:\n      app.kubernetes.io/name: order-service\n  template:\n    metadata:\n      labels:\n        app.kubernetes.io/name: order-service\n        app.kubernetes.io/team: orders\n        app.kubernetes.io/part-of: orders-platform\n        app.kubernetes.io/version: \"2.1.0\"\n        app.kubernetes.io/component: api\n        environment: prod\n        platform: backend\n        namespace-owner: orders\n    spec:\n      containers:\n        - name: order-service\n          image: your-registry/order-service:2.1.0\n```\n\n#### Excluding a pod from log collection\n\nIf you have a pod that should not send logs to the centralized platform (for example, a pod handling sensitive personal data that must stay local), add this annotation:\n\n```yaml\nmetadata:\n  annotations:\n    fluentbit.io/exclude: \"true\"\n```\n\n---\n\n### Step 6: Format your application logs as structured JSON\n\nFluent Bit collects whatever your pods write to `stdout` and `stderr`. Plain text logs are collected but are harder to search, filter, and alert on in Kibana. Structured JSON logs unlock full field-level search, dashboard aggregations, and precise ElastAlert rules.\n\n#### Required log fields\n\nEvery log event should include these fields:\n\n| Field | Type | Example | Purpose |\n|---|---|---|---|\n| `timestamp` | ISO 8601 string | `2026-05-14T10:00:00.000Z` | Event time. Use UTC. |\n| `level` | string | `info` | Severity. Must be one of: `debug`, `info`, `warn`, `error`, `fatal` |\n| `message` | string | `order created successfully` | Human-readable description of the event |\n| `service` | string | `order-service` | Name of the service writing the log. Should match `app.kubernetes.io/name` |\n| `team` | string | `orders` | Owning team. Should match `app.kubernetes.io/team` |\n| `environment` | string | `prod` | Deployment environment |\n\n#### Recommended log fields\n\nAdd these fields where relevant to your service:\n\n| Field | Type | Example | Purpose |\n|---|---|---|---|\n| `trace_id` | string | `4bf92f3577b34da6` | Distributed trace correlation (W3C or Jaeger format) |\n| `span_id` | string | `00f067aa0ba902b7` | Span within a trace |\n| `request_id` | string | `req-8d7f2a91` | HTTP or gRPC request correlation |\n| `user_id` | string | `usr-4421` | Authenticated user identifier (avoid PII) |\n| `duration_ms` | number | `142` | Request or operation duration in milliseconds |\n| `http_method` | string | `POST` | HTTP method for request logs |\n| `http_path` | string | `/api/v1/orders` | HTTP path (strip query parameters with sensitive data) |\n| `http_status` | number | `201` | HTTP response status code |\n| `error_type` | string | `ValidationError` | Error class or exception type |\n| `error_stack` | string | `...` | Stack trace for error-level logs |\n| `component` | string | `payment-processor` | Sub-component within the service |\n| `version` | string | `2.1.0` | Service version |\n\n#### Full structured log example\n\n```json\n{\n  \"timestamp\": \"2026-05-14T10:22:31.408Z\",\n  \"level\": \"error\",\n  \"message\": \"payment processing failed: card declined\",\n  \"service\": \"order-service\",\n  \"team\": \"payments\",\n  \"environment\": \"prod\",\n  \"version\": \"2.1.0\",\n  \"component\": \"payment-processor\",\n  \"trace_id\": \"4bf92f3577b34da6a3ce929d0e0e4736\",\n  \"span_id\": \"00f067aa0ba902b7\",\n  \"request_id\": \"req-8d7f2a91\",\n  \"user_id\": \"usr-4421\",\n  \"http_method\": \"POST\",\n  \"http_path\": \"/api/v1/orders\",\n  \"http_status\": 402,\n  \"error_type\": \"CardDeclinedError\",\n  \"duration_ms\": 312\n}\n```\n\n#### Minimal log example (acceptable for low-verbosity services)\n\n```json\n{\n  \"timestamp\": \"2026-05-14T10:22:31.408Z\",\n  \"level\": \"info\",\n  \"message\": \"order created\",\n  \"service\": \"order-service\",\n  \"team\": \"payments\",\n  \"environment\": \"prod\"\n}\n```\n\n#### Log format by language\n\n**Java (Logback + logstash-logback-encoder)**\n\n```xml\n\u003c!-- pom.xml --\u003e\n\u003cdependency\u003e\n  \u003cgroupId\u003enet.logstash.logback\u003c/groupId\u003e\n  \u003cartifactId\u003elogstash-logback-encoder\u003c/artifactId\u003e\n  \u003cversion\u003e7.4\u003c/version\u003e\n\u003c/dependency\u003e\n```\n\n```xml\n\u003c!-- logback.xml --\u003e\n\u003cappender name=\"STDOUT\" class=\"ch.qos.logback.core.ConsoleAppender\"\u003e\n  \u003cencoder class=\"net.logstash.logback.encoder.LogstashEncoder\"\u003e\n    \u003ccustomFields\u003e{\"service\":\"order-service\",\"team\":\"payments\",\"environment\":\"prod\"}\u003c/customFields\u003e\n    \u003cfieldNames\u003e\n      \u003ctimestamp\u003etimestamp\u003c/timestamp\u003e\n      \u003clevel\u003elevel\u003c/level\u003e\n      \u003cmessage\u003emessage\u003c/message\u003e\n    \u003c/fieldNames\u003e\n  \u003c/encoder\u003e\n\u003c/appender\u003e\n```\n\n**Node.js (pino)**\n\n```javascript\nconst pino = require('pino');\nconst logger = pino({\n  base: {\n    service: 'order-service',\n    team: 'payments',\n    environment: process.env.ENVIRONMENT || 'prod',\n  },\n  timestamp: pino.stdTimeFunctions.isoTime,\n  formatters: {\n    level: (label) =\u003e ({ level: label }),\n  },\n});\n```\n\n**Python (python-json-logger)**\n\n```python\nimport logging\nfrom pythonjsonlogger import jsonlogger\n\nlogger = logging.getLogger()\nhandler = logging.StreamHandler()\nformatter = jsonlogger.JsonFormatter(\n    fmt='%(timestamp)s %(level)s %(message)s',\n    rename_fields={'levelname': 'level', 'asctime': 'timestamp'},\n)\nhandler.setFormatter(formatter)\nlogger.addHandler(handler)\n\nlogger = logging.LoggerAdapter(logger, extra={\n    'service': 'order-service',\n    'team': 'payments',\n    'environment': 'prod',\n})\n```\n\n**Go (zap)**\n\n```go\nimport \"go.uber.org/zap\"\n\nfunc NewLogger() *zap.Logger {\n    cfg := zap.NewProductionConfig()\n    cfg.OutputPaths = []string{\"stdout\"}\n    logger, _ := cfg.Build(\n        zap.Fields(\n            zap.String(\"service\", \"order-service\"),\n            zap.String(\"team\", \"payments\"),\n            zap.String(\"environment\", \"prod\"),\n        ),\n    )\n    return logger\n}\n```\n\n#### What to avoid\n\n| Pattern | Problem | Better alternative |\n|---|---|---|\n| Plain text logs (`INFO order created`) | Cannot be field-filtered in Kibana | Emit structured JSON |\n| Logging PII (emails, phone numbers, card numbers) | Security and compliance risk | Hash or omit sensitive fields |\n| Multi-line stack traces as separate log lines | Breaks log correlation in Kafka | Wrap in `error_stack` JSON field |\n| Logging at `debug` level in production | Massively inflates log volume and cost | Use `info` or above in prod |\n| Using local time in `timestamp` | Time comparisons break across regions | Always use UTC ISO 8601 |\n| Logging binary or base64 payloads | Bloats Elasticsearch index | Log metadata only, not payload |\n\n---\n\n### Step 7: Verify your logs are flowing\n\nAfter completing all the steps above, use the following checks to confirm your logs are reaching the centralized platform.\n\n#### Check Fluent Bit output on your cluster\n\n```bash\nkubectl logs -n logging-agent daemonset/fluent-bit --tail=100 | grep kafka\n```\n\nLook for successful delivery messages and no `error` lines from the Kafka output plugin.\n\n#### Ask the platform team to confirm Kafka topic ingestion\n\nThe platform team can run:\n\n```bash\nkubectl exec -n logging -it \u003ckafka-broker-pod\u003e -- \\\n  kafka-consumer-groups.sh \\\n    --bootstrap-server localhost:9092 \\\n    --describe \\\n    --group vector-consumer-group\n```\n\nThis shows whether Vector is consuming your logs from `logs.raw`.\n\n#### Search in Kibana\n\nLog in to Kibana (the platform team will share the URL and credentials):\n\n```text\nData view: kubernetes-logs-*\n```\n\nFilter your logs:\n\n```text\nkubernetes.labels.app_kubernetes_io/team : \"payments\"\nAND\nkubernetes.namespace_name : \"payments\"\nAND\nenvironment : \"prod\"\n```\n\nOr using the JSON log fields directly:\n\n```text\nteam : \"payments\" AND level : \"error\"\n```\n\nExpected result: your structured log events should appear within 30 to 60 seconds of being emitted by your pods.\n\n#### Common issues at this stage\n\n| Symptom | Likely cause | Resolution |\n|---|---|---|\n| No logs in Kibana for your team | Fluent Bit not running or Kafka auth failing | Check `kubectl logs -n logging-agent daemonset/fluent-bit` for Kafka errors |\n| Logs appear but missing team fields | Pod labels not applied | Verify labels on the pod with `kubectl get pod \u003cname\u003e --show-labels` |\n| `log_processed` field is a string, not a JSON object | Application logging in plain text | Migrate to structured JSON output |\n| Logs appear under wrong index | `cluster` or `environment` field missing | Check the `record_modifier` filter in your Fluent Bit config |\n| Fluent Bit cannot connect to Kafka | Network path not open | Run `nc -zv \u003ckafka-bootstrap\u003e 9093` from a pod on your cluster |\n\n---\n\n## Application team logging standard\n\nFor best results, application teams should log structured JSON to stdout or stderr.\n\nExample — standard info log:\n\n```json\n{\n  \"timestamp\": \"2026-05-14T10:22:31.408Z\",\n  \"level\": \"info\",\n  \"message\": \"order created successfully\",\n  \"service\": \"order-service\",\n  \"team\": \"payments\",\n  \"environment\": \"prod\",\n  \"version\": \"2.1.0\",\n  \"trace_id\": \"4bf92f3577b34da6a3ce929d0e0e4736\",\n  \"span_id\": \"00f067aa0ba902b7\",\n  \"request_id\": \"req-8d7f2a91\",\n  \"duration_ms\": 142,\n  \"http_method\": \"POST\",\n  \"http_path\": \"/api/v1/orders\",\n  \"http_status\": 201\n}\n```\n\nExample — error log with stack trace:\n\n```json\n{\n  \"timestamp\": \"2026-05-14T10:22:31.408Z\",\n  \"level\": \"error\",\n  \"message\": \"payment processing failed: card declined\",\n  \"service\": \"order-service\",\n  \"team\": \"payments\",\n  \"environment\": \"prod\",\n  \"version\": \"2.1.0\",\n  \"trace_id\": \"4bf92f3577b34da6a3ce929d0e0e4736\",\n  \"span_id\": \"00f067aa0ba902b7\",\n  \"request_id\": \"req-8d7f2a91\",\n  \"http_method\": \"POST\",\n  \"http_path\": \"/api/v1/orders\",\n  \"http_status\": 402,\n  \"error_type\": \"CardDeclinedError\",\n  \"error_stack\": \"CardDeclinedError: card declined\\n  at PaymentProcessor.charge (/app/src/payment.js:88)\\n  at OrderService.create (/app/src/order.js:42)\",\n  \"duration_ms\": 312\n}\n```\n\nRecommended Kubernetes labels — minimal required set:\n\n```yaml\nmetadata:\n  labels:\n    app.kubernetes.io/name: order-service\n    app.kubernetes.io/team: payments\n    app.kubernetes.io/part-of: payments-platform\n    environment: prod\n```\n\nRecommended Kubernetes labels — full set with all optional fields:\n\n```yaml\nmetadata:\n  labels:\n    app.kubernetes.io/name: order-service\n    app.kubernetes.io/team: payments\n    app.kubernetes.io/part-of: payments-platform\n    app.kubernetes.io/version: \"2.1.0\"\n    app.kubernetes.io/component: api\n    environment: prod\n    platform: backend\n    namespace-owner: payments\n    cost-center: cc-12345\n```\n\nRecommended standard fields:\n\n| Field | Example | Purpose |\n|---|---|---|\n| `timestamp` | `2026-05-13T10:00:00.000Z` | Event time. Always UTC ISO 8601. |\n| `level` | `debug`, `info`, `warn`, `error`, `fatal` | Severity. Use a consistent lowercase set. |\n| `message` | `order created` | Human-readable description of the event |\n| `trace_id` | `4bf92f3577b34da6` | Distributed trace correlation |\n| `span_id` | `00f067aa0ba902b7` | Span within a trace |\n| `request_id` | `req-8d7f2a91` | HTTP or gRPC request correlation |\n| `service` | `order-service` | Service name. Match `app.kubernetes.io/name`. |\n| `team` | `payments` | Owning team. Match `app.kubernetes.io/team`. |\n| `environment` | `prod` | Environment |\n| `version` | `2.1.0` | Service version |\n| `duration_ms` | `142` | Request or operation duration in milliseconds |\n| `http_status` | `201` | HTTP response code |\n| `error_type` | `ValidationError` | Exception class for error-level events |\n\nRecommended Kubernetes labels — full reference:\n\n| Label | Required | Example | Purpose |\n|---|---|---|---|\n| `app.kubernetes.io/name` | Yes | `order-service` | Service identifier. Used for Kibana filters and routing. |\n| `app.kubernetes.io/team` | Yes | `payments` | Owning team. Drives team-scoped dashboards and alerts. |\n| `app.kubernetes.io/part-of` | Yes | `payments-platform` | Platform grouping for the service. |\n| `environment` | Yes | `prod` | Deployment environment. Must match `global.environment` in the central cluster. |\n| `app.kubernetes.io/version` | Recommended | `2.1.0` | Deployed image version. |\n| `app.kubernetes.io/component` | Recommended | `api`, `worker`, `job` | Logical role of the pod. |\n| `platform` | Recommended | `backend`, `data`, `frontend`, `infra` | Platform layer. Used for cost and ownership reporting. |\n| `namespace-owner` | Recommended | `payments` | Team that owns the namespace. |\n\nTo exclude a pod from Fluent Bit collection:\n\n```yaml\nmetadata:\n  annotations:\n    fluentbit.io/exclude: \"true\"\n```\n\n---\n\n## Production sizing notes\n\nThe default values are production-style starter values for a larger EKS environment, not a universal final sizing.\n\n### Default starter sizing\n\n| Component | Default |\n|---|---:|\n| Kafka controllers | 3 |\n| Kafka brokers | 5 |\n| Kafka topic partitions | 96 |\n| Elasticsearch masters | 3 |\n| Elasticsearch hot data nodes | 6 |\n| Elasticsearch warm data nodes | 3 |\n| Vector aggregators | 6 |\n| Kibana replicas | 2 |\n| ElastAlert replicas | 2 |\n| Fluent Bit | 1 pod per node |\n\n### Estimate daily log volume\n\nUse this formula:\n\n```text\ndaily_log_gb =\n  services\n  × average_pods_per_service\n  × average_logs_per_pod_per_second\n  × average_log_size_bytes\n  × 86400\n  / 1024^3\n```\n\nExample:\n\n```text\n400 services × 2 pods × 2 logs/sec × 800 bytes × 86400 / 1024^3\n≈ 103 GB/day raw logs\n```\n\nElasticsearch storage is usually larger than raw log size because of indexing, mappings, replicas, and metadata.\n\n### When to scale Kafka\n\nScale Kafka when:\n\n- Broker disk usage is high\n- Broker CPU is consistently high\n- Produce latency increases\n- Under-replicated partitions appear\n- Vector cannot consume fast enough even after scaling Vector\n\n### When to scale Vector\n\nScale Vector when:\n\n- Kafka consumer lag grows\n- Vector CPU remains high\n- Vector disk buffers grow continuously\n- Elasticsearch or S3 retries increase\n- End-to-end log delivery latency increases\n\n### When to scale Elasticsearch\n\nScale Elasticsearch when:\n\n- Indexing latency increases\n- JVM heap pressure is high\n- Query latency is high\n- Hot nodes are disk constrained\n- Shards are too large\n- Cluster health becomes yellow or red\n\n---\n\n## Operational best practices\n\n### 1. Keep Fluent Bit lightweight\n\nDo not add heavy parsing and enrichment at Fluent Bit unless necessary.\n\nRecommended split:\n\n```text\nFluent Bit = collect and forward\nVector = normalize and route\nElasticsearch = search\nS3 = archive\n```\n\n### 2. Use IRSA for S3\n\nFor EKS production, prefer:\n\n```yaml\ns3:\n  authMode: irsa\n```\n\nAvoid long-lived AWS access keys when possible.\n\n### 3. Keep Elasticsearch retention realistic\n\nFor high-volume microservices, do not keep too much data in Elasticsearch.\n\nRecommended pattern:\n\n```text\nElasticsearch = recent searchable logs\nS3 = long-term archive\n```\n\n### 4. Standardize team labels\n\nApplication teams should provide consistent labels:\n\n```yaml\napp.kubernetes.io/name\napp.kubernetes.io/team\nenvironment\n```\n\nThis enables better routing, dashboards, ownership, and alerting.\n\n### 5. Monitor the logging platform itself\n\nYou should monitor:\n\n- Fluent Bit error and retry counts\n- Kafka broker health\n- Kafka topic partition health\n- Kafka consumer lag\n- Vector buffer size\n- Vector output retries\n- Elasticsearch cluster health\n- Elasticsearch indexing latency\n- Elasticsearch JVM heap\n- Kibana availability\n- ElastAlert execution errors\n- S3 delivery failures\n\n### 6. Use separate node groups where possible\n\nFor production EKS, consider separate managed node groups for:\n\n- Kafka\n- Elasticsearch\n- General workloads\n- Logging/observability\n\nThis prevents noisy application workloads from affecting logging reliability.\n\n---\n\n## Troubleshooting\n\n### Helm install fails\n\nCheck rendered output:\n\n```bash\nhelm template platform-logging ./centralized-logging-eks \\\n  --namespace logging \\\n  -f ./centralized-logging-eks/values-eks-production.yaml\n```\n\nCheck release status:\n\n```bash\nhelm status platform-logging -n logging\n```\n\n### Kafka resources are not created\n\nCheck whether Strimzi is installed:\n\n```bash\nkubectl get pods -n logging | grep strimzi\nkubectl get crd | grep kafka.strimzi.io\n```\n\nIf Strimzi is managed separately, set:\n\n```yaml\noperators:\n  strimzi:\n    enabled: false\n```\n\n### Elasticsearch resources are not created\n\nCheck whether ECK is installed:\n\n```bash\nkubectl get pods -n elastic-system\nkubectl get crd | grep elasticsearch.k8s.elastic.co\n```\n\nIf ECK is managed separately, set:\n\n```yaml\noperators:\n  eck:\n    enabled: false\n```\n\n### No logs appear in Kibana\n\nCheck the flow step by step:\n\n```bash\nkubectl logs -n logging daemonset/platform-logging-fluent-bit --tail=100\nkubectl get kafkatopic -n logging\nkubectl logs -n logging statefulset/platform-logging-vector --tail=100\nkubectl get elasticsearch -n logging\n```\n\nThen check Kibana data view:\n\n```text\nkubernetes-logs-*\n```\n\n### S3 archive is not receiving logs\n\nCheck:\n\n```yaml\ns3:\n  enabled: true\n  bucket: your-bucket\n  authMode: irsa\n```\n\nVerify IRSA annotation:\n\n```yaml\ns3:\n  serviceAccountAnnotations:\n    eks.amazonaws.com/role-arn: arn:aws:iam::\u003caccount-id\u003e:role/platform-logging-vector-s3\n```\n\nCheck Vector logs:\n\n```bash\nkubectl logs -n logging statefulset/platform-logging-vector --tail=200\n```\n\n### Elasticsearch is under pressure\n\nReview:\n\n- Hot data node CPU\n- JVM heap pressure\n- Disk usage\n- Indexing latency\n- Shard count\n- ILM rollover size\n- Retention days\n\nThen consider:\n\n```yaml\nelasticsearch:\n  hotData:\n    count: 8\n  ilm:\n    deleteMinAge: 14d\n```\n\n### Kafka consumer lag is increasing\n\nConsider increasing Vector replicas:\n\n```yaml\nvector:\n  replicas: 8\n  hpa:\n    maxReplicas: 30\n```\n\nAlso review Kafka topic partitions:\n\n```yaml\nkafka:\n  topics:\n    raw:\n      partitions: 192\n```\n\nPartition increases should be planned carefully because they affect ordering and consumer behavior.\n\n---\n\n## Recommended Git workflow\n\nUse separate values files per environment:\n\n```text\nvalues-dev.yaml\nvalues-staging.yaml\nvalues-eks-production.yaml\n```\n\nRecommended deployment flow:\n\n```text\nPull request\n  -\u003e helm lint\n  -\u003e helm template\n  -\u003e review rendered diff\n  -\u003e merge\n  -\u003e deploy through CI/CD or GitOps\n```\n\nExample CI validation commands:\n\n```bash\nhelm dependency update ./centralized-logging-eks\n\nhelm lint ./centralized-logging-eks \\\n  -f ./centralized-logging-eks/values-eks-production.yaml\n\nhelm template platform-logging ./centralized-logging-eks \\\n  --namespace logging \\\n  -f ./centralized-logging-eks/values-eks-production.yaml\n```\n\n---\n\n## Summary\n\nThis Helm chart turns the centralized logging system into a repeatable, maintainable, production-ready EKS platform component.\n\nIt improves the previous manifest/script-based approach by providing:\n\n- One Helm release\n- One values file\n- Repeatable deployments\n- Easier upgrades\n- Rollback support\n- Clear ownership\n- Better GitOps compatibility\n- Better long-term maintainability\n\nFinal recommended architecture:\n\n```text\nFluent Bit -\u003e Kafka -\u003e Vector -\u003e Elasticsearch + S3 -\u003e Kibana + ElastAlert\n```\n\nFinal recommended Helm release name:\n\n```bash\nplatform-logging\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpramodksahoo%2Fcentralized-logging-eks","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpramodksahoo%2Fcentralized-logging-eks","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpramodksahoo%2Fcentralized-logging-eks/lists"}