Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/tanishkamarrott/building-a-resilient-devsecops-pipeline-for-a-reddit-clone-utilizing-jenkins-argocd-prometheus-g

This project involves orchestrating a resilient DevSecOps pipeline to build a Reddit clone application. It automates, streamlines, and secures infrastructure provisioning and application deployment cycles using Jenkins and ArgoCD. The pipeline integrates robust logging and monitoring solutions with Prometheus, Grafana, and Kibana
https://github.com/tanishkamarrott/building-a-resilient-devsecops-pipeline-for-a-reddit-clone-utilizing-jenkins-argocd-prometheus-g

argocd cicd devsecops elasticsearch fluentd grafana jenkins kibana prometheus

Last synced: 25 days ago
JSON representation

This project involves orchestrating a resilient DevSecOps pipeline to build a Reddit clone application. It automates, streamlines, and secures infrastructure provisioning and application deployment cycles using Jenkins and ArgoCD. The pipeline integrates robust logging and monitoring solutions with Prometheus, Grafana, and Kibana

Awesome Lists containing this project

README

        

# Building a Resilient DevSecOps Pipeline for a Reddit Clone: Utilizing Jenkins, ArgoCD, Prometheus, Grafana, and Kibana

## Quick Introduction:-

--> This project involves **orchestrating a resilient DevSecOps pipeline** to **build a Reddit clone application**.

--> It **automates, streamlines & secures infrastructure provisioning** & **application deployment cycles** using **Jenkins and ArgoCD**.

--> The pipeline **integrates robust logging and monitoring** solutions with **Prometheus, Grafana and Kibana**, ensuring high availability, scalability, and security.

This means **we're leveraging DevSecOps to deliver a reliable, efficient and secure deployment environment** for continuous integration and continuous delivery.

πŸ‘‰ _What are we trying to achieve?_

Β  **Optimized QA + Speedy Delivery + Security Ingrained = REAL Business Value**

Β 

## Workflow :-

**Infrastructure Provisioning** with **Terraform**
      ⬇️
**Container Orchestration** using **Elastic Kubernetes Service (EKS)**
      ⬇️
**Continuous Integration/Continuous Deployment (CI/CD)** with **Jenkins & ArgoCD**
      ⬇️
**Security Integrations** via **SonarQube, OWASP, & Trivy**
      ⬇️
**Logging, Monitoring & Data Visualization** through **Prometheus, Grafana, & EFK Stack (Elasticsearch, Fluentd, Kibana)**
Β Β Β Β Β Β =
**Real-time Insights** into **application health and performance**

## Deep-Dive:-

> As we delve deeper into the questions, I'll provide answers and share insights as they come to mind. πŸ™‚

### 1 --> Why Jenkins ? What value does it bring in ?

A --> Jenkins is fundamentally an open source "automation server".

It's primary task is **automating various aspects of a software development cycle,** --> Building, testing, deploying the application. This means --> **we're essentially speeding up builds, making build times quicker** πŸ‘

--

### 2 -->What about Jenkin's architectural design ?

> Or I'll reframe this another way : _What makes Jenkins a popular choice? How does Jenkin actually provide a truly robust and scalable platform?_

Let's walk through the key components:-

A --> It has a **Master-Slave Architecture**

This means we've got a **Jenkins Master** :-
- Would be responsible for "managing the overall Jenkins Environment" --> This means it would schedule the builds + dispatch the build to the agents - The worker nodes
- Would also monitor the status of the build, present the build results, and serve the Jenkins UI
- Plus, it would be responsible for managing the plugins & configurations

Second, we've got the worker nodes - the jenkin agents

"The worker nodes are *actually* responsible for running the builds, that've been assigned by the jenkins master"

1- They're responsible for executing the actual builds
2- They're dynamically provisioned/ deprovisioned to handle the load
3- The master would distribute the tasks amongst multiple agents, --> This way it's incorporating parallelism and reducing build time

1. **Provisioning Application Infrastructure with Terraform:**

- We started off with provisioning the necessary infrastructure using Terraform.
- Provisioned an EC2 instance: Configured security groups and used an `install.sh` script to install essential tools including Jenkins, Docker, Sonarqube, and Trivy.

**[UPDATE (11/07/2024)]**
### Why did we deploy Jenkins on top of the EKS cluster & not on a VM?

> #### Cons of an EC2 deployment 🚩 :-
> - **Limited Scalability:-** The original setup won't be scalable. It would require manual intervention
> - **SPOF (Single Point of Failure):-** Deploying all the three tools on a single VM. Means we're making the system all the more vulnerable to failures
> - **Resource contention:-** Heavy resource contention. Tools competing for resources = Creating potential performance bottlenecks
> - **Manual maintanence:-** A operational overhead in terms of the updates, maintainence & monitoring

### πŸ’‘ Moving to an EKS Deployment.

- Deploying Jenkins on an EKS cluster means **Kubernetes would automatically scale your Jenkins pods, in response to fluctuating workload requirements.** Resource Utilisation & Performance ++ πŸ‘

- EKS would take care of :-
- **Self-Healing:-** Detecting and replacing unhealthy pods automatically
- **Availability:-** Restarting failed Jenkins pods
- **Load Balancing:-** Fair distribution of the workload amongst the pods
- Plus, **rolling updates** :- You can then deploy new vrsion of Jenkins, without any downtime = Service Continuity πŸ‘
- **Resource Management :-** K8s bears the capabilities of dynamic alocation/ de-allocation of resources (Something that a traditional VM setup won't support)
- Strong security. Kubernetes provides for **RBAC - Role-Based Access Control, network policies, secrets management**. This means K8s makes our CI?CD Pipeline secure and compliant with best practices.

> **The declarative nature that kubernetes supports means you're essentially simplifying the automation** of the deployment of the pipeline, You could *version-control* these files too = **You're bringing in *consistency* and *reliability* across different deployments**

### What exactly is "Helm"? Is it "Yum" for Kubernetes?

Helm -- Package Manager for Kubernetes

--> Enables us to deploy + manage containerised applications on top of an EKS cluster

--> Helm offers something called helm charts, that're a collection of files --> This means k8s resource + application defintions.

They're reusable templates, that means they comprise the kubernetes manifest templates plus a values.yaml.

#### This means --> Re-usable Kubernetes Templates : `templates/deployment.yaml` + Default values for the variables - `values.yaml` (You could even customise deployments by over-riding the default values) = Final K8s manifests (Generated by Helm post injecting these values)

> Injecting these values from the YAML file --> `variables.yaml` , Helm can then generate the final K8s manifest files, based on the values that've been provided

So, we'd be provisioning two EKS clusters -->

1 --> CI Tools (Jenkins, Trivy & SonarQube)
2 --> CD (ArgoCD) + the Reddit Application Clone to be deployed

## β†’ Jenkins Setup & Tool Configuration_

### **What sort of plugins & global tool configurations have we used?**

| **The plugins we've used β†’** | | |
|-------------------------|-----------------------------|-------------------------------------------------------------------------------------------|
| The ones for code quality & analysis | SonarQube Scanner | SQ + Jenkins --> SAST|
| | Sonar Quality Gates | It breaks the build based on the quality thresholds |
| | OWASP Dependency Check | --> Vulnerabilities in project dependencies. |
| IaC Scanning | TfSec | Scans the IaC from a security standpoint |
| Secrets Detection | truffleHog | Helps detect accidentally committed secrets |

Please check out these files:- `updated_main.tf` & `install.sh`

| Global tools to standardise environments all across β†’ | | |
|-------------------------|------------------------|-------------------------------------------------------------------------------------------|
| Runtime & Environment | Eclipse Temurin Installer | ➑️ Means a specific JDK version is available for all jobs. |
| | Nodejs | Once setup, the necessary runtime is available for JS applications. πŸ‘ |

## β†’ _Pipeline Configuration_

Reddit-Clone-App-Jenkins-Pipeline

--

### _CI/CD Pipeline - Key Stages_

Cumulating the steps ‡️

Workspace Preparation β†’ **Fetch** the Latest Code β†’ **Static Code Analysis** β†’
**Quality Gate** Checkpoint β†’ Installing **Dependencies** β†’ **Scanning File System & Docker Images** β†’
**Containerization** β†’ **Detecting Unwanted Secrets** β†’ **IaC Analysis** for Security

--

_Snapshots:-_

_**Dependency-Check Results**_ –-> Distribution & severity of vulnerabilities in the Reddit Clone App.
![image](https://github.com/TanishkaMarrott/Orchestrating-DevSecOps-Pipeline-for-a-Cloud-Native-Architecture/assets/78227704/0e9a27ca-158a-4fa4-bfd7-4f82c5f6d30e)

--

_**Console Output**_ - --> Logs for execution of the pipeline stages
![image](https://github.com/TanishkaMarrott/Orchestrating-DevSecOps-Pipeline-for-a-Cloud-Native-Architecture/assets/78227704/cc0a6fbd-245c-49c4-9144-b0a6c41e7c80)

--

_**SonarQube Dashboard**_ --> Successful Quality Gate with an overview of code analysis
Reddit-Clone-App-SonarQubeServer

--

_**Docker Hub Repo: 'tanishkamarrott/reddit'**_ – 'reddit' image β†’ ready for pushes
![image](https://github.com/TanishkaMarrott/Orchestrating-DevSecOps-Pipeline-for-a-Cloud-Native-Architecture/assets/78227704/395f374d-9dbd-4436-9f73-c18848d40ccf)

## The Continuous Deployment Part

_**Intent:-**_ Automating the deployment of code from dev to prod, post a successful build ▢️
> **Continuous Deployment** ➑️ Means a faster deployment velocity = Accelerated releases = Faster Time-to-market πŸ‘

### _On a side note :- Is testing a part of both CI and CD?_


True.
But the kind and the emphasis of testing differs...

What does this mean?
➑️ Testing in CI is primarily about running tests against the code to ensure the codebase is stable and functional throughout.

**Crucial:-**
Integrating multiple code changes into the main-stream, shouldn't break production. --> **_Unit testing_** & **_Integration testing_**.

> 🎯 My goal here is frequent, incremental updates - (Immediate feedback** = Quicker Iterative loops)

--

When I talk about CD, it's not only about a "bug-free" code...

We need some other types of testing too, **Security testing**, **Performance Testing**, **UAT testing**. Making my software production-ready, considering the non-functional aspects as well.

> #### πŸ‘‰ *Is my software production ready? Can this be delivered to my users? Does it meet the overall quality standards?* CD answers such questions.

## Non-functional aspects of the the application infrastructure:-

Check out my TF code here:- https://github.com/TanishkaMarrott/AWS-EKS-TF/tree/main' <-- IMP

You can find the Reddit Clone Application code here :- https://github.com/TanishkaMarrott/Reddit-Clone-App

### Multi-AZ NAT gateway setup
###
### Multi-AZ worker node deployments:-

In this architecture, I've deployed **three NATs each with its own Elastic IP**, for **high availability** and **fault tolerance**.

> If one NAT gateway becomes unavailable, we can still route outbound traffic to the internet. We're fault-tolerant to AZ Failure. There's no impact on our operational efficiency.

β–Ά **High Availability** = **Fault Tolerance** = πŸ‘

Second, **Performance Optimization**.

> I didn't want any SPOFs or performance bottlenecks in my architecture. **Multi -NAT ensures that traffic from the instances do not necessarily need to cross inter-AZ for reaching the internet, = reducing Latency** πŸ‘.
>
> It does help me in the scalability aspect as well since resources in each AZ can scale out independently. We can add new subnets, add new instances in each AZ, without worrying NAT Gateway being a potential bottleneck. πŸ’‘

However, this is a **cost vs. fault tolerance trade-off**.
➑️ My decision here was prioritizing availability and performance over the cost considerations.

#### _Are we utilising both public and private subnets. Why?_

> My public subnets host the Load Balancers and NAT Gateways (the resources which are actually intended to be public), so, they'll distribute incoming internet traffic to the pods running the application.
>
> ➑️ This setup will **help us simplify and centralize traffic management** while keeping our backend pods secure.
>
> Also, **in case the applications in the private subnet wish to connect to the internet, for example :- for updates, APIs etc., it will be done via the NAT deployed in each public subnet**

**Secure outbound-only internet access** πŸ‘.

### _Granular Access controls for EKS_

**I've pruned down the public access CIDR** --> Allowed to access the Kubernetes API Server 🟰 Centralized control over access and management.
πŸ‘‡πŸΌ
> **I'd advise to tighten up security to the Corporate IP Address Range** as an ingress rule for the node group, that's something we might need for troubleshooting or administrative access

Plus

**We've got endpoint access restrictions, and secured SSH Access** - to the worker nodes by specifying source security group IDs and SSH keys; made sure the **IAM Policies** attached to the cluster and the node group are tied down. Chances of privilege escalation will be low.

### _Terraform State Backend - S3 + DynamoDB --> State Locking_

**I wanted to eliminate potential chances of State Corruption that might happen during multiple Terraform applies**. + DynamoDB will be used as a durable data-store for State Locking.

Topping it up with **S3 Versioning** on our backend S3 bucket to keep a history of our state files,▢️ for recovery from unintended changes. + **S3 Encryption** πŸ‘

### _How did I optimise on costs while still maintaining a level of fault-tolerance?_

Mix of both On-demand and Spot Instances πŸ‘

**We've decided to go in for:-**

**1- Two separate node groups:-
One for critical workloads (on-demand) and second, Spot for cost optimization.***

**2 - Multiple Instance groups have been specified ➑ Increasing chances of Spot Instances fulfillment.**

> β†ͺ️ This means we have an On-Demand capacity to handle Baseline Application Performance + a Spot Allocation strategy as a Cost Optimization strategy. πŸ‘ β˜‘οΈ

---

## _Why ArgoCD?_

πŸ‘‰ A Brilliant "declarative, GitOps CD Tool."

So, that's something I like.... So, what Argo does is , it **automatically reconciles differences** between the cluster's current state and what's in the manifest files, means that my changes are automatically **deployed and reflected in the live environment, as soon as they're pushed.**

> **Every change's versioned**, just in case changes don't go as planned, you can always **rollback to a previous state**

Here's the link to my K8s manifest files:- https://github.com/TanishkaMarrott/Reddit-Clone-K8s-Manifests

---

## _Quick Dive into the k8 manifests + Key Design Considerations_

### 1- **`Deployment.yaml`**

Blueprint for the Reddit-Clone pods we'll be creating

_**How did I improvise the deployment to be available and fault tolerant?**_

β˜‘ I've **increased the number of Pod Replicas**, K8s would then ensure that we'll have 2 instances of our application running at any given time. -> Availability, Load Distribution

β˜‘ I've also **specified the CPU and Memory Requests and Limits for the container**. Requests would be guranteed by the kuberenetes scheduler, while limits would ensure that none of our pods is inadvertently consuming excessive resources

▢️ **_Efficient resource Utilisation + High Availability_** 🏁 πŸ‘

> **IMP:-** I'd been observing that there was an uneven scheduling of pods across the nodes. Hence, I had to utilise the `topologySpreadConstraint` parameter, to ensure we're utilising our resources evenly. And a `maxSkew` parameter, this means resilient scheduling of pods across Nodes.

---

### 2- **`Service.yaml`**

We're exposing the set of pods running the containerised application through the Service of type loadBalancer.
**It listens for traffic on port 80 and forwards it to port 3000** - the port the application listens on within the container.

_**The non-functional aspects I've included:-**_

β˜‘ **I've made use of K8s annotations for Cross-Zone Load Balancing = HA Configuration** - Distributes Traffic evenly across pods in multiple AZs
Network Load Balancer naturally does ensure scalability - ➑️**NLBs = Reducing latency + Improving performance πŸ‘**

β˜‘ Also, **we'll be preserving client source ips are preserved** to ensure a better security, --> this will later help us in implementing WAF NACls that could be associated with the API Gateway fronting the LB --> Enhanced Security β˜‘οΈ

---

### 3- **`Ingress.yaml`**

I'm using this alongside the service object.

> ➑️ At times, when you're having multiple services, I'd not advise creating multiple services of type `LoadBalancer` , That wouldn't be a wise decision, **Instead, I'd advise to use an Ingress Controller to distribute / route the traffic based on the path in the URL**. **You'll simplify your network setup, while saving on extra infra costs.**

β˜‘οΈ In my case, **we will be utilising an ingress controller for its advanced traffic management + SSL termination capabilities.**

> ***Why?*** I will extrapolate this conecpt to use some ACLs for IP Whitelisting, Geo-restrictions in conjunction with an AWS API Gateway/ WAF for an eve better security posture. πŸ‘

β˜‘οΈ **We've limited the connections and requests per second, helps prevent resource exhaustion and overwhelming** of backend services 🏁 βœ”

---

### 4 - **`Cluster-autoscaler.yaml`** & 5. **`HPA-manifest.yaml`**

_Scaling via Cluster Auto-Scaler and Horizontal Pod Scaler_

πŸ’‘ **We wanted something that could adapt at both at the pod and the node level... Something that can help us scale effectively in Kubernetes and manage workload fluctuations as well.** And, hence we've added `cluster-autoscaler.yaml` and `hpa-manifest.yaml`


--> Cluster auto-scaler. πŸ‘‰ Scales the nodes up and down when there's a lack of sufficient resources to schedule pods or due to node utilization.
--> Horizontal Pod Autoscaler πŸ‘‰ Adjusting the number of pod replicas in a deployment, based on current demand,

β˜‘οΈ **We're considering **CPU Utilization** as our target metric here**. This helps us maintain an optimal Application performance irrespective of the fluctuations

---

### 6. **`RBAC-config.yaml`**

Why? β†’ **We need to finetune Access Control to Kubernetes resources**, either through User Accounts or Service Accounts

--

### **_Rationale for RBAC:-_**

**If we're looking for a much more auditable, and secure K8s environment wherein permissions are scoped, we must create a specific SA**, bind necessary permissions to the role πŸ‘ -- (the one's I need for the application's proper functioning ). This Role would then be attached to the SA...

βœ” Advantage A --> We're adhering to the **principle of Least privilege,**

βœ” Advantage B --> **We can also _scope_ permissions to a specific namespace,** if we're looking for granular access control. User Accounts too can be granted specific permissions for the resources they need to access (pods etc.)

### 7. _**K8s Network Policies** - A side note_

➑️ How you could implement it?

_Option 1_ - **You can have a `deny-all` policy`, restricting any ingress to all the pods** (as specified by the selector) within a namespace.

_Option 2_ - Or maybe have a specific Network Policy allowing inbound traffic from pods of a certain application within the same namespace. = This is what is predominantly done when you've got multiple applications --> **We're controlling Ingress/Egress , but at the pod level!**

**My current use-case doesn't require a policy restricting communication between pods running multiple applications.** And hence, I haven't created a manifest specifically for network-policies

---

ArgoCD has been exposed via the LoadBalancer Endpoint. Here are a couple of snapshots:-

image

--

image

_ArgoCD Pods:-_

Reddit-App-Clone-ArgoCD-pods-running

--

reddit-clone-argocd-pods

--

_My Application's frontend:-_

--

Reddit-App-Clone-App-FrontEnd

---

## Helm, Prometheus & Grafana - Monitoring + Visualisation combined

> πŸ€” I'll give an acronym here, heard about Docker? What does it actually do? It packages the application code, libraries & necessary dependencies into a single package (that's called an artifact). In the same way, Helm would package all K8s resources, like deployments, services. This means it more like a directory structure, packaging all K8s manifests, templates and config values.

## **Is Helm ~ GitOps. How?**

--> Helm lets you manage complex K8s applications

--> It lets you template charts as well, That means it'll enable us to inject values and configurations at runtime.

β–Ά _**Reproducibility and Reusability of K8 manifests**_

> πŸ’‘ **Helm shares some similarity from a conceptual standpoint with GitOps Practices.**
>
> Each time I install a new chart, it creates a new "release". So, this is a versioned snapshot, Helm keeps track of changes to your deployments.
>
> Just in case, the release isn't as well as it had been planned, you can rollback to a previous stable version. GitOps could extend this to a more broader sense, with both infrastructure provisioning / configuration plus the application deployment aspect...

--

We've added the Helm Repo `Prometheus-kube-stack.` This repo is a collection of all the k8s resources pertaining to Prometheus and Grafana. It lets you setup these tools in your cluster in a way that's fully integrated and easy to manage.

## Prometheus and Grafana

--> Our monitoring and observability tool suite.

Prometheus is actually a time- series database....

βœ… Step 1 --> **It collects data from a wide array of sources,** be it, infra-components, applications or services.....

βœ… Step 2 --> Exporters **--> Expose metrics in a way that can be easily consumed by Prom.**

βœ… Step 3 -->** You can then make use of **PromQL, to query the data,** or have an **Alerting manager setup / integrated with it,** to trigger off notifications for anomalies.


Infrastructure Component
↓
Exporter
(Exposes Metrics)
↓
Prometheus
↓
Querying with PromQL /
Alerting with Alert Manager

--> The prometheus stack, we installed using Helm, comes with the Grafana Deployment embedded.

So, what's Grafana? ⏬

Grafana is more of a Data Visualisation tool πŸ“Š
**You can actually fetch data from any of your Data sources, Prometheus in our case,** and create dashboards, create graphs, heatmaps. You can have _interactive dashboard with dynamic filtering capabailities_

> **I've seen folks utilising it's ingrained alerting mechanism, it gels well with notification and reporting tools like email, slack, etc.**

#### πŸ₯‡πŸ _Prometheus + Grafana = A powerful combo for monitoring and observability into application health & performance_

We've exposed these via a LoadBalancer Endpoint, not NodePort or ClusterIP

> Why? To make Prometheus and Grafana accessible from outside the Kubernetes cluster,
>
> **In my opinion, you should opt for LoadBalancer services.** NodePort can be suitable for smaller setups or environments where specific port access is manageable.
>
> Also, **I wouldn't recommend NodePort from a Security perspective.** However, LoadBalancer offers a more scalable and user-friendly way to expose services, --> It distributes traffic and ensures **service reliability and availability.** πŸ‘

_Couple of snaps wrt Prometheus and Grafana:-_

Reddit-App-Clone-App-Pods-prometheus-running

--

Reddit-App-Clone-prometheus-console

--

Reddit-App-Clone-App-Prometheus-Node-Disk-Info

--

_Attached - Grafana snaps:-_

reddit-clone-grafana-dash

--

Reddit-clone-app-grafana-pod-monitoring-dash

--

reddit-clone-grafana-network-io

--

reddit-clone-grafana-completemonitoring-dashboard

--

### What kind of insights do we get from these dashboards? βš›οΈ

_Pod monitoring dashboard:-_

**Metrics + data --> specifically for individual pods.**

> _We'll then be able to gauge underlying issues with the podsπŸ‘ ➑️ This will include pod status, CPU / memory usage, network usage, the volume of logs produced, the number of restarts etc._

_Cluster Monitoring dashboard ☸️ :-_

β†’ **Overall health of the k8s cluster with all its components - nodes, services and deployments**

> Helps us cover the following:-
> - number of deployments & nodes,
> - resource utilisation,
> - health and status of nodes
> - alerting and monitoring notifications.

_Node Monitoring dashboard:-_

It's around **pod allocation to the nodes**, checks for `MemoryPressure` , `OutOfDisk` or any such conditions, **node utilisation metrics** (over-utilised / underutilised), **health and status of the nodes**

---

## _The Logging Suite - EFK Stack_

- Since we're done with the monitoring and alerting aspect, let's turn to collecting, analysing & visualising our logs

You can check out my EFK manifests here:- https://github.com/TanishkaMarrott/EFK-Stack β˜‘οΈ

--

### _Quick dive into what's EFK, and into its workflow_

_**Intent:-**_ Log collection, aggregation and visualisation.

E --> ElasticSearch --> Search & Analytics Engine + Storing, indexing and querying capabilities

F --> FluentD --> Data collector & shipper

K --> Kibana --> Data Viz Tool

**_Expore + Analyse + Visualise Log Data = Making sense of the collected log data in real-time_** πŸ™‚

## _The EFK Workflow_


Data Sources --> They could be log files, shippers, etc

⏬

Fluentd --> Enriching it with metadata, transforming the data into a format suitable for ElasticSearch

⏬

Elasticsearch --> Storing, indicing the data

⏬

Kibana --> Visualization - gaining insights into patterns & trends)

Now let's discuss about our EFK Manifests, Apply the K8s manifests to create the EFK Deployment

> πŸ’‘ Let's clarify a few points here.

### Why did we implement the EFK Stack when Prometheus and Grafana were already in place?

Our Rationale :-

**Prometheus and Grafana help you with the "what" factor, the 'metrics'.** What's exactly happening in your system, status of your application at this point of time.

**EFK is more of a logging suite,**
We could get into the details and context around the "root-cause" of the problem through detailed logs, --> specific error message, status codes

➑️ enhanced troubleshooting & incident response πŸ‘ πŸ‘

---

## _Deep-dive into the EFK Manifests:-_

### 1 - **`Namespace.yaml`** :-

Applying this manifest would create a namespace `efklog` for the EFK stack components.

> _**I could have users across multiple teams working on a single cluster**, and **I can make use of RBAC - that's namespace scoped**, That means users can access resources they're intended to._

--

### 2 - **`ElasticSearch_Service.yaml`** :-

**We've configured the service to listen for requests at port 9200**, TCP Protocol, **And forward these requests to `db` port on the - `elastic-search-logging` pods**

> πŸ”Ž **_Purpose?_** Helps data sources send logs to the underlying pods --> Aggregation of logs to the ElasticSearch application

--

### 3 - **`ElasticSearch_StatefulSet.yaml`** :-

#### Security + Performance + Data Durability - How?

I'll quickly recapitulate the pointers / non-functional enhancements we've done.

βœ… We're being very specific in the permissions attached to the SA , to be assumed by the ElasticSearch Application pods -- with permissions to `get` resources like `endpoints` , `services` and `namespaces`.**
--> **limiting the operations ElasticSearch can perform.**

βœ… Next, **I wanted things to scale while still being cognizant of the maintained state** -- Remember, ElasticSearch is a distributed database.
--> So, **we've increased the number of replicas.**

➑️ Had to optimise performance as well, **had to define resource requests and limits**. This enabled me to ensure we've got sufficient resources for ElasticSearch, while not overwhelming / overconsuming system resources.

βœ… Simultaneusly, **I had to focus on Data Persistence to improve durability**. Despite pod restarts. Hence, we utilised **`PersistentVolumeClaim`** **Plus, a rolling update strategy for minimal downtime**

## The challenge we faced from a security standpoint

Not going too deep, but this is important. The `vm_max_map_count` defines the maximum number of memory-map areas a process may have. And it's crucial for such databases like ElasticSearch. We need to set a higher `vm.max_map_count` value (at least 262144) than the usual default.

**In a k8s environment, we cannot adjust system-level settings for the nodes from within the pods, by default. πŸ€” However, for the application to function optimally, I need to have this modified.**

--

▢️ Interim Solution:- So, we had used an `initContainer` . `InitContainer` runs to completion before other subsequent ElasticSearch Containers (In this eay, they'll have an environment set up and running.)

It must run in Privileged Mode, nearly equivalent to a root user
**Not Recommended. High risk of Privilege Escalation in case of a VM Compromise. We had to change our approach**

--

➑ What could be the Security Risks? Well, if the container would be compromised by an attacker, he could gain root access to the node, gaining control over other containers and services running on the node. He could access any critical files & configuration settings; could deploy malware, and exfiltrate sensitive data. Endless possibilities, all boiling down to privilege escalation

## _How did we enhance security while solving this pain-point?_

### Approach 1 - Why it didn't work out?

-->Each time a pod restarts or is redeployed, the initContainer checks if this system-level Configuration `vm-max-map-count` , is 262144 in our case.
-->Once this node-level configuration has been verified or applied if , subsequent ElasticSearch containers would run.

> This means
>
> 1) Multiple `initContainers` in multiple pods 🟰 **Multiple Containers running in privileged mode** 🟰 High Risk in terms of security exposure
>
> 2) Besides a higher attack surface area, **it also means a lag in pod startup times,** 🟰 **Operationally inefficient βž• a performance bottleneck**

This means I'm propagating changes from 'pod-level'. Not an ideal approach.

--

### Approach -2

**I'm using a DaemonSet for this purpose.**

-->The DaemonSet would create a pod in each node. Current + New.

-->There'll be a container within this pod, that'll be responsible for modifying the parameter

Β  Β **So, it's a **one-time execution,** this container executes once during the startup time, and persists beyond it's lifecycle**

> _So, how does it actually help us?_

βœ… **We've got a consistent application of the required system-level settings** across all nodes in a cluster

βœ… **Automatic application to new nodes in future**

βœ… We're **reducing the overhead that'll be incurred by the application pods,** due to checking or applying system-level settings, as these settings are pre-applied at the node level --> reducing startup times and complexity.

πŸ‘πŸ™‚

--

#### 4 - **`Fluentd_Config_Map.yaml`** :-

We'll be using a ConfigMap to configure FluentD.

Cluster Logs --> Fluentd - Collection & Enrichment --> Forwarded to ElasticSearch for Storage

System Configurations:- Specifications around security settings, like shared keys, and root directory to be used

Input configurations:- Position File to keep track of the logs that've been read. FluentD "tails" logs from all containers across all nodes

Some Data Enrichment --> Added JSON Parsing of logs - For the Non_JSOn - `regexp` parsing.
It also adds some metadata - k8s specific info to the log records
--> All misc processes, like filtering + graying out the data go in here

Buffer Configuration below πŸ‘‡

## _Specifics into Buffer configuration and overflow management_

🎯 **We had to ensure there's no data loss during high-volume periods,** or in cases when the downstream system (ES in this case) is temporarily unavailable

- **Mixing both memory and file based buffers :-** πŸ’‘

Memory based buffers --> Instant data access 🟰 Faster data ingestion and subsequent processing.
File based buffers --> Can Persist Data + Higher Disk Capacity 🟰 Can sustain backpressure scenarios / system outages

- Having some **limits on the total buffer size, and the max chunk size** ‡️

One ➑️ there's an upper bound of 512 MB, of the total memory size of the buffer, prevents fluentd from consuming excessive memory resources

Two ➑️ we've also set a limit on the total chunk size, 16 MB for memory, and 2M for file based buffers. Smaller chunks makes them more manageable and can reduce latency, But we need to cognizant of the overhead that may come along. 16 MB is a good figure though.

- A **flush interval of 5 seconds** --> not too short, not too frequent, helps reduce latency in forwarding/flushing the logs

- Our configuration also employs an **exponential retry strategy**, this means that if the logs cannot be directed downstream, for any problem whatsoever, FluentD would retry at increased intervals β˜‘οΈ

- Additionally, we've configured **overflow_action of block**, meaning if the buffer reaches its capacity, Fluentd will block new data from being buffered until space becomes available.

> ➑️ This ensures that in event of buffer overflows, the action would be to block further buffering and prevent unbounded memory usage.

I had to make some provisions for the max queue length limit and max time interval between the two subsequent retries for the file buffers

The logs are now "processed"

5 - We'll now direct these processed logs to the ElasticSearch instance port 9200, protocol https. It configures auth with ES, Also setups up buffering for outoutting the logs (file-based buffering here), I've also customised the naming pattern of ElasticSearch indexing, The indexing would be based on the orginating namespace and the current date.

## _How did we accelerate query and retrieval times in ES?_

We'll be customising ElasticSearch indexing names - being dynamically populated based on:-

- Name of the namespace it originates from.
- Timestamp (Dates in our case)

βœ… Time-based segmentation helps in implementing lifecycle configurations for storing these logs.

βœ… Namespace segragation --> Organised storage of logs and subsequently a faster retrieval.

**_Use-case?_**

So, in future, **if we'd wish to carry out historical data analysis, we'll be able to clearly delineate logs from different time periods and carry out Trend Analysis / Anomaly Detection**

## _Security Enhancements we've primarily focused on_:-

- (FluentD ━ K8s API Server) & (FluentD ━ ElastiSearch) **--> Secured with TLS/SSL**

- (FluentD ━ FluentD) **--> Authentication via the `shared_key`** .

> _Why? Because FluentD instances interact with the FluentD log aggregator instance, in such case, we need to verify the authenticity of the connection._

- **Hostname Verifications** :- Prevent man-in-the-middle attacks

- **Filtering out Sensitive data** before logs aggregation

> ➑️ None of the sensitive secrets lie exposed in the cluster logs.

- There've been multiple measures implemented to ensure that **resource usage is bounded, and thus preventing a DDoS Attack.**

#### 5. **`Fluentd_DaemonSet.yaml`** :-

Tasks we performed here:-

1- **Created an SA that will be assumed by the FluentD application pods** to communicate with the K8s API Server

2- A **Cluster Role consisting fo the permissions that'll be needed by FluentD** for collecting cluster-wide logs

3- A **role binding that'll be binding this ClusterRole to the SA .** for the pods to inherit these permissions.

4 - DaemonSet for FluentD

### Why did we deploy FluentD as a DaemonSet in this stack?

_Reason 1_ --> FluentD is a log forwarder. It doesn't need to be stateful.

_Reason 2_ --> Moreover, we need to ensure that a FluentD pod runs across all the nodes of the cluster.

_Reason 3_ --> FluentD needs to collect logs from node-specific paths like `/var/log`. Had it been a StatefulSet we would have been restrained to use stable network ids or maybe a persistent storage, which is out of context, for the use-case at hand.

> Running FluentD as a DaemonSet means **I'm comprehensively covering nodes all across the cluster, thus ensuring cluster-wide log collection + forwarding** πŸ‘

#### 6. `Kibana_Deployment.yaml`

Okay, so let's focus on what are the **non-functional aspects we've tried to incorporate in this deployment**

πŸ’ - **Multiple replicas of Kibana pods,** β–Ά **Higher availability.** Minimized downtime πŸ‘

πŸ’ - These containers will be running as a non-root user (`runAsUser: 1000`).

**We've made it a point to explicitly set `runAsNonRoot: true`**
---> _Low PrivEsc Risks_

πŸ’ - **We used a seccomp profile to enable admins limit the system calls a container can make.**

> Seccomps profile has wide applications **in protecting against kernel-level exploits.**

πŸ’ - **Having specified CPU and memory requests and limits, helps me in a dual manner.** One, we've got sufficient resources for Kibana Containers for maintaining a stable operation, while still preventing them from over-consuming resources, affecting my other services ▢️ Efficient Resource Management

πŸ’ - **Liveness + Readiness Probes.**
Readiness = When a Kibana pod is ready to start accepting traffic, Liveliness = Checks if the pod requires a restart

πŸ’ - **Plus a PVC - persistent volume claim to preserve the application's state** across restarts.

#### 7. `Kibana_Service.yaml`

What're we essentially doing? **We're defining a Service for users to access the Kibana dashboard from outside the k8s cluster**... β˜‘οΈ

The LoadBalancer type automatically provisions an external load balancer (supported by the cloud provider) & assigns it a public IP that routes to Kibana (port 5601)

> ➑️ **This means users can interact with Kibana’s UI by visiting `http://:5601`, where `` is the LB's IP Address**

_Kibana Snapshots:-_

Reddit-App-clone-kibana-1

--

Reddit-App-Clone-App-Kibana-2

--

Reddit-App-Clone-App-Kibana-3

--

## Wrapping it up!

A big thank you for accompanying me on this journey. It was an absolutely amazing experience!! 😊

--

In my humble opinion, by integrating tools like Jenkins & ArgoCD + Security Tool integrations and our logging/monitoring suite --> Prometheus, Grafana and the EFK stack within the k8s eco,

**We have essentially built something, that:-
1- Holds a good potential to streamline development and deployment processes**
2- **Very well aligned with, I'd say, with the fast-paced business requirements**

Also, I've done my best in **improvising this architectural workflow from a non-functional standpoint** ➑️ Security, Scalability, Performance and Fault tolerance β†’ all have been taken into account while creating this workflow πŸ‘

**Key takeaway:-**
> #### **From an agility and security standpoint, if you actually wish to "_create value_", it is absolutely important to ingrain DevSecOps principles from the very beginning** πŸ’‘

--

I've just scratched the surface, there's a long way to go, while refining and creating even better, resilent cloud solutions! 😊

### Suggestions for Potential Improvements:-

Absolutely!!
**You're warmly invited to contribute via a pull request or reach out directly at [email protected] for any inquiries** or collaboration opportunities. Additionally, **connect with me on LinkedIn - https://www.linkedin.com/in/tanishka-marrott/** to stay updated on my latest projects and professional endeavors.

### Acknowledgments:
Grateful to Mudit Mathur and to Sridhar Modalavalasa for their insightful blogs on DevSecOps. A special thank you to the **AWS Well-Architected documentation** for serving as my de-facto guide throughout this journey. πŸ™