Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/jorgermduarte/real-time-data-architecture-kafka-flink-dw-k8s

Real-time data processing architecture using Apache Kafka, Flink, and Kubernetes. This project demonstrates how to build a scalable and resilient pipeline for streaming data, performing ETL with Flink, and storing the processed data in a Data Warehouse for analysis.
https://github.com/jorgermduarte/real-time-data-architecture-kafka-flink-dw-k8s

apache big-data data-pipeline data-warehouse distributed-systems etl flink kafka kubernetes real-time streaming

Last synced: about 1 month ago
JSON representation

Host: GitHub
URL: https://github.com/jorgermduarte/real-time-data-architecture-kafka-flink-dw-k8s
Owner: jorgermduarte
License: mit
Created: 2024-10-15T18:47:36.000Z (4 months ago)
Default Branch: main
Last Pushed: 2024-12-09T01:15:17.000Z (about 2 months ago)
Last Synced: 2024-12-15T17:53:07.432Z (about 2 months ago)
Topics: apache, big-data, data-pipeline, data-warehouse, distributed-systems, etl, flink, kafka, kubernetes, real-time, streaming
Language: TypeScript
Homepage:
Size: 42.8 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# Project Description

This project showcases a local deployment using **Kubernetes with Minikube** to simulate a microservices-based architecture. The environment is structured to demonstrate how various components, such as API gateways, Kafka clusters, and Flink consumers, can be orchestrated to work together for real-time data processing.

## Key Components:
1. **API Gateway**: Acts as an entry point for user requests, balancing the load across multiple backend API instances.
2. **Load Balancer**: Distributes traffic evenly across multiple backend API instances to ensure high availability and fault tolerance.
3. **Backend API**: A microservice responsible for handling business logic, producing events to Kafka, and interacting with other services.
4. **Redis Cache**: Used as an in-memory data store to optimize performance by caching frequently accessed data, reducing load on the backend services and databases.
5. **Kafka Cluster**: Handles event-driven communication, producing and consuming messages through topics like `mock-user-topic`.
6. **Flink Consumer**: Ingests data from Kafka, performing ETL (Extract, Transform, Load) tasks and pushing the processed data to a **Data Warehouse**.
7. **Node Consumer**: Ingests data from Kafka, performing ETL (Extract, Transform, Load) tasks and pushing the processed data to a **Data Warehouse**.
8. **Data Warehouse (DW)**: Stores processed data, making it accessible for analysis and reporting.
9. **Monitoring and Logging**: Services such as Prometheus and Grafana are used to track system health, performance metrics, and logs.
10. **Data Dictionary**: Present on the data-dictionary directory which you can execute it locally.
11. **Scripts**: The project contains scripts to populate the oracle DW, and scripts to backup and load the oracle database.

## Local Environment:
- **Minikube** is used to run this Kubernetes setup locally for demonstration purposes. Minikube allows all services to run within a single-node cluster, emulating a real-world microservices architecture. The environment is configured with several deployments, including Kafka, Flink, Oracle DB, Redis, Prometheus, and Redis, to mimic a distributed system.

## Future Expansion:
- If this were a production-ready setup, the architecture would scale across multiple **Kubernetes clusters**. The **Master Node** would manage **worker nodes**, ensuring scalability, high availability, and failover capabilities. Using a more distributed architecture would ensure that the platform can handle real-world production loads with increased reliability and flexibility.

## Starting Minikube
```
minikube start --cpus 5 --memory 9192 --disk-size=50g --driver=docker
minikube dashboard
docker context ls
//eval $(minikube -p minikube docker-env)
```

## Publishing everything to the minikube for the first time
```
kubectl apply -f .\kubernetes\deployment\zookeper-deployment.yaml
kubectl apply -f .\kubernetes\deployment\redis-deployment.yaml
kubectl apply -f .\kubernetes\deployment\kafka-deployment.yaml
# kubectl apply -f .\kubernetes\jobs\kafka-topics-job.yaml -- not required, since the producers/consumers will create them automatically

# build & deploy the node-backend-api image
docker build -t node-backend-api:latest ./node-backend-api
minikube image load node-backend-api:latest
kubectl apply -f .\kubernetes\deployment\node-backend-api-deployment.yaml

# build & deploy the data warehouse image
kubectl apply -f .\kubernetes\deployment\oracle-db-deployment.yaml
docker build -t data-warehouse-app:latest .\datawarehouse\
minikube image load data-warehouse-app:latest
kubectl apply -f .\kubernetes\jobs\data-warehouse-job.yaml

# build & deploy the flink-consumer image
# docker build -t flink-consumer:latest ./flink-consumer
# minikube image load flink-consumer:latest
# kubectl apply -f .\kubernetes\deployment\flink-consumer-deployment.yaml

# build & deploy the node-consumer image
docker build -t node-consumer:latest ./node-consumer
minikube image load node-consumer:latest
kubectl apply -f .\kubernetes\deployment\node-consumer-deployment.yaml

# deploy cmac
kubectl apply -f .\kubernetes\deployment\cmac-deployment.yaml

# build & deploy the gateway
docker build -t gateway-app:latest .\Gateway\Gateway
minikube image load gateway-app:latest
kubectl apply -f .\kubernetes\deployment\gateway-deployment.yaml

# other stuff
#kubectl apply -f .\prometheus-deployment.yaml
#kubectl apply -f .\grafana-deployment.yaml
```

## Minikube useful comands
```
# list pods
kubectl get pods
# fowarding one local port to one pod:
kubectl port-forward 3001:3001
```

## Connecting in PowerBI
User: jorgermduarte
pass: 123456
server: localhost:1521/jorgermduarte