https://github.com/julio-analyst/debezium-cdc-mirroring
https://github.com/julio-analyst/debezium-cdc-mirroring
cdc data-mirroring debezium kafka-connect postgresql real-time-replication stream-processing
Last synced: 8 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/julio-analyst/debezium-cdc-mirroring
- Owner: Julio-analyst
- License: mit
- Created: 2025-07-08T17:23:14.000Z (12 months ago)
- Default Branch: main
- Last Pushed: 2025-07-08T17:46:50.000Z (12 months ago)
- Last Synced: 2025-07-08T18:53:49.875Z (12 months ago)
- Topics: cdc, data-mirroring, debezium, kafka-connect, postgresql, real-time-replication, stream-processing
- Language: HTML
- Homepage:
- Size: 28.4 MB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Debezium CDC Mirroring: Real-time PostgreSQL Replication
> Log-based data replication pipeline using Debezium, Kafka, Kafka Connect, and PostgreSQL with dynamic testing capabilities.

## π§ **FINAL STATUS: COMPREHENSIVE MONITORING COMPLETE**
β
**ALL SCRIPTS VERIFIED & WORKING** - Semua script telah ditest dan berfungsi sempurna
β
**REAL-TIME DATA CONFIRMED** - Tidak ada data hardcoded, semua live dari sistem
β
**CONTAINER STATS INTEGRATED** - CPU, Memory, Network I/O, PIDs untuk semua container
β
**DETAILED OUTPUT IMPLEMENTED** - Semua metrics ditampilkan di terminal dalam format readable
### **π― ENHANCED OUTPUT FEATURES:**
- **Container Analysis**: Detailed per-phase stats untuk setiap container (CPU%, Memory%, Network I/O, PIDs)
- **System Resources**: CPU/Memory trends dengan change indicators, Disk usage, Network throughput
- **Database Performance**: Record counts, sync analysis, database size, connections, transactions
- **Kafka & Connect**: Topics, consumer groups, connector status, task details
- **Latency Analysis**: Individual test results, averages, quality assessment (EXCELLENT/GOOD/FAIR/POOR)
- **Error & Log Analysis**: Per-container error counts, recent error previews
- **Replication Monitoring**: WAL position, replication slots status
- **Session Summary**: Monitoring times, processing duration, sync percentage
---
---
## π Overview
This project demonstrates a **real-time data replication** architecture using **Debezium** and **Apache Kafka** to capture changes (CDC) from a PostgreSQL source database and mirror them into a PostgreSQL target database.
### π Context:
* **Source DB**: `inventory`
* **Source Schema**: `inventory`
* **Source Table**: `orders`
* **Target DB**: `postgres`
* **Target Schema**: `public`
* **Target Table**: `orders`
π *Note: You can skip dropping foreign keys by pre-populating the referenced data. See example datasets below.*
### π New Dynamic Testing Features:
* **Real-time Data Fetching**: No hardcoded IDs or values
* **Environment-aware Configuration**: Supports config files and environment variables
* **Auto-discovery**: Automatically detects Docker containers and database schemas
* **Dynamic Data Generation**: Uses live database data for realistic testing
* **Configurable Performance Testing**: Adaptive batch sizing and concurrent operations
---
## π‘ Why CDC & Streaming?
Synchronizing data across systems in real time is a challenge. Traditional ETL tools introduce latency, and direct queries often overload production databases.
**Debezium** offers a **non-intrusive, log-based mechanism** to stream changes efficiently using Kafka β making it ideal for:
* Real-time backups
* Microservice synchronization
* Streaming data to analytics pipelines
* Data validation and monitoring
---
## π Data Flow Architecture
```
[Postgres Source] β [Debezium Source Connector] β [Kafka Broker] β [JDBC Sink Connector] β [Postgres Target]
```
**Components:**
* **Postgres Source**: Origin DB using WAL (Write-Ahead Log)
* **Debezium**: Captures changes in real time
* **Kafka Broker + Zookeeper**: Streams changes across connectors
* **Kafka Connect (JDBC Sink)**: Pushes data to target
* **Postgres Target**: Receives updates
* **Dynamic Testing Suite**: Real-time performance and validation testing
---
## π Project Structure
```
debezium-cdc-mirroring/
ββ docker-compose-postgres.yaml # Main deployment file
ββ inventory-source.json # Debezium connector config
ββ pg-sink.json # JDBC sink config
ββ requirements.txt # Python dependencies
ββ plugins/
β ββ debezium-connector-postgres/ # Debezium PostgreSQL connector
β ββ confluentinc-kafka-connect-jdbc/ # JDBC sink connector
ββ scripts/
β ββ insert_debezium.ps1 # Insert stress test script
β ββ monitoring_debezium.ps1 # Pipeline monitoring script
ββ docs/
β ββ erd.png # Entity Relationship Diagram
β ββ Scripts-Quick-Reference.md # Usage guide
β ββ Scripts-Documentation.md # Output analysis
ββ testing-results/ # Auto-generated logs
ββ README.md
```
---
## π Quick Start Guide
### β
Step 1: Clone & Spin Up Docker
```bash
git clone https://github.com/Julio-analyst/debezium-cdc-mirroring.git
cd debezium-cdc-mirroring/
docker compose -f docker-compose-postgres.yaml up -d
```
### β
Step 2: Register Connectors & Run Scripts
```bash
curl -X POST -H "Content-Type: application/json" --data "@inventory-source.json" http://localhost:8083/connectors
curl -X POST -H "Content-Type: application/json" --data "@pg-sink.json" http://localhost:8083/connectors
```
## PowerShell Scripts
```powershell
# Set execution policy (jika diperlukan)
Set-ExecutionPolicy -Scope Process -ExecutionPolicy Bypass
# Jalankan insert test
.\scripts\insert_debezium.ps1
```

```powershell
# Monitor pipeline performance
.\scripts\monitoring_debezium.ps1
```

**Custom Insert Parameters:**
```powershell
# Test ringan
.\scripts\insert_debezium.ps1 -RecordCount 500 -BatchSize 50
# Test standar dengan progress
.\scripts\insert_debezium.ps1 -RecordCount 2000 -BatchSize 200 -ShowProgress
# Stress test
.\scripts\insert_debezium.ps1 -RecordCount 10000 -BatchSize 1000 -DelayBetweenBatches 0
```
### β
Step 3: Check Connected Connectors
Untuk melihat daftar connector yang sudah aktif di Kafka Connect:
**PowerShell:**
```powershell
Invoke-WebRequest -Uri http://localhost:8083/connectors | Select-Object -ExpandProperty Content
```
**Command Prompt / Git Bash:**
```bash
curl http://localhost:8083/connectors
```
Untuk melihat status detail connector tertentu:
**PowerShell:**
```powershell
Invoke-WebRequest -Uri http://localhost:8083/connectors//status | Select-Object -ExpandProperty Content
```
**Command Prompt / Git Bash:**
```bash
curl http://localhost:8083/connectors//status
```
Ganti `` dengan nama connector yang ingin dicek.
### β
Step 3: Check Source Table Structure
```bash
docker exec -it debezium-cdc-mirroring-postgres-1 psql -U postgres -d inventory
\d inventory.orders
```
### β
Step 4: Check Existing Data
```sql
SELECT * FROM inventory.orders;
```
### β
Step 5: Modify Source Table (Optional for CRUD Testing)
```sql
ALTER TABLE inventory.orders DROP CONSTRAINT IF EXISTS orders_purchaser_fkey;
ALTER TABLE inventory.orders DROP CONSTRAINT IF EXISTS orders_product_id_fkey;
ALTER TABLE inventory.orders ADD COLUMN keterangan TEXT DEFAULT '';
```
### β
Step 6: Perform CRUD Operations
#### πΉ Insert
```sql
INSERT INTO inventory.orders(order_date, purchaser, quantity, product_id, keterangan)
VALUES ('2025-07-08', 1002, 3, 107, 'CDC TEST');
```
#### πΉ Update
```sql
UPDATE inventory.orders
SET keterangan = 'UPDATED FROM SOURCE'
WHERE purchaser = 999;
```
#### πΉ Delete
```sql
DELETE FROM inventory.orders
WHERE purchaser = 999;
```
π *Alternative to dropping FK: Pre-insert into `customers` and `products`:*
```sql
INSERT INTO inventory.customers(id, first_name, last_name, email) VALUES (999, 'Dummy', 'Customer', 'dummy@mail.com');
INSERT INTO inventory.products(id, name, description, weight) VALUES (999, 'Dummy Product', 'test', 1);
```
### β
Step 7: Check Replication Result in Target DB
```bash
docker exec -it debezium-cdc-mirroring-target-postgres-1 psql -U postgres -d postgres
SELECT * FROM public.orders;
```
---
## π Scripts & Monitoring
### π Insert Data Testing
Script `insert_debezium.ps1` melakukan stress test insert data ke database source dengan fitur:
- Real-time data fetching dari customers dan products
- Configurable batch size dan record count
- Resource monitoring per batch
- Automatic logging ke folder `testing-results/`
### Pipeline Monitoring
Script `monitoring_debezium.ps1` memberikan monitoring komprehensif:
- Container health dan resource usage
- Database connection dan performance
- Kafka topics dan consumer groups
- CDC connector status dan replication lag
- WAL monitoring dan slot status
- Performance metrics dan recommendations
### π Output & Logs
Semua hasil test dan monitoring tersimpan di:
- `testing-results/cdc-stress-test-[timestamp].log` - Detail insert test
- `testing-results/cdc-resource-usage-[timestamp].log` - Resource monitoring
**Contoh Performance Results:**
```
Test Duration: 00:00:44
Throughput: 22.51 operations/second
Success Rate: 100%
Average Batch Time: 184.45ms
Sync Status: SYNCHRONIZED
```
### Documentation
- **[Scripts Quick Reference](docs/Scripts-Quick-Reference.md)** - Panduan penggunaan lengkap
- **[Scripts Documentation](docs/Scripts-Documentation.md)** - Analisis output dan hasil
---
## π§ Advanced Configuration
### Parameter Optimization
- **Light Test**: `RecordCount=500, BatchSize=50`
- **Standard Test**: `RecordCount=2000, BatchSize=200`
- **Stress Test**: `RecordCount=10000, BatchSize=1000`
- **Optimal Performance**: `BatchSize=500` (sweet spot)
### Monitoring Strategy
```powershell
# Pre-test baseline
.\scripts\monitoring_debezium.ps1
# Execute test
.\scripts\insert_debezium.ps1 -RecordCount 5000 -BatchSize 500
# Post-test analysis
.\scripts\monitoring_debezium.ps1
```
---
## π‘οΈ View Events and Validate
### π Option A: Via CMD
```bash
docker exec -it kafka-tools kafka-console-consumer --bootstrap-server kafka:9092 --topic dbserver1.inventory.orders --from-beginning
```
### π Option B: Via Web UI (Kafdrop)
Access: [http://localhost:9000](http://localhost:9000)
* Inspect topics, partitions, and payload in browser
### π§ Check Running Connectors
```bash
curl http://localhost:8083/connectors
```
---
## π© Shutdown & Clean Up
After you're done:
```bash
# Option A (with volume deletion)
docker compose -f docker-compose-postgres.yaml down -v
# Option B (keep volume data)
docker compose -f docker-compose-postgres.yaml down
```
**β οΈ Donβt forget to stop containers after use to avoid memory or port issues.**
---
## β
Optional: Connect via DBeaver
> Create a PostgreSQL connection for both source and sink:
### πΉ Source (inventory):
* Host: `localhost`
* Port: `5432`
* Database: `inventory`
* User: `postgres`
* Password: `postgres`
### πΉ Target (postgres):
* Host: `localhost`
* Port: `5433`
* Database: `postgres`
* User: `postgres`
* Password: `postgres`
Click **Test Connection**, then **Finish**.
---
## ποΈ ERD & Sample Data
### π§© ERD
> Below is the simplified ERD of the source `inventory` database.

### π Sample Data Extract
```sql
-- Sample: customers
SELECT * FROM inventory.customers;
-- Sample: products
SELECT * FROM inventory.products;
-- Sample: orders
SELECT * FROM inventory.orders;
```
---
## π οΈ Tech Stack
* **Debezium 2.6** - PostgreSQL CDC connector
* **Apache Kafka & Kafka Connect** (Confluent) - Message streaming
* **PostgreSQL** - Source and target databases
* **Docker Compose** - Container orchestration
* **Kafdrop** - Kafka Web UI for monitoring
* **PowerShell Scripts** - Automated testing and monitoring
---
## π References & Documentation
* **[Scripts Quick Reference](docs/Scripts-Quick-Reference.md)** - Complete usage guide
* **[Scripts Documentation](docs/Scripts-Documentation.md)** - Output analysis and troubleshooting
* [Debezium Documentation](https://debezium.io/documentation/)
* [Kafka Connect JDBC Sink](https://docs.confluent.io/kafka-connect-jdbc/current/index.html)
* [Docker Compose Guide](https://docs.docker.com/compose/)
---