https://github.com/adudko/streaming-data-loader
Tool for efficient loading and processing of streaming data from Kafka to Elasticsearch
https://github.com/adudko/streaming-data-loader
asynchronous bulk-api docker-compose-infra elasticsearch kafka-consumer streaming-data-processing
Last synced: 2 days ago
JSON representation
Tool for efficient loading and processing of streaming data from Kafka to Elasticsearch
- Host: GitHub
- URL: https://github.com/adudko/streaming-data-loader
- Owner: aDudko
- License: mit
- Created: 2025-03-22T22:13:15.000Z (about 1 year ago)
- Default Branch: master
- Last Pushed: 2025-03-23T04:43:05.000Z (about 1 year ago)
- Last Synced: 2025-03-23T05:24:17.526Z (about 1 year ago)
- Topics: asynchronous, bulk-api, docker-compose-infra, elasticsearch, kafka-consumer, streaming-data-processing
- Language: Python
- Homepage:
- Size: 2.93 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Streaming Data Loader
[](https://www.python.org)
[](https://kafka.apache.org/)
[](https://www.elastic.co/elasticsearch/)
[](https://prometheus.io/)
[](https://grafana.com/)
[](https://docs.python.org/3/library/asyncio.html)
[](https://docs.pytest.org)
[](https://www.docker.com/)
[](https://kubernetes.io/)
[](LICENSE)
---
## Overview
**Streaming Data Loader** is a high-performance, async-powered microservice for processing data streams from **Kafka**
and bulk-loading them into **Elasticsearch**. It combines modern Python tooling with observability best practices to
provide **reliable, scalable, and debuggable data pipelines**.
Itβs fast, resilient, and production-ready β ideal for those who need lightweight alternatives to complex ETL systems.
---
## Key Features
- **Asynchronous processing** with `asyncio`, `aiokafka`, and `aiohttp`
- **Batch insertions** for throughput efficiency
- **Retry & fault-tolerant** logic for Kafka and Elasticsearch
- **Configurable** via `.env` and `pydantic-settings`
- **Docker & Kubernetes ready**
- **Prometheus + Grafana** monitoring included
- **Tested** with `pytest`, including integration scenarios
---
## Quick Start
### π³ Docker Compose
```bash
docker-compose up --build
```
- http://localhost:9090 β Prometheus
- http://localhost:3000 β Grafana (admin / admin)
- http://localhost:8080 β Kafka UI
### βΈοΈ Kubernetes
#### Step 1 β Deploy
```bash
./k8s-deploy.sh
```
#### Step 2 β Cleanup
```bash
./k8s-clean.sh
```
---
## Architecture
This project uses **Hexagonal Architecture** (Ports and Adapters), ensuring modularity, extensibility, and clean separation of concerns.
```text
Kafka -β KafkaConsumerService -β EventService -β ElasticsearchClientService -β Elasticsearch
β β
β-β Metrics + Logging (Prometheus + JSON logs)
```
### Layers
- **Input Ports**: Kafka Consumer (aiokafka), deserialization, batching
- **Application Core**: Event transformation, validation, retry logic
- **Output Ports**: Async Elasticsearch client, bulk insert, failure handling
- **Infrastructure**: Docker, Kubernetes, logging, metrics, monitoring
---
## π Why Choose This Over Logstash, Flume, etc.?
- β
**True async** data pipeline β lower latency, better throughput
- β
**No heavyweight config DSL** β Python code, `pyproject.toml`, `.env`
- β
**Built-in retries & fault handling** β robust out of the box
- β
**JSON logging** and metric labeling for full observability
- β
**Open-source & customizable** β perfect for modern data teams
---
## Observability
Prometheus scrapes metrics on `/metrics` (port `8000`). Dashboards are automatically provisioned in Grafana.
| Metric | Description |
|-------------------------------------|------------------------------------|
| `messages_processed_total` | Total number of processed messages |
| `errors_total` | Total errors during processing |
| `consume_duration_seconds` | Time spent reading from Kafka |
| `response_duration_seconds` | Time to insert into Elasticsearch |
| `transform_duration_seconds` | Time spent transforming messages |
| `batch_processing_duration_seconds` | Full batch processing time |
---
## Testing
```bash
pytest -v
```
Includes:
- β
Unit tests
- β
Integration tests (Kafka β ES)
- β
Metrics verification
- β
Config validation
---
## Technologies
- `Python 3.12` + `asyncio`
- `Kafka + aiokafka`
- `Elasticsearch` `Bulk API`
- `Pydantic` `dotenv` `poetry`
- `Prometheus` `Grafana`
- `Docker` `docker-compose`
- `Kubernetes-ready`
- JSON logging (`python-json-logger`)
---
## Project Structure
```text
streaming-data-loader/
βββ configs/ # Prometheus / Grafana
βββ src/ # Main source code
β βββ domain/
β βββ ports/
β βββ services/ # Event processing logic
β βββ config.py # Settings & env config
β βββ logger.py # JSON logger setup
β βββ metrics.py # Prometheus metrics
βββ tests/ # Unit & integration tests
βββ k8s/ # Kubernetes manifests
βββ docker-compose.yml
βββ Dockerfile
βββ deploy.sh / clean.sh
βββ pyproject.toml
```
---
## If you find this useful...
...give it a star, fork it, or mention it in your next data project!
## Author
**Anatoly Dudko**
[GitHub @aDudko](https://github.com/aDudko) β’ [LinkedIn](https://www.linkedin.com/in/dudko-anatol/)