An open API service indexing awesome lists of open source software.

https://github.com/adudko/streaming-data-loader

Tool for efficient loading and processing of streaming data from Kafka to Elasticsearch
https://github.com/adudko/streaming-data-loader

asynchronous bulk-api docker-compose-infra elasticsearch kafka-consumer streaming-data-processing

Last synced: 2 days ago
JSON representation

Tool for efficient loading and processing of streaming data from Kafka to Elasticsearch

Awesome Lists containing this project

README

          

# Streaming Data Loader

[![Python](https://img.shields.io/badge/Python-3.12-blue.svg)](https://www.python.org)
[![Kafka](https://img.shields.io/badge/Kafka-Streaming-red)](https://kafka.apache.org/)
[![Elasticsearch](https://img.shields.io/badge/Elasticsearch-Search-yellow)](https://www.elastic.co/elasticsearch/)
[![Prometheus](https://img.shields.io/badge/Monitoring-Prometheus-orange)](https://prometheus.io/)
[![Grafana](https://img.shields.io/badge/Dashboards-Grafana-brightgreen)](https://grafana.com/)
[![AsyncIO](https://img.shields.io/badge/Async-Enabled-lightgrey)](https://docs.python.org/3/library/asyncio.html)
[![CI](https://img.shields.io/badge/Tests-Pytest-green)](https://docs.pytest.org)
[![Docker](https://img.shields.io/badge/Docker-Ready-blue)](https://www.docker.com/)
[![Kubernetes](https://img.shields.io/badge/K8s-Supported-blueviolet)](https://kubernetes.io/)
[![License](https://img.shields.io/badge/License-MIT-lightgrey)](LICENSE)

---

## Overview

**Streaming Data Loader** is a high-performance, async-powered microservice for processing data streams from **Kafka**
and bulk-loading them into **Elasticsearch**. It combines modern Python tooling with observability best practices to
provide **reliable, scalable, and debuggable data pipelines**.

It’s fast, resilient, and production-ready β€” ideal for those who need lightweight alternatives to complex ETL systems.

---

## Key Features

- **Asynchronous processing** with `asyncio`, `aiokafka`, and `aiohttp`
- **Batch insertions** for throughput efficiency
- **Retry & fault-tolerant** logic for Kafka and Elasticsearch
- **Configurable** via `.env` and `pydantic-settings`
- **Docker & Kubernetes ready**
- **Prometheus + Grafana** monitoring included
- **Tested** with `pytest`, including integration scenarios

---

## Quick Start

### 🐳 Docker Compose

```bash
docker-compose up --build
```

- http://localhost:9090 β†’ Prometheus
- http://localhost:3000 β†’ Grafana (admin / admin)
- http://localhost:8080 β†’ Kafka UI

### ☸️ Kubernetes

#### Step 1 β€” Deploy

```bash
./k8s-deploy.sh
```

#### Step 2 β€” Cleanup

```bash
./k8s-clean.sh
```

---

## Architecture

This project uses **Hexagonal Architecture** (Ports and Adapters), ensuring modularity, extensibility, and clean separation of concerns.

```text
Kafka -β†’ KafkaConsumerService -β†’ EventService -β†’ ElasticsearchClientService -β†’ Elasticsearch
β”‚ ↓
β””-β†’ Metrics + Logging (Prometheus + JSON logs)
```

### Layers

- **Input Ports**: Kafka Consumer (aiokafka), deserialization, batching
- **Application Core**: Event transformation, validation, retry logic
- **Output Ports**: Async Elasticsearch client, bulk insert, failure handling
- **Infrastructure**: Docker, Kubernetes, logging, metrics, monitoring

---

## πŸ” Why Choose This Over Logstash, Flume, etc.?

- βœ… **True async** data pipeline β€” lower latency, better throughput
- βœ… **No heavyweight config DSL** β€” Python code, `pyproject.toml`, `.env`
- βœ… **Built-in retries & fault handling** β€” robust out of the box
- βœ… **JSON logging** and metric labeling for full observability
- βœ… **Open-source & customizable** β€” perfect for modern data teams

---

## Observability

Prometheus scrapes metrics on `/metrics` (port `8000`). Dashboards are automatically provisioned in Grafana.

| Metric | Description |
|-------------------------------------|------------------------------------|
| `messages_processed_total` | Total number of processed messages |
| `errors_total` | Total errors during processing |
| `consume_duration_seconds` | Time spent reading from Kafka |
| `response_duration_seconds` | Time to insert into Elasticsearch |
| `transform_duration_seconds` | Time spent transforming messages |
| `batch_processing_duration_seconds` | Full batch processing time |

---

## Testing

```bash
pytest -v
```

Includes:

- βœ… Unit tests
- βœ… Integration tests (Kafka β†’ ES)
- βœ… Metrics verification
- βœ… Config validation

---

## Technologies

- `Python 3.12` + `asyncio`
- `Kafka + aiokafka`
- `Elasticsearch` `Bulk API`
- `Pydantic` `dotenv` `poetry`
- `Prometheus` `Grafana`
- `Docker` `docker-compose`
- `Kubernetes-ready`
- JSON logging (`python-json-logger`)

---

## Project Structure

```text
streaming-data-loader/
β”œβ”€β”€ configs/ # Prometheus / Grafana
β”œβ”€β”€ src/ # Main source code
β”‚ β”œβ”€β”€ domain/
β”‚ β”œβ”€β”€ ports/
β”‚ β”œβ”€β”€ services/ # Event processing logic
β”‚ β”œβ”€β”€ config.py # Settings & env config
β”‚ β”œβ”€β”€ logger.py # JSON logger setup
β”‚ └── metrics.py # Prometheus metrics
β”œβ”€β”€ tests/ # Unit & integration tests
β”œβ”€β”€ k8s/ # Kubernetes manifests
β”œβ”€β”€ docker-compose.yml
β”œβ”€β”€ Dockerfile
β”œβ”€β”€ deploy.sh / clean.sh
└── pyproject.toml
```

---

## If you find this useful...

...give it a star, fork it, or mention it in your next data project!

## Author

**Anatoly Dudko**
[GitHub @aDudko](https://github.com/aDudko) β€’ [LinkedIn](https://www.linkedin.com/in/dudko-anatol/)