https://github.com/euclidstellar/medplat-demo

The Unified Data Ingestion System is designed to handle both streaming and batch data ingestion with a modular architecture.
https://github.com/euclidstellar/medplat-demo

batch-processing data-ingestion kafka stream-processing-engine

Last synced: 10 months ago
JSON representation

The Unified Data Ingestion System is designed to handle both streaming and batch data ingestion with a modular architecture.

Host: GitHub
URL: https://github.com/euclidstellar/medplat-demo
Owner: EuclidStellar
Created: 2025-05-05T17:20:16.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2025-05-05T18:28:26.000Z (about 1 year ago)
Last Synced: 2025-05-05T19:48:14.601Z (about 1 year ago)
Topics: batch-processing, data-ingestion, kafka, stream-processing-engine
Language: Python
Homepage:
Size: 14.6 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          # Medplat | Unified Data Ingestion System



  

  

  



## Overview

The Unified Data Ingestion System is designed to handle both streaming and batch data ingestion with a modular architecture. It centralizes data flow through a dispatcher and validates incoming data against versioned schemas.

## Features

- **Streaming Data Ingestion**: Supports real-time data ingestion from Kafka.

- **Batch Data Ingestion**: Handles data from REST APIs and FTP servers.

- **Centralized Dispatcher**: Routes and validates data before sending it to the output layer.

- **Schema Management**: Utilizes a schema registry for versioned schema validation.

## Project Structure

```

unified-data-ingestion-system

├── src

│   ├── adapters          # Contains ingestion adapters

│   ├── dispatcher        # Centralized dispatcher logic

│   ├── schema_registry   # Manages schema versions

│   ├── config            # Configuration settings

│   ├── utils             # Utility functions

│   └── main.py           # Entry point of the application

├── tests                 # Unit tests for the application

├── schemas               # JSON schemas for validation

├── docker-compose.yml    # Docker Compose configuration

├── Dockerfile            # Docker image build instructions

└── requirements.txt      # Python dependencies

```

## System Architecture

```mermaid

flowchart TD

    subgraph "Streaming Adapters"

        KA[Kafka Adapter]

        RA[RabbitMQ Adapter]

    end

    

    subgraph "Batch Adapters"

        AA[API Adapter]

        FA[FTP Adapter]

        CA[CSV Adapter]

    end

    

    subgraph "Core System"

        D[Dispatcher]

        SR[Schema Registry]

        VAL[Validators]

    end

    

    subgraph "Output Layer"

        PW[Parquet Writer]

        DB[Database Writer]

    end

    

    subgraph "Support Systems"

        LOG[Logging]

        CONF[Configuration]

    end

    

    KA -->|Push data| D

    RA -->|Push data| D

    AA -->|Push data| D

    FA -->|Push data| D

    CA -->|Push data| D

    

    D <-->|Validate| SR

    SR -->|Use| VAL

    

    D -->|Route valid data| PW

    D -->|Route valid data| DB

    

    LOG <--- D

    LOG <--- KA

    LOG <--- RA

    LOG <--- AA

    LOG <--- FA

    LOG <--- CA

    

    CONF --> KA

    CONF --> RA

    CONF --> AA

    CONF --> FA

    CONF --> CA

    CONF --> D

```

1. **Ingestion Adapters**  

   - Abstract connectors for each data source  

   - Implement `connect()`, `ingest()`, `close()`  

2. **Schema Registry**  

   - Versioned JSON/Avro schemas  

   - Centralized validation before dispatch  

3. **Dispatcher Service**  

   - Routes validated events to downstream sinks  

   - Supports fan‑out to PostgreSQL, Elasticsearch, S3, Kafka, etc.

4. **Error & Retry Handler**  

   - Dead‑letter queue for malformed or failed messages  

   - Automatic retries with backoff  

5. **Observability**  

   - Structured logging (via ELK / FluentD)  

   - Metrics exposed for Prometheus (e.g. `messages_processed_total`)

  

## Data Processing Flow

```mermaid

sequenceDiagram

    participant DS as Data Source

    participant IA as Ingestion Adapter

    participant DP as Dispatcher

    participant SR as Schema Registry

    participant OC as Output Consumer

    

    DS->>IA: Send raw data

    activate IA

    IA->>IA: Format data

    IA->>DP: Send structured data

    deactivate IA

    

    activate DP

    DP->>SR: Validate data against schema

    activate SR

    SR-->>DP: Return validation result

    deactivate SR

    

    alt Valid Data

        DP->>OC: Route data to consumer

        activate OC

        OC-->>DP: Acknowledge receipt

        deactivate OC

    else Invalid Data

        DP->>DP: Log validation error

        DP->>DP: Store rejected data

    end

    

    DP-->>IA: Processing complete

    deactivate DP

```

## 🚀 Getting Started

### ✅ Prerequisites

- [Python 3.9+](https://www.python.org/downloads/)

- [Docker](https://www.docker.com/) (optional but recommended for quick setup)

---

### 🛠 Installation

#### 1. Clone the Repository

```bash

git clone https://github.com/euclidstellar/medplat-demo.git

cd medplat-demo

```

#### 2. Install Dependencies

Create and activate a virtual environment:

```bash

python -m venv venv

source venv/bin/activate  # On Windows: venv\Scripts\activate

pip install -r requirements.txt

```

To integrate a new data source, follow these steps:

### Create a New Adapter Class

Extend the base `IngestionAdapter` class and implement the required methods:

- `connect()`: Setup connection to your source (e.g., open socket, API client)

- `ingest()`: Read data from the source and return it in a structured format

- `close()`: Gracefully shut down the adapter (e.g., close connection, release resources)

### Register Your Adapter

Update `config/adapters.yaml` with your adapter’s name and settings.

### example adapter code 

```python

class MyNewAdapter(IngestionAdapter):

    def __init__(self, dispatcher, config):

        super().__init__("my_adapter", dispatcher)

        self.config = config

        

    def connect(self):

        # Initialize connection to data source

        

    def ingest(self):

        # Get data and send to dispatcher

        data = self._fetch_data()

        self.dispatcher.receive_data(data, self.name, "my_schema_id")

        

    def close(self):

        # Close connections

```

---

## License

This project is licensed under the **MIT License** – see the [LICENSE](LICENSE) file for details.

---

## Acknowledgments

- Built as a prototype for the **MedPlat** project.

- Inspired by modern data engineering best practices including:

  - Plug-and-play architectures

  - Schema-first data validation

  - Unified real-time and batch data pipelines

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/euclidstellar/medplat-demo

Awesome Lists containing this project

README