https://github.com/stefen-taime/mako-main

Declarative real-time data pipelines Framework. YAML in, events out.
https://github.com/stefen-taime/mako-main

data datapipeline declarative-config declarative-pipeline declarative-programming declarative-workflows framework open-source

Last synced: 22 days ago
JSON representation

Declarative real-time data pipelines Framework. YAML in, events out.

Host: GitHub
URL: https://github.com/stefen-taime/mako-main
Owner: Stefen-Taime
License: mit
Created: 2026-03-12T18:33:33.000Z (4 months ago)
Default Branch: main
Last Pushed: 2026-03-12T18:41:52.000Z (4 months ago)
Last Synced: 2026-03-13T00:44:49.735Z (4 months ago)
Topics: data, datapipeline, declarative-config, declarative-pipeline, declarative-programming, declarative-workflows, framework, open-source
Language: Go
Homepage: https://mcsedition.org/fr
Size: 1.56 MB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE

Awesome Lists containing this project

README

Mako

Declarative real-time data pipelines. YAML in, events out.

Named after the shortfin mako -- fastest shark in the ocean. Your data deserves the same speed.

Catalog ·
Sources ·
Sinks ·
Transforms ·
Workflows ·
Observability ·
Grafana ·
Contributing

---

```yaml
pipeline:
name: order-events
source:
type: kafka
topic: events.orders
transforms:
- name: pii_mask
type: hash_fields
fields: [email, phone, ssn]
- name: filter_prod
type: filter
condition: "environment = production"
sink:
type: snowflake
database: ANALYTICS
schema: RAW
table: ORDER_EVENTS
monitoring:
freshnessSLA: 5m
metrics:
enabled: true
port: 9090
```

```bash
mako init # Starter template (stdout, zero deps)
mako init --full pipeline-full.yaml # Full reference with all connectors
mako validate pipeline.yaml
mako dry-run pipeline.yaml < events.jsonl
mako run pipeline.yaml
mako workflow workflow.yaml # DAG orchestration
```

---

## Connector Catalog

Sources

Where your data comes from

ConnectorTypeHighlightsExamples

HTTP / REST APIhttpPagination, OAuth2, Bearer, Basic, API Key, rate limiting8 pipelines
FilefileJSON, CSV, Parquet, gzip -- local or remote URL5 pipelines
Apache KafkakafkaConsumer groups, earliest/latest offset (franz-go)2 pipelines
PostgreSQL CDCpostgres_cdcSnapshot, CDC, snapshot+CDC (pglogrepl)1 pipeline
DuckDBduckdbEmbedded SQL, native Parquet/CSV, S3/GCS/Azure3 pipelines

Full source documentation →

---

Sinks

Where your data goes

ConnectorTypeHighlightsExamples

PostgreSQLpostgresAuto-flatten, COPY protocol, Vault secrets2 pipelines
SnowflakesnowflakeAuto-DDL, flatten mode, batch loading2 pipelines
DuckDBduckdbAuto-table, schema evolution, Parquet/CSV export2 pipelines
Google Cloud StoragegcsParquet + CSV, Snappy compression3 pipelines
Apache KafkakafkaSchema Registry validation, DLQ1 pipeline
BigQuerybigqueryStreaming inserter--
ClickHouseclickhouseclickhouse-go v2, flatten mode--
S3s3Parquet + CSV, AWS SDK v2--
StdoutstdoutDebug output to console1 pipeline

Full sink documentation →

---

Transforms

How your data is processed

TransformTypeDescriptionExamples

SQL EnrichmentsqlCASE WHEN, computed fields, DuckDB functions2 pipelines
WASM PluginspluginCustom logic in Go (TinyGo) or Rust2 pipelines + source
Schema ValidationschemaConfluent Schema Registry (log / reject / DLQ)3 pipelines
Data Qualitydq_checknot_null, range, in_set, regex, type checks2 pipelines
PII Maskinghash_fieldsSHA-256 hash for emails, phones, cards, SSNs2 pipelines
FilterfilterKeep/discard events by condition1 pipeline
Rename Fieldsrename_fieldsRename columns for target conventionused across examples
Drop Fieldsdrop_fieldsRemove unnecessary columnsused across examples
Cast Fieldscast_fieldsType conversion (string, int, float, bool)used across examples
FlattenflattenFlatten nested JSON objectsused across examples
Default Valuesdefault_valuesSet defaults for missing fieldsused across examples
DeduplicatededuplicateRemove duplicates by keyused across examples

Full transform documentation →

---

Workflows

DAG orchestration for multi-pipeline jobs

WorkflowStepsHighlights

NYC TLC Star Schema9 + quality gateStar schema from 700K+ taxi trips, 6 dimensions, fact table, daily aggregation, SQL assertions
Multi-Source Demo3 (parallel)HTTP + CSV sources, DuckDB + PostgreSQL sinks, parallel execution
ETL Demo3 (sequential)Simple ingest → transform → load chain

Full workflow documentation →

---

Cross-Reference: Connectors as Source & Sink

ConnectorAs SourceAs Sink

PostgreSQLCDC source Sink
DuckDBQuery source Sink + export
KafkaConsumer Producer
GCSvia DuckDB httpfsSink

---

## Quick Start

```bash
# Clone and build
git clone https://github.com/Stefen-Taime/mako.git
cd mako
go build -o bin/mako .

# Create your first pipeline (HTTP source → stdout, zero dependencies)
./bin/mako init

# Validate
./bin/mako validate pipeline.yaml

# Run pipeline — fetches 100 commerce records, applies transforms, prints to stdout
./bin/mako run pipeline.yaml

# Run a workflow (DAG of multiple pipelines)
./bin/mako workflow workflow.yaml
```

**Output of `mako run`:**

```
🔌 Preflight checks...
✅ source — ready
✅ stdout — connected
🚀 Pipeline "commerce-ingest" started
📥 Source: http (https://raw.githubusercontent.com/.../json_bank_20240116_1.json)
🔄 Transforms:
└─ pii_mask (hash_fields)
└─ cleanup (drop_fields)
└─ filter_price (filter)
📤 Sinks:
└─ stdout
{"_pii_processed":true,"color":"yellow","department":"Kitchen","id":3592,...}
...
📊 Final stats: 100 in → stdout → 58 out, 0 errors
```

The starter pipeline fetches commerce data from [open-source-data](https://github.com/Stefen-Taime/open-source-data), hashes `user_id` (PII compliance), drops unnecessary fields, and filters items with `price > 50`. Zero infrastructure needed.

### Full template

To see **all** available sources, sinks, transforms, and monitoring options:

```bash
./bin/mako init --full pipeline-full.yaml
```

This generates a reference YAML with every connector and option as commented blocks. Uncomment the sections you need.

---

## Observability

Every pipeline exposes real-time Prometheus metrics, health probes, and a status API on a single HTTP port (default `:9090`).

### Prometheus Metrics

```text
mako_events_in_total{pipeline="order-events"} 15234
mako_events_out_total{pipeline="order-events"} 15230
mako_errors_total{pipeline="order-events"} 4
mako_dlq_total{pipeline="order-events"} 2
mako_schema_failures_total{pipeline="order-events"} 1
mako_sink_latency_microseconds{pipeline="order-events"} 4523
mako_throughput_events_per_second{pipeline="order-events"} 1523.40
mako_uptime_seconds{pipeline="order-events"} 3600.0
mako_pipeline_ready{pipeline="order-events"} 1
```

Metrics are synced from the pipeline engine every **500ms** for live visibility during execution, with a final sync after graceful shutdown.

**Workflow mode:** All pipelines in a workflow share a single Prometheus endpoint (`:9090`) via a shared [MetricsRegistry](pkg/observability/registry.go) -- no port-per-pipeline overhead.

### Grafana Dashboard

A pre-built Grafana dashboard is included at [`grafana/mako-dashboard.json`](grafana/mako-dashboard.json) with 4 sections:

| Section | Panels |
|---------|--------|
| **Overview** | Events In, Events Out, Errors, DLQ Events, Schema Failures, Uptime |
| **Throughput** | Events/sec rate graph, Instantaneous throughput |
| **Errors & DLQ** | Error rate per minute (bar chart), Error rate % (gauge), Pipeline Ready (UP/DOWN) |
| **Sink Performance** | Sink write latency, Events In vs Out (cumulative) |

The dashboard auto-discovers pipelines via a `$pipeline` template variable and supports multi-select.

### Local Setup (Prometheus + Grafana)

The `docker/` stack includes Prometheus and Grafana pre-configured to scrape Mako pipelines:

```bash
cd docker/
docker compose up -d prometheus grafana

# Prometheus → http://localhost:9091
# Grafana → http://localhost:3000 (admin / mako)
```

Grafana is auto-provisioned with the Prometheus datasource and the Mako dashboard -- no manual import needed. Just run a pipeline with `mako run` or `mako workflow` and open Grafana.

### HTTP Endpoints

| Endpoint | Description | Use |
|----------|-------------|-----|
| `GET /metrics` | Prometheus text format | Scraping by Prometheus/Grafana |
| `GET /health` | Liveness probe (always 200) | Kubernetes `livenessProbe` |
| `GET /ready` | Readiness probe (200 when running) | Kubernetes `readinessProbe` |
| `GET /status` | Pipeline status JSON | Monitoring dashboards |

### Slack Alerting

Send alerts to Slack on errors, SLA breaches, and pipeline completion:

```yaml
monitoring:
freshnessSLA: 5m
alertChannel: "#data-alerts"
slackWebhookURL: ${SLACK_WEBHOOK_URL}
alerts:
- name: high_error_rate
type: error_rate
threshold: "0.5%"
severity: critical
- name: volume_drop
type: volume
threshold: "-50%"
severity: warning
```

Alert rule types: **latency** (stale data), **error_rate** (% threshold), **volume** (throughput change). Each rule has a 5-minute cooldown. See [docs/observability.md](docs/observability.md) for full details.

---

## Architecture

---

## Project Structure

```text
mako/
├── main.go # CLI entry point (init, run, workflow, validate, generate)
├── api/v1/types.go # Pipeline + Workflow spec (the YAML DSL model)
├── pkg/
│ ├── config/config.go # YAML parser + validator
│ ├── pipeline/engine.go # Runtime: Source -> Transforms -> Sink
│ ├── source/
│ │ ├── file.go # File source (JSONL, CSV, JSON, Parquet + gzip)
│ │ ├── postgres_cdc.go # PostgreSQL CDC (pgx + pglogrepl)
│ │ ├── http.go # HTTP/API source (pagination, OAuth2)
│ │ ├── duckdb.go # DuckDB source (SQL, Parquet/CSV/JSON + S3/GCS)
│ │ └── multi.go # Multi-source with join support
│ ├── sink/
│ │ ├── sink.go # Stdout, File sinks + BuildFromSpec
│ │ ├── postgres.go # PostgreSQL (pgx + COPY)
│ │ ├── snowflake.go # Snowflake (gosnowflake)
│ │ ├── bigquery.go # BigQuery (streaming inserter)
│ │ ├── clickhouse.go # ClickHouse (clickhouse-go v2)
│ │ ├── s3.go # S3 (AWS SDK v2)
│ │ ├── gcs.go # GCS (cloud.google.com/go/storage)
│ │ ├── duckdb.go # DuckDB (embedded, Parquet/CSV export)
│ │ ├── encode.go # Shared Parquet + CSV encoders
│ │ └── resolve.go # Secret resolution (config -> env -> Vault)
│ ├── transform/
│ │ ├── transform.go # All built-in transforms
│ │ └── wasm.go # WASM plugin runtime (wazero)
│ ├── workflow/
│ │ ├── engine.go # DAG engine (parallel steps, failure policies)
│ │ └── quality_gate.go # SQL assertions against DuckDB
│ ├── observability/
│ │ ├── server.go # Prometheus metrics + health + status HTTP
│ │ └── registry.go # Shared metrics registry (workflow mode)
│ ├── kafka/kafka.go # Kafka source + sink (franz-go)
│ ├── schema/registry.go # Schema Registry client + validator
│ ├── join/join.go # Multi-source join engine
│ ├── duckdbext/cloud.go # DuckDB httpfs + cloud credentials
│ ├── alerting/ # Slack alert rules + notifications
│ └── vault/vault.go # HashiCorp Vault client
├── examples/ # Pipeline catalog (see below)
│ ├── sources/ # HTTP, File, Kafka, PostgreSQL CDC, DuckDB
│ ├── sinks/ # PostgreSQL, Snowflake, DuckDB, GCS, Kafka, Stdout
│ ├── transforms/ # SQL, WASM, Schema, DQ Check, PII, Filter
│ └── workflows/ # NYC TLC Star Schema, ETL Demo, Multi-Source
├── docs/ # Detailed documentation
├── docker/ # Local infra (Kafka, PostgreSQL, Prometheus, Grafana)
│ ├── prometheus/prometheus.yml # Pre-configured to scrape Mako on :9090
│ └── grafana/provisioning/ # Auto-provision datasource + dashboard
├── grafana/mako-dashboard.json # Grafana dashboard (Overview, Throughput, Errors, Sink)
├── .github/workflows/ci.yml # CI: unit + integration tests
└── Dockerfile # Production image
```

---

## CI / Testing

GitHub Actions runs on every push/PR:

**Unit tests** (fast, no Docker):

- 70+ tests covering config, validation, transforms, WASM plugins, sources, sinks
- Benchmarks for transform chain performance
- Example validation + dry-run

**Integration tests** (Docker services):

- Kafka (KRaft) + PostgreSQL + Schema Registry
- Full pipeline: produce messages -> consume -> transform -> write to PG
- HTTP endpoint verification (/metrics, /health, /ready, /status)
- File source validation

```bash
# Run locally
go test -v -count=1 ./...
go test -bench=. -benchmem ./...
```

---

## Roadmap

- [x] Kafka consumer/producer (franz-go)
- [x] PostgreSQL sink (pgx + COPY)
- [x] Snowflake sink (gosnowflake) + flatten mode
- [x] BigQuery sink (streaming inserter)
- [x] Schema Registry validation (JSON Schema)
- [x] File source (JSONL, CSV, JSON + transparent gzip)
- [x] Prometheus metrics (/metrics)
- [x] Health/readiness probes (/health, /ready)
- [x] Pipeline status API (/status)
- [x] CI with integration tests (Kafka + PG + Schema Registry)
- [x] S3/GCS object storage sinks
- [x] Grafana dashboard templates
- [x] ClickHouse sink (clickhouse-go v2)
- [x] WASM plugin transforms (wazero)
- [x] Parquet + CSV output formats for S3/GCS
- [x] HashiCorp Vault integration (secret resolution chain)
- [x] PostgreSQL CDC source (snapshot, cdc, snapshot+cdc)
- [x] HTTP/API source (pagination, OAuth2, rate limiting, retries)
- [x] Real-time observability metrics (500ms sync, sink latency)
- [x] Rust WASM plugin example
- [x] DuckDB embedded source + sink
- [x] Parquet file source (native reading via parquet-go)
- [x] DuckDB cloud storage (S3/GCS/Azure via httpfs)
- [x] Workflow engine (DAG orchestration, parallel steps, failure policies)
- [x] Data quality: inline `dq_check` transform
- [x] Data quality: `quality_gate` workflow step (SQL assertions)
- [x] Shared Prometheus metrics registry (single port for workflows)
- [ ] Helm chart for Kubernetes deployment
- [ ] Codegen: `mako generate --k8s` + `--tf` (Kubernetes manifests, Terraform HCL)

---

## License

MIT

---

*Built by [mcsEdition](https://mcsedition.org/fr)*

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome