{"id":47727018,"url":"https://github.com/posthog/millpond","last_synced_at":"2026-05-29T01:00:55.327Z","repository":{"id":346676814,"uuid":"1188165751","full_name":"PostHog/millpond","owner":"PostHog","description":"Purpose-built Kafka to Ducklake ingestion pipeline","archived":false,"fork":false,"pushed_at":"2026-05-21T06:12:13.000Z","size":792,"stargazers_count":4,"open_issues_count":5,"forks_count":1,"subscribers_count":1,"default_branch":"main","last_synced_at":"2026-05-21T07:18:52.301Z","etag":null,"topics":["apache-kafka","duckdb","ducklake","etl","ingestion","kafka"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/PostHog.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-03-21T17:50:04.000Z","updated_at":"2026-05-21T06:10:35.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/PostHog/millpond","commit_stats":null,"previous_names":["posthog/millpond"],"tags_count":46,"template":false,"template_full_name":null,"purl":"pkg:github/PostHog/millpond","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PostHog%2Fmillpond","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PostHog%2Fmillpond/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PostHog%2Fmillpond/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PostHog%2Fmillpond/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/PostHog","download_url":"https://codeload.github.com/PostHog/millpond/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PostHog%2Fmillpond/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33632271,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-05-28T02:00:06.440Z","response_time":99,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apache-kafka","duckdb","ducklake","etl","ingestion","kafka"],"created_at":"2026-04-02T20:47:53.046Z","updated_at":"2026-05-29T01:00:55.308Z","avatar_url":"https://github.com/PostHog.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Millpond — Kafka to DuckLake or Iceberg\n\nA standalone Python app that consumes from a Kafka topic and writes to a lake table. Single thread, single loop, no Kafka Connect. One deployment writes to exactly one destination — either [DuckLake](https://github.com/duckdb/ducklake) or [Apache Iceberg](https://iceberg.apache.org/), selected via `MILLPOND_DESTINATION`.\n\n**Contents:**\n[Naming](#naming) | [Why](#why) | [Architecture](#architecture) | [Destinations](#destinations) | [Record Handling](#record-handling) | [Adaptive Backpressure](#adaptive-backpressure) | [Performance](#performance) | [Resource Footprint](#resource-footprint) | [Setup](#setup) | [Development](#development) | [Configuration](#configuration) | [Releases](#releases) | [Deployment](#deployment) | [Partitioning](#partitioning) | [Object Sizing](#object-sizing) | [Error Handling](#error-handling-and-retries) | [Multiple Pipelines](#multiple-pipelines) | [AWS Credential Isolation](#aws-credential-isolation) | [Operational Notes](#operational-notes) | [Next steps](#next-steps)\n\n## Naming\n\n\u003cimg src=\"imgs/500px-Hagley_mill_race.jpeg\" alt=\"A mill pond\" width=\"300\" align=\"right\"\u003e\n\n\u003e **millpond** (noun): a pond created by damming a stream to produce a head of water for operating a mill.\n\u003e — [Merriam-Webster](https://www.merriam-webster.com/dictionary/millpond)\n\nMillpond accumulates a stream of Kafka records until a threshold is reached, then releases them into a downstream lake. Like a [mill pond](https://en.wikipedia.org/wiki/Mill_pond) feeding a lake.\n\n## Why\n\nKafka Connect imposes ~1100 lines of lock management, scheduled executors, and rebalance handling to work around its lack of backpressure and explicit offset control. Millpond replaces all of that with:\n\n```\nloop:\n  consume() → JSON → Arrow → accumulate\n  when buffer full or time elapsed:\n    write to lake → commit offsets\n```\n\nSingle thread, single loop. Kafka is the buffer. Offset commit is explicit (after successful write only). No data loss window.\n\n## Architecture\n\n```\nK8s StatefulSet (N replicas)\n  └─ Pod (ordinal 0..N-1)\n       └─ Single loop: consume → convert → [filter] → accumulate → [sort] → flush → commit\n```\n\n- One topic and one table per deployment\n- Static partition assignment via pod ordinal — no consumer groups\n- If a pod dies, its partitions stop being consumed until K8s restarts it\n- Optional filter and sort stages — see [Record Handling](#record-handling) below\n\n## Destinations\n\nMillpond writes to one of two lake formats, selected at startup by `MILLPOND_DESTINATION` (default `ducklake`). A single deployment writes to exactly one — there is no per-batch routing. Switching destinations requires redeploying with different env vars; the at-rest data is not portable between the two without a separate migration.\n\n|  | DuckLake | Iceberg |\n|---|---|---|\n| Catalog | Postgres (via DuckDB ducklake extension) | REST catalog (e.g. Polaris, Tabular, AWS Glue REST adapter) |\n| Storage | S3 / S3-compatible | S3 / S3-compatible |\n| Reader ecosystem | DuckDB-native; growing third-party support | Broad (Spark, Trino, Athena, Snowflake, DuckDB ≥1.5) |\n| Partitioning | Caller-supplied via `DUCKLAKE_PARTITION_BY`; arbitrary DDL expression | Hardcoded identity transforms on derived `year`/`month`/`day`/`hour` int32 columns; Hive-style layout |\n| Schema evolution | DuckDB DDL (`ADD COLUMN IF NOT EXISTS`, `ALTER COLUMN SET DATA TYPE` with widening enforcement) | PyIceberg `update_schema()` transaction (single commit per flush) |\n| Maintenance tooling | Bundled (`tools/ducklake_maintenance.py` CronJob, `tools/ducklake_metrics.py` daemon) | Not bundled — use your catalog's native compaction/expiry |\n| `_inserted_at` column | Added at INSERT via DuckDB `NOW()` (per-row, microsecond drift possible within a flush) | Added at write time in Python (single timestamp shared by every row in a flush) |\n| Multi-pod concurrent writes | Native; idempotent DDL handles races | Native; PyIceberg optimistic concurrency + retry loop handles races |\n\nThe selection is a thin Protocol-based abstraction (`millpond/sink.py`) — `main.py` only sees `Sink.write(batch)`, `reset_caches()`, `close()`. Both implementations are in their own module (`ducklake.py`, `iceberg.py`).\n\n## Record Handling\n\nTwo optional stages sit between Kafka conversion and the sink. Both are disabled when their env vars are unset.\n\n### Allowlist filter\n\nDrops records whose value in a configured field is not in a configured allowlist. Applied immediately after JSON→Arrow conversion, before records enter the pending buffer.\n\n```\nMILLPOND_FILTER_KEEP_FIELD_NAME=team_id\nMILLPOND_FILTER_VALUES=2,4,1956,69\n```\n\nValues auto-detect: tokens that all parse as integers become an int allowlist; otherwise the whole list is treated as strings.\n\nTwo skip reasons are tracked on `millpond_records_skipped_total`:\n\n- `filter_field_missing` — column absent from this batch's schema, null for that row, or column type is not filterable (only integer and string columns are supported; bool, float, timestamp, struct, list, etc. are rejected explicitly to avoid silent surprising matches under PyArrow's `safe=True` cast semantics).\n- `filter_excluded` — column present and non-null but value not in the allowlist. Expected steady-state drop reason.\n\n`MILLPOND_FILTER_DROP_FIELD_NAME` is reserved at the config layer (mutex with keep) and currently rejected at startup. It will become a denylist filter in a future release without env-var churn.\n\n### Pre-write sort\n\nSorts the consolidated batch by one or more columns ascending, right before `sink.write()`. Both DuckLake and Iceberg sinks see pre-sorted data, which improves Parquet compression (especially for low-cardinality keys like `team_id`) and downstream reader predicate pushdown.\n\n```\nMILLPOND_SORT_BY=team_id,timestamp\n```\n\nSort order is left-to-right (`team_id` primary, `timestamp` secondary). Direction is ascending only today; if you need descending, file an issue. PyArrow's sort is stable, so equal-key rows preserve their consume order.\n\nIf any sort field is missing from a batch's schema, the sort is skipped (records still flow through, just unsorted), `millpond_sort_skipped_total{reason=\"field_missing\"}` increments by the record count, and a warning logs once per distinct missing-fields pattern (per pod lifetime — prevents log floods under sustained misconfiguration).\n\nPer-flush cost is ~50–200 ms on a 256 MB / 30k-row batch. Peak memory roughly doubles during the sort because `pa.Table.take()` rewrites a fresh copy of every column; budget accordingly relative to the pod's memory limit.\n\n## Adaptive Backpressure\n\nThe consume batch size automatically scales based on how full the pending buffer is relative to the flush threshold. When the buffer is empty, millpond consumes at full speed. As the buffer approaches the flush size, the batch size drops proportionally, smoothing throughput during catchup and traffic spikes. OOM prevention comes from bounding librdkafka's internal fetch buffer via `queued.max.messages.kbytes` (16MB per partition).\n\n```\nfullness = pending_bytes / flush_size\nbatch_size = max(10, int(CONSUME_BATCH_SIZE * (1.0 - fullness)))\n```\n\nMetrics: `millpond_buffer_fullness` and `millpond_consume_batch_size_current`.\n\n## Performance\n\nThe hot path is all C/C++: librdkafka → orjson → PyArrow → DuckDB (zero-copy Arrow scan). Python is glue.\n\n## Resource Footprint\n\n| | Kafka Connect worker | Millpond pod |\n|-|---------------------|-----------|\n| Memory request | 4-8Gi (JVM heap) | 256Mi |\n| Memory limit | 8-16Gi | 512Mi |\n| Steady-state | ~4GB (JVM + framework + GC headroom) | ~250-300MB |\n\nNo JVM, no framework, no GC heap overhead. ~16x less memory per pod. The entire runtime is C/C++ libraries with a Python glue layer.\n\n## Setup\n\nRequires [Flox](https://flox.dev):\n\n```bash\nflox activate\njust sync\njust run\n```\n\n## Development\n\n```bash\njust fmt               # format code\njust lint              # lint code\njust test              # run unit tests (includes both backends' suites + cross-backend equivalence)\njust test-integration  # run integration tests (local DuckDB + MinIO/iceberg-rest via testcontainers)\njust test-e2e          # run E2E tests (docker-compose, builds stack automatically)\njust ci                # format check + lint + unit tests\njust up                # start docker-compose stack (DuckLake — plaintext Kafka)\njust up-ssl            # start docker-compose stack (DuckLake — SSL Kafka, closer to prod)\njust down              # stop docker-compose stack\njust down-ssl          # stop SSL docker-compose stack\n```\n\nThe `just up` / `just up-ssl` dev stacks are DuckLake-only. For Iceberg local dev, the integration test fixture in `tests/integration/compose.yaml` brings up MinIO + a tabulario/iceberg-rest catalog; that stack is what the iceberg integration tests use and what to point at for ad-hoc Iceberg work.\n\n### SSL Kafka Testing\n\nThe `just up-ssl` recipe generates self-signed certs and runs Kafka with SSL listeners, matching the production MSK configuration. This exercises the `KAFKA_CONSUMER_*` env var override path that isn't tested with plaintext Kafka.\n\nRequires Docker (uses `keytool` from the Kafka container image for cert generation).\n\n### DuckLake Maintenance\n\n`tools/ducklake_maintenance.py` is a self-contained Python script for DuckLake maintenance operations (snapshot expiry, file cleanup, orphan deletion, checkpoint, tiered compaction, deletion-queue dedup, catalog-side orphan recovery). It is baked into the Docker image at `/app/tools/ducklake_maintenance.py` and designed to run as a K8s CronJob reusing the same image and credentials as the main application.\n\n```bash\npython /app/tools/ducklake_maintenance.py maintain --days 7           # expire snapshots + cleanup files\npython /app/tools/ducklake_maintenance.py maintain --days 7 --dry-run # preview only\npython /app/tools/ducklake_maintenance.py expire --days 3             # expire snapshots only\npython /app/tools/ducklake_maintenance.py cleanup --days 1            # cleanup scheduled files only\npython /app/tools/ducklake_maintenance.py cleanup-all                 # cleanup all scheduled files regardless of age\npython /app/tools/ducklake_maintenance.py dedup-deletions             # drop duplicate rows in the pending-deletion queue\npython /app/tools/ducklake_maintenance.py find-orphans                # list catalog rows whose S3 key no longer exists\npython /app/tools/ducklake_maintenance.py heal-orphans                # delete those catalog rows (gated B1/B3 safety checks)\npython /app/tools/ducklake_maintenance.py cleanup-all-safe            # dedup + heal-orphans + cleanup-all in a loop until clean\npython /app/tools/ducklake_maintenance.py fsck                        # cleanup-all-safe + ducklake_delete_orphaned_files\npython /app/tools/ducklake_maintenance.py checkpoint                  # integrated merge + expire + cleanup\npython /app/tools/ducklake_maintenance.py orphans                     # delete S3-side orphaned files (catalog has no row)\npython /app/tools/ducklake_maintenance.py compact --tier 1            # tiered compaction (see \"When to add a merge job\")\n```\n\nThe script logs `cleanup throughput: files_processed=N elapsed_s=T rate_obj_s=R queue_depth_after=A` after every `cleanup` / `cleanup-all` (skipped on `--dry-run`), so you can confirm steady-state throughput without enabling debug logging. `files_processed` is the actual count of files the call returned, not a queue-depth delta, so the number is accurate even when other writers enqueue deletions during the run. Pass `--debug` to opt back into DuckDB's HTTP and Postgres-extension query logging — both are off by default because they add per-call overhead that compounds across tens of thousands of S3 deletes.\n\nIf `PUSHGATEWAY_URL` is set, the script pushes `maintenance_start_time` (on start) and `maintenance_duration_seconds` (on completion) to a Prometheus Pushgateway, enabling Grafana annotation queries for maintenance windows.\n\n#### Catalog-side orphan recovery\n\nIf a `cleanup-all` run is interrupted (DuckLake bug: an S3 NoSuchKey on DELETE rolls back the whole transaction, but the S3 deletes already-completed are permanent), the catalog ends up with rows in `ducklake_files_scheduled_for_deletion` that point at S3 keys that no longer exist. Every subsequent `cleanup-all` will crash on those orphans until they're cleaned up. The catalog-recovery subcommands handle this without manual SQL surgery:\n\n| Subcommand | Action |\n|---|---|\n| `find-orphans` | List orphan rows on stdout (read-only). |\n| `heal-orphans` | Delete the orphan rows. Two safety gates: B1 proves `ducklake_data_file` is non-empty AND no orphan path is still live; B3 aborts if any positional-delete vector references an orphan id. `--dry-run` runs the gates but skips the DELETE. |\n| `cleanup-all-safe` | Loop dedup-deletions + heal-orphans + cleanup-all under one advisory lock until cleanup-all exits clean. Caps at `--max-iterations` (default 10). |\n| `fsck` | `cleanup-all-safe` followed by `ducklake_delete_orphaned_files` (S3-side orphan sweep). The end-to-end \"lake catalog is healthy\" recipe. |\n\nMutual exclusion comes from `pg_try_advisory_lock(hashtext('millpond-ducklake-maintenance')::bigint)` taken on the `pg` ATTACH; concurrent maintenance invocations bail with a clear error rather than racing each other's DELETEs.\n\n`tools/ducklake_maintenance.sql` is loaded at every session start (both by `ducklake_maintenance.py` and by the `just shell` recipe) and defines small DuckDB macros for ad-hoc inspection — `SELECT count_pending_dups()` for queue dup count, `SELECT * FROM find_catalog_orphans('s3://bucket/lake/data')` for the orphan list. The header documents the conventions (no `LEFT ANTI JOIN`, no duckdb-side `ctid`, advisory-lock key) that any new recipe must follow.\n\n`tools/justfile` wraps the script and is also baked into the image at `/justfile` for interactive use:\n\n```bash\njust --list                  # see available recipes\njust maintain-dry-run 3      # preview: expire \u003e3 day snapshots + cleanup\njust maintain 3              # execute it\njust dedup-deletions-dry-run # preview duplicate rows in the pending-deletion queue\njust dedup-deletions         # drop them\njust find-orphans            # list catalog-side orphan rows\njust heal-orphans-dry-run    # preview heal-orphans (gates only, no DELETE)\njust heal-orphans            # delete catalog-side orphan rows\njust cleanup-all-safe        # dedup + heal + cleanup-all in a loop\njust fsck-dry-run            # preview fsck end-to-end\njust fsck                    # bring catalog to known-good state\njust shell                   # interactive DuckDB shell with lake + pg ATTACHed and macros loaded\njust drop events             # drop a table (data files remain until cleanup)\njust orphans-dry-run         # preview S3-side orphaned files\n```\n\nAll commands use the pod's existing env vars (`DUCKLAKE_RDS_*`, `DUCKDB_S3_*`, `DUCKLAKE_DATA_PATH`).\n\n### DuckLake state metrics\n\n`tools/ducklake_metrics.py` is a small long-running daemon that runs catalog-side queries against the DuckLake on a schedule and exposes results as Prometheus gauges over HTTP. Same Docker image as `ducklake_maintenance.py`; intended to run as a single-replica Deployment so a Prometheus scraper can watch lake shape, compaction backlog, snapshot age, partition skew, and the pending-deletion queue without S3 round trips.\n\n```bash\njust ducklake-metrics                       # built-ins only, listens on :9100\njust ducklake-metrics-with-config queries.yaml   # extend built-ins from user YAML\njust ducklake-metrics-list                  # print resolved query list and exit (no connection needed)\n```\n\nEndpoints: `/metrics` (Prometheus exposition), `/-/healthy` (k8s liveness), `/-/ready` (k8s readiness). The daemon reconnects to the catalog with exponential backoff (1s → 60s cap) on connect failure, and forces a reconnect after 10 consecutive query failures across all queries; transient SQL errors in a single query log + increment `ducklake_metrics_query_errors_total` without killing the process.\n\nBuilt-in queries:\n\n| Metric prefix | Labels | Values | Source |\n|---|---|---|---|\n| `ducklake_pending_deletes` | — | `total`, `unique_paths`, `dup_rows` | `ducklake_files_scheduled_for_deletion` |\n| `ducklake_files_per_band` | `band` (`lt1mib` / `1to5mib` / `5to10mib` / `10to32mib` / `32to64mib` / `64to128mib` / `gt128mib`) | `count`, `bytes` | `ducklake_data_file` |\n| `ducklake_compaction_candidates` | `tier` (`tier1` / `tier2` / `tier3` / `large` / `total`) | `count` | `ducklake_data_file` |\n| `ducklake_snapshots` | — | `count`, `oldest_seconds_ago`, `newest_seconds_ago` | `ducklake_snapshot` |\n| `ducklake_files_per_partition_top20` | `partition` | `count` | `ducklake_data_file` ⨝ `ducklake_file_partition_value` |\n| `ducklake_catalog` | `suffix` | `format_version` | `ducklake_metadata` (key=`version`); numeric `major.minor` lands in the value, any trailing tag (e.g. `-dev1`, `-rc7`) lands in the `suffix` label so dev/pre-release builds stay distinguishable. Empty `suffix=\"\"` for clean releases |\n\nPlus self-metrics: `ducklake_metrics_up`, `ducklake_metrics_query_duration_seconds{query}`, `ducklake_metrics_query_last_success_timestamp{query}`, `ducklake_metrics_query_errors_total{query}`.\n\nUser YAML schema (extends or overrides built-ins by name):\n\n```yaml\nqueries:\n  - name: events_files_per_table\n    help: Live data file count by table (custom example)\n    interval_mins: 5            # positive integer; minimum 1\n    labels: [table_name]\n    values: [count]\n    sql: |\n      SELECT t.table_name, COUNT(*) AS count\n      FROM __ducklake_metadata_lake.ducklake_data_file df\n      JOIN __ducklake_metadata_lake.ducklake_table t USING (table_id)\n      WHERE df.end_snapshot IS NULL\n      GROUP BY t.table_name\n```\n\nBuilt-ins are intentionally lake-wide (no `table_name` label); per-table breakdowns belong in user YAML when needed.\n\nConfiguration env vars (in addition to the standard `DUCKLAKE_*` / `DUCKDB_*` set used by `ducklake_maintenance.py`):\n\n| Variable | Default | Description |\n|---|---|---|\n| `DUCKLAKE_METRICS_PORT` | `9100` | HTTP listen port |\n| `DUCKLAKE_METRICS_CONFIG` | unset | Path to user-supplied queries YAML |\n| `DUCKLAKE_METRICS_DISABLE` | unset | Comma-separated query names to skip from built-ins |\n\n## Configuration\n\nAll configuration via environment variables.\n\n### Shared (always required)\n\n| Variable | Required | Default | Description |\n|----------|----------|---------|-------------|\n| `KAFKA_BOOTSTRAP_SERVERS` | yes | | Kafka broker addresses |\n| `KAFKA_TOPIC` | yes | | Topic to consume |\n| `REPLICA_COUNT` | yes | | Number of StatefulSet replicas (must match `spec.replicas`) |\n| `MILLPOND_DESTINATION` | no | `ducklake` | Destination format: `ducklake` or `iceberg`. Case-insensitive; empty/whitespace falls back to `ducklake`. |\n| `FLUSH_SIZE` | no | `104857600` | Flush after this many bytes of accumulated Arrow data (default 100MB) |\n| `FLUSH_INTERVAL_MS` | no | `60000` | Flush after this many ms |\n| `GROUP_ID` | no | `millpond-{topic}-{table}` | Kafka group.id — used for offset storage in `__consumer_offsets` only, no consumer group semantics. Changing this loses committed offsets and triggers full replay. For Iceberg the default is `millpond-{topic}-{iceberg_table}` (the namespace prefix only shows up in metrics/client.id, not group.id). |\n| `CONSUME_BATCH_SIZE` | no | `1000` | Max messages per `consume()` call — amortizes Python↔C boundary cost |\n| `FETCH_MIN_BYTES` | no | `1048576` | Broker accumulates at least this many bytes before responding (1MB) |\n| `FETCH_MAX_WAIT_MS` | no | `500` | Max broker wait when `fetch.min.bytes` not yet satisfied |\n| `STATS_INTERVAL_MS` | no | `5000` | librdkafka internal stats emission interval (0 to disable) |\n| `LOG_LEVEL` | no | `INFO` | Python log level (DEBUG, INFO, WARNING, ERROR) |\n\n### DuckLake (required when `MILLPOND_DESTINATION=ducklake`)\n\n| Variable | Required | Default | Description |\n|----------|----------|---------|-------------|\n| `DUCKLAKE_TABLE` | yes | | Target DuckLake table name |\n| `DUCKLAKE_DATA_PATH` | yes | | S3 path for DuckLake data files |\n| `DUCKLAKE_CONNECTION` | yes | | DuckDB connection string |\n| `DUCKLAKE_RDS_HOST` | yes | | Postgres host for DuckLake metadata |\n| `DUCKLAKE_RDS_PORT` | no | `5432` | Postgres port |\n| `DUCKLAKE_RDS_DATABASE` | no | `ducklake` | Postgres database name |\n| `DUCKLAKE_RDS_USERNAME` | no | `ducklake` | Postgres username |\n| `DUCKLAKE_RDS_PASSWORD` | yes | | Postgres password |\n| `DUCKLAKE_PARTITION_BY` | no | | Hive-style partition expression (e.g. `year(_inserted_at),month(_inserted_at),day(_inserted_at),hour(_inserted_at)`). Applied via `ALTER TABLE SET PARTITIONED BY` on first write. |\n| `DUCKDB_S3_ACCESS_KEY_ID` | yes | | Static S3 access key for DuckDB |\n| `DUCKDB_S3_SECRET_ACCESS_KEY` | yes | | Static S3 secret for DuckDB |\n| `DUCKDB_S3_REGION` | no | | S3 region |\n| `DUCKDB_S3_ENDPOINT` | no | | S3 endpoint override (MinIO, etc.) |\n| `DUCKDB_S3_USE_SSL` | no | | `true` / `false` |\n| `DUCKDB_S3_URL_STYLE` | no | | `vhost` / `path` |\n\n### Iceberg (required when `MILLPOND_DESTINATION=iceberg`)\n\n| Variable | Required | Default | Description |\n|----------|----------|---------|-------------|\n| `ICEBERG_CATALOG_URI` | yes | | REST catalog endpoint (e.g. `https://catalog.example.com`) |\n| `ICEBERG_WAREHOUSE` | yes | | Warehouse identifier, typically the S3 root (`s3://warehouse/`) |\n| `ICEBERG_NAMESPACE` | yes | | Catalog namespace (validated as a safe identifier) |\n| `ICEBERG_TABLE` | yes | | Target table name within the namespace |\n| `ICEBERG_TABLE_LOCATION` | no | | Explicit `s3://...` table location; if unset, catalog picks |\n| `ICEBERG_CATALOG_TOKEN` | no | | Bearer / OAuth token for the REST catalog |\n| `MILLPOND_S3_ACCESS_KEY_ID` | yes | | Static S3 access key for PyIceberg's PyArrow S3 filesystem |\n| `MILLPOND_S3_SECRET_ACCESS_KEY` | yes | | Static S3 secret |\n| `MILLPOND_S3_REGION` | yes | | S3 region |\n| `MILLPOND_S3_ENDPOINT` | no | | S3 endpoint override (MinIO, etc.) |\n\n`MILLPOND_S3_*` is a separate env var family from `DUCKDB_S3_*` deliberately — they target different client libraries, and a deployment switch from DuckLake to Iceberg should be a clean swap of env vars rather than re-using the DuckDB-specific names.\n\n### Optional record handling\n\nSee [Record Handling](#record-handling) for context. All four variables below are optional; unset means the corresponding stage is disabled.\n\n| Variable | Required | Default | Description |\n|----------|----------|---------|-------------|\n| `MILLPOND_FILTER_KEEP_FIELD_NAME` | no | | Column name to check against the allowlist. Must be set with `MILLPOND_FILTER_VALUES`. Validated as a safe identifier. |\n| `MILLPOND_FILTER_DROP_FIELD_NAME` | no | | Reserved for a future denylist filter; setting it today raises at startup. Mutually exclusive with `MILLPOND_FILTER_KEEP_FIELD_NAME`. |\n| `MILLPOND_FILTER_VALUES` | no | | Comma-separated allowed values. Auto-detected as int if every token parses as an integer, string otherwise. Required when either filter field name is set. |\n| `MILLPOND_SORT_BY` | no | | Comma-separated column names; the batch is sorted ascending by these in tuple order before each write. Missing fields cause the sort to be skipped (records still flow). |\n\n## Releases\n\nEvery merge to `main` automatically:\n1. Bumps the patch version (`v0.0.1` → `v0.0.2`)\n2. Builds and pushes a Docker image to `ghcr.io/posthog/millpond:\u003ctag\u003e`\n3. Creates a GitHub release with changelog\n\nImages: `ghcr.io/posthog/millpond:v0.0.X` or `ghcr.io/posthog/millpond:latest`\n\n## Deployment\n\n```bash\nkubectl apply -f k8s/service.yaml\nkubectl apply -f k8s/pdb.yaml\nkubectl apply -f k8s/statefulset.yaml\n```\n\nPartition count is discovered at startup via `consumer.list_topics()`. Each pod computes its partition assignment from its ordinal:\n\n```python\nmy_partitions = [p for p in range(partition_count) if p % replica_count == ordinal]\n```\n\n### Updating\n\nRolling updates are a poor fit — pods with different `REPLICA_COUNT` values cause double-assignment or gaps. Since Kafka is the durable buffer:\n\n1. **Canary**: Deploy one pod with the new version, verify metrics\n2. **Graceful shutdown**: Scale to 0 (pods flush and commit)\n3. **Full redeploy**: Update image/config, scale back up from committed offsets\n\nDowntime = drain time + startup time (~2-3 min). Kafka buffers trivially.\n\n**Never `kubectl scale` without updating `REPLICA_COUNT`.** Use Helm to manage both atomically.\n\n## Partitioning\n\nPartitioning is per-destination — DuckLake takes a caller-supplied expression, Iceberg is hardcoded.\n\n### DuckLake\n\nSet `DUCKLAKE_PARTITION_BY` to enable Hive-style partitioning on S3. Files are written into `key=value/` directories (e.g. `year=2026/month=3/day=23/hour=21/*.parquet`), enabling S3 prefix filtering, bulk lifecycle rules, and partition discovery by external tools.\n\n```bash\nDUCKLAKE_PARTITION_BY=\"year(_inserted_at),month(_inserted_at),day(_inserted_at),hour(_inserted_at)\"\n```\n\nPartition on `_inserted_at` (always a real TIMESTAMP), not source `timestamp` fields (typically VARCHAR). Applied via `ALTER TABLE SET PARTITIONED BY` on first write — idempotent, safe for multiple pods and restarts. If added to an existing unpartitioned table, new files get HSP layout while old files remain flat; DuckLake queries both transparently via metadata.\n\n### Iceberg\n\nThe partition spec is hardcoded: identity transforms on four int32 columns (`year`, `month`, `day`, `hour`) derived from `_inserted_at` at write time. This produces the same Hive-style layout as DuckLake — `year=2026/month=3/day=23/hour=21/*.parquet` — for the same S3-prefix-filter and lifecycle reasons. There is no env var; every Iceberg deployment gets the same spec.\n\nTrade-off: Iceberg doesn't know the four columns are derived from `_inserted_at`, so reader queries need to filter on the partition columns explicitly to get pruning. A future spec evolution can layer hidden partitioning on top without rewriting data if reader ergonomics start to matter; not needed today.\n\n## Object Sizing\n\nS3 throughput scales with object size — small objects (\u003c1MB) waste per-request overhead, while larger objects (128MB+) maximize GET/PUT throughput. Millpond flushes are triggered by whichever comes first: `FLUSH_SIZE` (Arrow bytes in memory) or `FLUSH_INTERVAL_MS` (wall clock). The resulting Parquet file is typically **3-4x smaller** than the Arrow representation due to columnar encoding and compression.\n\nAt steady state with moderate volume, most flushes are **time-triggered** — the interval expires before the size ceiling is hit. Object size is therefore driven by: `(msgs/s per pod) × (bytes/msg as Parquet) × (flush interval)`.\n\n### Sizing by volume\n\nAssuming ~366 bytes/row in Parquet (7-column event schema), 512 partitions, 8 replicas (64 partitions/pod):\n\n| Per-partition msg/s | Total msg/s | Per-pod msg/s | Parquet/file @60s | Parquet/file @90s | Memory/pod @90s |\n|---|---|---|---|---|---|\n| 500 | 256K | 32K | ~11MB | ~17MB | 512Mi |\n| 1K | 512K | 64K | ~23MB | ~34MB | 512Mi |\n| 2K | 1M | 128K | ~45MB | ~68MB | 512Mi |\n| 4K | 2M | 256K | ~90MB | ~135MB | 640Mi |\n| 9.5K (peak) | 4.9M | 608K | ~213MB | ~320MB | 1Gi |\n\n### Recommended settings for ~128MB target objects\n\nFor a pipeline averaging 4K msg/s per partition with 512 partitions and 8 replicas:\n\n```yaml\nFLUSH_SIZE: \"1073741824\"       # 1GB Arrow ceiling (safety valve for burst/catchup)\nFLUSH_INTERVAL_MS: \"90000\"     # 90s — produces ~135MB Parquet at mean volume\n```\n\nMemory limit: 640Mi (90s × 256K msg/s × ~1KB Arrow/msg ≈ ~230MB Arrow + DuckDB + librdkafka overhead).\n\nAt peak (9.5K/partition), the size trigger fires at ~35s producing ~320MB objects — acceptable, and the pod stays within 1Gi.\n\n### When to add a merge job\n\nIf your volume is low enough that time-triggered flushes produce \u003c10MB objects, run periodic compaction. The `compact` subcommand implements a tiered strategy: small files merge frequently into medium files, medium files merge less often into large files. Each tier saves and restores the catalog's `target_file_size` so running one tier doesn't permanently change file sizing for inserts or other compactions.\n\n```bash\njust compact-to-tier-1-dry-run        # preview: files \u003c1 MiB → ~5 MiB\njust compact-to-tier-1                # execute (catalog-wide)\njust compact-to-tier-2 events         # tier 1-\u003e2, scoped to one table\njust compact-probe events 4           # diagnostic: merge up to 4 adjacent files in 'events'\n```\n\nTier ranges (verified semantics: `min_file_size` inclusive, `max_file_size` exclusive):\n\n| Recipe | Input range | Target |\n|---|---|---|\n| `compact-to-tier-1` | `[0, 1 MiB)` | ~5 MiB |\n| `compact-to-tier-2` | `[1 MiB, 10 MiB)` | ~32 MiB |\n| `compact-to-tier-3` | `[10 MiB, 64 MiB)` | ~128 MiB |\n\nThe `compact` subcommand bounds DuckDB resource use during the merge — `--threads` (default 2) and `--memory-limit` (default 4GB) — because `ducklake_merge_adjacent_files` isn't fully streaming today and over-uses memory relative to input size. The defaults are conservative; raise them on lakes that fit comfortably in pod memory.\n\nThis is an out-of-band maintenance operation, not part of the hot path.\n\nSee the [sizing calculator](https://posthog.github.io/millpond/sizing-calculator.html) for interactive estimates.\n\n## Error Handling and Retries\n\nThe flush path has two failure points, each with its own retry policy:\n\n| Operation | Attempts | Backoff between failures | On exhaustion |\n|-----------|----------|--------------------------|---------------|\n| Lake write | 3 | 1s, 2s (last attempt raises immediately) | Re-raise → pod crashes, K8s restarts, replays from last committed offset |\n| Offset commit | 3 | 0.5s, 1s (last attempt raises immediately) | Re-raise → pod crashes, replays from last committed offset (duplicates bounded by one flush batch) |\n\nBoth use `errors_total{type=\"write_retry\"}` and `errors_total{type=\"offset_commit\"}` counters so transient vs persistent failures are distinguishable in dashboards.\n\nThe write-retry loop catches `Exception` broadly to cover both backends' failure modes — `duckdb.Error` for DuckLake; `pyiceberg.exceptions.CommitFailedException`, `CommitStateUnknownException`, `ServerError`, `ServiceUnavailableError` for Iceberg REST catalog 5xx; `OSError` for S3; `KafkaException` for broker disconnects. Each retry invokes `sink.reset_caches()` to drop cached table/schema state so the next attempt re-checks the catalog (covers the case where another pod evolved the schema or recreated the table between attempts).\n\n**Why crash after exhausting retries?** A persistent write failure means S3 or the catalog is down — continuing would just accumulate pending data in memory until OOM. A persistent commit failure means the Kafka coordinator is unreachable — the write already succeeded, but without committed offsets the next restart will replay the batch (at-least-once duplicates). In both cases, crashing lets K8s apply its restart backoff, and Kafka holds the data safely until the dependency recovers.\n\n## Multiple Pipelines\n\nEach topic→table mapping is a separate StatefulSet. The application doesn't change — just the env vars. Template with Helm:\n\n```yaml\n# values.yaml\npipelines:\n  events:\n    topic: clickhouse_events_json\n    table: events\n    partitions: 512\n    replicas: 8\n  sessions:\n    topic: clickhouse_sessions_json\n    table: sessions\n    partitions: 64\n    replicas: 4\n  logs:\n    topic: app_logs\n    table: logs\n    partitions: 128\n    replicas: 8\n```\n\nOne `range` over `pipelines` in the StatefulSet template produces N independent StatefulSets. Adding a pipeline is adding a block to `values.yaml` and running `helm upgrade`.\n\n## AWS Credential Isolation\n\nMillpond uses two separate AWS credential paths that must not interfere with each other:\n\n| Component | Auth | Credential source |\n|---|---|---|\n| Kafka (MSK) | SASL/OAUTHBEARER | IRSA (standard AWS credential chain) |\n| S3 (lake data files) | Static IAM keys | `DUCKDB_S3_*` (DuckLake) or `MILLPOND_S3_*` (Iceberg) |\n\nNeither backend uses the standard `AWS_ACCESS_KEY_ID` / `AWS_SECRET_ACCESS_KEY` env vars — those take precedence in the credential chain and would shadow the IRSA role used for Kafka authentication. DuckDB-specific names for DuckLake, Millpond-specific names for Iceberg. The two families are deliberately separate so a deployment switch between destinations is a clean env-var swap rather than a re-use.\n\nDuckDB's [aws extension does not support IRSA](https://github.com/duckdb/duckdb-aws/issues/31) — it cannot perform the `AssumeRoleWithWebIdentity` token exchange that IRSA requires. PyIceberg's S3 access is similarly handled via static credentials passed through catalog properties (`s3.access-key-id` etc.) to keep the IRSA token out of the S3 client's credential resolution. Same isolation pattern, different transport.\n\n## Operational Notes\n\n### Periodic MSK IAM auth errors\n\nWhen using MSK IAM authentication (SASL/OAUTHBEARER), you will see periodic bursts of `connection reset by peer` and `SASL OAUTHBEARER mechanism handshake failed` errors in the logs every ~48 minutes. These are **expected and harmless**.\n\nlibrdkafka does not re-authenticate on existing connections when the OAUTHBEARER token refreshes ([KIP-255](https://cwiki.apache.org/confluence/display/KAFKA/KIP-255%3A+OAuth+Authentication+via+SASL%2FOAUTHBEARER)). Instead, the MSK broker closes the connection when the old token expires (~15 min lifetime), and librdkafka reconnects with the refreshed token. The ~48 minute interval corresponds to the IRSA projected token refresh (80% of the default 1-hour TTL).\n\nThe errors come from librdkafka's internal logger (the `%3|...|FAIL|` lines) and bypass Python's log formatting. They auto-resolve within seconds with no data loss.\n\nRelated issues:\n- [confluent-kafka-python #1485](https://github.com/confluentinc/confluent-kafka-python/issues/1485) — oauth token not refreshing on existing connections\n- [aws-msk-iam-auth #143](https://github.com/aws/aws-msk-iam-auth/issues/143) — re-authentication fails with OAUTHBEARER\n- [aws-msk-iam-auth #176](https://github.com/aws/aws-msk-iam-auth/issues/176) — second re-authentication fails with default credentials\n\n## Next steps\n\n### Iceberg multi-writer commit contention\n\nThe Iceberg sink can't currently sustain two pods committing to the same table at typical flush cadence. PyIceberg's REST commit attaches a branch-snapshot requirement (`expected id != actual id`); when a second writer commits between when we loaded the table and when we send the commit, the catalog rejects with `CommitFailedException: branch main has changed`. `_write_with_retry` in `main.py` invalidates caches and retries up to 3 times with exponential backoff (1s, 2s, 4s), but under sustained dual-writer load with `FLUSH_INTERVAL_MS=5000`, the retries collide with the *next* round of commits and exhaust the budget — the pod exits.\n\nThis is why `docker-compose.iceberg.yaml` runs a single millpond pod while `docker-compose.yaml` (DuckLake) runs two. The DuckLake path serialises writes through Postgres-backed catalog locks; Iceberg's optimistic concurrency control needs more retry headroom and jittered backoff to avoid retry-storms, or the writers need partition-aware table layouts so they aren't contending on the same snapshot.\n\nSurfaced by the e2e suite when first wired up — see `tests/e2e/test_e2e_iceberg.py` and the comment in `docker-compose.iceberg.yaml`.\n\n## Note\nThis project should absolutely be called TableFowl, but that would be an [SEO](https://www.confluent.io/product/tableflow/) and linguistic palaver.\n\n---\n\nPhoto: Public Domain, [Wikimedia Commons](https://commons.wikimedia.org/w/index.php?curid=695982)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fposthog%2Fmillpond","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fposthog%2Fmillpond","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fposthog%2Fmillpond/lists"}