{"id":49907839,"url":"https://github.com/ez-biz/data-builder","last_synced_at":"2026-05-16T11:12:41.132Z","repository":{"id":352344780,"uuid":"1181560951","full_name":"ez-biz/data-builder","owner":"ez-biz","description":"Visual ETL pipeline platform — connect databases, browse catalogs, build pipelines with drag \u0026 drop, CDC to S3, scheduling, monitoring \u0026 log export","archived":false,"fork":false,"pushed_at":"2026-04-19T04:53:13.000Z","size":272,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-04-19T06:21:58.922Z","etag":null,"topics":["cdc","data-engineering","etl","fastapi","pipeline","react"],"latest_commit_sha":null,"homepage":null,"language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ez-biz.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-03-14T10:04:09.000Z","updated_at":"2026-04-19T04:53:17.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/ez-biz/data-builder","commit_stats":null,"previous_names":["ez-biz/data-builder"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/ez-biz/data-builder","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ez-biz%2Fdata-builder","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ez-biz%2Fdata-builder/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ez-biz%2Fdata-builder/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ez-biz%2Fdata-builder/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ez-biz","download_url":"https://codeload.github.com/ez-biz/data-builder/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ez-biz%2Fdata-builder/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33100383,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-16T04:41:52.686Z","status":"ssl_error","status_checked_at":"2026-05-16T04:41:52.009Z","response_time":115,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cdc","data-engineering","etl","fastapi","pipeline","react"],"created_at":"2026-05-16T11:12:40.878Z","updated_at":"2026-05-16T11:12:41.123Z","avatar_url":"https://github.com/ez-biz.png","language":"TypeScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Data Builder\n\nA visual ETL pipeline platform for building data workflows with drag-and-drop. Connect to databases, browse table catalogs, and create pipelines — no code required.\n\n## Features\n\n- **Database Connectors** — Connect to PostgreSQL and Databricks with Fernet-encrypted credential storage\n- **Catalog Browser** — Browse schemas, tables, and columns; preview rows on demand\n- **Visual Pipeline Builder** — Drag-and-drop canvas with 6 node types:\n  - **Source** — Read from a database table\n  - **Filter** — Apply WHERE conditions\n  - **Transform** — Rename, cast, or compute columns\n  - **Join** — Inner, left, right, full, or cross joins\n  - **Aggregate** — GROUP BY with SUM, COUNT, AVG, MIN, MAX\n  - **Destination** — Write to a target table (`append` / `overwrite`)\n- **Pipeline Validation** — DAG cycle detection, handle validation, connectivity checks\n- **Distributed Execution** — Celery workers pull jobs from Redis; scale horizontally\n- **Run Control** — Trigger runs, cancel in-flight runs (SIGTERM revoke), retry failed runs\n- **Scheduled Runs** — Attach a cron expression to any pipeline; a polling scheduler dispatches due jobs\n- **CDC Streams (poll-based)** — Track a monotonically-increasing column, write new rows to S3 in JSONL or CSV with exponential-backoff auto-retry on transient DB errors\n- **Monitoring** — Run history with status, duration, rows; aggregated stats dashboard; exportable logs (JSON/CSV); webhook notifications with HMAC signing\n- **Auto-save** — Debounced save (3s) preserves canvas state\n- **Workbench UI** — Dark sidebar + light work area + dot-grid canvas; Emerald primary accent; Inter + JetBrains Mono typography; fully keyboard-accessible\n\n## Architecture\n\n```\n┌─────────────────────────────────────────────────────┐\n│  Frontend (React + TypeScript + Vite)               │\n│  ┌──────────┐ ┌──────────────┐ ┌─────────────────┐  │\n│  │ Sidebar  │ │ React Flow   │ │ Config Panel    │  │\n│  │ Catalog  │ │ Canvas       │ │ / Run History   │  │\n│  │ Browser  │ │ (drag/drop)  │ │                 │  │\n│  └──────────┘ └──────────────┘ └─────────────────┘  │\n│   Zustand (canvas)  TanStack Query (server state)   │\n│   shadcn/Radix primitives · Workbench design system │\n└───────────────────────┬─────────────────────────────┘\n                        │ REST API\n┌───────────────────────┴─────────────────────────────┐\n│  Backend (FastAPI) — stateless API server           │\n│  ┌────────────┐ ┌──────────┐ ┌────────────────────┐ │\n│  │ Connectors │ │ Catalog  │ │ Pipeline CRUD +    │ │\n│  │ (PG, DBX)  │ │ Service  │ │ Validation + Runs  │ │\n│  └────────────┘ └──────────┘ └────────────────────┘ │\n│  Scheduler thread (cron poll) · Notification webhook│\n└────────┬──────────────────────┬─────────────────────┘\n         │                      │ dispatch via Redis\n         │                      ▼\n         │          ┌─────────────────────────────┐\n         │          │ Celery worker(s)            │\n         │          │ pipeline.run · cdc.sync     │\n         │          │ (retry + SIGTERM revoke)    │\n         │          └──────────┬──────────────────┘\n         ▼                     ▼\n  ┌──────────────┐    ┌──────────────┐    ┌──────────┐\n  │ PostgreSQL 16│    │  Redis 7     │    │ S3 (CDC) │\n  │ app metadata │    │  broker+back │    │ output   │\n  └──────────────┘    └──────────────┘    └──────────┘\n```\n\n## Tech Stack\n\n| Layer     | Technology                                                            |\n|-----------|-----------------------------------------------------------------------|\n| Frontend  | React 19, TypeScript, Vite, Tailwind 4, React Flow 12                 |\n| State     | Zustand (canvas), TanStack Query (server)                             |\n| UI        | shadcn/Radix primitives, Lucide icons, Inter + JetBrains Mono         |\n| Design    | Workbench tokens (Emerald #059669); custom primitives — `DataTable`, `StatCard`, `EmptyState`, `PageHeader`, `Badge` variants |\n| Backend   | Python, FastAPI, SQLAlchemy 2.0, Alembic, psycopg2                    |\n| Tasking   | Celery 5 + Redis broker; retry policy on transient DB errors          |\n| Database  | PostgreSQL 16 (metadata), Redis 7 (broker + cache)                    |\n| Object    | AWS S3 via boto3 (CDC destination)                                    |\n| Security  | Fernet encryption (credentials), HMAC-SHA256 webhook signing          |\n\n## Quick Start\n\n### Prerequisites\n\n- Python 3.9+\n- Node.js 18+ with pnpm\n- Docker \u0026 Docker Compose (for PostgreSQL and Redis)\n\n### 1. Clone and start databases\n\n```bash\ngit clone \u003crepo-url\u003e data-builder\ncd data-builder\ndocker compose -f docker/docker-compose.yml up postgres redis -d\n```\n\n### 2. Backend setup\n\n```bash\ncd backend\npython3 -m venv .venv\nsource .venv/bin/activate\npip install --upgrade pip setuptools\npip install -e \".[dev]\"\n\n# Run migrations\nalembic upgrade head\n\n# Start server\nuvicorn app.main:app --reload --port 8000\n```\n\n### 3. Frontend setup\n\n```bash\ncd frontend\npnpm install\npnpm run dev\n```\n\nOpen [http://localhost:5173](http://localhost:5173)\n\n### 4. Celery worker (for actual pipeline execution)\n\n```bash\ncd backend\nsource .venv/bin/activate\ncelery -A app.celery_app worker --loglevel=info --concurrency=4\n```\n\nWithout the worker, runs dispatch but stay in `pending`. Docker Compose starts one automatically.\n\n### Full Docker (alternative)\n\n```bash\ndocker compose -f docker/docker-compose.yml up --build\n```\n\nBrings up: `postgres`, `redis`, `backend` (FastAPI + scheduler), `worker` (Celery), `frontend`.\n\n## Development\n\n```bash\n# Run backend tests (112 tests)\ncd backend \u0026\u0026 source .venv/bin/activate \u0026\u0026 pytest -v\n\n# Build frontend (tsc + vite)\ncd frontend \u0026\u0026 pnpm run build\n\n# All-in-one dev (requires docker for PG/Redis)\nmake dev\n```\n\n### Design docs\n\nCompleted and upcoming work is captured under `docs/superpowers/`:\n\n- `specs/2026-04-18-ui-ux-revamp-design.md` — Workbench design spec\n- `plans/2026-04-18-ui-ux-revamp.md` — implementation plan (31 tasks)\n\n### Makefile commands\n\n| Command           | Description                          |\n|-------------------|--------------------------------------|\n| `make setup`      | Install all dependencies             |\n| `make dev`        | Start PG/Redis + backend + frontend  |\n| `make docker-up`  | Start everything via Docker          |\n| `make docker-down`| Stop Docker services                 |\n| `make migrate`    | Run Alembic migrations               |\n| `make test`       | Run all tests                        |\n\n## API Endpoints\n\n| Method | Endpoint                                                   | Description                        |\n|--------|------------------------------------------------------------|------------------------------------|\n| GET    | `/api/health`                                              | Health check                       |\n| CRUD   | `/api/connectors`                                          | Manage connectors                  |\n| POST   | `/api/connectors/{id}/test`                                | Test connection                    |\n| GET    | `/api/catalog/{id}/schemas`                                | List schemas                       |\n| GET    | `/api/catalog/{id}/schemas/{s}/tables`                     | List tables                        |\n| GET    | `/api/catalog/{id}/schemas/{s}/tables/{t}/columns`         | List columns                       |\n| GET    | `/api/catalog/{id}/schemas/{s}/tables/{t}/preview`         | Preview data                       |\n| CRUD   | `/api/pipelines`                                           | Manage pipelines                   |\n| POST   | `/api/pipelines/{id}/validate`                             | Validate pipeline                  |\n| POST   | `/api/pipelines/{id}/run`                                  | Dispatch a run via Celery          |\n| GET    | `/api/pipelines/{id}/runs`                                 | List recent runs                   |\n| POST   | `/api/pipelines/{id}/runs/{rid}/retry`                     | Retry a failed/cancelled run       |\n| POST   | `/api/pipelines/{id}/runs/{rid}/cancel`                    | Revoke an in-flight Celery task    |\n| CRUD   | `/api/cdc/jobs`                                            | Manage CDC jobs                    |\n| POST   | `/api/cdc/jobs/{id}/sync`                                  | Trigger an incremental CDC sync    |\n| POST   | `/api/cdc/jobs/{id}/snapshot`                              | Trigger a full snapshot CDC sync   |\n| POST   | `/api/cdc/jobs/{id}/start`                                 | Start the long-running watcher     |\n| POST   | `/api/cdc/jobs/{id}/stop`                                  | Stop (cancel) the watcher          |\n| GET    | `/api/cdc/jobs/{id}/logs`                                  | List CDC sync logs                 |\n| GET    | `/api/monitoring/stats?days=N`                             | Aggregated run + CDC stats         |\n| POST   | `/api/monitoring/export/webhook`                           | Push logs to a webhook endpoint    |\n\nInteractive docs at [http://localhost:8000/docs](http://localhost:8000/docs) (Swagger UI).\n\n## Environment Variables\n\nCopy `.env.example` to `.env` and configure:\n\n| Variable         | Default                         | Description                     |\n|------------------|---------------------------------|---------------------------------|\n| `DATABASE_URL`   | `postgresql://...localhost:5432` | PostgreSQL connection string    |\n| `REDIS_URL`      | `redis://localhost:6379/0`      | Redis connection string         |\n| `SECRET_KEY`     | (weak default)                  | Encryption key — **change in prod** |\n| `CORS_ORIGINS`   | `[\"http://localhost:5173\"]`     | Allowed CORS origins            |\n| `LOG_LEVEL`      | `DEBUG`                         | Python logging level            |\n\n## Project Structure\n\n```\ndata-builder/\n├── backend/\n│   ├── app/\n│   │   ├── connectors/     # Database connector implementations\n│   │   ├── core/           # Encryption, exceptions\n│   │   ├── models/         # SQLAlchemy models (Connector, Pipeline)\n│   │   ├── routers/        # FastAPI route handlers\n│   │   ├── schemas/        # Pydantic request/response models\n│   │   └── services/       # Business logic layer\n│   ├── alembic/            # Database migrations\n│   └── tests/              # pytest test suite\n├── frontend/\n│   └── src/\n│       ├── api/            # React Query hooks\n│       ├── components/\n│       │   ├── layout/     # AppShell, Sidebar, Header\n│       │   ├── pipeline/   # Canvas, nodes, config panel\n│       │   ├── connectors/ # Connector forms\n│       │   └── ui/         # Reusable UI primitives\n│       ├── pages/          # Route page components\n│       ├── stores/         # Zustand stores\n│       └── types/          # TypeScript interfaces\n├── docker/                 # Dockerfiles + compose\n├── Makefile\n└── .env.example\n```\n\n## Roadmap\n\n- [x] **Phase 1** — Foundation: connectors, catalog, visual canvas, validation, auto-save\n- [x] **Phase 2a** — Distributed execution engine via Celery + Redis (cancel + retry); cron scheduling; webhook notifications\n- [x] **Phase 2b** — Workbench UI revamp (Emerald + Balanced density; `DataTable`, `StatCard`, `EmptyState`, `PageHeader`)\n- [x] **Phase 3a** — Poll-based CDC (tracking-column → S3 JSONL/CSV) with transient-error retry\n- [x] **Phase 3a.1** — CDC v2 foundation: `CDCKind` discriminator, event-log JSONL schema, `cdc.watch` long-running watcher pattern, start/stop endpoints; unblocks 3b and 3c\n- [ ] **Phase 2c** — SQL pushdown (execute as SQL instead of in-memory Python; unlocks \u003e100k-row datasets)\n- [ ] **Phase 3b** — WAL-based CDC for PostgreSQL (logical replication; captures deletes, no row-miss window)\n- [ ] **Phase 3c** — MongoDB support: new `MongoConnector` + CDC via Change Streams (native `resume_token`, captures insert/update/replace/delete)\n- [ ] **Phase 4**  — Text2SQL (natural-language → pipeline definition via LLM tool-use)\n- [ ] **UI-follow-ups** — dark mode toggle, command palette (⌘K), Playwright visual-regression suite\n\nDetailed scoping, sequencing rationale, and prerequisites for each open phase live in [`docs/superpowers/roadmap.md`](./docs/superpowers/roadmap.md).\n\n## Security\n\n- Connector credentials are encrypted at rest using Fernet symmetric encryption\n- SQL identifiers are validated against `[a-zA-Z_][a-zA-Z0-9_]*` pattern\n- PostgreSQL queries use parameterized queries via psycopg2\n- CORS is restricted to configured origins\n- Production deployment should use a strong `SECRET_KEY` (warns on weak defaults)\n\n## License\n\nData Builder is **proprietary software** — Copyright © 2026 Anchit Gupta. All rights reserved.\n\n**Any use, copy, modification, redistribution, or monetization requires prior written permission from the author.** The fact that the source is visible in this repository does not, by itself, grant any right to use it.\n\nWhat is permitted without asking:\n\n- Viewing the source on the authorized GitHub repository\n- Quoting short excerpts in technical discussion (with attribution)\n- Forking on GitHub solely to propose a pull request back to this repo\n\nWhat requires explicit permission:\n\n- Running the software for any personal, internal, educational, research, non-profit, or commercial purpose\n- Hosting or serving functionality from the software to any third party\n- Modifying, adapting, translating, or creating derivative works\n- Redistributing, sublicensing, selling, or bundling the software\n- Monetizing the software or any derivative (right expressly reserved to the author)\n- Using any distinctive name, logo, or visual identity associated with the software\n\nAny permitted use must include visible attribution:\n\n\u003e \"Data Builder by Anchit Gupta — used with permission.\n\u003e Source: https://github.com/ez-biz/data-builder\"\n\n**Requesting permission:** email **anchitgupt2012@gmail.com** with the intended use, scope, duration, whether monetary consideration is involved, and the attribution you plan to display. Full terms: [`LICENSE`](./LICENSE).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fez-biz%2Fdata-builder","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fez-biz%2Fdata-builder","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fez-biz%2Fdata-builder/lists"}