{"id":47724206,"url":"https://github.com/vtvito/arrowflow","last_synced_at":"2026-04-02T20:04:29.474Z","repository":{"id":264059604,"uuid":"856534266","full_name":"VTvito/arrowflow","owner":"VTvito","description":"Build ETL pipelines in plain English. A modular microservices platform with AI-assisted pipeline generation, Apache Arrow IPC data transfer, and Airflow orchestration.","archived":false,"fork":false,"pushed_at":"2026-03-01T19:02:44.000Z","size":365,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2026-03-01T20:38:54.281Z","etag":null,"topics":["ai-agent","airflow","apache-arrow","data-engineering","data-pipeline","docker","etl","flask","llm","microservices","python","streamlit"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/VTvito.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2024-09-12T18:31:47.000Z","updated_at":"2026-03-01T19:02:47.000Z","dependencies_parsed_at":null,"dependency_job_id":"a09dd121-bb51-43a6-9780-5649150f8fad","html_url":"https://github.com/VTvito/arrowflow","commit_stats":null,"previous_names":["vtvito/etl_microservices","vtvito/arrowflow"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/VTvito/arrowflow","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/VTvito%2Farrowflow","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/VTvito%2Farrowflow/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/VTvito%2Farrowflow/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/VTvito%2Farrowflow/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/VTvito","download_url":"https://codeload.github.com/VTvito/arrowflow/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/VTvito%2Farrowflow/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31314849,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-02T12:59:32.332Z","status":"ssl_error","status_checked_at":"2026-04-02T12:54:48.875Z","response_time":89,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai-agent","airflow","apache-arrow","data-engineering","data-pipeline","docker","etl","flask","llm","microservices","python","streamlit"],"created_at":"2026-04-02T20:04:26.243Z","updated_at":"2026-04-02T20:04:29.463Z","avatar_url":"https://github.com/VTvito.png","language":"Python","readme":"# ArrowFlow\n\n[![CI](https://github.com/VTvito/arrowflow/actions/workflows/ci.yml/badge.svg)](https://github.com/VTvito/arrowflow/actions/workflows/ci.yml)\n[![Python 3.9+](https://img.shields.io/badge/python-3.9%2B-blue.svg)](https://www.python.org/downloads/)\n[![License: MIT](https://img.shields.io/badge/license-MIT-green.svg)](LICENSE)\n[![Docker Compose](https://img.shields.io/badge/docker-compose-2496ED.svg?logo=docker\u0026logoColor=white)](docker-compose.yml)\n\n**Build ETL pipelines by describing them in plain English.** ArrowFlow is a modular platform where each data operation (extract, transform, load) is an independent microservice. An AI agent translates natural language into executable pipeline definitions, validated and run automatically.\n\n\u003e *\"Load the HR dataset, remove salary outliers, fill missing values with the median, and save as Parquet\"*\n\u003e \u0026mdash; That's all it takes. The AI agent generates a YAML pipeline, validates it, and executes it across the services.\n\n---\n\n## Key Features\n\n- **Natural Language Pipelines** \u0026mdash; Describe what you need in plain text; the AI agent generates and executes a validated YAML pipeline\n- **11 Composable Services** \u0026mdash; Extract (CSV, SQL, API, Excel), Transform (clean, filter, join, quality checks, outlier detection, LLM), Load (CSV, Excel, JSON, Parquet)\n- **High-Performance Data Transfer** \u0026mdash; Apache Arrow IPC binary format between all services (zero-copy, no CSV/JSON parsing overhead)\n- **Visual Pipeline Builder** \u0026mdash; Streamlit wizard: choose data → describe pipeline → execute → download results, all on a single page\n- **Airflow Orchestration** \u0026mdash; Production-ready DAGs with file-based XCom for large datasets\n- **Full Observability** \u0026mdash; Prometheus metrics + Grafana dashboards + structured JSON logging + correlation ID tracing\n- **Extensible** \u0026mdash; Add a new service in minutes using the included scaffold template and step-by-step guide\n\n---\n\n## Quick Start\n\n### Prerequisites\n\n- [Docker Desktop](https://www.docker.com/products/docker-desktop/) (with Docker Compose)\n- Python 3.9+ (for local development and tests)\n\n### Setup\n\n```bash\ngit clone https://github.com/VTvito/arrowflow.git\ncd arrowflow\nmake quickstart\n```\n\nThis will build all images, start 18 containers, and load the demo datasets.\nThe Airflow admin user (`admin`/`admin`) is created automatically on first boot.\n\n### Open the UIs\n\n| Interface | URL | Credentials |\n|---|---|---|\n| **Streamlit** (Pipeline Builder + Dataset Explorer) | http://localhost:8501 | \u0026mdash; |\n| **Airflow** | http://localhost:8080 | admin / admin |\n| **Grafana** (pre-provisioned dashboard) | http://localhost:3000 | admin / *GF_SECURITY_ADMIN_PASSWORD from .env* |\n| **Prometheus** | http://localhost:9090 | \u0026mdash; |\n| **cAdvisor** (container resources) | http://localhost:8088 | \u0026mdash; |\n\n### Try a Demo Pipeline\n\nTrigger one of the pre-built DAGs from the Airflow UI:\n\n| DAG | What it does |\n|---|---|\n| `hr_analytics_pipeline` | HR data \u0026rarr; quality check \u0026rarr; drop columns \u0026rarr; outlier detection \u0026rarr; clean nulls \u0026rarr; save |\n| `ecommerce_pipeline` | E-commerce orders \u0026rarr; quality \u0026rarr; outlier detection \u0026rarr; fill nulls \u0026rarr; save |\n| `weather_api_pipeline` | Live weather API (Open-Meteo, no key needed) \u0026rarr; quality \u0026rarr; clean \u0026rarr; save as Parquet |\n\nOr paste a YAML from [`examples/pipelines/`](examples/pipelines/) into the Streamlit YAML Editor (inside the \"Edit YAML\" expander in step 3).\n\nAfter execution, results appear inline in step 4 with data preview and download buttons. The full **Dataset Explorer**, **Service Catalog**, **Airflow Quick Triggers**, and **Platform Health** are available under the collapsed **Advanced Tools** section.\n\n### New in Streamlit UX\n\n- **Single-page wizard flow**: Data → Describe → Review \u0026 Execute → Results — no tab navigation needed\n- **Step 1 — Data Source**: upload, pick an existing dataset, or select a bundled demo file\n- **Step 2 — Describe**: write what you want in plain language; pipeline is generated by the AI agent\n- **Step 3 — Review \u0026 Execute**: visual pipeline summary with one-click execute; YAML editor available as a collapsed expander for power users\n- **Step 4 — Results**: metrics, step-by-step status, data preview, and CSV/JSON/Arrow download buttons\n- **Advanced Tools** (collapsed): Dataset Explorer, Service Catalog, Airflow Quick Triggers, Platform Health\n- **OpenRouter model semaphore** in sidebar: one-click model reachability check before generation\n\n---\n\n## How It Works\n\n```\nUser: \"Load the HR dataset, check quality, remove salary outliers, and save as Excel\"\n  ↓\nAI Agent → generates YAML pipeline definition\n  ↓\nValidator → checks services, parameters, dependencies\n  ↓\nPipeline Compiler → executes steps in parallel via Preparator SDK\n  ↓\nOutput: cleaned dataset saved in the requested format\n```\n\nThe AI agent supports **OpenAI**, **OpenRouter**, and **local HuggingFace** models. The YAML editor and validator work without any API key.\n\n---\n\n## Architecture\n\n```mermaid\ngraph LR\n    subgraph Sources\n        CSV[CSV Files]\n        SQL[(SQL Database)]\n        API[REST API]\n        XLS[Excel Files]\n    end\n\n    subgraph Extract Services\n        E1[extract-csv :5001]\n        E2[extract-sql :5005]\n        E3[extract-api :5006]\n        E4[extract-excel :5007]\n    end\n\n    subgraph Transform Services\n        T1[clean-nan :5002]\n        T2[delete-columns :5004]\n        T3[join-datasets :5008]\n        T4[data-quality :5010]\n        T5[outlier-detection :5011]\n        T6[text-completion-llm :5012]\n    end\n\n    subgraph Load\n        L1[load-data :5009]\n    end\n\n    subgraph Orchestration\n        AF[Airflow :8080]\n        ST[Streamlit UI :8501]\n        AG[AI Agent]\n    end\n\n    subgraph Monitoring\n        PR[Prometheus :9090]\n        GR[Grafana :3000]\n    end\n\n    CSV --\u003e E1\n    SQL --\u003e E2\n    API --\u003e E3\n    XLS --\u003e E4\n\n    E1 \u0026 E2 \u0026 E3 \u0026 E4 --\u003e|Arrow IPC| T1 \u0026 T2 \u0026 T3 \u0026 T4 \u0026 T5 \u0026 T6\n    T1 \u0026 T2 \u0026 T3 \u0026 T4 \u0026 T5 \u0026 T6 --\u003e|Arrow IPC| L1\n\n    AF --\u003e|Preparator SDK| E1 \u0026 T1 \u0026 L1\n    ST --\u003e AG\n    AG --\u003e|Pipeline YAML| AF\n\n    E1 \u0026 T1 \u0026 L1 -.-\u003e|/metrics| PR\n    PR --\u003e GR\n```\n\nAll data flows between services as **Apache Arrow IPC** \u0026mdash; a columnar binary format that avoids the overhead of CSV/JSON serialization.\n\n### Services\n\n| Category | Service | Port | Description |\n|---|---|---|---|\n| **Extract** | `extract-csv-service` | 5001 | Reads CSV files from the shared volume |\n| | `extract-sql-service` | 5005 | Executes read-only SQL queries via SQLAlchemy |\n| | `extract-api-service` | 5006 | Fetches data from REST APIs (supports auth) |\n| | `extract-excel-service` | 5007 | Reads .xls/.xlsx files |\n| **Transform** | `clean-nan-service` | 5002 | Handles nulls (drop, fill mean/median/mode/value, ffill, bfill) |\n| | `delete-columns-service` | 5004 | Removes specified columns |\n| | `join-datasets-service` | 5008 | Joins two datasets (inner/left/right/outer) |\n| | `data-quality-service` | 5010 | Validates data quality rules (null ratio, duplicates, types, ranges, completeness) |\n| | `outlier-detection-service` | 5011 | Z-score based outlier detection and removal |\n| | `text-completion-llm-service` | 5012 | LLM text generation via HuggingFace |\n| **Load** | `load-data-service` | 5009 | Saves data as CSV, Excel, JSON, or Parquet |\n\nEvery service also exposes `GET /health` (health check) and `GET /metrics` (Prometheus counters).\n\n---\n\n## Use Cases\n\n### HR People Analytics\n\nA 6-step pipeline for the IBM HR Attrition dataset (demo data included):\n\n**Extract CSV \u0026rarr; Data Quality \u0026rarr; Drop Columns \u0026rarr; Outlier Detection \u0026rarr; Clean NaN \u0026rarr; Load**\n\nThe DAG supports parameterized dataset name, output format, z-score threshold, and file-based XCom for large datasets.\n\n### E-commerce Order Analytics\n\nPrice validation and cleanup for e-commerce order data (demo data included):\n\n**Extract CSV \u0026rarr; Data Quality + Completeness \u0026rarr; Outlier Detection \u0026rarr; Fill NaN (median) \u0026rarr; Load as Parquet**\n\n### Live Weather Data\n\nDemonstrates the API extraction service with live data (no API key required):\n\n**Extract API (Open-Meteo) \u0026rarr; Data Quality \u0026rarr; Clean NaN (forward fill) \u0026rarr; Load as Parquet**\n\n### Example Pipeline YAMLs\n\nReady-to-use definitions in [`examples/pipelines/`](examples/pipelines/):\n- [`hr_analytics.yaml`](examples/pipelines/hr_analytics.yaml) \u0026mdash; HR analytics (6 steps)\n- [`ecommerce_analytics.yaml`](examples/pipelines/ecommerce_analytics.yaml) \u0026mdash; E-commerce orders (5 steps)\n- [`weather_data.yaml`](examples/pipelines/weather_data.yaml) \u0026mdash; Weather API (4 steps)\n\n---\n\n## Benchmark\n\nCompare microservices vs monolithic (pure Pandas) performance:\n\n```bash\nmake benchmark-data    # Generate datasets (1k–500k rows)\nmake benchmark-all     # Run both approaches + generate charts\n```\n\nResults including PNG charts and an interactive Plotly report are saved to `benchmark/results/`.\n\n---\n\n## Development\n\n### Testing\n\n```bash\nmake test              # Run all tests (unit + integration)\nmake test-coverage     # With coverage report\nmake lint              # Ruff linter\n```\n\n### Adding a New Service\n\nCopy the scaffold template and follow the guide:\n\n```bash\ncp -r templates/new_service services/my-service\n# Replace placeholders, implement logic, register, build\n```\n\nFull walkthrough: [docs/extending.md](docs/extending.md)\n\n### Documentation\n\n| Doc | Contents |\n|---|---|\n| [docs/demo-guide.md](docs/demo-guide.md) | Step-by-step demo: UI, YAML editor, SDK, Airflow |\n| [docs/architecture.md](docs/architecture.md) | System design, Arrow IPC, parallelism, Gunicorn, security |\n| [docs/access-credentials.md](docs/access-credentials.md) | All service URLs, credentials, env vars |\n\n### Project Structure\n\n\u003cdetails\u003e\n\u003csummary\u003eClick to expand\u003c/summary\u003e\n\n```\n├── docker-compose.yml          # Full stack (18 containers)\n├── Makefile                    # Common commands\n├── data/demo/                  # Bundled demo datasets\n│   ├── hr_sample.csv\n│   └── ecommerce_orders.csv\n├── examples/pipelines/         # Ready-to-use YAML pipelines\n├── templates/new_service/      # Service scaffold template\n├── docs/extending.md           # Extension guide\n├── airflow/dags/               # Airflow DAG definitions\n├── preparator/                 # Client SDK + service registry\n├── services/\n│   ├── common/                 # Shared utilities (Arrow, logging, health, metrics)\n│   └── \u003cservice-name\u003e/         # Each service: Dockerfile, run.py, app/\n├── ai_agent/                   # LLM provider, pipeline agent, compiler\n├── streamlit_app/              # Streamlit UI\n├── schemas/                    # JSON Schema + service registry\n├── benchmark/                  # Performance comparison tools\n├── tests/                      # 17 unit + 2 integration test files\n└── prometheus/                 # Scrape configuration\n```\n\n\u003c/details\u003e\n\n### Key Conventions\n\n- **Business logic isolation** \u0026mdash; HTTP/Flask code in `routes.py`, pure data logic in separate modules\n- **Arrow IPC everywhere** \u0026mdash; No CSV/JSON for inter-service data transfer\n- **X-Params header** \u0026mdash; JSON-encoded parameters for transform/load services\n- **Correlation ID tracing** \u0026mdash; `X-Correlation-ID` propagated end-to-end across all services\n- **Structured JSON logging** \u0026mdash; Consistent single-line JSON output with service, correlation_id, dataset_name\n\n### Security\n\n- Dataset names validated and constrained to safe characters; file paths resolved under `/app/data` only\n- SQL extraction accepts only read-only queries (`SELECT`/`WITH`), blocks dangerous keywords, redacts credentials\n- API extraction validates URL scheme/host and blocks private network targets by default (SSRF mitigation)\n\n---\n\n## Configuration\n\n| Variable | Default | Description |\n|---|---|---|\n| `LLM_PROVIDER` | `openai` | AI agent provider (`openai`, `openrouter`, or `local`) |\n| `OPENAI_API_KEY` | \u0026mdash; | Required if `LLM_PROVIDER=openai` |\n| `OPENAI_MODEL` | `gpt-4o-mini` | OpenAI model |\n| `OPENROUTER_API_KEY` | \u0026mdash; | Required if `LLM_PROVIDER=openrouter` |\n| `OPENROUTER_MODEL` | `stepfun/step-3.5-flash:free` | Default OpenRouter model |\n| `OPENROUTER_FALLBACK_MODELS` | `arcee-ai/trinity-large-preview:free,...` | Comma-separated fallback models if selected model is unavailable |\n| `LOCAL_LLM_URL` | `http://localhost:5012` | Local text-completion service URL when running Streamlit on host |\n| `ETL_DATA_ROOT` | `/app/data` | Base directory for datasets and metadata |\n| `ALLOW_PRIVATE_API_URLS` | `false` | Allow private/local API targets in extract-api |\n\nSee [`.env.example`](.env.example) for all available variables including database and monitoring credentials.\n\n---\n\n## Technology Stack\n\n| Layer | Technology |\n|---|---|\n| Microservices | Python 3.9, Flask, Gunicorn |\n| Data Format | Apache Arrow IPC (streaming) |\n| Orchestration | Apache Airflow |\n| AI Agent | OpenAI / OpenRouter / HuggingFace Transformers |\n| UI | Streamlit |\n| Containers | Docker, Docker Compose (PostgreSQL 16, Airflow 2.10.4) |\n| Monitoring | Prometheus + Grafana |\n| Testing | pytest, ruff |\n| CI/CD | GitHub Actions |\n\n---\n\n## License\n\nMIT\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvtvito%2Farrowflow","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fvtvito%2Farrowflow","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvtvito%2Farrowflow/lists"}