{"id":50593868,"url":"https://github.com/jaehyeon-kim/open-dataml-stack","last_synced_at":"2026-06-05T12:30:48.830Z","repository":{"id":360294820,"uuid":"1223119078","full_name":"jaehyeon-kim/open-dataml-stack","owner":"jaehyeon-kim","description":"A curated collection of open source technologies and an accompanying CLI for experimenting with modern data architecture and MLOps.","archived":false,"fork":false,"pushed_at":"2026-05-25T20:17:50.000Z","size":446,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-05-25T21:27:35.398Z","etag":null,"topics":["apache-airflow","apache-flink","apache-iceberg","apache-kafka","apache-spark","cli","clickhouse","data-engineering","data-infrastructure","data-lakehouse","docker-compose","mlflow","mlops","modern-data-stack","openlineage","openmetadata","prometheus","python","stream-processing","trino"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jaehyeon-kim.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-04-28T02:56:11.000Z","updated_at":"2026-05-25T20:28:28.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/jaehyeon-kim/open-dataml-stack","commit_stats":null,"previous_names":["jaehyeon-kim/open-dataml-stack"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/jaehyeon-kim/open-dataml-stack","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jaehyeon-kim%2Fopen-dataml-stack","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jaehyeon-kim%2Fopen-dataml-stack/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jaehyeon-kim%2Fopen-dataml-stack/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jaehyeon-kim%2Fopen-dataml-stack/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jaehyeon-kim","download_url":"https://codeload.github.com/jaehyeon-kim/open-dataml-stack/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jaehyeon-kim%2Fopen-dataml-stack/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33942426,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-05T02:00:06.157Z","response_time":120,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apache-airflow","apache-flink","apache-iceberg","apache-kafka","apache-spark","cli","clickhouse","data-engineering","data-infrastructure","data-lakehouse","docker-compose","mlflow","mlops","modern-data-stack","openlineage","openmetadata","prometheus","python","stream-processing","trino"],"created_at":"2026-06-05T12:30:48.770Z","updated_at":"2026-06-05T12:30:48.824Z","avatar_url":"https://github.com/jaehyeon-kim.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Open DataML Stack\n\n![CI Status](https://img.shields.io/github/actions/workflow/status/jaehyeon-kim/open-dataml-stack/ci.yml?branch=main\u0026label=CI)\n![PyPI Version](https://img.shields.io/pypi/v/dml-cli)\n![Python Versions](https://img.shields.io/pypi/pyversions/dml-cli)\n![License](https://img.shields.io/github/license/jaehyeon-kim/open-dataml-stack)\n\n![Architecture Diagram](./image/diagram.png)\n\nA curated collection of open-source technologies and an accompanying CLI (`dml`) for experimenting with modern data architecture and MLOps locally.\n\nProvisioning a local data environment with distributed systems can be highly complex. The Open DataML Stack streamlines this process by resolving dependency conflicts, network routing configurations, and integration challenges across tools like Kafka, Spark, Flink, Iceberg, and Airflow. It provides a cohesive, Docker-based blueprint that operates seamlessly out of the box.\n\n## Bundled Technologies\n\nThe stack is organized into distinct profiles that can be launched independently or together:\n\n- **Event Streaming:** Kafka (KRaft), Schema Registry (Karapace), Kafka Connect\n- **Processing Engines:** Apache Spark, Apache Flink\n- **Storage \u0026 Catalog:** SeaweedFS (S3-compatible), Iceberg REST Catalog, ClickHouse, PostgreSQL (pgvector), Valkey, Apache Fluss\n- **Orchestration \u0026 MLOps:** Apache Airflow, MLflow, Feast Feature Store\n- **Federation \u0026 BI:** Trino, Metabase\n- **Governance \u0026 Observability:** OpenMetadata, Marquez (Lineage), Prometheus, Grafana\n\n## Prerequisites \u0026 Installation\n\n### Requirements\n\n- **Docker:** Docker Engine or Docker Desktop must be running. We highly recommend allocating at least 8GB to 16GB of RAM to Docker, as data processing engines are resource-heavy.\n- **Python:** Version 3.10 or higher.\n\n### Installation\n\nSince `dml` is a CLI tool, it is highly recommended to install it in an isolated environment using `uv tool` or `pipx`.\n\n**Using uv (Recommended):**\n\n```bash\nuv tool install dml-cli\n```\n\n**Using pipx:**\n\n```bash\npipx install dml-cli\n```\n\n**Using pip:**\n\n```bash\npip install dml-cli\n```\n\n## Quick Start\n\nGet your local cluster up and running in three simple steps.\n\n**1. Initialize your workspace**\nThis command copies the default Docker Compose files and configurations into a hidden `.dml` folder in your current directory.\n\n```bash\ndml init\n```\n\n**2. Explore available profiles**\nSee a full list of technologies you can launch.\n\n```bash\ndml list\n```\n\n**3. Launch the streaming and batch processing engines**\nBring up a robust data engineering environment.\n\n```bash\ndml up kafka flink1 spark\n```\n\n_Note: You do not need to memorize dependencies. The CLI will automatically detect that these profiles require foundational infrastructure and will launch PostgreSQL, SeaweedFS (S3), and the Iceberg REST Catalog for you before starting the target compute engines._\n\n## CLI Command Reference\n\nThe `dml` CLI orchestrates the Open DataML Stack and is logically grouped by functionality. You can append `--help` to any command for deeper parameter details.\n\n### Global Options\n\n- `--verbose`: Enable debug-level logging across all commands.\n- `-w, --workspace PATH`: Path to the DML workspace directory (default: `./.dml`).\n\n### Inspection \u0026 Info\n\n- `dml list`: List all available profiles and their capabilities.\n- `dml explain \u003cprofile\u003e`: Explain the details, services, images, and dependencies of a profile.\n- `dml ps`: List Docker containers managed by the Open DataML Stack.\n- `dml info`: View package and system-wide Docker daemon health status.\n\n### Workspace\n\n- `dml init`: Initialize a local `.dml` workspace for custom configurations.\n\n### Cluster Lifecycle\n\n- `dml pull`: Pre-fetch Docker images without starting the containers.\n- `dml up`: Launch DataML profiles (automatically resolves upstream dependencies).\n- `dml down`: Stop and remove profile containers and networks.\n\n### Data Operations\n\n- `dml iceberg`: Execute PyIceberg CLI commands natively within the stack.\n\n### Management\n\n- `dml logs`: Fetch the logs of containers managed by specific profiles.\n- `dml restart`: Restart one or more specific profiles.\n\n### Examples\n\n```bash\n# View all profiles and exposed ports\n$ dml list -d\n\n# See exactly what the kafka profile provisions\n$ dml explain kafka\n\n# Launch specific profiles and their dependencies\n$ dml up flink1 kafka spark\n\n# Complete teardown and wipe all data\n$ dml down --all --volumes\n```\n\n## Workspace Customization (.dml)\n\nThe Open DataML Stack is designed to be fully hackable. When you run `dml init`, a local `./.dml/` workspace is generated in your current working directory.\n\nThis folder contains all the underlying configurations that power the stack:\n\n- `compose-*.yml`: The actual Docker Compose definitions. You can edit these to change exposed ports, adjust memory limits, or inject new environment variables.\n- `registry.yml`: The internal dependency graph.\n- `.env`: The environment variables used across the stack (e.g., default credentials or timezones).\n\nThe CLI will always prioritize the files in your local `./.dml/` directory. If you make a mistake, you can always revert to the pristine default state by running `dml init --force`.\n\n## Local Development \u0026 Contributing\n\nIf you want to contribute to the CLI itself, we welcome pull requests!\n\n1. Clone the repository.\n2. Install [uv](https://docs.astral.sh/uv/) for dependency management.\n3. Sync the dependencies and install the project in development mode:\n   ```bash\n   uv sync\n   ```\n4. Install the pre-commit hooks to ensure formatting checks pass:\n   ```bash\n   uv run pre-commit install\n   ```\n5. Run the test suite:\n   ```bash\n   uv run pytest tests/\n   ```\n\n## License\n\nThis project is licensed under the Apache License 2.0. See the [LICENSE](LICENSE) file for details.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjaehyeon-kim%2Fopen-dataml-stack","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjaehyeon-kim%2Fopen-dataml-stack","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjaehyeon-kim%2Fopen-dataml-stack/lists"}