{"id":42829592,"url":"https://github.com/erwan-simon/aws-data-platform-framework","last_synced_at":"2026-05-23T23:07:38.968Z","repository":{"id":334099810,"uuid":"1136804903","full_name":"erwan-simon/aws-data-platform-framework","owner":"erwan-simon","description":"A unified framework to industrialize data ingestion, transformation and pipeline execution on AWS using Terraform, from infrastructure provisioning to runtime execution, designed as a reusable and standalone data platform.","archived":false,"fork":false,"pushed_at":"2026-02-07T11:06:54.000Z","size":361,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"prod","last_synced_at":"2026-02-07T20:27:21.644Z","etag":null,"topics":["aws","data","data-framework","datalake","docker","iceberg","python","spark","step-functions","terraform","terraform-module"],"latest_commit_sha":null,"homepage":"","language":"HCL","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/erwan-simon.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-01-18T11:47:29.000Z","updated_at":"2026-02-07T11:06:36.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/erwan-simon/aws-data-platform-framework","commit_stats":null,"previous_names":["erwan-simon/aws-data-platform-framework"],"tags_count":3,"template":false,"template_full_name":null,"purl":"pkg:github/erwan-simon/aws-data-platform-framework","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/erwan-simon%2Faws-data-platform-framework","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/erwan-simon%2Faws-data-platform-framework/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/erwan-simon%2Faws-data-platform-framework/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/erwan-simon%2Faws-data-platform-framework/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/erwan-simon","download_url":"https://codeload.github.com/erwan-simon/aws-data-platform-framework/tar.gz/refs/heads/prod","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/erwan-simon%2Faws-data-platform-framework/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32291347,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-26T08:29:33.829Z","status":"ssl_error","status_checked_at":"2026-04-26T08:29:18.366Z","response_time":129,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["aws","data","data-framework","datalake","docker","iceberg","python","spark","step-functions","terraform","terraform-module"],"created_at":"2026-01-30T11:21:33.410Z","updated_at":"2026-05-23T23:07:38.956Z","avatar_url":"https://github.com/erwan-simon.png","language":"HCL","funding_links":[],"categories":[],"sub_categories":[],"readme":"# AWS Data Platform Framework\n\n![License](https://img.shields.io/badge/license-CC--BY--NC--4.0-blue)\n![Python](https://img.shields.io/badge/python-3.11-blue)\n![Terraform AWS provider](https://img.shields.io/badge/terraform%20aws-%E2%89%A55.60-blueviolet)\n\nA unified framework to industrialize data ingestion, transformation, and pipeline execution on\nAWS using Terraform — from infrastructure provisioning to runtime execution. Reusable, standalone,\nand ready to be dropped into a new AWS account.\n\n```mermaid\nflowchart LR\n    DF[\"\u003cb\u003edomain_factory\u003c/b\u003e\u003cbr/\u003e\u003cbr/\u003eA production-ready\u003cbr/\u003edata domain on AWS,\u003cbr/\u003ein one Terraform call.\u003cbr/\u003e\u003cbr/\u003e\u003ci\u003estorage · permissions · alerting\u003c/i\u003e\"]\n    PF[\"\u003cb\u003epipeline_factory\u003c/b\u003e\u003cbr/\u003e\u003cbr/\u003eYour pipelines,\u003cbr/\u003edeclared as code.\u003cbr/\u003eDeployed as Step Functions.\u003cbr/\u003e\u003cbr/\u003e\u003ci\u003eDocker images · per-job IAM · scheduling\u003c/i\u003e\"]\n    SDK[\"\u003cb\u003edatalake_sdk\u003c/b\u003e\u003cbr/\u003e\u003cbr/\u003eWrite your tasks.\u003cbr/\u003eThe framework handles\u003cbr/\u003ethe lake integration.\u003cbr/\u003e\u003cbr/\u003e\u003ci\u003eNative Python · PySpark · SQL\u003c/i\u003e\"]\n\n    DF --\u003e PF --\u003e SDK\n```\n\n## What you get\n\n- **Domain provisioning in one Terraform call.** S3, Glue DB, Lake Formation, Athena workgroup,\n  IAM, ECR, CodeArtifact, EMR Studio, Bedrock inference profile, ECS/EMR sandbox images,\n  failsafe-shutdown Lambda. All resources tagged for FinOps.\n- **Pipelines as code.** Declare tasks in a Terraform map; you get a Step Functions state\n  machine over ECS Fargate or EMR Serverless tasks, with EventBridge triggers, IAM, logs, and\n  failure alerts.\n- **Two runtimes, one task contract.** Pandas + awswrangler on ECS Fargate for small/medium\n  jobs, PySpark on EMR Serverless for big ones. Switch by changing one Terraform field.\n- **Iceberg tables.** ACID, schema evolution, time travel, partition evolution. Compaction and\n  vacuum run automatically.\n- **Schema enforcement.** Declare column types and constraints (`ge`, `isin`, `str_matches`,\n  `unique`, …) per output table; the SDK builds a [Pandera](https://pandera.readthedocs.io/)\n  schema from the YAML and validates every DataFrame before writing. Same contract for Python\n  and PySpark.\n- **Multi-stage by Terraform workspaces.** `dev`, `uat`, `prod`, … isolated automatically —\n  resource names and database prefixes derived from the workspace.\n- **Local–prod parity.** Run any task locally in the same image used in production, with a\n  Jupyter notebook attached.\n- **Optional AI agent.** *Datalfred* — a Bedrock-backed agent for querying the lake and\n  triggering ingestions in natural language. Off via `enable_llm = false`.\n- **Claude Code integration.** Every scaffolded domain ships a `CLAUDE.md` plus skills to add\n  tasks (`/new-task`), scaffold pipelines (`/new-pipeline`), and upgrade the framework\n  (`/update-framework`) — Claude does the multi-file edits, the human reviews the diff.\n\n## How it works\n\nStep Functions invokes each task with a callback token. The task uses the SDK to ingest data\ninto Iceberg tables on S3, registered in the Glue Data Catalog and governed by Lake Formation.\nAthena provides SQL access on top.\n\n| Concept     | What it is                                                                    | Provisioned by                            |\n|-------------|-------------------------------------------------------------------------------|-------------------------------------------|\n| **Domain**  | S3, Glue DB, IAM, Lake Formation, Athena workgroup, sandbox images.           | [`domain_factory/`](domain_factory)       |\n| **Pipeline**| Step Functions workflow over a set of tasks, with triggers and alerts.        | [`pipeline_factory/`](pipeline_factory)   |\n| **Task**    | Python or SQL unit of work on ECS Fargate or EMR Serverless. Reads/writes Iceberg. | `tasks_configuration` map in the pipeline |\n| **Stage**   | Environment (`dev`, `prod`, …) derived from the Terraform workspace.          | Terraform workspace                       |\n| **Iceberg** | On-disk format for every managed table — ACID, schema evolution, time travel. | Automatic                                 |\n\nResource names follow `{project_name}_{domain_name}_{stage_name}_…`. Non-prod stages prefix\ndatabase names (`dev_my_db`); `prod` uses the unprefixed name.\n\n## Quickstart\n\nScaffold a domain from [`cookiecutter_template/`](cookiecutter_template) — a minimal 2-task\nstarter pipeline (`write_mock_data` → `transform`) you rewrite. For a feature-exhaustive\nexample, see [`integration_tests/`](integration_tests).\n\nPrerequisites:\n* an AWS account\n* [mise](https://mise.jdx.dev/) — installs the terraform/awscli/poetry versions pinned in the scaffold\n* a running Docker daemon (Docker Desktop / OrbStack / colima) — needed at `terraform apply` time to build task images\n* (optional) an existing S3 bucket for Terraform state — leave the cookiecutter prompt empty to use a local backend\n* a VPC tagged `Name = {project_name}_network_platform_prod` — see [`aws-network-stack`](https://github.com/erwan-simon/aws-network-stack) for a ready-made one (NAT gateway optional via `nat_gateways_count`).\n\n1. Install cookiecutter\n```bash\npip install cookiecutter\n```\n\n2. Scaffold a domain (interactive prompts; pre-fill via `key=value` arguments).\n```bash\ncookiecutter https://github.com/erwan-simon/aws-data-platform-framework \\\n  --directory cookiecutter_template \\\n  aws_account_id=$(aws sts get-caller-identity --query Account --output text) \\\n  aws_region=$(aws configure get region) \\\n  dataplatform_version=vX.Y.Z\n```\n\u003e Resolve the latest framework tag with:\n\u003e `git ls-remote --tags https://github.com/erwan-simon/aws-data-platform-framework | awk -F'/' '{print $NF}' | grep -v '\\^{}$' | sort -V | tail -1`\n\n3. Deploy\n```bash\ncd \u003cdomain_name\u003e\nmise install            # installs the terraform/awscli/poetry versions pinned in mise.toml\nmise run deploy dev     # terraform init + workspace select/new + apply --auto-approve\n```\n\nThe pipeline runs on schedule; trigger it manually via the Step Functions console\n(`{PROJECT_NAME}_{DOMAIN_NAME}_dev_{PIPELINE_NAME}`) or `mise run run-pipeline dev \u003cpipeline_name\u003e`.\n\nTo consume `domain_factory` / `pipeline_factory` as remote Terraform modules pinned to a\nrelease tag, see [`docs/deploying.md`](docs/deploying.md). To write tasks, see\n[`docs/pipelines.md`](docs/pipelines.md).\n\n## Documentation\n\n| If you want to…                          | Go to                                       |\n|------------------------------------------|---------------------------------------------|\n| Use the SDK (CLI or Python library)      | [`datalake_sdk/README.md`](datalake_sdk/README.md) |\n| Deploy and operate the platform          | [`docs/deploying.md`](docs/deploying.md)    |\n| Write a pipeline task                    | [`docs/pipelines.md`](docs/pipelines.md)    |\n\n## Repository layout\n\n```\n.\n├── datalake_sdk/         Python SDK and CLI used at runtime by tasks (and by humans)\n├── domain_factory/       Terraform module — per-domain foundation\n├── pipeline_factory/     Terraform module — pipelines from tasks_configuration\n├── cookiecutter_template/ Scaffold for a new domain (minimal 2-task starter pipeline)\n├── integration_tests/    In-tree, feature-exhaustive domain CI deploys end-to-end\n├── scripts/              CI helpers (scaffold generator, integration test driver)\n└── docs/                 In-depth guides (deployment, pipeline authoring)\n```\n\n## License \u0026 Contributing\n\nThis project is licensed under [Creative Commons Attribution-NonCommercial 4.0](LICENSE).\n\nThe source of truth for development is GitLab; this GitHub repository is a read-only mirror\nthat runs `semantic-release` on the `prod` branch. Commits must follow\n[Conventional Commits](https://www.conventionalcommits.org/) — versioning and SDK publication\nare derived from commit messages.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ferwan-simon%2Faws-data-platform-framework","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ferwan-simon%2Faws-data-platform-framework","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ferwan-simon%2Faws-data-platform-framework/lists"}