{"id":42829592,"url":"https://github.com/erwan-simon/aws-data-platform-framework","last_synced_at":"2026-04-26T09:03:46.026Z","repository":{"id":334099810,"uuid":"1136804903","full_name":"erwan-simon/aws-data-platform-framework","owner":"erwan-simon","description":"A unified framework to industrialize data ingestion, transformation and pipeline execution on AWS using Terraform, from infrastructure provisioning to runtime execution, designed as a reusable and standalone data platform.","archived":false,"fork":false,"pushed_at":"2026-02-07T11:06:54.000Z","size":361,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"prod","last_synced_at":"2026-02-07T20:27:21.644Z","etag":null,"topics":["aws","data","data-framework","datalake","docker","iceberg","python","spark","step-functions","terraform","terraform-module"],"latest_commit_sha":null,"homepage":"","language":"HCL","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/erwan-simon.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-01-18T11:47:29.000Z","updated_at":"2026-02-07T11:06:36.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/erwan-simon/aws-data-platform-framework","commit_stats":null,"previous_names":["erwan-simon/aws-data-platform-framework"],"tags_count":3,"template":false,"template_full_name":null,"purl":"pkg:github/erwan-simon/aws-data-platform-framework","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/erwan-simon%2Faws-data-platform-framework","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/erwan-simon%2Faws-data-platform-framework/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/erwan-simon%2Faws-data-platform-framework/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/erwan-simon%2Faws-data-platform-framework/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/erwan-simon","download_url":"https://codeload.github.com/erwan-simon/aws-data-platform-framework/tar.gz/refs/heads/prod","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/erwan-simon%2Faws-data-platform-framework/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32291347,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-26T08:29:33.829Z","status":"ssl_error","status_checked_at":"2026-04-26T08:29:18.366Z","response_time":129,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["aws","data","data-framework","datalake","docker","iceberg","python","spark","step-functions","terraform","terraform-module"],"created_at":"2026-01-30T11:21:33.410Z","updated_at":"2026-04-26T09:03:46.015Z","avatar_url":"https://github.com/erwan-simon.png","language":"HCL","funding_links":[],"categories":[],"sub_categories":[],"readme":"# AWS Data Platform Framework\n\n* [I. Project Overview](#i-project-overview)\n* [II. Architecture / Design](#ii-architecture--design)\n* [III. Prerequisites](#iii-prerequisites)\n* [IV. Installation / Setup](#iv-installation--setup)\n  * [A. Install datalake\\_sdk from AWS CodeArtifact](#a-install-datalake_sdk-from-aws-codeartifact)\n  * [B. Install datalake\\_sdk from Source](#b-install-datalake_sdk-from-source)\n  * [C. Deploy Infrastructure](#c-deploy-infrastructure)\n* [V. Usage](#v-usage)\n  * [A. CLI - Ingest Data](#a-cli---ingest-data)\n  * [B. Programmatic - Ingest Data with Python](#b-programmatic---ingest-data-with-python)\n  * [C. Delete a Table](#c-delete-a-table)\n  * [D. Migrate Data Across Stages](#d-migrate-data-across-stages)\n  * [E. Query Data with Athena](#e-query-data-with-athena)\n  * [F. AI Agent - Datalfred](#f-ai-agent---datalfred)\n  * [G. Ingestion Modes](#g-ingestion-modes)\n  * [H. Local Task Execution](#h-local-task-execution)\n* [VI. Infrastructure](#vi-infrastructure)\n  * [A. Domain Factory](#a-domain-factory)\n  * [B. Pipeline Factory](#b-pipeline-factory)\n  * [C. Terraform Modules](#c-terraform-modules)\n  * [D. Deployment Workflow](#d-deployment-workflow)\n* [VII. Configuration](#vii-configuration)\n  * [A. Environment Variables](#a-environment-variables)\n  * [B. Task Configuration](#b-task-configuration)\n  * [C. Table Metadata](#c-table-metadata)\n  * [D. Triggers](#d-triggers)\n* [VIII. Project Structure](#viii-project-structure)\n  * [A. datalake\\_sdk](#a-datalake_sdk)\n  * [B. domain\\_factory](#b-domain_factory)\n  * [C. pipeline\\_factory](#c-pipeline_factory)\n  * [D. test](#d-test)\n* [IX. Limitations / Assumptions](#ix-limitations--assumptions)\n\n## I. Project Overview\n\nThis project is an **AWS-based data lake platform** designed to facilitate data ingestion, storage, transformation, and governance at scale. It provides:\n\n- A **Python SDK** (`datalake_sdk`) for interacting with the data lake, enabling data ingestion with multiple modes (overwrite, append, upsert)\n- **Terraform infrastructure-as-code modules** for provisioning AWS resources organized into domains and pipelines\n- Support for both **native Python (Pandas)** and **Spark (EMR Serverless)** processing environments\n- **Apache Iceberg** table format for advanced data lake capabilities (ACID transactions, schema evolution, time travel)\n- **AWS Lake Formation** integration for fine-grained access control and data governance\n- An **AI agent** (\"Datalfred\") for natural language interaction with the data lake\n- **Automated orchestration** using AWS Step Functions\n\nThe platform is intended for data engineers, data scientists, and developers who need to build scalable, governed data pipelines on AWS.\n\nFor detailed information about the `datalake_sdk` Python package, refer to the [datalake_sdk README](datalake_sdk/README.md).\n\n## II. Architecture / Design\n\n### High-Level Components\n\nThe architecture is organized around three main layers:\n\n1. **SDK Layer** (`datalake_sdk`):\n   - Python library providing abstractions for data ingestion and processing\n   - CLI tool for manual data operations\n   - Wrappers for Spark and native Python environments\n   - AI agent (Datalfred) for conversational data lake interaction\n\n2. **Infrastructure Layer** (Terraform modules):\n   - **Domain Factory**: Provisions core AWS infrastructure per domain (S3 buckets, Glue databases, Lake Formation, Athena workgroups, IAM roles)\n   - **Pipeline Factory**: Creates data pipelines with orchestrated tasks (ECS/EMR tasks, Step Functions, CloudWatch logs)\n   \n3. **Execution Layer**:\n   - **ECS Fargate tasks**: Lightweight Python data processing\n   - **EMR Serverless**: Spark-based distributed processing\n   - **Step Functions**: Orchestration and workflow management\n\n### Data Flow\n\n1. Data is ingested via the `datalake_sdk` CLI or programmatically through Python code\n2. Tasks run in containerized environments (ECS or EMR) defined by Terraform\n3. Data is written to **S3** in Iceberg format with metadata in **AWS Glue Data Catalog**\n4. **Lake Formation** manages permissions on databases and tables\n5. **Athena** provides SQL query access to the data\n6. **Step Functions** orchestrate multi-step pipelines with dependency management\n\n### Key Design Patterns\n\n- **Domain-Driven Design**: Resources are grouped by business domain\n- **Infrastructure as Code**: All AWS resources defined in Terraform\n- **Schema-on-Read**: Table schemas are inferred from data at ingestion time\n- **Separation of Concerns**: Data storage (S3), metadata (Glue), access control (Lake Formation), and orchestration (Step Functions) are decoupled\n- **Multi-Stage Support**: Terraform workspaces allow dev/uat/prod isolation\n\n### Organizational Conventions\n\nThis platform adheres to organizational technical conventions:\n\n- **CI/CD Platform**: GitLab CI is used for continuous integration and deployment (`.gitlab-ci.yml`). GitHub is a read-only mirror.\n- **AWS Naming Convention**: Resources follow the pattern `{project_name}_{domain_name}_{stage_name}_resource_name`\n- **Stage Name Derivation**: \n  - In GitLab CI: derived from Git branch name (`$CI_COMMIT_REF_SLUG`)\n  - Locally: derived from active Terraform workspace\n- **AWS Region**: Default region is `eu-west-1` (Ireland)\n- **Terraform Backend**: Backend configuration is provided at initialization time via runtime parameters:\n  ```bash\n  terraform init \\\n    -backend-config=\"bucket=$TERRAFORM_BACKEND_BUCKET\" \\\n    -backend-config=\"dynamodb_table=$TERRAFORM_BACKEND_DYNAMODB\"\n  ```\n- **Cost Allocation Tags**: All resources are tagged with `project_name`, `domain_name`, and `stage_name` for FinOps tracking\n\n## III. Prerequisites\n\n### Required Tools\n\n- **AWS Account** with administrative access or appropriate IAM permissions\n- **Terraform** \u003e= 5.60.0, \u003c 6.14.0 (AWS provider version)\n- **Python** ~3.13\n- **Poetry** (for local SDK development and installation)\n- **Docker** (for building container images and local task execution)\n- **AWS CLI** configured with credentials\n- **Git** access to the GitLab repository\n\n### AWS Services Used\n\n- **Storage \u0026 Catalog**: S3, Glue Data Catalog\n- **Governance \u0026 Security**: Lake Formation, IAM\n- **Compute**: ECS (Fargate), EMR Serverless\n- **Orchestration**: Step Functions, EventBridge\n- **Querying**: Athena\n- **Monitoring**: CloudWatch\n- **Container Registry**: ECR\n- **AI/ML**: Bedrock (for Datalfred agent)\n- **Package Management**: CodeArtifact\n- **Notifications**: Secrets Manager (for Slack integration)\n\n### Infrastructure Prerequisites\n\n- **Terraform Backend**: S3 bucket and DynamoDB table for state storage (must be created beforehand)\n- **VPC**: A VPC tagged with `Name: {project_name}_network_platform_prod` containing public and/or private subnets\n- **NAT Gateway**: Required if using private subnets (`use_public_subnets=false`)\n\n## IV. Installation / Setup\n\n### A. Install datalake_sdk from AWS CodeArtifact\n\n1. **Configure AWS credentials** with CodeArtifact read access:\n\n```bash\nexport CODEARTIFACT_AUTH_TOKEN=$(aws codeartifact get-authorization-token \\\n  --domain $CODEARTIFACT_DOMAIN_NAME \\\n  --domain-owner $AWS_ACCOUNT_ID \\\n  --query authorizationToken \\\n  --output text)\n```\n\n2. **Configure pip** to use CodeArtifact:\n\n```bash\npip config set site.index-url https://aws:$CODEARTIFACT_AUTH_TOKEN@$CODEARTIFACT_DOMAIN_NAME-$AWS_ACCOUNT_ID.d.codeartifact.$AWS_REGION.amazonaws.com/pypi/$CODEARTIFACT_REPOSITORY_NAME/simple/\n\npip config set site.extra-index-url https://pypi.python.org/simple/\n```\n\n3. **Install the SDK**:\n\n```bash\npip install datalake-sdk\ndatalake_sdk --help\n```\n\n4. **(Optional) Install with AI agent support**:\n\n```bash\npip install datalake-sdk[agent]\n```\n\n### B. Install datalake_sdk from Source\n\n1. **Clone the repository**:\n\n```bash\ngit clone ${REPO_URL}\ncd datalake/datalake_sdk\n```\n\n2. **Install dependencies**:\n\n```bash\npoetry install\n```\n\n3. **Option 1 - Install globally**:\n\n```bash\npoetry build\npip install dist/*.whl\ndatalake_sdk --help\n```\n\n4. **Option 2 - Run via Poetry**:\n\n```bash\npoetry run datalake_sdk --help\n```\n\nFor complete SDK documentation, see [datalake_sdk/README.md](datalake_sdk/README.md).\n\n### C. Deploy Infrastructure\n\n#### 1. Initialize Terraform Backend\n\nEnsure you have an S3 bucket and DynamoDB table for Terraform state management.\n\n#### 2. Create a Domain\n\nCreate a `main.tf` file using the `domain_factory` module:\n\n```hcl\nmodule \"domain\" {\n  source                        = \"./domain_factory\"\n  project_name                  = \"my_project\"\n  domain_name                   = \"my_domain\"\n  stage_name                    = \"dev\"\n  git_repository                = \"${REPO_URL}\"\n  datalake_admin_principal_arns = [\"arn:aws:iam::123456789012:role/AdminRole\"]\n  failure_notification_receivers = [\"user@example.com\"]\n}\n```\n\n#### 3. Deploy the Domain\n\n```bash\nterraform init \\\n  -backend-config=\"bucket=$TERRAFORM_BACKEND_BUCKET\" \\\n  -backend-config=\"dynamodb_table=$TERRAFORM_BACKEND_DYNAMODB\"\n\nterraform workspace new dev\nterraform apply\n```\n\n#### 4. Create Pipelines\n\nUse the `pipeline_factory` module to create data pipelines (see [Section VI.B](#b-pipeline-factory) for configuration details).\n\n## V. Usage\n\n### A. CLI - Ingest Data\n\nIngest a CSV file into the data lake:\n\n```bash\ndatalake_sdk \\\n  --project-name poc \\\n  --domain-name my_tests \\\n  --stage-name prd \\\n  ingest \\\n  --database-name my_database \\\n  --table-name my_table \\\n  --input-file-path ./file.csv \\\n  --ingestion-mode upsert \\\n  --upsert-keys \"column_1/column_2\" \\\n  --partition-keys \"column_3/column_4\" \\\n  --csv-delimiter \";\"\n```\n\n**Note**: CSV files must include headers.\n\n### B. Programmatic - Ingest Data with Python\n\n```python\nfrom datalake_sdk.native_python_processing_wrapper import NativePythonProcessingWrapper\n\nwrapper = NativePythonProcessingWrapper(\n    project_name=\"poc\",\n    domain_name=\"my_tests\",\n    stage_name=\"prd\",\n    output_tables={\n        \"my_database.my_table\": {\n            \"upsert_keys\": [\"column_1\", \"column_2\"],\n            \"partition_keys\": [\"column_3\"],\n            \"ingestion_mode\": \"upsert\"\n        }\n    }\n)\n\ndataframe = wrapper.read_input_dataset(\"./file.csv\", csv_delimiter=\";\")\nwrapper.ingest(\"my_database.my_table\", dataframe)\n```\n\nFor Spark environments, replace `NativePythonProcessingWrapper` with `SparkProcessingWrapper`.\n\n### C. Delete a Table\n\n```bash\ndatalake_sdk \\\n  --project-name poc \\\n  --domain-name my_tests \\\n  --stage-name prd \\\n  delete_table \\\n  --database-name my_database \\\n  --table-name my_table\n```\n\n### D. Migrate Data Across Stages\n\nCopy the data of one or all tables from a source stage to the current target stage (e.g. `prod` → `dev`):\n\n```bash\ndatalake_sdk \\\n  --project-name poc \\\n  --domain-name newsroom \\\n  --stage-name dev \\\n  migrate_data \\\n  --source-stage-name prod \\\n  --database-name newsroom \\\n  --source-table-name articles \\\n  --owner-job tests/test_native_write\n```\n\nBehavior:\n- Reads the source via Athena in chunks and re-ingests through the SDK in `upsert` mode.\n- If `--source-table-name` is omitted, every table of the source database is replicated to the target database with the same name.\n- If `--upsert-keys` is omitted, falls back to the `datalake_sdk_upsert_keys` Glue table property of the source table.\n- If `--owner-job pipeline_name/task_name` is provided, Lake Formation `ALL` permissions (with grant option) are granted on each target table to the IAM role `{project}_{domain}_{target_stage}_{pipeline}_{task}`. If omitted, the SDK falls back to the `datalake_sdk_pipeline_name` / `datalake_sdk_task_name` properties of the source table to derive the role; otherwise a warning is emitted (the migrating principal becomes the LF owner and the original pipeline may lose access).\n\n### E. Query Data with Athena\n\nUse the AWS Athena console or CLI to query Iceberg tables:\n\n```sql\nSELECT * FROM dev_my_database.my_table WHERE column_3 = 'value';\n```\n\n### F. AI Agent - Datalfred\n\nInteract with the data lake using natural language (requires `datalake-sdk[agent]`):\n\n```bash\ndatalake_sdk \\\n  --project-name poc \\\n  --domain-name my_tests \\\n  --stage-name prd \\\n  datalfred \\\n  --model-size large\n```\n\nDatalfred can:\n- Query data using natural language\n- Investigate pipeline failures\n- Analyze code and configurations\n\nFor more information, see [datalake_sdk/README.md - Datalfred Agent](datalake_sdk/README.md#c-datalfred-agent).\n\n### G. Ingestion Modes\n\n- **overwrite**: Replaces all existing table data\n- **append**: Adds new rows without modifying existing data (may create duplicates)\n- **upsert**: Updates existing rows or inserts new ones based on upsert keys\n\nFor detailed explanations and examples, see [datalake_sdk/README.md - Ingestion Modes](datalake_sdk/README.md#viii-ingestion-modes).\n\n### H. Local Task Execution\n\nThe platform allows you to execute task code in a local Dockerized environment that is **identical to the AWS task execution environment**. This is particularly useful for developing new tasks or debugging existing ones.\n\nYou can run either:\n- **ECS tasks** (native Python with Pandas)\n- **EMR Serverless tasks** (PySpark)\n\nThe Docker image can be:\n- A **sandbox image** (intermediate base image)\n- A **task-specific image** (containing the final Python/PySpark code)\n\n#### Prerequisites\n\n- Docker must be running locally\n- The Docker image must be available:\n  - If built locally, it's already available\n  - If from ECR, you must authenticate and pull the image\n\n#### 1. Authenticate to ECR\n\nAssuming AWS credentials are configured:\n\n```bash\naws ecr get-login-password --region ${ECR_REGION} | docker login --username AWS --password-stdin ${AWS_ACCOUNT_ID}.dkr.ecr.${ECR_REGION}.amazonaws.com\n```\n\n#### 2. Run an ECS Task (Native Python)\n\nThis launches a Jupyter Notebook environment for native Python tasks:\n\n```bash\ndocker run \\\n  -e AWS_PROFILE=${AWS_CREDENTIALS_PROFILE} \\\n  --mount type=bind,source=$HOME/.aws/,target=/root/.aws/ \\\n  -p 8888:8888 \\\n  ${AWS_ACCOUNT_ID}.dkr.ecr.${ECR_REGION}.amazonaws.com/${ECR_NAME}:${DOCKER_IMAGE_TAG} \\\n  jupyter notebook --ip=\"0.0.0.0\" --no-browser --allow-root\n```\n\nThe command will output the Jupyter Notebook URL. Copy and paste it into your browser.\n\n#### 3. Run an EMR Serverless Task (PySpark)\n\nThis launches a Jupyter Notebook with PySpark configured:\n\n```bash\nexport CREDENTIALS=$(aws configure export-credentials)\nmkdir -p logs  # To access generated Spark logs\n\ndocker run -d \\\n  -e AWS_ACCESS_KEY_ID=$(echo $CREDENTIALS | jq -r '.AccessKeyId') \\\n  -e AWS_SECRET_ACCESS_KEY=$(echo $CREDENTIALS | jq -r '.SecretAccessKey') \\\n  -e AWS_SESSION_TOKEN=$(echo $CREDENTIALS | jq -r '.SessionToken // \"\"') \\\n  -e AWS_REGION=${AWS_REGION} \\\n  -e AWS_DEFAULT_REGION=${AWS_REGION} \\\n  --mount type=bind,source=$(pwd)/logs,target=/var/log/spark/user/ \\\n  -p 8888:8888 \\\n  -e PYSPARK_DRIVER_PYTHON=jupyter \\\n  -e PYSPARK_DRIVER_PYTHON_OPTS='notebook --ip=\"0.0.0.0\" --no-browser' \\\n  ${AWS_ACCOUNT_ID}.dkr.ecr.${ECR_REGION}.amazonaws.com/${ECR_NAME}:${DOCKER_IMAGE_TAG} \\\n  pyspark --master local \\\n  --conf spark.hadoop.fs.s3a.endpoint=s3.${AWS_REGION}.amazonaws.com \\\n  --conf spark.hadoop.hive.metastore.client.factory.class=com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory\n\ncat logs/stderr\n```\n\nThe Jupyter Notebook URL will be printed in the `logs/stderr` file. Copy and paste it into your browser.\n\n#### Notes\n\n- **AWS Credentials**: The ECS example mounts `~/.aws/` to use your local AWS profile. The EMR example exports credentials as environment variables.\n- **Port Mapping**: Both examples expose port 8888 for Jupyter Notebook access.\n- **Spark Configuration**: The EMR example configures Spark to use S3 and AWS Glue Data Catalog.\n- **Logs Directory**: For EMR tasks, Spark logs are written to the local `logs/` directory for debugging.\n\n## VI. Infrastructure\n\n### A. Domain Factory\n\nThe `domain_factory` Terraform module provisions foundational infrastructure for a data domain.\n\n#### Key Resources\n\n- **S3 Buckets**:\n  - `{project_name}-{domain_name}-{stage_name}-data`: Stores Iceberg table data with versioning and intelligent tiering\n  - `{project_name}-{domain_name}-{stage_name}-technical`: Stores logs, temporary files, and Athena query results\n  \n- **Glue Database**: Domain-scoped catalog for tables (`{stage_prefix}{domain_name}`)\n\n- **Lake Formation**: \n  - Registers S3 data location\n  - Manages database and table permissions\n  - Supports cross-account data sharing\n\n- **Athena Workgroup**: Query execution environment (`{project_name}_{domain_name}_{stage_name}`)\n\n- **IAM Roles**: Task execution roles with least-privilege permissions\n\n- **Security Groups**: Network isolation for processing tasks\n\n- **CodeArtifact Repository**: Private Python package hosting for the SDK\n\n- **ECS/EMR Sandbox**: Pre-built base images for task execution\n\n- **Lambda (Failsafe Shutdown)**: Monitors and terminates long-running tasks\n\n- **Bedrock Inference Profile**: AI model access for Datalfred (model sizes: `small`, `medium`, `large`)\n\n- **EMR Studio**: Interactive development environment for Spark jobs\n\n#### Key Variables\n\n| Variable | Type | Description | Default |\n|----------|------|-------------|---------|\n| `project_name` | string | Project identifier | Required |\n| `domain_name` | string | Domain name | Required |\n| `stage_name` | string | Environment (dev, uat, prod, etc.) | Required |\n| `git_repository` | string | GitLab repository URL | Required |\n| `datalake_admin_principal_arns` | list(string) | IAM principals with full data access | `[]` |\n| `use_public_subnets` | bool | Use public vs. private subnets | `true` |\n| `database_description` | string | Description of the domain database | `\"\"` |\n| `skip_emr_serverless_sandbox_creation` | bool | Skip EMR sandbox image creation | `true` |\n| `failure_notification_receivers` | list(string) | Email addresses for failure alerts | Required |\n\n#### Outputs\n\nThe module exports a domain object containing all necessary information for pipeline creation (see `domain_factory/outputs.tf`).\n\n### B. Pipeline Factory\n\nThe `pipeline_factory` Terraform module provisions data pipelines with orchestrated tasks.\n\n#### Key Resources\n\n- **Step Functions State Machine**: Workflow orchestration with task dependencies\n\n- **ECS or EMR Tasks**: Containerized data processing\n  - **ECS**: Fargate tasks for lightweight Python jobs\n  - **EMR**: Serverless Spark for large-scale processing\n\n- **Glue Database** (optional): Pipeline-scoped catalog (`{stage_prefix}{pipeline_name}`)\n\n- **CloudWatch Logs**: Task execution logs with 30-day retention\n\n- **EventBridge Scheduler**: Schedule-based or event-driven triggers\n\n- **IAM Roles**: Task-specific permissions (data access, Lake Formation, S3)\n\n- **ECR Repositories**: Docker image storage per task\n\n- **Failure Notifications**: CloudWatch Events trigger notifications on task failures\n\n#### Key Variables\n\n| Variable | Type | Description | Default |\n|----------|------|-------------|---------|\n| `pipeline_name` | string | Pipeline identifier | Required |\n| `tasks_configuration` | map(object) | Task definitions (see below) | Required |\n| `trigger` | object | Pipeline trigger configuration | `{\"type\": \"none\", \"argument\": \"none\"}` |\n| `orchestration_configuration_template_file_path` | string | Step Functions template path | Required |\n| `domain_object` | object | Output from domain_factory | Required |\n| `failure_notification_receivers` | list(string) | Email addresses for failure alerts | `[]` |\n| `skip_pipeline_database_creation` | bool | Skip pipeline database creation | `false` |\n\n#### Task Configuration Structure\n\n```hcl\ntasks_configuration = {\n  \"task_name\" : {\n    \"type\" : \"python\" | \"sql\"\n    \"path\" : \"./relative/path/to/task/code\"\n    \"infra_type\" : \"ECS\" | \"EMRServerless\"\n    \"infra_config\" : {\n      \"cpu\" : \"512\"        # ECS only: CPU units\n      \"memory\" : \"1024\"    # ECS only: Memory in MB\n    }\n    \"input_tables\" : [\"db.table1\", \"db.table2\"]\n    \"output_tables\" : {\n      \"db.output_table\" : {\n        \"ingestion_mode\" : \"overwrite\" | \"append\" | \"upsert\"\n        \"upsert_keys\" : [\"id\"]\n        \"partition_keys\" : [\"date\"]\n      }\n    }\n    \"additional_parameters\" : {\n      \"param_key\" : \"static_value\"\n      \"dynamic_param.$\" : \"$.trigger_param\"  # Reference trigger input\n    }\n    \"additional_rebuild_trigger\" : {}  # Force image rebuild\n    \"additional_permissions\" : \"\u003cIAM policy JSON\u003e\"  # Extra IAM permissions\n  }\n}\n```\n\n#### Trigger Configuration\n\n**Schedule-based (cron)**:\n```hcl\ntrigger = {\n  \"type\" : \"schedule\"\n  \"argument\" : \"cron(15 1 * * ? *)\"\n  \"parameters\" : jsonencode({\n    \"key\" : \"value\"\n  })\n}\n```\n\n**Manual execution only**:\n```hcl\ntrigger = {\n  \"type\" : \"none\"\n  \"argument\" : \"none\"\n}\n```\n\n### C. Terraform Modules\n\nThe `pipeline_factory/modules` directory contains three submodules:\n\n#### 1. `ecs_factory`\n\nProvisions ECS Fargate tasks:\n- Task definition with environment variables\n- IAM roles for task execution and data access\n- ECR repository and Docker image build\n- CloudWatch log groups\n\n#### 2. `emr_factory`\n\nProvisions EMR Serverless applications:\n- EMR application with Spark runtime\n- IAM roles for job execution and data access\n- ECR repository and Docker image build (Spark-compatible)\n- S3 paths for Spark logs\n\n#### 3. `build_and_upload_image_to_ecr`\n\nAutomates Docker image management:\n- Copies task code and dependencies\n- Builds Docker image using sandbox base image\n- Pushes image to ECR\n- Supports rebuild triggers for code changes\n\n### D. Deployment Workflow\n\n1. **Domain Deployment**: Terraform provisions domain infrastructure (S3, Glue, Lake Formation, IAM, etc.)\n\n2. **Pipeline Deployment**: Terraform provisions pipeline infrastructure\n   - Creates Step Functions state machine\n   - Builds Docker images for each task\n   - Pushes images to ECR\n   - Creates ECS task definitions or EMR applications\n\n3. **Task Execution**:\n   - EventBridge scheduler or manual trigger starts Step Functions execution\n   - Step Functions orchestrates task execution based on orchestration template\n   - ECS/EMR tasks run with environment variables set by Terraform\n   - Tasks use `datalake_sdk` to read/write data\n\n4. **Data Ingestion**:\n   - Tasks transform data using Pandas or Spark\n   - SDK ingests data to S3 in Iceberg format\n   - Glue Catalog metadata is updated\n   - Lake Formation permissions are enforced\n\n5. **Monitoring \u0026 Notifications**:\n   - CloudWatch logs capture task execution\n   - Failsafe Lambda monitors task duration\n   - CloudWatch Events trigger email notifications on failures\n\n## VII. Configuration\n\n### A. Environment Variables\n\nSet automatically by infrastructure; users can access via `task_additional_parameters`:\n\n| Variable | Description | Set By |\n|----------|-------------|--------|\n| `PROJECT_NAME` | Project identifier | Terraform |\n| `DOMAIN_NAME` | Domain name | Terraform |\n| `STAGE_NAME` | Environment name | Terraform |\n| `PIPELINE_NAME` | Pipeline name | Terraform |\n| `TASK_NAME` | Task name | Terraform |\n| `INPUT_TABLES` | JSON-encoded list of input tables | Terraform |\n| `OUTPUT_TABLES` | JSON-encoded dict of output table configs | Terraform |\n| `IS_SQL_JOB` | Whether task executes SQL (`true`/`false`) | Terraform |\n| `TASK_ADDITIONAL_PARAMETERS_*` | Custom parameters from Terraform | Terraform |\n| `step_function_task_token` | Step Functions callback token | Step Functions |\n| `step_function_execution_arn` | Step Functions execution ARN | Step Functions |\n\n### B. Task Configuration\n\nExample task configuration in Terraform:\n\n```hcl\ntasks_configuration = {\n  \"my_task\" : {\n    \"type\" : \"python\",\n    \"path\" : \"./my_task/\",\n    \"infra_type\" : \"ECS\",\n    \"infra_config\" : {\n      \"cpu\" : \"512\",\n      \"memory\" : \"1024\"\n    },\n    \"input_tables\" : [\"db.input_table\"],\n    \"output_tables\" : {\n      \"db.output_table\" : {\n        \"ingestion_mode\" : \"upsert\",\n        \"upsert_keys\" : [\"id\"],\n        \"partition_keys\" : [\"date\"]\n      }\n    },\n    \"additional_parameters\" : {\n      \"my_param.$\" : \"$.trigger_param\",  # Dynamic from trigger\n      \"static_param\" : \"value\"\n    },\n    \"additional_permissions\" : data.aws_iam_policy_document.my_policy.json\n  }\n}\n```\n\n### C. Table Metadata\n\nPlace YAML files in `code/tables_configuration/` to document tables:\n\n```yaml\n# code/tables_configuration/my_database.my_table.yaml\ndescription: \"Customer dimension table\"\nschema:\n  customer_id:\n    description: \"Unique customer identifier\"\n  customer_name:\n    description: \"Full name of the customer\"\n```\n\nIn addition, the SDK automatically writes a few Glue **table properties** on every successful ingestion:\n\n- `datalake_sdk_upsert_keys` — comma-separated upsert keys used (only for `upsert` mode). Updated at every write; a warning is emitted if the keys differ from the previously stored value.\n- `datalake_sdk_pipeline_name` / `datalake_sdk_task_name` — the pipeline/task that produced the table (skipped for ad-hoc CLI ingestions).\n\nThese properties are consumed by `datalake_sdk migrate_data` to derive default upsert keys and the owner IAM role.\n\n### D. Triggers\n\n**Schedule**: Cron-based execution\n\n```hcl\ntrigger = {\n  \"type\" : \"schedule\"\n  \"argument\" : \"cron(15 1 * * ? *)\"\n  \"parameters\" : jsonencode({\"key\": \"value\"})\n}\n```\n\n**None**: Manual execution only\n\n```hcl\ntrigger = {\n  \"type\" : \"none\"\n  \"argument\" : \"none\"\n}\n```\n\n## VIII. Project Structure\n\n```\ndatalake/\n├── datalake_sdk/              # Python SDK and CLI\n├── domain_factory/            # Terraform module for domain infrastructure\n├── pipeline_factory/          # Terraform module for pipeline infrastructure\n│   └── modules/\n│       ├── ecs_factory/       # ECS task provisioning\n│       ├── emr_factory/       # EMR Serverless provisioning\n│       └── build_and_upload_image_to_ecr/  # Docker build and push\n├── test/                      # Integration tests and examples\n├── doc_resources/             # Documentation resources\n├── .gitlab-ci.yml             # GitLab CI pipeline configuration\n├── .github/workflows/         # GitHub Actions (semantic-release)\n├── LICENSE                    # Creative Commons Attribution-NonCommercial 4.0\n└── README.md                  # This file\n```\n\n### A. datalake_sdk\n\n**Purpose**: Provides a unified interface for data lake operations.\n\nThe `datalake_sdk` is a comprehensive Python package for interacting with the data lake. It includes:\n\n- **CLI**: Command-line interface for ingestion, table deletion, and AI agent interaction\n- **Processing Wrappers**: Abstract base class and implementations for Pandas and Spark\n- **Datalfred Agent**: AI-powered assistant for natural language data lake interaction\n\nFor complete documentation, see [datalake_sdk/README.md](datalake_sdk/README.md).\n\n**Key Files**:\n- `main.py`: CLI entry point with subcommands\n- `base_processing_wrapper.py`: Abstract base class\n- `native_python_processing_wrapper.py`: Pandas implementation\n- `spark_processing_wrapper.py`: Spark implementation\n- `ingestion.py`: CLI ingestion command\n- `delete_table.py`: CLI delete command\n- `migrate_data.py`: CLI command to copy data from one stage to another\n- `update_foreign_linked_databases.py`: CLI command to sync Glue resource links for cross-account databases\n- `datalfred_agent/`: AI agent modules\n\n**Dependencies** (from `pyproject.toml`):\n- Core: `boto3`, `click`, `awswrangler`, `pyyaml`, `tqdm`, `slack-sdk`\n- Optional: `strands-agents`, `strands-agents-tools`, `strands-agents-builder` (for Datalfred)\n\n**Version**: 5.7.11 (automatically detected by domain_factory)\n\n### B. domain_factory\n\n**Purpose**: Terraform module to provision AWS resources for a data domain.\n\n**Key Files**:\n- `s3_data.tf`, `s3_technical.tf`: S3 bucket definitions\n- `glue_database.tf`: Glue Data Catalog database\n- `lakeformation.tf`: Lake Formation registration and permissions\n- `athena_workgroup.tf`: Athena workgroup configuration\n- `ecs_cluster_sandbox.tf`: ECS base image and cluster\n- `emr_serverless_application_sandbox.tf`: EMR Serverless base image\n- `codeartifact_repository.tf`: Private package repository\n- `lambda_failsafe_shutdown.tf`: Task timeout enforcement\n- `bedrock_inference_profile.tf`: AI model access\n- `code_datalake_sdk.tf`: Packages and publishes SDK to CodeArtifact\n- `variables.tf`: Input variables\n- `outputs.tf`: Exported domain configuration\n- `locals.tf`: Local variables (environment naming, SDK version extraction)\n\n**Outputs**: Exports domain configuration consumed by pipeline_factory.\n\n### C. pipeline_factory\n\n**Purpose**: Terraform module to create data pipelines with orchestrated tasks.\n\n**Key Files**:\n- `step_function.tf`: AWS Step Functions state machine\n- `ecs_tasks.tf`: ECS task module invocations\n- `emr_tasks.tf`: EMR Serverless application module invocations\n- `event_bridge_scheduler.tf`: Pipeline trigger configuration\n- `cloudwatch_event_task_failed.tf`: Failure notification setup\n- `cloudwatch_event_failsafe_shutdown.tf`: Failsafe Lambda trigger\n- `glue_database.tf`: Pipeline-scoped database (optional)\n- `variables.tf`: Input variables\n- `outputs.tf`: Pipeline outputs\n- `locals.tf`: Local variables (environment naming)\n\n**Modules**:\n- `ecs_factory/`: Provisions ECS Fargate tasks\n- `emr_factory/`: Provisions EMR Serverless applications\n- `build_and_upload_image_to_ecr/`: Builds and uploads Docker images\n\n### D. test\n\n**Purpose**: Integration tests and example pipeline implementation.\n\n**Key Files**:\n- `domain.tf`: Test domain deployment\n- `pipeline.tf`: Test pipeline with multiple task types\n- `variables.tf`: Test-specific variable definitions\n- `integration_tests_pipeline/`: Test tasks\n  - `test_write/`: Python task for data generation\n  - `test_native_sql_entrypoint/`: Native SQL task\n  - `test_spark_sql_entrypoint/`: Spark SQL task\n  - `check_and_clean/`: Validation and cleanup task\n  - `orchestration_configuration.tftpl.json`: Step Functions orchestration\n- `utils/`: Test utilities\n  - `run_integration_tests.py`: Test execution script\n  - `pipeline_utils/`: Test library for dependency validation\n\n**Variable Handling**:\n\nThe test configuration uses a different variable format for convenience:\n\n| Variable | Type in domain_factory | Type in test | Transformation |\n|----------|----------------------|--------------|----------------|\n| `datalake_admin_principal_arns` | `list(string)` | `string` (comma-separated role names) | Split by comma, lookup ARNs via `data.aws_iam_role`, pass as list |\n| `failure_notification_receivers` | `list(string)` | `string` (comma-separated emails) | Split by comma in module call |\n\nExample test variable usage:\n```hcl\n# test/domain.tf\ndata \"aws_iam_role\" \"datalake_admins\" {\n  for_each = toset(split(\",\", var.datalake_admin_principal_arns))\n  name = each.value\n}\n\nmodule \"domain\" {\n  # ...\n  datalake_admin_principal_arns = values(data.aws_iam_role.datalake_admins)[*].arn\n  failure_notification_receivers = split(\",\", var.failure_notification_receivers)\n}\n```\n\n**CI/CD**: Integration tests run automatically in GitLab CI (`run_integration_tests` stage).\n\n## IX. Limitations / Assumptions\n\n1. **AWS-Only**: This platform is tightly coupled to AWS services and cannot be deployed on other cloud providers without significant refactoring.\n\n2. **Python 3.13**: The SDK and processing tasks require Python ~3.13. Older Python versions are not supported.\n\n3. **Iceberg Format**: All tables are stored in Apache Iceberg format. Direct Parquet or other formats are not supported for managed tables.\n\n4. **Region**: Infrastructure is deployed in a single AWS region (default: `eu-west-1`). Cross-region replication is not implemented.\n\n5. **Terraform State Backend**: Assumes an existing S3 bucket and DynamoDB table for Terraform state management. These must be created manually before deployment.\n\n6. **Naming Conventions**: Resource names follow the pattern `{project_name}_{domain_name}_{stage_name}`. Non-prod stages prefix database names (e.g., `dev_my_database`). Production (`stage_name = \"prod\"`) databases have no prefix.\n\n7. **Lake Formation Permissions**: The platform assumes AWS Lake Formation is the primary access control mechanism. IAM-only setups are not fully supported.\n\n8. **CSV Ingestion**: CSV files must include headers for schema inference.\n\n9. **Upsert Key Uniqueness**: Upsert keys must guarantee row uniqueness in the ingested dataset. Violations will cause ingestion failure.\n\n10. **Concurrency**: Iceberg commit conflicts (e.g., simultaneous writes) are mitigated with retries (up to 30 retries with 2-10 minute waits), but high-concurrency scenarios may require tuning.\n\n11. **Failsafe Shutdown**: The failsafe Lambda function monitors task durations but does not enforce hard limits on EMR Serverless jobs.\n\n12. **Datalfred Agent**: The AI agent requires AWS Bedrock inference profiles to be pre-configured in the domain. Model sizes are fixed (`small`, `medium`, `large`).\n\n13. **GitLab Primary**: GitLab is the source of truth for CI/CD. GitHub is a read-only mirror. GitHub Actions are only used for semantic-release on the `prod` branch.\n\n14. **Subnet Configuration**: Tasks run in public subnets by default (`use_public_subnets=true`). Private subnets require a NAT Gateway for internet access (not provisioned by this platform).\n\n15. **Integration Tests**: The `test/` folder contains integration tests that create and delete tables. These tests assume administrative permissions and should not be run in production environments.\n\n16. **ECS Task Limits**: ECS tasks are constrained by Fargate CPU/memory limits (max 4 vCPU, 30 GB RAM). Larger workloads require EMR Serverless.\n\n17. **SQL Tasks**: SQL entry point tasks (`type: \"sql\"`) are limited to single output tables and use a `main.sql` file. Multi-table SQL tasks are not supported.\n\n18. **Workspace Isolation**: Terraform workspaces are used for environment isolation. The stage name is derived from:\n    - **GitLab CI**: Git branch name (`$CI_COMMIT_REF_SLUG`)\n    - **Local execution**: Active Terraform workspace (use `terraform workspace select \u003cstage\u003e`)\n\n19. **Athena Costs**: Query costs are not monitored or capped by the platform. Users should implement AWS Budgets or Cost Anomaly Detection separately.\n\n20. **VPC Dependency**: The domain factory expects a VPC tagged with `Name: {project_name}_network_platform_prod` containing appropriately tagged subnets (`Tier: Public` or `Tier: Private`).\n\n21. **EMR Sandbox Creation**: By default, `skip_emr_serverless_sandbox_creation=true` to reduce deployment time. Set to `false` if large-scale Spark processing is required.\n\n22. **CodeArtifact Publishing**: The domain factory automatically builds and publishes the `datalake_sdk` to CodeArtifact during deployment. The version is extracted from `datalake_sdk/pyproject.toml`.\n\n23. **Semantic Versioning**: Releases are managed via semantic-release on GitHub (`.releaserc.json`). Conventional commit messages are required for automated versioning.\n\n24. **Local AWS Credentials**: Terraform executed locally uses the default AWS credentials configured on the machine. Verify the active AWS account before applying changes.\n\n25. **Local Task Execution**: Docker must be running and the task image must be available locally (either built locally or pulled from ECR after authentication). AWS credentials are required for accessing S3 and Glue.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ferwan-simon%2Faws-data-platform-framework","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ferwan-simon%2Faws-data-platform-framework","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ferwan-simon%2Faws-data-platform-framework/lists"}