https://github.com/erwan-simon/aws-data-platform-framework

A unified framework to industrialize data ingestion, transformation and pipeline execution on AWS using Terraform, from infrastructure provisioning to runtime execution, designed as a reusable and standalone data platform.
https://github.com/erwan-simon/aws-data-platform-framework

aws data data-framework datalake docker iceberg python spark step-functions terraform terraform-module

Last synced: 17 days ago
JSON representation

Host: GitHub
URL: https://github.com/erwan-simon/aws-data-platform-framework
Owner: erwan-simon
License: other
Created: 2026-01-18T11:47:29.000Z (4 months ago)
Default Branch: prod
Last Pushed: 2026-02-07T11:06:54.000Z (3 months ago)
Last Synced: 2026-02-07T20:27:21.644Z (3 months ago)
Topics: aws, data, data-framework, datalake, docker, iceberg, python, spark, step-functions, terraform, terraform-module
Language: HCL
Homepage:
Size: 353 KB
Stars: 2
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# AWS Data Platform Framework

* [I. Project Overview](#i-project-overview)
* [II. Architecture / Design](#ii-architecture--design)
* [III. Prerequisites](#iii-prerequisites)
* [IV. Installation / Setup](#iv-installation--setup)
* [A. Install datalake\_sdk from AWS CodeArtifact](#a-install-datalake_sdk-from-aws-codeartifact)
* [B. Install datalake\_sdk from Source](#b-install-datalake_sdk-from-source)
* [C. Deploy Infrastructure](#c-deploy-infrastructure)
* [V. Usage](#v-usage)
* [A. CLI - Ingest Data](#a-cli---ingest-data)
* [B. Programmatic - Ingest Data with Python](#b-programmatic---ingest-data-with-python)
* [C. Delete a Table](#c-delete-a-table)
* [D. Migrate Data Across Stages](#d-migrate-data-across-stages)
* [E. Query Data with Athena](#e-query-data-with-athena)
* [F. AI Agent - Datalfred](#f-ai-agent---datalfred)
* [G. Ingestion Modes](#g-ingestion-modes)
* [H. Local Task Execution](#h-local-task-execution)
* [VI. Infrastructure](#vi-infrastructure)
* [A. Domain Factory](#a-domain-factory)
* [B. Pipeline Factory](#b-pipeline-factory)
* [C. Terraform Modules](#c-terraform-modules)
* [D. Deployment Workflow](#d-deployment-workflow)
* [VII. Configuration](#vii-configuration)
* [A. Environment Variables](#a-environment-variables)
* [B. Task Configuration](#b-task-configuration)
* [C. Table Metadata](#c-table-metadata)
* [D. Triggers](#d-triggers)
* [VIII. Project Structure](#viii-project-structure)
* [A. datalake\_sdk](#a-datalake_sdk)
* [B. domain\_factory](#b-domain_factory)
* [C. pipeline\_factory](#c-pipeline_factory)
* [D. test](#d-test)
* [IX. Limitations / Assumptions](#ix-limitations--assumptions)

## I. Project Overview

This project is an **AWS-based data lake platform** designed to facilitate data ingestion, storage, transformation, and governance at scale. It provides:

- A **Python SDK** (`datalake_sdk`) for interacting with the data lake, enabling data ingestion with multiple modes (overwrite, append, upsert)
- **Terraform infrastructure-as-code modules** for provisioning AWS resources organized into domains and pipelines
- Support for both **native Python (Pandas)** and **Spark (EMR Serverless)** processing environments
- **Apache Iceberg** table format for advanced data lake capabilities (ACID transactions, schema evolution, time travel)
- **AWS Lake Formation** integration for fine-grained access control and data governance
- An **AI agent** ("Datalfred") for natural language interaction with the data lake
- **Automated orchestration** using AWS Step Functions

The platform is intended for data engineers, data scientists, and developers who need to build scalable, governed data pipelines on AWS.

For detailed information about the `datalake_sdk` Python package, refer to the [datalake_sdk README](datalake_sdk/README.md).

## II. Architecture / Design

### High-Level Components

The architecture is organized around three main layers:

1. **SDK Layer** (`datalake_sdk`):
- Python library providing abstractions for data ingestion and processing
- CLI tool for manual data operations
- Wrappers for Spark and native Python environments
- AI agent (Datalfred) for conversational data lake interaction

2. **Infrastructure Layer** (Terraform modules):
- **Domain Factory**: Provisions core AWS infrastructure per domain (S3 buckets, Glue databases, Lake Formation, Athena workgroups, IAM roles)
- **Pipeline Factory**: Creates data pipelines with orchestrated tasks (ECS/EMR tasks, Step Functions, CloudWatch logs)

3. **Execution Layer**:
- **ECS Fargate tasks**: Lightweight Python data processing
- **EMR Serverless**: Spark-based distributed processing
- **Step Functions**: Orchestration and workflow management

### Data Flow

1. Data is ingested via the `datalake_sdk` CLI or programmatically through Python code
2. Tasks run in containerized environments (ECS or EMR) defined by Terraform
3. Data is written to **S3** in Iceberg format with metadata in **AWS Glue Data Catalog**
4. **Lake Formation** manages permissions on databases and tables
5. **Athena** provides SQL query access to the data
6. **Step Functions** orchestrate multi-step pipelines with dependency management

### Key Design Patterns

- **Domain-Driven Design**: Resources are grouped by business domain
- **Infrastructure as Code**: All AWS resources defined in Terraform
- **Schema-on-Read**: Table schemas are inferred from data at ingestion time
- **Separation of Concerns**: Data storage (S3), metadata (Glue), access control (Lake Formation), and orchestration (Step Functions) are decoupled
- **Multi-Stage Support**: Terraform workspaces allow dev/uat/prod isolation

### Organizational Conventions

This platform adheres to organizational technical conventions:

- **CI/CD Platform**: GitLab CI is used for continuous integration and deployment (`.gitlab-ci.yml`). GitHub is a read-only mirror.
- **AWS Naming Convention**: Resources follow the pattern `{project_name}_{domain_name}_{stage_name}_resource_name`
- **Stage Name Derivation**:
- In GitLab CI: derived from Git branch name (`$CI_COMMIT_REF_SLUG`)
- Locally: derived from active Terraform workspace
- **AWS Region**: Default region is `eu-west-1` (Ireland)
- **Terraform Backend**: Backend configuration is provided at initialization time via runtime parameters:
```bash
terraform init \
-backend-config="bucket=$TERRAFORM_BACKEND_BUCKET" \
-backend-config="dynamodb_table=$TERRAFORM_BACKEND_DYNAMODB"
```
- **Cost Allocation Tags**: All resources are tagged with `project_name`, `domain_name`, and `stage_name` for FinOps tracking

## III. Prerequisites

### Required Tools

- **AWS Account** with administrative access or appropriate IAM permissions
- **Terraform** >= 5.60.0, < 6.14.0 (AWS provider version)
- **Python** ~3.13
- **Poetry** (for local SDK development and installation)
- **Docker** (for building container images and local task execution)
- **AWS CLI** configured with credentials
- **Git** access to the GitLab repository

### AWS Services Used

- **Storage & Catalog**: S3, Glue Data Catalog
- **Governance & Security**: Lake Formation, IAM
- **Compute**: ECS (Fargate), EMR Serverless
- **Orchestration**: Step Functions, EventBridge
- **Querying**: Athena
- **Monitoring**: CloudWatch
- **Container Registry**: ECR
- **AI/ML**: Bedrock (for Datalfred agent)
- **Package Management**: CodeArtifact
- **Notifications**: Secrets Manager (for Slack integration)

### Infrastructure Prerequisites

- **Terraform Backend**: S3 bucket and DynamoDB table for state storage (must be created beforehand)
- **VPC**: A VPC tagged with `Name: {project_name}_network_platform_prod` containing public and/or private subnets
- **NAT Gateway**: Required if using private subnets (`use_public_subnets=false`)

## IV. Installation / Setup

### A. Install datalake_sdk from AWS CodeArtifact

1. **Configure AWS credentials** with CodeArtifact read access:

```bash
export CODEARTIFACT_AUTH_TOKEN=$(aws codeartifact get-authorization-token \
--domain $CODEARTIFACT_DOMAIN_NAME \
--domain-owner $AWS_ACCOUNT_ID \
--query authorizationToken \
--output text)
```

2. **Configure pip** to use CodeArtifact:

```bash
pip config set site.index-url https://aws:$CODEARTIFACT_AUTH_TOKEN@$CODEARTIFACT_DOMAIN_NAME-$AWS_ACCOUNT_ID.d.codeartifact.$AWS_REGION.amazonaws.com/pypi/$CODEARTIFACT_REPOSITORY_NAME/simple/

pip config set site.extra-index-url https://pypi.python.org/simple/
```

3. **Install the SDK**:

```bash
pip install datalake-sdk
datalake_sdk --help
```

4. **(Optional) Install with AI agent support**:

```bash
pip install datalake-sdk[agent]
```

### B. Install datalake_sdk from Source

1. **Clone the repository**:

```bash
git clone ${REPO_URL}
cd datalake/datalake_sdk
```

2. **Install dependencies**:

```bash
poetry install
```

3. **Option 1 - Install globally**:

```bash
poetry build
pip install dist/*.whl
datalake_sdk --help
```

4. **Option 2 - Run via Poetry**:

```bash
poetry run datalake_sdk --help
```

For complete SDK documentation, see [datalake_sdk/README.md](datalake_sdk/README.md).

### C. Deploy Infrastructure

#### 1. Initialize Terraform Backend

Ensure you have an S3 bucket and DynamoDB table for Terraform state management.

#### 2. Create a Domain

Create a `main.tf` file using the `domain_factory` module:

```hcl
module "domain" {
source = "./domain_factory"
project_name = "my_project"
domain_name = "my_domain"
stage_name = "dev"
git_repository = "${REPO_URL}"
datalake_admin_principal_arns = ["arn:aws:iam::123456789012:role/AdminRole"]
failure_notification_receivers = ["user@example.com"]
}
```

#### 3. Deploy the Domain

```bash
terraform init \
-backend-config="bucket=$TERRAFORM_BACKEND_BUCKET" \
-backend-config="dynamodb_table=$TERRAFORM_BACKEND_DYNAMODB"

terraform workspace new dev
terraform apply
```

#### 4. Create Pipelines

Use the `pipeline_factory` module to create data pipelines (see [Section VI.B](#b-pipeline-factory) for configuration details).

## V. Usage

### A. CLI - Ingest Data

Ingest a CSV file into the data lake:

```bash
datalake_sdk \
--project-name poc \
--domain-name my_tests \
--stage-name prd \
ingest \
--database-name my_database \
--table-name my_table \
--input-file-path ./file.csv \
--ingestion-mode upsert \
--upsert-keys "column_1/column_2" \
--partition-keys "column_3/column_4" \
--csv-delimiter ";"
```

**Note**: CSV files must include headers.

### B. Programmatic - Ingest Data with Python

```python
from datalake_sdk.native_python_processing_wrapper import NativePythonProcessingWrapper

wrapper = NativePythonProcessingWrapper(
project_name="poc",
domain_name="my_tests",
stage_name="prd",
output_tables={
"my_database.my_table": {
"upsert_keys": ["column_1", "column_2"],
"partition_keys": ["column_3"],
"ingestion_mode": "upsert"
}
}
)

dataframe = wrapper.read_input_dataset("./file.csv", csv_delimiter=";")
wrapper.ingest("my_database.my_table", dataframe)
```

For Spark environments, replace `NativePythonProcessingWrapper` with `SparkProcessingWrapper`.

### C. Delete a Table

```bash
datalake_sdk \
--project-name poc \
--domain-name my_tests \
--stage-name prd \
delete_table \
--database-name my_database \
--table-name my_table
```

### D. Migrate Data Across Stages

Copy the data of one or all tables from a source stage to the current target stage (e.g. `prod` → `dev`):

```bash
datalake_sdk \
--project-name poc \
--domain-name newsroom \
--stage-name dev \
migrate_data \
--source-stage-name prod \
--database-name newsroom \
--source-table-name articles \
--owner-job tests/test_native_write
```

Behavior:
- Reads the source via Athena in chunks and re-ingests through the SDK in `upsert` mode.
- If `--source-table-name` is omitted, every table of the source database is replicated to the target database with the same name.
- If `--upsert-keys` is omitted, falls back to the `datalake_sdk_upsert_keys` Glue table property of the source table.
- If `--owner-job pipeline_name/task_name` is provided, Lake Formation `ALL` permissions (with grant option) are granted on each target table to the IAM role `{project}_{domain}_{target_stage}_{pipeline}_{task}`. If omitted, the SDK falls back to the `datalake_sdk_pipeline_name` / `datalake_sdk_task_name` properties of the source table to derive the role; otherwise a warning is emitted (the migrating principal becomes the LF owner and the original pipeline may lose access).

### E. Query Data with Athena

Use the AWS Athena console or CLI to query Iceberg tables:

```sql
SELECT * FROM dev_my_database.my_table WHERE column_3 = 'value';
```

### F. AI Agent - Datalfred

Interact with the data lake using natural language (requires `datalake-sdk[agent]`):

```bash
datalake_sdk \
--project-name poc \
--domain-name my_tests \
--stage-name prd \
datalfred \
--model-size large
```

Datalfred can:
- Query data using natural language
- Investigate pipeline failures
- Analyze code and configurations

For more information, see [datalake_sdk/README.md - Datalfred Agent](datalake_sdk/README.md#c-datalfred-agent).

### G. Ingestion Modes

- **overwrite**: Replaces all existing table data
- **append**: Adds new rows without modifying existing data (may create duplicates)
- **upsert**: Updates existing rows or inserts new ones based on upsert keys

For detailed explanations and examples, see [datalake_sdk/README.md - Ingestion Modes](datalake_sdk/README.md#viii-ingestion-modes).

### H. Local Task Execution

The platform allows you to execute task code in a local Dockerized environment that is **identical to the AWS task execution environment**. This is particularly useful for developing new tasks or debugging existing ones.

You can run either:
- **ECS tasks** (native Python with Pandas)
- **EMR Serverless tasks** (PySpark)

The Docker image can be:
- A **sandbox image** (intermediate base image)
- A **task-specific image** (containing the final Python/PySpark code)

#### Prerequisites

- Docker must be running locally
- The Docker image must be available:
- If built locally, it's already available
- If from ECR, you must authenticate and pull the image

#### 1. Authenticate to ECR

Assuming AWS credentials are configured:

```bash
aws ecr get-login-password --region ${ECR_REGION} | docker login --username AWS --password-stdin ${AWS_ACCOUNT_ID}.dkr.ecr.${ECR_REGION}.amazonaws.com
```

#### 2. Run an ECS Task (Native Python)

This launches a Jupyter Notebook environment for native Python tasks:

```bash
docker run \
-e AWS_PROFILE=${AWS_CREDENTIALS_PROFILE} \
--mount type=bind,source=$HOME/.aws/,target=/root/.aws/ \
-p 8888:8888 \
${AWS_ACCOUNT_ID}.dkr.ecr.${ECR_REGION}.amazonaws.com/${ECR_NAME}:${DOCKER_IMAGE_TAG} \
jupyter notebook --ip="0.0.0.0" --no-browser --allow-root
```

The command will output the Jupyter Notebook URL. Copy and paste it into your browser.

#### 3. Run an EMR Serverless Task (PySpark)

This launches a Jupyter Notebook with PySpark configured:

```bash
export CREDENTIALS=$(aws configure export-credentials)
mkdir -p logs # To access generated Spark logs

docker run -d \
-e AWS_ACCESS_KEY_ID=$(echo $CREDENTIALS | jq -r '.AccessKeyId') \
-e AWS_SECRET_ACCESS_KEY=$(echo $CREDENTIALS | jq -r '.SecretAccessKey') \
-e AWS_SESSION_TOKEN=$(echo $CREDENTIALS | jq -r '.SessionToken // ""') \
-e AWS_REGION=${AWS_REGION} \
-e AWS_DEFAULT_REGION=${AWS_REGION} \
--mount type=bind,source=$(pwd)/logs,target=/var/log/spark/user/ \
-p 8888:8888 \
-e PYSPARK_DRIVER_PYTHON=jupyter \
-e PYSPARK_DRIVER_PYTHON_OPTS='notebook --ip="0.0.0.0" --no-browser' \
${AWS_ACCOUNT_ID}.dkr.ecr.${ECR_REGION}.amazonaws.com/${ECR_NAME}:${DOCKER_IMAGE_TAG} \
pyspark --master local \
--conf spark.hadoop.fs.s3a.endpoint=s3.${AWS_REGION}.amazonaws.com \
--conf spark.hadoop.hive.metastore.client.factory.class=com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory

cat logs/stderr
```

The Jupyter Notebook URL will be printed in the `logs/stderr` file. Copy and paste it into your browser.

#### Notes

- **AWS Credentials**: The ECS example mounts `~/.aws/` to use your local AWS profile. The EMR example exports credentials as environment variables.
- **Port Mapping**: Both examples expose port 8888 for Jupyter Notebook access.
- **Spark Configuration**: The EMR example configures Spark to use S3 and AWS Glue Data Catalog.
- **Logs Directory**: For EMR tasks, Spark logs are written to the local `logs/` directory for debugging.

## VI. Infrastructure

### A. Domain Factory

The `domain_factory` Terraform module provisions foundational infrastructure for a data domain.

#### Key Resources

- **S3 Buckets**:
- `{project_name}-{domain_name}-{stage_name}-data`: Stores Iceberg table data with versioning and intelligent tiering
- `{project_name}-{domain_name}-{stage_name}-technical`: Stores logs, temporary files, and Athena query results

- **Glue Database**: Domain-scoped catalog for tables (`{stage_prefix}{domain_name}`)

- **Lake Formation**:
- Registers S3 data location
- Manages database and table permissions
- Supports cross-account data sharing

- **Athena Workgroup**: Query execution environment (`{project_name}_{domain_name}_{stage_name}`)

- **IAM Roles**: Task execution roles with least-privilege permissions

- **Security Groups**: Network isolation for processing tasks

- **CodeArtifact Repository**: Private Python package hosting for the SDK

- **ECS/EMR Sandbox**: Pre-built base images for task execution

- **Lambda (Failsafe Shutdown)**: Monitors and terminates long-running tasks

- **Bedrock Inference Profile**: AI model access for Datalfred (model sizes: `small`, `medium`, `large`)

- **EMR Studio**: Interactive development environment for Spark jobs

#### Key Variables

| Variable | Type | Description | Default |
|----------|------|-------------|---------|
| `project_name` | string | Project identifier | Required |
| `domain_name` | string | Domain name | Required |
| `stage_name` | string | Environment (dev, uat, prod, etc.) | Required |
| `git_repository` | string | GitLab repository URL | Required |
| `datalake_admin_principal_arns` | list(string) | IAM principals with full data access | `[]` |
| `use_public_subnets` | bool | Use public vs. private subnets | `true` |
| `database_description` | string | Description of the domain database | `""` |
| `skip_emr_serverless_sandbox_creation` | bool | Skip EMR sandbox image creation | `true` |
| `failure_notification_receivers` | list(string) | Email addresses for failure alerts | Required |

#### Outputs

The module exports a domain object containing all necessary information for pipeline creation (see `domain_factory/outputs.tf`).

### B. Pipeline Factory

The `pipeline_factory` Terraform module provisions data pipelines with orchestrated tasks.

#### Key Resources

- **Step Functions State Machine**: Workflow orchestration with task dependencies

- **ECS or EMR Tasks**: Containerized data processing
- **ECS**: Fargate tasks for lightweight Python jobs
- **EMR**: Serverless Spark for large-scale processing

- **Glue Database** (optional): Pipeline-scoped catalog (`{stage_prefix}{pipeline_name}`)

- **CloudWatch Logs**: Task execution logs with 30-day retention

- **EventBridge Scheduler**: Schedule-based or event-driven triggers

- **IAM Roles**: Task-specific permissions (data access, Lake Formation, S3)

- **ECR Repositories**: Docker image storage per task

- **Failure Notifications**: CloudWatch Events trigger notifications on task failures

#### Key Variables

| Variable | Type | Description | Default |
|----------|------|-------------|---------|
| `pipeline_name` | string | Pipeline identifier | Required |
| `tasks_configuration` | map(object) | Task definitions (see below) | Required |
| `trigger` | object | Pipeline trigger configuration | `{"type": "none", "argument": "none"}` |
| `orchestration_configuration_template_file_path` | string | Step Functions template path | Required |
| `domain_object` | object | Output from domain_factory | Required |
| `failure_notification_receivers` | list(string) | Email addresses for failure alerts | `[]` |
| `skip_pipeline_database_creation` | bool | Skip pipeline database creation | `false` |

#### Task Configuration Structure

```hcl
tasks_configuration = {
"task_name" : {
"type" : "python" | "sql"
"path" : "./relative/path/to/task/code"
"infra_type" : "ECS" | "EMRServerless"
"infra_config" : {
"cpu" : "512" # ECS only: CPU units
"memory" : "1024" # ECS only: Memory in MB
}
"input_tables" : ["db.table1", "db.table2"]
"output_tables" : {
"db.output_table" : {
"ingestion_mode" : "overwrite" | "append" | "upsert"
"upsert_keys" : ["id"]
"partition_keys" : ["date"]
}
}
"additional_parameters" : {
"param_key" : "static_value"
"dynamic_param.$" : "$.trigger_param" # Reference trigger input
}
"additional_rebuild_trigger" : {} # Force image rebuild
"additional_permissions" : "" # Extra IAM permissions
}
}
```

#### Trigger Configuration

**Schedule-based (cron)**:
```hcl
trigger = {
"type" : "schedule"
"argument" : "cron(15 1 * * ? *)"
"parameters" : jsonencode({
"key" : "value"
})
}
```

**Manual execution only**:
```hcl
trigger = {
"type" : "none"
"argument" : "none"
}
```

### C. Terraform Modules

The `pipeline_factory/modules` directory contains three submodules:

#### 1. `ecs_factory`

Provisions ECS Fargate tasks:
- Task definition with environment variables
- IAM roles for task execution and data access
- ECR repository and Docker image build
- CloudWatch log groups

#### 2. `emr_factory`

Provisions EMR Serverless applications:
- EMR application with Spark runtime
- IAM roles for job execution and data access
- ECR repository and Docker image build (Spark-compatible)
- S3 paths for Spark logs

#### 3. `build_and_upload_image_to_ecr`

Automates Docker image management:
- Copies task code and dependencies
- Builds Docker image using sandbox base image
- Pushes image to ECR
- Supports rebuild triggers for code changes

### D. Deployment Workflow

1. **Domain Deployment**: Terraform provisions domain infrastructure (S3, Glue, Lake Formation, IAM, etc.)

2. **Pipeline Deployment**: Terraform provisions pipeline infrastructure
- Creates Step Functions state machine
- Builds Docker images for each task
- Pushes images to ECR
- Creates ECS task definitions or EMR applications

3. **Task Execution**:
- EventBridge scheduler or manual trigger starts Step Functions execution
- Step Functions orchestrates task execution based on orchestration template
- ECS/EMR tasks run with environment variables set by Terraform
- Tasks use `datalake_sdk` to read/write data

4. **Data Ingestion**:
- Tasks transform data using Pandas or Spark
- SDK ingests data to S3 in Iceberg format
- Glue Catalog metadata is updated
- Lake Formation permissions are enforced

5. **Monitoring & Notifications**:
- CloudWatch logs capture task execution
- Failsafe Lambda monitors task duration
- CloudWatch Events trigger email notifications on failures

## VII. Configuration

### A. Environment Variables

Set automatically by infrastructure; users can access via `task_additional_parameters`:

| Variable | Description | Set By |
|----------|-------------|--------|
| `PROJECT_NAME` | Project identifier | Terraform |
| `DOMAIN_NAME` | Domain name | Terraform |
| `STAGE_NAME` | Environment name | Terraform |
| `PIPELINE_NAME` | Pipeline name | Terraform |
| `TASK_NAME` | Task name | Terraform |
| `INPUT_TABLES` | JSON-encoded list of input tables | Terraform |
| `OUTPUT_TABLES` | JSON-encoded dict of output table configs | Terraform |
| `IS_SQL_JOB` | Whether task executes SQL (`true`/`false`) | Terraform |
| `TASK_ADDITIONAL_PARAMETERS_*` | Custom parameters from Terraform | Terraform |
| `step_function_task_token` | Step Functions callback token | Step Functions |
| `step_function_execution_arn` | Step Functions execution ARN | Step Functions |

### B. Task Configuration

Example task configuration in Terraform:

```hcl
tasks_configuration = {
"my_task" : {
"type" : "python",
"path" : "./my_task/",
"infra_type" : "ECS",
"infra_config" : {
"cpu" : "512",
"memory" : "1024"
},
"input_tables" : ["db.input_table"],
"output_tables" : {
"db.output_table" : {
"ingestion_mode" : "upsert",
"upsert_keys" : ["id"],
"partition_keys" : ["date"]
}
},
"additional_parameters" : {
"my_param.$" : "$.trigger_param", # Dynamic from trigger
"static_param" : "value"
},
"additional_permissions" : data.aws_iam_policy_document.my_policy.json
}
}
```

### C. Table Metadata

Place YAML files in `code/tables_configuration/` to document tables:

```yaml
# code/tables_configuration/my_database.my_table.yaml
description: "Customer dimension table"
schema:
customer_id:
description: "Unique customer identifier"
customer_name:
description: "Full name of the customer"
```

In addition, the SDK automatically writes a few Glue **table properties** on every successful ingestion:

- `datalake_sdk_upsert_keys` — comma-separated upsert keys used (only for `upsert` mode). Updated at every write; a warning is emitted if the keys differ from the previously stored value.
- `datalake_sdk_pipeline_name` / `datalake_sdk_task_name` — the pipeline/task that produced the table (skipped for ad-hoc CLI ingestions).

These properties are consumed by `datalake_sdk migrate_data` to derive default upsert keys and the owner IAM role.

### D. Triggers

**Schedule**: Cron-based execution

```hcl
trigger = {
"type" : "schedule"
"argument" : "cron(15 1 * * ? *)"
"parameters" : jsonencode({"key": "value"})
}
```

**None**: Manual execution only

```hcl
trigger = {
"type" : "none"
"argument" : "none"
}
```

## VIII. Project Structure

```
datalake/
├── datalake_sdk/ # Python SDK and CLI
├── domain_factory/ # Terraform module for domain infrastructure
├── pipeline_factory/ # Terraform module for pipeline infrastructure
│ └── modules/
│ ├── ecs_factory/ # ECS task provisioning
│ ├── emr_factory/ # EMR Serverless provisioning
│ └── build_and_upload_image_to_ecr/ # Docker build and push
├── test/ # Integration tests and examples
├── doc_resources/ # Documentation resources
├── .gitlab-ci.yml # GitLab CI pipeline configuration
├── .github/workflows/ # GitHub Actions (semantic-release)
├── LICENSE # Creative Commons Attribution-NonCommercial 4.0
└── README.md # This file
```

### A. datalake_sdk

**Purpose**: Provides a unified interface for data lake operations.

The `datalake_sdk` is a comprehensive Python package for interacting with the data lake. It includes:

- **CLI**: Command-line interface for ingestion, table deletion, and AI agent interaction
- **Processing Wrappers**: Abstract base class and implementations for Pandas and Spark
- **Datalfred Agent**: AI-powered assistant for natural language data lake interaction

For complete documentation, see [datalake_sdk/README.md](datalake_sdk/README.md).

**Key Files**:
- `main.py`: CLI entry point with subcommands
- `base_processing_wrapper.py`: Abstract base class
- `native_python_processing_wrapper.py`: Pandas implementation
- `spark_processing_wrapper.py`: Spark implementation
- `ingestion.py`: CLI ingestion command
- `delete_table.py`: CLI delete command
- `migrate_data.py`: CLI command to copy data from one stage to another
- `update_foreign_linked_databases.py`: CLI command to sync Glue resource links for cross-account databases
- `datalfred_agent/`: AI agent modules

**Dependencies** (from `pyproject.toml`):
- Core: `boto3`, `click`, `awswrangler`, `pyyaml`, `tqdm`, `slack-sdk`
- Optional: `strands-agents`, `strands-agents-tools`, `strands-agents-builder` (for Datalfred)

**Version**: 5.7.11 (automatically detected by domain_factory)

### B. domain_factory

**Purpose**: Terraform module to provision AWS resources for a data domain.

**Key Files**:
- `s3_data.tf`, `s3_technical.tf`: S3 bucket definitions
- `glue_database.tf`: Glue Data Catalog database
- `lakeformation.tf`: Lake Formation registration and permissions
- `athena_workgroup.tf`: Athena workgroup configuration
- `ecs_cluster_sandbox.tf`: ECS base image and cluster
- `emr_serverless_application_sandbox.tf`: EMR Serverless base image
- `codeartifact_repository.tf`: Private package repository
- `lambda_failsafe_shutdown.tf`: Task timeout enforcement
- `bedrock_inference_profile.tf`: AI model access
- `code_datalake_sdk.tf`: Packages and publishes SDK to CodeArtifact
- `variables.tf`: Input variables
- `outputs.tf`: Exported domain configuration
- `locals.tf`: Local variables (environment naming, SDK version extraction)

**Outputs**: Exports domain configuration consumed by pipeline_factory.

### C. pipeline_factory

**Purpose**: Terraform module to create data pipelines with orchestrated tasks.

**Key Files**:
- `step_function.tf`: AWS Step Functions state machine
- `ecs_tasks.tf`: ECS task module invocations
- `emr_tasks.tf`: EMR Serverless application module invocations
- `event_bridge_scheduler.tf`: Pipeline trigger configuration
- `cloudwatch_event_task_failed.tf`: Failure notification setup
- `cloudwatch_event_failsafe_shutdown.tf`: Failsafe Lambda trigger
- `glue_database.tf`: Pipeline-scoped database (optional)
- `variables.tf`: Input variables
- `outputs.tf`: Pipeline outputs
- `locals.tf`: Local variables (environment naming)

**Modules**:
- `ecs_factory/`: Provisions ECS Fargate tasks
- `emr_factory/`: Provisions EMR Serverless applications
- `build_and_upload_image_to_ecr/`: Builds and uploads Docker images

### D. test

**Purpose**: Integration tests and example pipeline implementation.

**Key Files**:
- `domain.tf`: Test domain deployment
- `pipeline.tf`: Test pipeline with multiple task types
- `variables.tf`: Test-specific variable definitions
- `integration_tests_pipeline/`: Test tasks
- `test_write/`: Python task for data generation
- `test_native_sql_entrypoint/`: Native SQL task
- `test_spark_sql_entrypoint/`: Spark SQL task
- `check_and_clean/`: Validation and cleanup task
- `orchestration_configuration.tftpl.json`: Step Functions orchestration
- `utils/`: Test utilities
- `run_integration_tests.py`: Test execution script
- `pipeline_utils/`: Test library for dependency validation

**Variable Handling**:

The test configuration uses a different variable format for convenience:

| Variable | Type in domain_factory | Type in test | Transformation |
|----------|----------------------|--------------|----------------|
| `datalake_admin_principal_arns` | `list(string)` | `string` (comma-separated role names) | Split by comma, lookup ARNs via `data.aws_iam_role`, pass as list |
| `failure_notification_receivers` | `list(string)` | `string` (comma-separated emails) | Split by comma in module call |

Example test variable usage:
```hcl
# test/domain.tf
data "aws_iam_role" "datalake_admins" {
for_each = toset(split(",", var.datalake_admin_principal_arns))
name = each.value
}

module "domain" {
# ...
datalake_admin_principal_arns = values(data.aws_iam_role.datalake_admins)[*].arn
failure_notification_receivers = split(",", var.failure_notification_receivers)
}
```

**CI/CD**: Integration tests run automatically in GitLab CI (`run_integration_tests` stage).

## IX. Limitations / Assumptions

1. **AWS-Only**: This platform is tightly coupled to AWS services and cannot be deployed on other cloud providers without significant refactoring.

2. **Python 3.13**: The SDK and processing tasks require Python ~3.13. Older Python versions are not supported.

3. **Iceberg Format**: All tables are stored in Apache Iceberg format. Direct Parquet or other formats are not supported for managed tables.

4. **Region**: Infrastructure is deployed in a single AWS region (default: `eu-west-1`). Cross-region replication is not implemented.

5. **Terraform State Backend**: Assumes an existing S3 bucket and DynamoDB table for Terraform state management. These must be created manually before deployment.

6. **Naming Conventions**: Resource names follow the pattern `{project_name}_{domain_name}_{stage_name}`. Non-prod stages prefix database names (e.g., `dev_my_database`). Production (`stage_name = "prod"`) databases have no prefix.

7. **Lake Formation Permissions**: The platform assumes AWS Lake Formation is the primary access control mechanism. IAM-only setups are not fully supported.

8. **CSV Ingestion**: CSV files must include headers for schema inference.

9. **Upsert Key Uniqueness**: Upsert keys must guarantee row uniqueness in the ingested dataset. Violations will cause ingestion failure.

10. **Concurrency**: Iceberg commit conflicts (e.g., simultaneous writes) are mitigated with retries (up to 30 retries with 2-10 minute waits), but high-concurrency scenarios may require tuning.

11. **Failsafe Shutdown**: The failsafe Lambda function monitors task durations but does not enforce hard limits on EMR Serverless jobs.

12. **Datalfred Agent**: The AI agent requires AWS Bedrock inference profiles to be pre-configured in the domain. Model sizes are fixed (`small`, `medium`, `large`).

13. **GitLab Primary**: GitLab is the source of truth for CI/CD. GitHub is a read-only mirror. GitHub Actions are only used for semantic-release on the `prod` branch.

14. **Subnet Configuration**: Tasks run in public subnets by default (`use_public_subnets=true`). Private subnets require a NAT Gateway for internet access (not provisioned by this platform).

15. **Integration Tests**: The `test/` folder contains integration tests that create and delete tables. These tests assume administrative permissions and should not be run in production environments.

16. **ECS Task Limits**: ECS tasks are constrained by Fargate CPU/memory limits (max 4 vCPU, 30 GB RAM). Larger workloads require EMR Serverless.

17. **SQL Tasks**: SQL entry point tasks (`type: "sql"`) are limited to single output tables and use a `main.sql` file. Multi-table SQL tasks are not supported.

18. **Workspace Isolation**: Terraform workspaces are used for environment isolation. The stage name is derived from:
- **GitLab CI**: Git branch name (`$CI_COMMIT_REF_SLUG`)
- **Local execution**: Active Terraform workspace (use `terraform workspace select `)

19. **Athena Costs**: Query costs are not monitored or capped by the platform. Users should implement AWS Budgets or Cost Anomaly Detection separately.

20. **VPC Dependency**: The domain factory expects a VPC tagged with `Name: {project_name}_network_platform_prod` containing appropriately tagged subnets (`Tier: Public` or `Tier: Private`).

21. **EMR Sandbox Creation**: By default, `skip_emr_serverless_sandbox_creation=true` to reduce deployment time. Set to `false` if large-scale Spark processing is required.

22. **CodeArtifact Publishing**: The domain factory automatically builds and publishes the `datalake_sdk` to CodeArtifact during deployment. The version is extracted from `datalake_sdk/pyproject.toml`.

23. **Semantic Versioning**: Releases are managed via semantic-release on GitHub (`.releaserc.json`). Conventional commit messages are required for automated versioning.

24. **Local AWS Credentials**: Terraform executed locally uses the default AWS credentials configured on the machine. Verify the active AWS account before applying changes.

25. **Local Task Execution**: Docker must be running and the task image must be available locally (either built locally or pulled from ECR after authentication). AWS credentials are required for accessing S3 and Glue.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/erwan-simon/aws-data-platform-framework

Awesome Lists containing this project

README