https://github.com/nanlabs/cloud-data-engineer-challenge

🚀 Cloud Data Engineer Challenge – Build an event-driven pipeline using AWS S3, Lambda, PostgreSQL (PostGIS) and API Gateway. Use IaC to deploy your solution. Bonus points for CI/CD, monitoring, and Docker support. See README for details! 📖
https://github.com/nanlabs/cloud-data-engineer-challenge

aws challenge cloud data-engineering docker etl event-driven postgis postgresql s3 technical-interview terraform

Last synced: about 1 month ago
JSON representation

Host: GitHub
URL: https://github.com/nanlabs/cloud-data-engineer-challenge
Owner: nanlabs
License: mit
Created: 2025-03-18T03:49:57.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2025-03-18T07:09:18.000Z (about 1 year ago)
Last Synced: 2025-06-04T22:47:02.473Z (about 1 year ago)
Topics: aws, challenge, cloud, data-engineering, docker, etl, event-driven, postgis, postgresql, s3, technical-interview, terraform
Homepage:
Size: 7.81 KB
Stars: 0
Watchers: 5
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # 🚀 Cloud Data Engineer Challenge

Welcome to the **Cloud Data Engineer Challenge!** 🎉 This challenge is designed to evaluate your ability to work with **Infrastructure as Code (IaC), AWS data services, and data engineering workflows**, ensuring efficient data ingestion, storage, and querying.

> [!NOTE]

> You can use **any IaC tool of your choice** (Terraform preferred, but alternatives are allowed). If you choose a different tool or a combination of tools, **justify your decision!**

## ⚡ Challenge Overview

Your task is to deploy the following infrastructure on AWS:

> 🎯 **Key Objectives:**

- **An S3 bucket** that will receive data files as new objects.

- **A Lambda function** that is triggered by a `PUT` event in the S3 bucket.

- **The Lambda function must:**

  - Process the ingested data and perform a minimal aggregation.

  - Store the processed data in a **PostgreSQL database with PostGIS enabled**.

  - Expose an API Gateway endpoint (`GET /aggregated-data`) to query and retrieve the aggregated data.

- **A PostgreSQL database** running in a private subnet with PostGIS enabled.

- **Networking must include:** VPC, public/private subnets, and security groups.

- **The Lambda must be in a private subnet** and use a NAT Gateway in a public subnet for internet access 🌍

- **CloudWatch logs** should capture Lambda execution details and possible errors.

> [!IMPORTANT]

> Ensure that your solution is modular, well-documented, and follows best practices for security and maintainability.

## 📌 Requirements

### 🛠 Tech Stack

> ⚡ **Must Include:**

- **IaC:** Any tool of your choice (**Terraform preferred**, but others are allowed if justified).

- **AWS Services:** S3, Lambda, API Gateway, CloudWatch, PostgreSQL with PostGIS (RDS or self-hosted on EC2).

### 📄 Expected Deliverables

> 📥 **Your submission must be a Pull Request that includes:**

- **An IaC module** that deploys the entire architecture.

- **A `README.md`** with deployment instructions and tool selection justification.

- **A working API Gateway endpoint** that returns the aggregated data stored in PostgreSQL.

- **CloudWatch logs** capturing Lambda execution details.

- **Example input files** to trigger the data pipeline (placed in an `examples/` directory).

- **A sample event payload** (JSON format) to simulate the S3 `PUT` event.

> [!TIP]

> Use the `docs` folder to store any additional documentation or diagrams that help explain your solution.

> Mention any assumptions or constraints in your `README.md`.

## 🌟 Nice to Have

> 💡 **Bonus Points For:**

- **Data Quality & Validation**: Implementing **schema validation before storing data in PostgreSQL**.

- **Indexing & Query Optimization**: Using **PostGIS spatial indexing** for efficient geospatial queries.

- **Monitoring & Alerts**: Setting up **AWS CloudWatch Alarms** for S3 event failures or Lambda errors.

- **Automated Data Backups**: Creating periodic **database backups to S3** using AWS Lambda or AWS Backup.

- **GitHub Actions for validation**: Running **`terraform fmt`, `terraform validate`**, or equivalent for the chosen IaC tool.

- **Pre-commit hooks**: Ensuring linting and security checks before committing.

- **Docker for local testing**: Using **Docker Compose to spin up**:

  - Running a local PostgreSQL database with PostGIS to simulate the cloud environment 🛠

  - Providing a local S3-compatible service (e.g., MinIO) to test file ingestion before deployment 🖥

> [!TIP]

> Looking for inspiration or additional ideas to earn extra points? Check out our **[Awesome NaNLABS repository](https://github.com/nanlabs/awesome-nan)** for reference projects and best practices! 🚀

## 📥 Submission Guidelines

> 📌 **Follow these steps to submit your solution:**

1. **Fork this repository.**

2. **Create a feature branch** for your implementation.

3. **Commit your changes** with meaningful commit messages.

4. **Open a Pull Request** following the provided template.

5. **Our team will review** and provide feedback.

## ✅ Evaluation Criteria

> 🔍 **What we'll be looking at:**

- **Correctness and completeness** of the **data pipeline**.

- **Use of best practices for event-driven processing** (S3 triggers, Lambda execution).

- **Data transformation & aggregation logic** implemented in Lambda.

- **Optimization for geospatial queries** using PostGIS.

- **Data backup & integrity strategies** (optional, e.g., automated S3 backups).

- **CI/CD automation using GitHub Actions and pre-commit hooks** (optional).

- **Documentation clarity**: Clear explanation of data flow, transformation logic, and infrastructure choices.

## 🎯 **Good luck and happy coding!** 🚀

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/nanlabs/cloud-data-engineer-challenge

Awesome Lists containing this project

README