Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/aws-samples/tracing-etl-workloads-otel

This project demonstrates an end-to-end extract, transform, and load (ETL) pipeline using AWS Glue, orchestrated by AWS Step Functions, and instrumented with AWS Distro for OpenTelemetry (ADOT) for comprehensive tracing. It showcases how to implement distributed tracing and observability in a serverless data processing architecture, providing deep
https://github.com/aws-samples/tracing-etl-workloads-otel

etl opentelemetry tracing x-ray

Last synced: 24 days ago
JSON representation

This project demonstrates an end-to-end extract, transform, and load (ETL) pipeline using AWS Glue, orchestrated by AWS Step Functions, and instrumented with AWS Distro for OpenTelemetry (ADOT) for comprehensive tracing. It showcases how to implement distributed tracing and observability in a serverless data processing architecture, providing deep

Awesome Lists containing this project

README

        

# Tracing ETL Workloads in AWS Glue using AWS X-Ray and AWS Distro for OpenTelemetry

This project demonstrates an end-to-end extract, transform, and load (ETL) pipeline using AWS Glue, orchestrated by AWS Step Functions, and instrumented with AWS Distro for OpenTelemetry (ADOT) for comprehensive tracing. It showcases how to implement distributed tracing and observability in a serverless data processing architecture, providing deep insights into ETL workflows.


Architecture Diagram

## Prerequisites

- AWS Account
- AWS CLI configured with appropriate permissions
- Node.js and npm installed
- Python 3.8 or later
- AWS CDK CLI installed (`npm install -g aws-cdk`)

## Setup

1. Clone the repository:
```
git clone https://github.com/your-repo/otel-data-processing-solution.git
cd otel-data-processing-solution
```

2. Create and activate a virtual environment:
```
python3 -m venv .venv
source .venv/bin/activate # For Linux and macOS
```
or
```
python -m venv .venv
.venv\Scripts\activate # For Windows
```

3. Install dependencies:
```
pip install -r requirements.txt
```

4. Bootstrap the CDK environment (if not already done):
```
cdk bootstrap
```

5. Deploy the stack:
```
cdk deploy
```

## Usage

1. After the deployment, upload a raw Airbnb dataset file (CSV format) to the created S3 bucket in the `input/` prefix.
2. An S3 event notification - invokes a lambda function that will automatically trigger the Step Function workflow.
3. The workflow will clean the data, process it, and store the results back in S3.

## Data Processing Workflow

1. Data Cleaning Job: Converts data to Parquet format, standardizes column names, drops unnecessary columns, removes duplicates, fills null values, and performs various data transformations.
2. Data Processing Job: Calculates metrics, ranks properties, and groups data for analysis.

## Observability and Tracing

- The solution uses OpenTelemetry for distributed tracing and metrics collection.
- Traces and metrics are sent to the OpenTelemetry Collector running on ECS.
- AWS X-Ray provides end-to-end tracing across AWS services.
- Custom X-Ray helper module correlates traces across the ETL pipeline.
- Individual operations within Glue jobs are wrapped in spans for detailed tracing.

## Cleanup

To avoid incurring future charges, delete the resources:

```
cdk destroy
```

Note: CloudWatch logs generated by Lambda functions are not automatically deleted. Use the CloudWatch console to manually remove any logs you don't want to retain.

For more information and detailed explanations, please refer to the accompanying blog [post]().