https://github.com/epomatti/az-e2e-data-eng-proj

Data engineering with Azure services
https://github.com/epomatti/az-e2e-data-eng-proj

azure data data-engineering databricks datafactory datalake lake synapse terraform

Last synced: 7 months ago
JSON representation

Data engineering with Azure services

Host: GitHub
URL: https://github.com/epomatti/az-e2e-data-eng-proj
Owner: epomatti
License: mit
Created: 2023-11-01T21:51:49.000Z (almost 2 years ago)
Default Branch: main
Last Pushed: 2023-11-05T13:39:42.000Z (almost 2 years ago)
Last Synced: 2025-01-17T18:36:08.396Z (9 months ago)
Topics: azure, data, data-engineering, databricks, datafactory, datalake, lake, synapse, terraform
Language: HCL
Homepage:
Size: 404 KB
Stars: 0
Watchers: 2
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# Azure End-To-End Data Engineering Project

Complete data ingestion, transformation and load using Azure services.

> Implementation reference from [this video][1].

## Azure Infrastructure

Create the `.auto.tfvars` files and set the parameters as you prefer:

```sh
cp azure/config/dev.tfvars azure/.auto.tfvars
```

Check your public IP address to be added in the firewalls allow rules:

```sh
dig +short myip.opendns.com @resolver1.opendns.com
```

The [dataset][2] is already available in the `./dataset/` directory and will be uploaded to the storage.

Create the resources on Azure:

```sh
terraform -chdir="azure" init
terraform -chdir="azure" apply -auto-approve
```

Trigger the pipeline to get the data into the stage filesystem:

```sh
az datafactory pipeline create-run \
--resource-group rg-olympics \
--name PrepareForDatabricks \
--factory-name adf-olympics-sandbox
```

If you're not using Synapse immediately, pause the Synapse SQL pool to avoid costs while setting up the infrastructure:

```sh
az synapse sql pool pause -n pool1 --workspace-name synw-olympics -g rg-olympics
```

## Databricks

The previous Azure run should have created the `databricks/.auto.tfvars` file to configure Databricks.

Apply the Databricks configuration:

> 💡 If you haven't yet, you need to login to Databricks, which will create Key Vault policies.

```sh
terraform -chdir="databricks" init
terraform -chdir="databricks" apply -auto-approve
```

Once Databricks is running, execute the notebook to generate the data.

## Synapse

Connect to Synapse Studio.

Enter the Data blade to create a new `Lake Database` using the studio and generate the tables from the `transformed-data` filesystem.

Upload or copy the SQL test script:

```sh
az synapse sql-script create -f scripts/synapse-queries.sql -n Init --workspace-name synw-olympics --sql-pool-name pool1 --sql-database-name pool1
```

[1]: https://youtu.be/IaA9YNlg5hM?list=PL_ko60AZHL-pWXeO6YouiE-ZQlM02duKy
[2]: https://www.kaggle.com/datasets/arjunprasadsarkhel/2021-olympics-in-tokyo

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/epomatti/az-e2e-data-eng-proj

Awesome Lists containing this project

README