https://github.com/epomatti/az-e2e-data-eng-proj
Data engineering with Azure services
https://github.com/epomatti/az-e2e-data-eng-proj
azure data data-engineering databricks datafactory datalake lake synapse terraform
Last synced: 7 months ago
JSON representation
Data engineering with Azure services
- Host: GitHub
- URL: https://github.com/epomatti/az-e2e-data-eng-proj
- Owner: epomatti
- License: mit
- Created: 2023-11-01T21:51:49.000Z (almost 2 years ago)
- Default Branch: main
- Last Pushed: 2023-11-05T13:39:42.000Z (almost 2 years ago)
- Last Synced: 2025-01-17T18:36:08.396Z (9 months ago)
- Topics: azure, data, data-engineering, databricks, datafactory, datalake, lake, synapse, terraform
- Language: HCL
- Homepage:
- Size: 404 KB
- Stars: 0
- Watchers: 2
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Azure End-To-End Data Engineering Project
Complete data ingestion, transformation and load using Azure services.
> Implementation reference from [this video][1].
## Azure Infrastructure
Create the `.auto.tfvars` files and set the parameters as you prefer:
```sh
cp azure/config/dev.tfvars azure/.auto.tfvars
```Check your public IP address to be added in the firewalls allow rules:
```sh
dig +short myip.opendns.com @resolver1.opendns.com
```The [dataset][2] is already available in the `./dataset/` directory and will be uploaded to the storage.
Create the resources on Azure:
```sh
terraform -chdir="azure" init
terraform -chdir="azure" apply -auto-approve
```Trigger the pipeline to get the data into the stage filesystem:
```sh
az datafactory pipeline create-run \
--resource-group rg-olympics \
--name PrepareForDatabricks \
--factory-name adf-olympics-sandbox
```If you're not using Synapse immediately, pause the Synapse SQL pool to avoid costs while setting up the infrastructure:
```sh
az synapse sql pool pause -n pool1 --workspace-name synw-olympics -g rg-olympics
```## Databricks
The previous Azure run should have created the `databricks/.auto.tfvars` file to configure Databricks.
Apply the Databricks configuration:
> 💡 If you haven't yet, you need to login to Databricks, which will create Key Vault policies.
```sh
terraform -chdir="databricks" init
terraform -chdir="databricks" apply -auto-approve
```Once Databricks is running, execute the notebook to generate the data.
## Synapse
Connect to Synapse Studio.
Enter the Data blade to create a new `Lake Database` using the studio and generate the tables from the `transformed-data` filesystem.
Upload or copy the SQL test script:
```sh
az synapse sql-script create -f scripts/synapse-queries.sql -n Init --workspace-name synw-olympics --sql-pool-name pool1 --sql-database-name pool1
```[1]: https://youtu.be/IaA9YNlg5hM?list=PL_ko60AZHL-pWXeO6YouiE-ZQlM02duKy
[2]: https://www.kaggle.com/datasets/arjunprasadsarkhel/2021-olympics-in-tokyo