{"id":20697470,"url":"https://github.com/epomatti/az-e2e-data-eng-proj","last_synced_at":"2026-04-28T17:32:11.139Z","repository":{"id":205616082,"uuid":"713117952","full_name":"epomatti/az-e2e-data-eng-proj","owner":"epomatti","description":"Data engineering with Azure services","archived":false,"fork":false,"pushed_at":"2023-11-05T13:39:42.000Z","size":414,"stargazers_count":0,"open_issues_count":0,"forks_count":1,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-06-05T23:41:13.520Z","etag":null,"topics":["azure","data","data-engineering","databricks","datafactory","datalake","lake","synapse","terraform"],"latest_commit_sha":null,"homepage":"","language":"HCL","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/epomatti.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-11-01T21:51:49.000Z","updated_at":"2023-11-05T19:56:40.000Z","dependencies_parsed_at":null,"dependency_job_id":"93dbcd28-ee35-4fc3-a21d-d226180fb831","html_url":"https://github.com/epomatti/az-e2e-data-eng-proj","commit_stats":null,"previous_names":["epomatti/az-e2e-data-eng-proj"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/epomatti/az-e2e-data-eng-proj","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/epomatti%2Faz-e2e-data-eng-proj","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/epomatti%2Faz-e2e-data-eng-proj/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/epomatti%2Faz-e2e-data-eng-proj/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/epomatti%2Faz-e2e-data-eng-proj/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/epomatti","download_url":"https://codeload.github.com/epomatti/az-e2e-data-eng-proj/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/epomatti%2Faz-e2e-data-eng-proj/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32392291,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-28T14:34:11.604Z","status":"ssl_error","status_checked_at":"2026-04-28T14:32:37.009Z","response_time":56,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["azure","data","data-engineering","databricks","datafactory","datalake","lake","synapse","terraform"],"created_at":"2024-11-17T00:18:12.510Z","updated_at":"2026-04-28T17:32:11.113Z","avatar_url":"https://github.com/epomatti.png","language":"HCL","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Azure End-To-End Data Engineering Project\n\nComplete data ingestion, transformation and load using Azure services.\n\n\u003e Implementation reference from [this video][1].\n\n\u003cimg src=\".assets/azure-e2e.png\" /\u003e\n\n## Azure Infrastructure\n\nCreate the `.auto.tfvars` files and set the parameters as you prefer:\n\n```sh\ncp azure/config/dev.tfvars azure/.auto.tfvars\n```\n\nCheck your public IP address to be added in the firewalls allow rules:\n\n```sh\ndig +short myip.opendns.com @resolver1.opendns.com\n```\n\nThe [dataset][2] is already available in the `./dataset/` directory and will be uploaded to the storage.\n\nCreate the resources on Azure:\n\n```sh\nterraform -chdir=\"azure\" init\nterraform -chdir=\"azure\" apply -auto-approve\n```\n\nTrigger the pipeline to get the data into the stage filesystem:\n\n```sh\naz datafactory pipeline create-run \\\n    --resource-group rg-olympics \\\n    --name PrepareForDatabricks \\\n    --factory-name adf-olympics-sandbox\n```\n\nIf you're not using Synapse immediately, pause the Synapse SQL pool to avoid costs while setting up the infrastructure:\n\n```sh\naz synapse sql pool pause -n pool1 --workspace-name synw-olympics -g rg-olympics\n```\n\n## Databricks\n\nThe previous Azure run should have created the `databricks/.auto.tfvars` file to configure Databricks.\n\nApply the Databricks configuration:\n\n\u003e 💡 If you haven't yet, you need to login to Databricks, which will create Key Vault policies.\n\n```sh\nterraform -chdir=\"databricks\" init\nterraform -chdir=\"databricks\" apply -auto-approve\n```\n\nOnce Databricks is running, execute the notebook to generate the data.\n\n## Synapse\n\nConnect to Synapse Studio.\n\nEnter the Data blade to create a new `Lake Database` using the studio and generate the tables from the `transformed-data` filesystem.\n\nUpload or copy the SQL test script:\n\n```sh\naz synapse sql-script create -f scripts/synapse-queries.sql -n Init --workspace-name synw-olympics --sql-pool-name pool1 --sql-database-name pool1\n```\n\n\n[1]: https://youtu.be/IaA9YNlg5hM?list=PL_ko60AZHL-pWXeO6YouiE-ZQlM02duKy\n[2]: https://www.kaggle.com/datasets/arjunprasadsarkhel/2021-olympics-in-tokyo\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fepomatti%2Faz-e2e-data-eng-proj","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fepomatti%2Faz-e2e-data-eng-proj","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fepomatti%2Faz-e2e-data-eng-proj/lists"}