{"id":26578041,"url":"https://github.com/dadananjesha/azuredataengine","last_synced_at":"2026-04-15T14:37:31.932Z","repository":{"id":232764615,"uuid":"785138283","full_name":"DadaNanjesha/AzureDataEngine","owner":"DadaNanjesha","description":"AzureDataEngine is a robust, scalable batch processing data architecture built on the Azure platform. It efficiently extracts, transforms, and loads massive datasets for machine learning applications, leveraging Azure Blob Storage, PostgreSQL, Databricks, and Key Vault to ensure reliability and maintainability.","archived":false,"fork":false,"pushed_at":"2025-03-08T21:59:25.000Z","size":1092,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-08T22:25:50.578Z","etag":null,"topics":["azure","batch-processing","blob-storage","databricks","etl","etl-framework","key-vault","postgresql-database","spark","vnet"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/DadaNanjesha.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-04-11T09:18:23.000Z","updated_at":"2025-03-08T21:59:28.000Z","dependencies_parsed_at":"2025-03-08T22:22:39.820Z","dependency_job_id":"7d687dda-f30d-46f1-8393-f3900ad1e445","html_url":"https://github.com/DadaNanjesha/AzureDataEngine","commit_stats":null,"previous_names":["dadananjesha/batch-processing"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DadaNanjesha%2FAzureDataEngine","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DadaNanjesha%2FAzureDataEngine/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DadaNanjesha%2FAzureDataEngine/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DadaNanjesha%2FAzureDataEngine/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/DadaNanjesha","download_url":"https://codeload.github.com/DadaNanjesha/AzureDataEngine/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245053113,"owners_count":20553263,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["azure","batch-processing","blob-storage","databricks","etl","etl-framework","key-vault","postgresql-database","spark","vnet"],"created_at":"2025-03-23T04:19:03.270Z","updated_at":"2026-04-15T14:37:26.905Z","avatar_url":"https://github.com/DadaNanjesha.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Batch Processing Data Architecture 🚀📊\r\n\r\n\u003c!-- Top Tech Stack Badges --\u003e\r\n\u003cdiv align=\"center\"\u003e\r\n  \u003ca href=\"https://azure.microsoft.com/\"\u003e\r\n    \u003cimg src=\"https://img.shields.io/badge/Azure%20DevOps-0078D4?style=for-the-badge\u0026logo=azuredevops\u0026logoColor=white\" alt=\"Azure DevOps\" /\u003e\r\n  \u003c/a\u003e\r\n  \u003ca href=\"https://azure.microsoft.com/\"\u003e\r\n    \u003cimg src=\"https://img.shields.io/badge/Azure%20Repos-0078D4?style=for-the-badge\u0026logo=azuredevops\u0026logoColor=white\" alt=\"Azure Repos\" /\u003e\r\n  \u003c/a\u003e\r\n  \u003ca href=\"https://azure.microsoft.com/\"\u003e\r\n    \u003cimg src=\"https://img.shields.io/badge/Azure%20Pipelines-0078D4?style=for-the-badge\u0026logo=azurepipelines\u0026logoColor=white\" alt=\"Azure Pipelines\" /\u003e\r\n  \u003c/a\u003e\r\n  \u003ca href=\"https://portal.azure.com/\"\u003e\r\n    \u003cimg src=\"https://img.shields.io/badge/Azure%20Portal-0078D4?style=for-the-badge\u0026logo=microsoftazure\u0026logoColor=white\" alt=\"Azure Portal\" /\u003e\r\n  \u003c/a\u003e\r\n  \u003ca href=\"https://www.postgresql.org/\"\u003e\r\n    \u003cimg src=\"https://img.shields.io/badge/PostgreSQL-336791?style=for-the-badge\u0026logo=postgresql\u0026logoColor=white\" alt=\"Azure Database for PostgreSQL\" /\u003e\r\n  \u003c/a\u003e\r\n  \u003ca href=\"https://databricks.com/\"\u003e\r\n    \u003cimg src=\"https://img.shields.io/badge/Databricks-F9BF3B?style=for-the-badge\u0026logo=databricks\u0026logoColor=white\" alt=\"Azure Databricks\" /\u003e\r\n  \u003c/a\u003e\r\n  \u003ca href=\"https://azure.microsoft.com/en-us/services/storage/blobs/\"\u003e\r\n    \u003cimg src=\"https://img.shields.io/badge/Blob%20Storage-0078D4?style=for-the-badge\u0026logo=microsoftazure\u0026logoColor=white\" alt=\"Azure Blob Storage\" /\u003e\r\n  \u003c/a\u003e\r\n  \u003ca href=\"https://azure.microsoft.com/en-us/services/key-vault/\"\u003e\r\n    \u003cimg src=\"https://img.shields.io/badge/Key%20Vault-0078D4?style=for-the-badge\u0026logo=microsoftazure\u0026logoColor=white\" alt=\"Azure Key Vault\" /\u003e\r\n  \u003c/a\u003e\r\n  \u003ca href=\"https://azure.microsoft.com/\"\u003e\r\n    \u003cimg src=\"https://img.shields.io/badge/Network%20Watcher-0078D4?style=for-the-badge\u0026logo=microsoftazure\u0026logoColor=white\" alt=\"Network Watcher\" /\u003e\r\n  \u003c/a\u003e\r\n  \u003ca href=\"https://azure.microsoft.com/\"\u003e\r\n    \u003cimg src=\"https://img.shields.io/badge/Network%20Security-0078D4?style=for-the-badge\u0026logo=microsoftazure\u0026logoColor=white\" alt=\"Network Security\" /\u003e\r\n  \u003c/a\u003e\r\n  \u003ca href=\"https://azure.microsoft.com/\"\u003e\r\n    \u003cimg src=\"https://img.shields.io/badge/Resource%20Group-0078D4?style=for-the-badge\u0026logo=microsoftazure\u0026logoColor=white\" alt=\"Resource Group\" /\u003e\r\n  \u003c/a\u003e\r\n  \u003ca href=\"https://www.python.org/\"\u003e\r\n    \u003cimg src=\"https://img.shields.io/badge/Python-3.8%2B-blue?style=for-the-badge\u0026logo=python\u0026logoColor=white\" alt=\"Python\" /\u003e\r\n  \u003c/a\u003e\r\n\u003c/div\u003e\r\n\r\n---\r\n\r\n## 📖 Introduction\r\n\r\n**Batch Processing Data Architecture** is a robust project that builds a scalable, dependable, and maintainable data processing backend on the Azure platform. Designed as the backbone for a machine learning application, it efficiently processes enormous amounts of data, performs necessary preprocessing, and aggregates it for downstream ML tasks.\r\n\r\nThe system leverages canonical software components and data engineering best practices to integrate multiple Azure services for a comprehensive solution.\r\n\r\n![Project Architecture](https://github.com/DadaNanjesha/batch-processing/blob/main/Project%20structure.png)\r\n\r\n---\r\n\r\n## ✨ Key Features\r\n\r\n- **Scalable Batch Processing:** Efficiently processes massive datasets in scheduled batches.\r\n- **ETL Workflows:** Custom Python scripts for data extraction, transformation, and loading.\r\n- **Azure Integration:** Leverages Blob Storage, PostgreSQL, Databricks, Key Vault, and more.\r\n- **Modular Design:** Easy-to-maintain code structure with dedicated ETL and loading scripts.\r\n\r\n---\r\n## 🛠️ Technologies Used\r\n\r\n- **Azure DevOps**  \r\n\r\n- **Azure Repos**  \r\n\r\n- **Azure Pipelines**  \r\n\r\n- **Azure Portal**  \r\n\r\n- **Azure Database for PostgreSQL**  \r\n\r\n- **Azure Databricks**  \r\n\r\n- **Azure Blob Storage**  \r\n\r\n- **Azure Key Vault**  \r\n\r\n- **Network Watcher \u0026 Network Security**  \r\n\r\n- **Resource Group**  \r\n\r\n- **Python**  \r\n\r\n---\r\n\r\n## 🔄 Flow Diagram\r\n\r\n```mermaid\r\nflowchart TD\r\n    A[📄 CSV Data Source] --\u003e B[🔄 ETL_batchdata.py]\r\n    B --\u003e C[🧹 Data Transformation \u0026 Aggregation]\r\n    C --\u003e D[📤 loadtoblobtable.py]\r\n    D --\u003e E[💾 Storage :Azure Blob/PostgreSQL]\r\n    E --\u003e F[📈 Machine Learning Application]\r\n```\r\n\r\n---\r\n\r\n## 🗂️ Project Structure\r\n\r\n```plaintext\r\nbatch-processing/\r\n├── .gitignore                          # Git ignore file\r\n├── ETL_batchdata.py                    # Main ETL script for batch data processing\r\n├── loadtoblobtable.py                  # Script to load processed data into storage\r\n├── GoudaShanbog_DadaNanjesha_10220129_Data Engineering_Phase1.pdf  # Phase 1 design document\r\n├── GoudaShanbog_DadaNanjesha_10220129_Data Engineering_Phase2.pdf  # Phase 2 design document\r\n├── GoudaShanbog_DadaNanjesha_10220129_Data Engineering_Phase3.pdf  # Phase 3 design document\r\n├── Project structure.png               # Visual diagram of project architecture\r\n└── output file.pdf                     # Sample output report from data aggregation\r\n```\r\n\r\n---\r\n\r\n## 💻 Setup Steps\r\n\r\nBefore getting started, ensure you have an active [Azure subscription](https://azure.microsoft.com/).\r\n\r\n1. **Create Your Azure Environment:**\r\n   - Set up your Azure subscription and create a Resource Group.\r\n   - Provision necessary services such as Azure Blob Storage, PostgreSQL, Databricks, Key Vault, etc.\r\n\r\n2. **Prepare Your Data:**\r\n   - Deploy your CSV data into the PostgreSQL database or Blob Storage as needed.\r\n\r\n3. **Run the ETL Process:**\r\n   - Execute the `ETL_batchdata.py` script to extract, transform, and prepare your data.\r\n   - Run `loadtoblobtable.py` to load the processed data into your target storage.\r\n\r\n4. **Integrate with ML Application:**\r\n   - Ensure your machine learning application can access the processed data from the designated storage.\r\n\r\n---\r\n## ⭐️ Support \u0026 Call-to-Action\r\n\r\nIf you find this project useful, please consider:\r\n- **Starring** the repository ⭐️\r\n- **Forking** the project to contribute enhancements\r\n- **Following** for updates on future improvements\r\n\r\nYour engagement helps increase visibility and encourages further collaboration!\r\n\r\n---\r\n## 📜 License\r\n\r\nThis project is licensed under the [MIT License](LICENSE).\r\n\r\n---\r\n\r\n## 🙏 Acknowledgements\r\n\r\n- **Azure Services:** For providing a robust, scalable infrastructure.\r\n- **Data Engineering Principles:** Guiding our modular and reliable architecture.\r\n- **Contributors:** Thank you to everyone who supported and contributed to this project.\r\n\r\n---\r\n\r\n*Happy Data Processing! 🚀📊*\r\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdadananjesha%2Fazuredataengine","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdadananjesha%2Fazuredataengine","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdadananjesha%2Fazuredataengine/lists"}