{"id":14982340,"url":"https://github.com/Narius2030/MOLISA-Data-Warehouse","last_synced_at":"2025-10-29T12:31:32.286Z","repository":{"id":253426330,"uuid":"837799948","full_name":"Narius2030/MOLISA-Data-Warehouse-Integration","owner":"Narius2030","description":"Extract data from many databases of Labor, Invalids and Social Affairs sectors and convert to appropriate structure and format, then upload to shared data warehouse and data mart. Thanks to that, people of state agencies can easily retrieve and analyze data based on the compiled data warehouse. ","archived":false,"fork":false,"pushed_at":"2024-09-05T15:33:11.000Z","size":9335,"stargazers_count":3,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-02-02T01:31:45.146Z","etag":null,"topics":["apache-airflow","apache-spark","api-rest","data-pipeline","data-warehousing","medallion-architecture","postgresql"],"latest_commit_sha":null,"homepage":"","language":"PLpgSQL","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Narius2030.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-08-04T04:27:58.000Z","updated_at":"2024-09-17T18:38:58.000Z","dependencies_parsed_at":"2024-09-29T06:15:23.509Z","dependency_job_id":null,"html_url":"https://github.com/Narius2030/MOLISA-Data-Warehouse-Integration","commit_stats":{"total_commits":49,"total_committers":2,"mean_commits":24.5,"dds":"0.24489795918367352","last_synced_commit":"d979dcd12478384464ae36ab701b5bd85e48ced0"},"previous_names":["narius2030/molisa-data-warehouse-integration"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Narius2030%2FMOLISA-Data-Warehouse-Integration","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Narius2030%2FMOLISA-Data-Warehouse-Integration/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Narius2030%2FMOLISA-Data-Warehouse-Integration/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Narius2030%2FMOLISA-Data-Warehouse-Integration/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Narius2030","download_url":"https://codeload.github.com/Narius2030/MOLISA-Data-Warehouse-Integration/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":238825708,"owners_count":19537112,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apache-airflow","apache-spark","api-rest","data-pipeline","data-warehousing","medallion-architecture","postgresql"],"created_at":"2024-09-24T14:05:13.940Z","updated_at":"2025-10-29T12:31:31.623Z","avatar_url":"https://github.com/Narius2030.png","language":"PLpgSQL","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Integration Strategy\n\n![image](https://github.com/user-attachments/assets/4d76bfb0-aff3-4520-9972-8ab6ce76008e)\n\n\n**Description:** this data warehouse was designed follow `Inmon approach` that integrated all of data into a single warehouse and it created several data marts associating sectors in government system\n- **Data Source:** Multi-databases from different systems in governmental sector\n- **Medallion Architecture:** Refining data across layers that has the goal of improving the structure and quality of data for better insights and analysis - `bronze -\u003e silver -\u003e gold`\n- **Staging Area:** Ensuring independence between source database and data warehouse when performing transformations and aggergrates\n\n# Data Pipline Automation\n\nAll of the step in this project was design to a data pipeline which can be automated to load raw data from source that then go in medallion procedure for ensuring the quality of information. Finally, it was passed into warehouse and data marts.\n- **Scheduler:** leveraging Apache Airflow to automate end-to-end integration process\n- **Transformation:** using Apache Spark engine which was Pyspark package in Python to process and aggregate information\n- **Environment:** this process was deployed on Docker containers including *Database Server* and *Airflow*\n\n\n### Docker setup\n\nDockerfile for Airflow and Spark\n```dockerfile\nFROM apache/airflow:2.9.1-python3.11\n\nUSER root\n\n# Install OpenJDK-17\nRUN apt update \u0026\u0026 \\\n    apt-get install -y openjdk-17-jdk \u0026\u0026 \\\n    apt-get install -y ant \u0026\u0026 \\\n    apt-get clean;\n\n# Set JAVA_HOME\nENV JAVA_HOME /usr/lib/jvm/java-17-openjdk-amd64/\nRUN export JAVA_HOME\n\nUSER airflow\n\n# Sync files from local to Docker image\nCOPY ./airflow/dags /opt/airflow/dags\nCOPY requirements.txt .\n\n# Pyspark package\nRUN pip install --no-cache-dir -r requirements.txt\nRUN rm requirements.txt\n```\n\n\nDAGs of data warehouse integration\n\n![image](https://github.com/user-attachments/assets/91cd725b-35f7-49f8-a173-f086a9024a22)\n\n\nDAGs of Resident data mart integration\n\n![image](https://github.com/user-attachments/assets/517eaa0d-5013-4325-9b48-f9a574010f26)\n\n\nDAGs of Time and Location integration\n\n![image](https://github.com/user-attachments/assets/3a854528-70d3-4dbf-8961-c6eae03502b4)\n\n\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FNarius2030%2FMOLISA-Data-Warehouse","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FNarius2030%2FMOLISA-Data-Warehouse","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FNarius2030%2FMOLISA-Data-Warehouse/lists"}