{"id":25434360,"url":"https://github.com/tom474/data_pipeline_with_databricks","last_synced_at":"2025-05-14T21:12:48.847Z","repository":{"id":269510913,"uuid":"907445577","full_name":"tom474/data_pipeline_with_databricks","owner":"tom474","description":"[RMIT 2024C] EEET2574 - Big Data for Engineering - MongoDB and Spark","archived":false,"fork":false,"pushed_at":"2025-02-16T16:14:51.000Z","size":16710,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-02-16T16:41:28.319Z","etag":null,"topics":["data-engineering","data-science","data-visualization","databricks","jupyter-notebook","machine-learning","mongodb","python","spark"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/tom474.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-12-23T15:44:04.000Z","updated_at":"2025-02-16T16:14:54.000Z","dependencies_parsed_at":"2025-02-16T16:42:49.514Z","dependency_job_id":null,"html_url":"https://github.com/tom474/data_pipeline_with_databricks","commit_stats":null,"previous_names":["tom474/mongodb_and_spark","tom474/data_pipeline_with_databricks"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tom474%2Fdata_pipeline_with_databricks","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tom474%2Fdata_pipeline_with_databricks/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tom474%2Fdata_pipeline_with_databricks/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tom474%2Fdata_pipeline_with_databricks/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/tom474","download_url":"https://codeload.github.com/tom474/data_pipeline_with_databricks/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254227631,"owners_count":22035671,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-engineering","data-science","data-visualization","databricks","jupyter-notebook","machine-learning","mongodb","python","spark"],"created_at":"2025-02-17T06:16:23.566Z","updated_at":"2025-05-14T21:12:48.834Z","avatar_url":"https://github.com/tom474.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Data Pipeline with Databricks  \n\nThis project focuses on building a data pipeline using **Databricks**, **PySpark**, and **MongoDB**, integrating **MLflow** for machine learning tracking. The pipeline handles data ingestion, transformation, model training, and visualization.\n\n## Tech Stack  \n\n- Python  \n- Databricks  \n- PySpark  \n- MLflow\n- MongoDB  \n\n## Features  \n\n### Data Preparation on MongoDB  \n- Pre-loaded datasets categorized by year and type (Electricity, Gas).\n- Collections structure: `electricity_2018`, `electricity_2019`, `electricity_2020`, `gas_2018`, `gas_2019`, `gas_2020`.\n\n### Data Ingestion  \n- Loading electricity and gas consumption data from MongoDB into PySpark DataFrames.\n- Merging 2018 and 2019 data as the training dataset while 2020 data is used for testing.\n\n### Data Exploration  \n- Conversion of Spark DataFrames to Pandas for in-depth analysis.\n- Shape, summary statistics, missing values, and outlier detection.\n- Categorical and numerical features analysis.\n\n### Data Cleaning  \n- Removal of columns with excessive missing values (`%Defintieve aansl (NRM)`).\n- Dropping identifier and high-cardinality categorical columns (`_id`, `street`, `zipcode_from`, `zipcode_to`).\n- Eliminating constant columns (`annual_consume_lowtarif_perc` for gas).\n- Handling outliers using log transformation.\n- Duplicate data removal.\n\n### Data Transformation  \n- **Encoding categorical features** (One-Hot Encoding via `StringIndexer` \u0026 `OneHotEncoder`).\n- **Scaling numerical features** using `MinMaxScaler`.\n- **Feature engineering**: Combining transformed columns into a single feature vector.\n\n### Model Training and Tracking  \n- Implemented **Random Forest Regressor** and **Decision Tree Regressor** models.\n- **MLflow** integration for parameter tracking and performance logging.\n- Best models selected based on **MAE, R2, and RMSE** metrics:\n  - **Electricity:** `RandomForestRegressor(numTrees=30, maxDepth=7)`\n  - **Gas:** `DecisionTreeRegressor(maxDepth=7, minInstancesPerNode=2)`\n\n### Data Visualization  \n- **MongoDB Charts Dashboard**: [View Charts](https://charts.mongodb.com/charts-bigdataasm2-szigrao/public/dashboards/67724c59-0c78-4054-8e6b-1061df46332b)  \n- Key visualizations:\n  - **Top 10 Cities by Electricity Annual Consumption (2018)**\n  - **Electricity Distribution of Connection Types (2019)**\n  - **Top 10 Cities by Gas Annual Consumption (2018)**\n  - **Gas Distribution of Connection Types (2019)**\n\n## Quick Start\n\n### Prerequisites\n\n- Databricks Community Account: [Sign up](https://community.cloud.databricks.com)\n- MongoDB Account: [Sign up](https://account.mongodb.com/account/login)\n\n### Create a compute cluster\n\n- On the sidebar, select **Compute**.\n- Select **Create compute**.\n- Enter compute name: **Big Data Assignment 2's Cluster**.\n- Choose the databricks runtime version: **9.1 LTS ML (Scala 2.12, Spark 3.1.2)**.\n- Select **Create compute**.\n\n![task0-create-cluster.png](https://github.com/tom474/data_pipeline_with_databricks/blob/main/assets/task0-attach-cluster.png?raw=true)\n\n### Install mongodb spark connector library\n- On the navigation bar, select **Libraries**.\n- Select **Install new**.\n- For Library Source, select **Maven**.\n- For Coordinates, select **Select Packages**.\n- Select **Spark Packages**.\n- Search and select **mongo-spark** with version **3.0.1**.\n- Select **Install**.\n\n![task0-install-library.png](https://github.com/tom474/data_pipeline_with_databricks/blob/main/assets/task0-install-library.png?raw=true)\n\n### Attach cluster to notebook\n- Import the notebook to databricks using `.dbc` or `.ipynb` file.\n- Select the notebook.\n- For the Connect, select the created cluster.\n\n![task0-attach-cluster.png](https://github.com/tom474/data_pipeline_with_databricks/blob/main/assets/task0-attach-cluster.png?raw=true)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftom474%2Fdata_pipeline_with_databricks","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftom474%2Fdata_pipeline_with_databricks","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftom474%2Fdata_pipeline_with_databricks/lists"}