https://github.com/tom474/data_pipeline_with_databricks
[RMIT 2024C] EEET2574 - Big Data for Engineering - MongoDB and Spark
https://github.com/tom474/data_pipeline_with_databricks
data-engineering data-science data-visualization databricks jupyter-notebook machine-learning mongodb python spark
Last synced: about 1 year ago
JSON representation
[RMIT 2024C] EEET2574 - Big Data for Engineering - MongoDB and Spark
- Host: GitHub
- URL: https://github.com/tom474/data_pipeline_with_databricks
- Owner: tom474
- Created: 2024-12-23T15:44:04.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-02-16T16:14:51.000Z (over 1 year ago)
- Last Synced: 2025-02-16T16:41:28.319Z (over 1 year ago)
- Topics: data-engineering, data-science, data-visualization, databricks, jupyter-notebook, machine-learning, mongodb, python, spark
- Language: Jupyter Notebook
- Homepage:
- Size: 15.9 MB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Data Pipeline with Databricks
This project focuses on building a data pipeline using **Databricks**, **PySpark**, and **MongoDB**, integrating **MLflow** for machine learning tracking. The pipeline handles data ingestion, transformation, model training, and visualization.
## Tech Stack
- Python
- Databricks
- PySpark
- MLflow
- MongoDB
## Features
### Data Preparation on MongoDB
- Pre-loaded datasets categorized by year and type (Electricity, Gas).
- Collections structure: `electricity_2018`, `electricity_2019`, `electricity_2020`, `gas_2018`, `gas_2019`, `gas_2020`.
### Data Ingestion
- Loading electricity and gas consumption data from MongoDB into PySpark DataFrames.
- Merging 2018 and 2019 data as the training dataset while 2020 data is used for testing.
### Data Exploration
- Conversion of Spark DataFrames to Pandas for in-depth analysis.
- Shape, summary statistics, missing values, and outlier detection.
- Categorical and numerical features analysis.
### Data Cleaning
- Removal of columns with excessive missing values (`%Defintieve aansl (NRM)`).
- Dropping identifier and high-cardinality categorical columns (`_id`, `street`, `zipcode_from`, `zipcode_to`).
- Eliminating constant columns (`annual_consume_lowtarif_perc` for gas).
- Handling outliers using log transformation.
- Duplicate data removal.
### Data Transformation
- **Encoding categorical features** (One-Hot Encoding via `StringIndexer` & `OneHotEncoder`).
- **Scaling numerical features** using `MinMaxScaler`.
- **Feature engineering**: Combining transformed columns into a single feature vector.
### Model Training and Tracking
- Implemented **Random Forest Regressor** and **Decision Tree Regressor** models.
- **MLflow** integration for parameter tracking and performance logging.
- Best models selected based on **MAE, R2, and RMSE** metrics:
- **Electricity:** `RandomForestRegressor(numTrees=30, maxDepth=7)`
- **Gas:** `DecisionTreeRegressor(maxDepth=7, minInstancesPerNode=2)`
### Data Visualization
- **MongoDB Charts Dashboard**: [View Charts](https://charts.mongodb.com/charts-bigdataasm2-szigrao/public/dashboards/67724c59-0c78-4054-8e6b-1061df46332b)
- Key visualizations:
- **Top 10 Cities by Electricity Annual Consumption (2018)**
- **Electricity Distribution of Connection Types (2019)**
- **Top 10 Cities by Gas Annual Consumption (2018)**
- **Gas Distribution of Connection Types (2019)**
## Quick Start
### Prerequisites
- Databricks Community Account: [Sign up](https://community.cloud.databricks.com)
- MongoDB Account: [Sign up](https://account.mongodb.com/account/login)
### Create a compute cluster
- On the sidebar, select **Compute**.
- Select **Create compute**.
- Enter compute name: **Big Data Assignment 2's Cluster**.
- Choose the databricks runtime version: **9.1 LTS ML (Scala 2.12, Spark 3.1.2)**.
- Select **Create compute**.

### Install mongodb spark connector library
- On the navigation bar, select **Libraries**.
- Select **Install new**.
- For Library Source, select **Maven**.
- For Coordinates, select **Select Packages**.
- Select **Spark Packages**.
- Search and select **mongo-spark** with version **3.0.1**.
- Select **Install**.

### Attach cluster to notebook
- Import the notebook to databricks using `.dbc` or `.ipynb` file.
- Select the notebook.
- For the Connect, select the created cluster.
