https://github.com/betkh/meterdata-wrangling-azuredatabricks
https://github.com/betkh/meterdata-wrangling-azuredatabricks
Last synced: over 1 year ago
JSON representation
- Host: GitHub
- URL: https://github.com/betkh/meterdata-wrangling-azuredatabricks
- Owner: BeTKH
- Created: 2024-12-19T21:00:17.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-12-19T21:38:04.000Z (over 1 year ago)
- Last Synced: 2024-12-19T22:40:52.874Z (over 1 year ago)
- Language: Python
- Size: 5.68 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
## PySpark Data Wrangling: Electrical Meter Reading Data
This project demonstrates data wrangling and analysis using PySpark in Azure Databricks, focusing on cleaning and transforming a mock dataset from an electrical meter reading system. It also showcases querying the cleaned dataset to answer specific analytical questions.
---
## Project Overview
### Dataset Details
- **Source**: Data from an electrical meter reading system.
- **Structure**: Each row contains:
- Customer and meter information.
- Hourly readings (24 columns) with corresponding QC codes.
- **QC Code**: Only readings with QC code `3` are valid.
Additional metadata about customers and meters is provided in `CustMeter.csv`.
---
## Key Steps
### Data Cleaning Tasks
Using PySpark, the raw data is cleaned and transformed to meet the following requirements:
- **Wide to Long Format**: Each hourly reading is converted into individual rows with columns for: `IntervalHour` (1-24) , `QCCode` and `IntervalValue`.
- **Filter Criteria**:
- Retained valid data types (`KWH`, `UNITS`, `Signed Net in Watts`, `Fwd Consumption in Watts`).
- Removed bad QC codes (anything other than `3`).
- Eliminated duplicates.
- **Sorting**: Data is sorted by customer, meter, datatype, date, and interval hour.


### Saved cleaned data in Azure blob storage
- Output Saved in two formats: `CSV` and `Parquet`



### Analysis
The cleaned dataset is analyzed to answer key business questions:
- Questions are addressed by querying the cleaned Parquet dataset.
- Results are standardized into a DataFrame and exported as a CSV file for review.

---
## Deliverables
- **Cleaned Data**: CSV and Parquet files stored in Azure Storage.
- **Analysis Results**: CSV file with answers to analytical queries.
- **Notebooks**: PySpark notebooks for data cleaning (`CleanMeterData.py`) and analysis (`AnalyzeData.py`).
This project highlights proficiency in PySpark, data transformation, and cloud-based data workflows in the context of electrical meter reading data.