{"id":26187406,"url":"https://github.com/betkh/meterdata-wrangling-azuredatabricks","last_synced_at":"2025-03-11T23:49:46.443Z","repository":{"id":268949516,"uuid":"905952123","full_name":"BeTKH/MeterData-Wrangling-AzureDataBricks","owner":"BeTKH","description":null,"archived":false,"fork":false,"pushed_at":"2024-12-19T21:38:04.000Z","size":5951,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2024-12-19T22:40:52.874Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/BeTKH.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-12-19T21:00:17.000Z","updated_at":"2024-12-19T21:39:21.000Z","dependencies_parsed_at":"2024-12-19T22:40:59.450Z","dependency_job_id":"ce555cbd-99ba-4f1d-8fcf-6c049bc5818e","html_url":"https://github.com/BeTKH/MeterData-Wrangling-AzureDataBricks","commit_stats":null,"previous_names":["betkh/mtere-datawrangling-azuredatabricks"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BeTKH%2FMeterData-Wrangling-AzureDataBricks","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BeTKH%2FMeterData-Wrangling-AzureDataBricks/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BeTKH%2FMeterData-Wrangling-AzureDataBricks/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BeTKH%2FMeterData-Wrangling-AzureDataBricks/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/BeTKH","download_url":"https://codeload.github.com/BeTKH/MeterData-Wrangling-AzureDataBricks/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243131006,"owners_count":20241177,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-03-11T23:49:45.885Z","updated_at":"2025-03-11T23:49:46.426Z","avatar_url":"https://github.com/BeTKH.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"## PySpark Data Wrangling: Electrical Meter Reading Data\n\nThis project demonstrates data wrangling and analysis using PySpark in Azure Databricks, focusing on cleaning and transforming a mock dataset from an electrical meter reading system. It also showcases querying the cleaned dataset to answer specific analytical questions.\n\n---\n\n## Project Overview\n\n### Dataset Details\n\n- **Source**: Data from an electrical meter reading system.\n- **Structure**: Each row contains:\n  - Customer and meter information.\n  - Hourly readings (24 columns) with corresponding QC codes.\n- **QC Code**: Only readings with QC code `3` are valid.\n\nAdditional metadata about customers and meters is provided in `CustMeter.csv`.\n\n---\n\n## Key Steps\n\n### Data Cleaning Tasks\n\nUsing PySpark, the raw data is cleaned and transformed to meet the following requirements:\n\n- **Wide to Long Format**: Each hourly reading is converted into individual rows with columns for: `IntervalHour` (1-24) , `QCCode` and `IntervalValue`.\n- **Filter Criteria**:\n  - Retained valid data types (`KWH`, `UNITS`, `Signed Net in Watts`, `Fwd Consumption in Watts`).\n  - Removed bad QC codes (anything other than `3`).\n  - Eliminated duplicates.\n- **Sorting**: Data is sorted by customer, meter, datatype, date, and interval hour.\n    \u003cp align=\"center\"\u003e\u003cimg src=\"screenshots/RawData.png\" alt=\"Data Cleaning-parquet\" width=\"\"\u003e\u003c/p\u003e\n    \u003cp align=\"center\"\u003e\u003cimg src=\"screenshots/CleanMeterData.png\" alt=\"Data Cleaning-parquet\" width=\"\"\u003e\u003c/p\u003e\n\n### Saved cleaned data in Azure blob storage\n\n- Output Saved in two formats: `CSV` and `Parquet`\n\n    \u003cp align=\"center\"\u003e\u003cimg src=\"screenshots/Screenshot_Parquet.png\" alt=\"Data Cleaning-parquet\" width=\"\"\u003e\u003c/p\u003e\n    \u003cp align=\"center\"\u003e\u003cimg src=\"screenshots/Screenshot_CSV.png\" alt=\"Data Cleaning -csv\" width=\"\"\u003e\u003c/p\u003e\n    \u003cp align=\"center\"\u003e\u003cimg src=\"screenshots/Schema.png\" alt=\"Data Model\" width=\"350\"\u003e\u003c/p\u003e\n\n### Analysis\n\nThe cleaned dataset is analyzed to answer key business questions:\n\n- Questions are addressed by querying the cleaned Parquet dataset.\n- Results are standardized into a DataFrame and exported as a CSV file for review.\n\n\u003cp align=\"center\"\u003e\u003cimg src=\"screenshots/analysisResult.png\" alt=\"Data Analysis\" width=\"\"\u003e\u003c/p\u003e\n\n---\n\n## Deliverables\n\n- **Cleaned Data**: CSV and Parquet files stored in Azure Storage.\n- **Analysis Results**: CSV file with answers to analytical queries.\n- **Notebooks**: PySpark notebooks for data cleaning (`CleanMeterData.py`) and analysis (`AnalyzeData.py`).\n\nThis project highlights proficiency in PySpark, data transformation, and cloud-based data workflows in the context of electrical meter reading data.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbetkh%2Fmeterdata-wrangling-azuredatabricks","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbetkh%2Fmeterdata-wrangling-azuredatabricks","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbetkh%2Fmeterdata-wrangling-azuredatabricks/lists"}