https://github.com/aruadecarvalho/impact-insight
GCP data pipeline predicting power outages in São Paulo based on weather data. Ingests CSV data into BigQuery, pulls real-time weather via Cloud Functions and Pub/Sub, and uses ARIMA and LightGBM models for predictions. Results are visualized in Looker Studio. Infrastructure is deployed with Terraform.
https://github.com/aruadecarvalho/impact-insight
bq bqml etl-pipeline gcp gcp-cloud-functions gcp-pubsub gcp-pubsub-bigquery python terraform
Last synced: 3 months ago
JSON representation
GCP data pipeline predicting power outages in São Paulo based on weather data. Ingests CSV data into BigQuery, pulls real-time weather via Cloud Functions and Pub/Sub, and uses ARIMA and LightGBM models for predictions. Results are visualized in Looker Studio. Infrastructure is deployed with Terraform.
- Host: GitHub
- URL: https://github.com/aruadecarvalho/impact-insight
- Owner: aruadecarvalho
- License: mit
- Created: 2024-09-15T17:45:43.000Z (9 months ago)
- Default Branch: main
- Last Pushed: 2024-10-06T17:05:20.000Z (9 months ago)
- Last Synced: 2025-01-24T14:16:34.496Z (5 months ago)
- Topics: bq, bqml, etl-pipeline, gcp, gcp-cloud-functions, gcp-pubsub, gcp-pubsub-bigquery, python, terraform
- Language: HCL
- Homepage:
- Size: 348 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Power Outage Prediction Pipeline
## Project Goal
This project implements an end-to-end data pipeline on Google Cloud Platform (GCP) that leverages historical weather data (precipitation and temperature) to predict power outages complaints across the state of São Paulo. The pipeline is designed to:
1. Ingest historical weather data from a CSV file and populate a BigQuery table.
2. Periodically pull real-time weather data from an external API via a Cloud Function, triggered by Cloud Scheduler.
3. Pass the real-time data through Pub/Sub into BigQuery for continuous analysis.
4. Use BigQuery ML (BQML) to train two models:
- **ARIMA** for forecasting weather (precipitation and temperature).
- **LightGBM** for predicting power outages based on weather conditions.
5. Visualize the results and predictions through a Looker Studio dashboard.## Pipeline Overview
1. **Batch Process**: Initial ingestion of historical weather data from a CSV into BigQuery.
2. **Streaming Process**: Regular ingestion of real-time weather data using Cloud Functions and Pub/Sub.
3. **Machine Learning**: BQML is used to train and predict power outages using weather data.
4. **Visualization**: Power outages predictions are visualized using Looker Studio.## Project Structure
```
/bqml
├── arima_*.sql # ARIMA model script for weather data forecasting
├── lightgbm_weather_metrics.sql # LightGBM model script for predicting power outages/infra
├── main.tf # Main Terraform configuration file
├── variables.tf # Variable definitions for Terraform
├── example.tfvars # Example of the variables neeeded to run the infrastructure
```## Data Ingestion Process

1. **Batch Ingestion**: The pipeline starts with ingesting historical weather data from a CSV file, which contains data related to precipitation, temperature, and power outages complaints across the state of São Paulo. This data is loaded into BigQuery for training purposes.
2. **Streaming Ingestion**: A GCP Cloud Function, triggered by Cloud Scheduler, is responsible for periodically pulling real-time weather data from an external API. This data is sent to a Pub/Sub topic, which in turn inserts the data into BigQuery for continuous model predictions.## Machine Learning Models
- **ARIMA Model**: This model forecasts future weather conditions (precipitation and temperature) based on historical data.
- **LightGBM Model**: This model predicts the number of power outage complaints, based on the weather forecasts (precipitation and temperature), and feeds these predictions back into the pipeline for analysis.## Infrastructure Deployment
The infrastructure for this project is provisioned using Terraform. It sets up the necessary GCP resources such as BigQuery, Cloud Functions, Pub/Sub, and Cloud Scheduler.
### Terraform Variables
To deploy the infrastructure, you'll need to set the following variables in your `.tfvars` file:
```hclgcp_project_id = ""
region = ""
bucket_name = ""
terraform_sa_email = ""
weather_api_key = ""
```## Visualizing the Data
Once the pipeline is set up and running, data predictions and insights are visualized using Looker Studio. The dashboard provides a real-time view of predicted power outages and allows users to track correlations between weather conditions and service complaints.
## How to Run the Project
### Prerequisites
- Google Cloud Platform account.
- Terraform installed on your local machine.
- Access to Looker Studio for data visualization.### Steps
1. **Set up Infrastructure**:
- Clone the repository.
- Navigate to the `/infra` folder.
- Fill in the necessary variables in your `.tfvars` file.
- Run `terraform init` to initialize the project.
- Run `terraform apply` to deploy the infrastructure.2. **Ingest Data**:
- Place the historical weather data CSV in the configured GCS bucket.
- Run the initial ingestion script to populate BigQuery with historical data.3. **Set Up Cloud Functions and Scheduler**:
- Deploy the Cloud Function to pull real-time data from the external API.
- Set up Cloud Scheduler to trigger the function periodically.4. **Train and Use Models**:
- Use the scripts in `/bqml` to train the ARIMA and LightGBM models in BigQuery.
- Predictions from the models will be stored in BigQuery tables and can be accessed for further analysis.5. **Visualize Data**:
- Connect BigQuery tables to Looker Studio to create a dashboard.
- Monitor real-time power outage predictions based on weather data.## Future Enhancements
- Extend the pipeline to support additional regions beyond São Paulo.
- Enhance the machine learning model by incorporating more features or using alternative algorithms.
- Improve visualization with more detailed insights and breakdowns.## Authors
- [@mateuscazuza](https://github.com/mateuscazuza)
- [@ADIANA2](https://github.com/ADIANA2)
- [@aruadecarvalho](https://github.com/aruadecarvalho)