{"id":51373180,"url":"https://github.com/stephenbaraik/stormwatch-ai","last_synced_at":"2026-07-03T09:08:46.792Z","repository":{"id":367709422,"uuid":"1282007563","full_name":"stephenbaraik/stormwatch-ai","owner":"stephenbaraik","description":"Extreme Weather Early Warning System — Real data pipeline from Open-Meteo to Supabase, three XGBoost models for cyclone/heatwave/extreme rainfall prediction, and a FastAPI serving layer.","archived":false,"fork":false,"pushed_at":"2026-06-27T09:09:31.000Z","size":87,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-06-27T09:20:01.797Z","etag":null,"topics":["climate","data-pipeline","extreme-weather","fastapi","india-weather","machine-learning","open-meteo","supabase","weather-prediction","xgboost"],"latest_commit_sha":null,"homepage":"https://wayvknnxfhtdbkkozlld.supabase.co","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/stephenbaraik.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-06-27T07:38:18.000Z","updated_at":"2026-06-27T09:09:34.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/stephenbaraik/stormwatch-ai","commit_stats":null,"previous_names":["stephenbaraik/stormwatch-ai"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/stephenbaraik/stormwatch-ai","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/stephenbaraik%2Fstormwatch-ai","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/stephenbaraik%2Fstormwatch-ai/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/stephenbaraik%2Fstormwatch-ai/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/stephenbaraik%2Fstormwatch-ai/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/stephenbaraik","download_url":"https://codeload.github.com/stephenbaraik/stormwatch-ai/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/stephenbaraik%2Fstormwatch-ai/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":35079496,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-07-03T02:00:05.635Z","response_time":110,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["climate","data-pipeline","extreme-weather","fastapi","india-weather","machine-learning","open-meteo","supabase","weather-prediction","xgboost"],"created_at":"2026-07-03T09:08:45.959Z","updated_at":"2026-07-03T09:08:46.785Z","avatar_url":"https://github.com/stephenbaraik.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# StormWatch AI — Extreme Weather Early Warning System\n\n[![Data Pipeline](https://github.com/stephenbaraik/stormwatch-ai/actions/workflows/data-pipeline.yml/badge.svg)](https://github.com/stephenbaraik/stormwatch-ai/actions/workflows/data-pipeline.yml)\n\nStormWatch AI downloads **real weather data** from the [Open-Meteo Archive API](https://archive-api.open-meteo.com/), ingests it into **Supabase** via an overnight cron pipeline, and trains **three XGBoost models** to predict extreme weather events across 15 Indian cities.\n\n**No synthetic data.** Every prediction is backed by 16+ years of historical meteorological records.\n\n| Model | Task | Accuracy | ROC-AUC |\n|-------|------|----------|---------|\n| **Cyclone Intensity** | Saffir-Simpson category (0–5) | **98.9%** | — |\n| **Heatwave Detection** | Heatwave flag (binary) | **99.4%** | **0.9982** |\n| **Extreme Rainfall** | 95th percentile exceedance (binary) | **97.5%** | **0.9744** |\n\n---\n\n## Table of Contents\n\n- [Architecture](#architecture)\n- [Models](#models)\n- [Cities Covered](#cities-covered)\n- [Data Pipeline](#data-pipeline)\n- [PySpark ETL](#pyspark-etl)\n- [API](#api)\n- [Setup](#setup)\n- [Usage](#usage)\n- [Project Structure](#project-structure)\n- [Configuration](#configuration)\n- [Testing](#testing)\n- [Deployment](#deployment)\n\n---\n\n## Architecture\n\n```\nOpen-Meteo Archive API\n        │\n        ▼\n┌──────────────────┐     ┌──────────────┐     ┌──────────────────┐\n│  Data Pipeline   │────▶│   Supabase   │────▶│  XGBoost Models  │\n│ (GitHub Actions) │     │ (PostgreSQL) │     │  3 classifiers   │\n│  cron @ 2 AM IST │     │  weather_data│     │                  │\n│  yearly chunks   │     │  download_   │     │  cyclone         │\n│  5s/chunk delay  │     │  batches     │     │  heatwave        │\n│  20s/city delay  │     │              │     │  extreme_rainfall│\n└──────────────────┘     └──────────────┘     └──────────────────┘\n        │                                              │\n        ▼                                              ▼\n┌──────────────────┐                          ┌──────────────────┐\n│  PySpark ETL     │                          │  FastAPI Server  │\n│  Window functions│                          │  /predict/*      │\n│  Parquet output  │                          │  /monitor/drift  │\n└──────────────────┘                          │  /health         │\n                                              └──────────────────┘\n```\n\n### Rate Limit Protection\n\nOpen-Meteo enforces a **5,000 API calls per hour** limit. Each variable-day-location combination counts as one call. The pipeline avoids hitting this with three safeguards:\n\n| Measure | Value | Why |\n|---|---|---|\n| **Yearly chunking** | 16 single-year chunks per city | A single 16-year request for 15 variables costs ~430 calls; yearly chunks cost ~39 each. Under the 5,000/hr limit for all 15 cities. |\n| **Chunk delay** | 5 seconds between chunks | Prevents burst throttling within a city's download. |\n| **City delay** | 20 seconds between cities | Spreads load evenly across the full city list. |\n| **Retry backoff** | Exponential: 10s, 20s, 40s | If a chunk fails, waits longer before retrying. |\n| **Total call estimate** | ~240 calls per full run | Well within the 5,000/hour budget. |\n\nAll delays are configurable in [`configs/config.yaml`](#configuration).\n\n---\n\n## Models\n\nThree XGBoost classifiers trained exclusively on real data from Supabase:\n\n### 1. Cyclone Intensity (Multi-class)\n\nPredicts Saffir-Simpson category (0–5) from atmospheric features.\n\n| Feature | Source |\n|---|---|\n| `lat`, `lon` | Coordinates |\n| `wind_max` | Max wind speed |\n| `pressure_min` | Min pressure |\n| `wind_gust` | Wind gust speed |\n| `pressure_trend` | Pressure change |\n| `wind_trend` | Wind speed change |\n| `year`, `month` | Temporal features |\n| `lat_abs` | Absolute latitude |\n\n### 2. Heatwave Prediction (Binary)\n\nFlags whether a day qualifies as a heatwave based on temperature thresholds.\n\n**Labeling rule:** A day is flagged as a heatwave when the mean temperature exceeds the 90th percentile of the trailing 30-day window.\n\n### 3. Extreme Rainfall (Binary)\n\nFlags extreme precipitation events.\n\n**Labeling rule:** A day is flagged as extreme when precipitation exceeds the 95th percentile of the trailing 30-day window.\n\n\u003e **Training requires real data in Supabase.** Run the data pipeline first (see [Data Pipeline](#data-pipeline)).\n\n---\n\n## Cities Covered\n\n15 cities spanning India's climate zones:\n\n| City | State | Zone | Coordinates |\n|---|---|---|---|\n| Mumbai | Maharashtra | Coastal | 19.08°N, 72.88°E |\n| Chennai | Tamil Nadu | Coastal | 13.08°N, 80.27°E |\n| Kolkata | West Bengal | Coastal | 22.57°N, 88.36°E |\n| Delhi | Delhi | Inland | 28.70°N, 77.10°E |\n| Ahmedabad | Gujarat | Arid | 23.02°N, 72.57°E |\n| Hyderabad | Telangana | Inland | 17.39°N, 78.49°E |\n| Bengaluru | Karnataka | Inland | 12.97°N, 77.59°E |\n| Kochi | Kerala | Coastal | 9.93°N, 76.27°E |\n| Bhubaneswar | Odisha | Coastal | 20.30°N, 85.82°E |\n| Jaipur | Rajasthan | Arid | 26.91°N, 75.79°E |\n| Lucknow | Uttar Pradesh | Inland | 26.85°N, 80.95°E |\n| Guwahati | Assam | Humid | 26.14°N, 91.74°E |\n| Pune | Maharashtra | Inland | 18.52°N, 73.86°E |\n| Visakhapatnam | Andhra Pradesh | Coastal | 17.69°N, 83.22°E |\n| Surat | Gujarat | Coastal | 21.17°N, 72.83°E |\n\n---\n\n## Data Pipeline\n\nThe pipeline downloads 15 weather variables from Open-Meteo's archive API for each city from 2010-01-01 to yesterday:\n\n`temperature_2m_max`, `temperature_2m_min`, `temperature_2m_mean`, `precipitation_sum`, `rain_sum`, `snowfall_sum`, `precipitation_hours`, `wind_speed_10m_max`, `wind_gusts_10m_max`, `wind_direction_10m_dominant`, `pressure_msl_mean`, `relative_humidity_2m_mean`, `cloud_cover_mean`, `shortwave_radiation_sum`, `et0_fao_evapotranspiration`\n\n### Flow\n\n1. **Download** — Open-Meteo archive API, yearly chunks with pacing delays\n2. **Preprocess** — Label extreme events (heatwave/rainfall/cyclone thresholds), build feature columns\n3. **Upload** — Upsert to Supabase `weather_data` table with batch tracking\n4. **CSV backup** — Individual city files saved as GitHub Actions artifacts (7-day retention)\n\n### Schedule\n\n| Trigger | Time | Mechanism |\n|---|---|---|\n| **Daily cron** | 2:00 AM IST (20:30 UTC) | GitHub Actions scheduled workflow |\n| **Manual** | Any time | GitHub → Actions → Data Pipeline → Run workflow |\n\n### Supabase Schema\n\nTwo tables in the `public` schema:\n\n**`weather_data`** — One row per city per day with all 15 weather variables, coordinates, extreme event flags, and batch metadata. Unique constraint on `(city, time)` for safe re-runs.\n\n**`download_batches`** — Tracks each pipeline run: start time, completion time, cities processed, rows ingested, status, and error messages.\n\n---\n\n## PySpark ETL\n\nAn **Apache PySpark** ETL layer sits alongside the pandas pipeline, demonstrating distributed data processing. It reads the same city CSVs and produces partitioned Parquet output with identical feature engineering:\n\n```\nCSVs (14 cities, 84K rows)\n        │\n        ▼\n┌───────────────────────────────┐\n│  PySpark ETL (spark_etl.py)  │\n│  1. Read CSVs + rename cols  │\n│  2. Seasonal sin/cos encode  │\n│  3. Heatwave streak (Window) │\n│  4. Percentile thresholds    │\n│  5. Lag features (1, 3, 7)   │\n│  6. Rolling mean/std (3, 7)  │\n│  7. Write partitioned Parquet│\n└───────────────────────────────┘\n        │\n        ▼\nweather_pyspark.parquet/ (partitioned by city)\n```\n\n### Requirements\n\n- **JDK 21** (Spark 4.x is incompatible with JDK 24+)\n- PySpark (included in `requirements.txt`)\n\n### Running\n\n```bash\nexport JAVA_HOME=/usr/lib/jvm/java-21-openjdk-amd64\npython -m stormwatch.data.spark_etl\n# Output: data/processed/weather_pyspark.parquet/\n```\n\n### Training on PySpark Output\n\n```python\nimport pandas as pd\nfrom stormwatch.models.train import train_heatwave_model, train_rainfall_model\n\ndf = pd.read_parquet(\"data/processed/weather_pyspark.parquet\")\nhw_model = train_heatwave_model(df, use_hyperopt=False)\nrf_model = train_rainfall_model(df, use_hyperopt=False)\n```\n\n---\n\n## API\n\nFastAPI server with interactive docs at `/docs`.\n\n### Endpoints\n\n| Method | Path | Description |\n|---|---|---|\n| GET | `/health` | Health check + loaded models |\n| GET | `/models` | List loaded models |\n| POST | `/predict/cyclone` | Cyclone intensity (category 0–5) |\n| POST | `/predict/heatwave` | Heatwave probability + severity |\n| POST | `/predict/rainfall` | Extreme rainfall probability |\n| POST | `/monitor/drift` | Data drift check |\n\n### Prediction Response Format\n\n```json\n{\n  \"model\": \"cyclone_intensity\",\n  \"prediction\": {\n    \"category\": 3,\n    \"description\": \"Category 3 (111-129 mph)\",\n    \"probabilities\": {\n      \"0\": 0.02, \"1\": 0.08, \"2\": 0.15,\n      \"3\": 0.55, \"4\": 0.18, \"5\": 0.02\n    },\n    \"wind_kts\": 90.0,\n    \"confidence\": 0.55\n  }\n}\n```\n\n---\n\n## Setup\n\n### Prerequisites\n\n- Python 3.13+\n- A [Supabase](https://supabase.com) project (free tier works)\n- JDK 21 (for PySpark ETL — optional)\n- Node.js 18+ (for PDF report generation — optional)\n- GitHub account (for Actions cron)\n\n### Local Setup\n\n```bash\n# Clone and enter\ngit clone https://github.com/stephenbaraik/stormwatch-ai.git\ncd stormwatch-ai\n\n# Create virtualenv and install\nmake setup\n\n# Or manually:\npython3 -m venv .venv\nsource .venv/bin/activate\npip install -r requirements.txt\n\n# Configure environment\ncp .env.example .env\n# Edit .env with your Supabase URL and service role key\n```\n\n### Supabase Setup\n\n1. Create a project at [supabase.com](https://supabase.com)\n2. Go to **SQL Editor**, paste and run [`stormwatch/database/schema.sql`](stormwatch/database/schema.sql)\n3. Get your credentials from **Project Settings → API**:\n   - `SUPABASE_URL` — Project URL (e.g. `https://xxxx.supabase.co`)\n   - `SUPABASE_SERVICE_KEY` — `service_role` secret (not the anon key)\n\n### GitHub Secrets\n\nFor the cron pipeline to upload to Supabase, set these in your repo:\n**Settings → Secrets and variables → Actions → New repository secret**\n\n- `SUPABASE_URL` — Your Supabase project URL\n- `SUPABASE_SERVICE_KEY` — Your service role key\n\n---\n\n## Usage\n\n### Run the Data Pipeline\n\nDownloads weather data, preprocesses it, and uploads to Supabase:\n\n```bash\n# Full pipeline (download → preprocess → supabase → CSV)\nmake pipeline\n\n# Or with explicit options\npython -m stormwatch.data.pipeline \\\n  --start-date 2010-01-01 \\\n  --output-dir data/raw\n\n# Skip Supabase upload (CSV only)\npython -m stormwatch.data.pipeline --no-upload\n```\n\nManual trigger from GitHub:\n1. Go to your repo → **Actions** → **Data Pipeline**\n2. Click **Run workflow**\n3. (Optional) Set a custom start date or check \"Full refresh\"\n\n### Train Models\n\nRequires data in Supabase first:\n\n```bash\nmake train\n```\n\nOr step by step:\n\n```bash\n# Train all three models\npython -m stormwatch.models.train\n\n# Train individual models\npython -c \"from stormwatch.models.train import train_all; train_all()\"\n```\n\n### Run the API\n\n```bash\nmake api\n# Or manually:\nuvicorn stormwatch.api.server:app --reload --host 0.0.0.0 --port 8000\n```\n\nOpen [http://localhost:8000/docs](http://localhost:8000/docs) for the interactive Swagger UI.\n\n### Monitor Drift\n\n```bash\nmake monitor\n```\n\n### Generate Report Figures\n\n```bash\npython scripts/generate_figures.py\n# Output: docs/figures/ (11 PNG visualizations)\n```\n\n### Generate PDF Report\n\nConverts the markdown report to a styled PDF with embedded figures:\n\n```bash\nnpm install          # one-time: install marked + puppeteer\nnode scripts/convert_to_pdf.mjs\n# Output: docs/end_to_end_report.pdf\n```\n\n---\n\n## Project Structure\n\n```\nstormwatch-ai/\n├── .github/workflows/       # CI/CD\n│   ├── ci.yml               # Lint → Test → Docker build\n│   └── data-pipeline.yml    # Daily cron @ 2 AM IST\n├── configs/\n│   └── config.yaml          # Runtime configuration\n├── data/\n│   ├── raw/                 # Downloaded CSV files (gitignored)\n│   └── processed/           # PySpark Parquet output (gitignored)\n├── docs/\n│   ├── end_to_end_report.md # Full ML report (1000+ lines)\n│   └── figures/             # 11 generated visualizations\n├── models/                  # Trained model pickles (gitignored)\n├── scripts/\n│   ├── convert_to_pdf.mjs   # Markdown → PDF (Puppeteer)\n│   ├── generate_figures.py  # Report figure generation\n│   └── check_supabase.py    # Supabase connectivity check\n├── stormwatch/              # Main package\n│   ├── api/                 # FastAPI server + request/response schemas\n│   │   ├── schemas.py\n│   │   └── server.py\n│   ├── data/                # Data pipeline\n│   │   ├── download.py      # Open-Meteo + IBTrACS downloaders\n│   │   ├── pipeline.py      # Orchestrator (download → preprocess → upload)\n│   │   ├── preprocess.py    # Extreme event labeling + feature engineering\n│   │   └── spark_etl.py     # PySpark ETL: Window features + Parquet\n│   ├── database/            # Supabase client + schema\n│   │   ├── schema.sql\n│   │   └── supabase_client.py\n│   ├── features/            # Feature engineering\n│   │   └── builder.py\n│   ├── models/              # XGBoost model definitions + training\n│   │   ├── base.py\n│   │   ├── cyclone.py\n│   │   ├── heatwave.py\n│   │   ├── rainfall.py\n│   │   └── train.py\n│   ├── monitor/             # Data drift monitoring\n│   │   └── drift.py\n│   ├── config.py            # Pydantic-typed config loader\n│   └── logger.py            # Logging setup\n├── tests/                   # Test suite (80 tests)\n│   ├── conftest.py\n│   ├── test_api.py\n│   ├── test_config.py\n│   ├── test_models.py\n│   └── test_monitor.py\n├── .env.example             # Environment template\n├── .gitignore\n├── docker-compose.yml\n├── Dockerfile\n├── Makefile                 # Common commands\n├── package.json             # Node deps (marked, puppeteer)\n├── pyproject.toml\n├── README.md\n└── requirements.txt\n```\n\n---\n\n## Configuration\n\nAll tunable parameters in [`configs/config.yaml`](configs/config.yaml):\n\n```yaml\ndata:\n  openmeteo:\n    start_date: \"2010-01-01\"\n    timezone: \"Asia/Kolkata\"\n    retry_attempts: 3\n    retry_delay_seconds: 10\n    city_delay_seconds: 20      # ⬅ Pacing between cities\n    chunk_delay_seconds: 5.0    # ⬅ Pacing between yearly chunks\n\nmodels:\n  cyclone_intensity:\n    type: \"multiclass\"\n    target: \"category\"\n    hyperopt_evals: 50\n  heatwave_prediction:\n    type: \"binary\"\n    target: \"heatwave_occurred\"\n    hyperopt_evals: 50\n  extreme_rainfall:\n    type: \"binary\"\n    target: \"extreme_rainfall\"\n    hyperopt_evals: 50\n\ntraining:\n  mlflow_tracking_uri: \"sqlite:///mlflow/mlflow.db\"\n  experiment_name: \"stormwatch-ai\"\n  cv_folds: 5\n```\n\nEnvironment variables override config values using the pattern:\n```bash\nSTORMWATCH__DATA__OPENMETEO__CHUNK_DELAY_SECONDS=10\n```\n\n---\n\n## Testing\n\n```bash\n# Run all tests\nmake test\n\n# Or manually\npytest tests/ -v\n\n# With coverage\npytest tests/ -v --cov=stormwatch\n```\n\nThe test suite uses mock data fixtures (no external API calls). 80+ tests covering:\n- Data preprocessing and extreme event labeling\n- Model training and prediction interfaces\n- API endpoints (request/response serialization)\n- Config loading and validation\n- Drift monitoring logic\n\n---\n\n## Deployment\n\n### Docker\n\n```bash\n# Build and run the full stack\nmake docker-up\n\n# Or manually\ndocker compose up -d\n```\n\n### GitHub Actions (Data Pipeline — already configured)\n\nThe pipeline runs daily at 2:00 AM IST. Monitor runs at:\n- GitHub → Actions → Data Pipeline\n\n### API Deploy (Render / Railway / Fly)\n\nThe FastAPI server (`stormwatch.api.server:app`) can be deployed to any platform. Ensure:\n- Environment variables `SUPABASE_URL` and `SUPABASE_SERVICE_KEY` are set\n- Trained model pickles are available in the `models/` directory\n\n---\n\n## Monitoring\n\nThe drift detection module uses the **Kolmogorov-Smirnov two-sample test** (`scipy.stats.ks_2samp`) to compare recent predictions against a reference window. Alerts fire when ≥1/3 of features show statistically significant drift (p \u003c 0.05).\n\n```bash\n# Run drift check\npython -m stormwatch.monitor.drift\n\n# Via API\ncurl -X POST \"http://localhost:8000/monitor/drift?model_name=cyclone\"\n```\n\nAll predictions are logged to SQLite (`mlflow/monitor.db`) for drift analysis.\n\n---\n\n## License\n\nMIT\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fstephenbaraik%2Fstormwatch-ai","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fstephenbaraik%2Fstormwatch-ai","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fstephenbaraik%2Fstormwatch-ai/lists"}