https://github.com/bayoadejare/lightning-containers
Docker powered starter for geospatial analysis of lightning atmospheric data.
https://github.com/bayoadejare/lightning-containers
clustering-analysis csv-files data-engineer data-engineering-pipeline data-warehouse databases docker jupyter machine-learning-algorithms noaa-weather orchestrator pandas python3 spatialite sqlite streamlit-dashboard
Last synced: 6 months ago
JSON representation
Docker powered starter for geospatial analysis of lightning atmospheric data.
- Host: GitHub
- URL: https://github.com/bayoadejare/lightning-containers
- Owner: BayoAdejare
- License: apache-2.0
- Created: 2024-02-12T09:54:26.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-04-01T16:24:43.000Z (6 months ago)
- Last Synced: 2025-04-06T04:27:33.968Z (6 months ago)
- Topics: clustering-analysis, csv-files, data-engineer, data-engineering-pipeline, data-warehouse, databases, docker, jupyter, machine-learning-algorithms, noaa-weather, orchestrator, pandas, python3, spatialite, sqlite, streamlit-dashboard
- Language: Python
- Homepage: https://lightning-containers.streamlit.app/
- Size: 160 MB
- Stars: 6
- Watchers: 2
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: readme.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
Awesome Lists containing this project
README
## Table of Contents
- [Introduction](#introduction)
- [Project Structure](#project-structure)
- [Requirements](#requirements)
- [Installation](#installation)
- [ETL Flow](#etl-flow)
- [Clustering Flow](#clustering-flow)
- [Dashboard Map](#dashboard-map)
- [Testing](#testing)
- [CI/CD](#cicd)
- [License](#license)
- [Acknowledgements](#acknowledgements)## Introduction
This is a monolith Docker image to help you get started with geospatial analysis and visualization of lightning atmospheric data. The data comes from US **National Oceanic and Atmospheric Administration (NOAA)** [Geostationary Lightning Mapper (GLM) - Data Product](https://www.goes-r.gov/products/baseline-lightning-detection.html) sourced from AWS s3 buckets. There are currently two main component:
1. ETL Ingestion - data ingestion and analysis processes.
2. Streamlit dashboard app - frontend gis visualization dashboard.Processing done using Pandas dataframes, SQlite with Spatialite extension as the local storage and self-hosted Prefect server instance for orchestration and observability of the processing pipelines.
|
![]()
|:--:|
|Architecture: Docker + Prefect + Pandas + SQLite + Streamlit|**Brief Data Summary [Lightning Cluster Filter Algorithm (LCFA)](https://www.star.nesdis.noaa.gov/goesr/documents/ATBDs/Baseline/ATBD_GOES-R_GLM_v3.0_Jul2012.pdf)**
```
The multidimensional data structures stored in the netCDF4 files contain a rich variety of
data including metadata with descriptors. In general, the main variables: flashes, groups,
events form an hierarchy, i.e. a series of detected radiant events are clustered into groups and groups
are clustered into flashes using LCFA.
```
## Project Structure```
lightning-containers/
|
βββ src/
β βββ flows.py
β βββ tasks/
| βββ analytics/
| βββ etl/
βββ app/
| βββ dashboard.py
βββ notebooks/
| βββ clustering/
| βββ mapping/
| βββ streaming/
βββ tests/
β βββ test_clustering.py
| βββ test_extract.py
| βββ test_load.py
| βββ test_transform.py
βββ docs/
β βββ index.md
βββ img/
βββ .streamlit/
β βββ config.toml
β βββ secrets.toml
βββ .github/
β βββ workflows/
β βββ docker-image.yml
βββ data/
βββ .gitignore
βββ LICENSE
βββ CONTRIBUTING.md
βββ CODE_OF_CONDUCT.md
βββ Dockerfile
βββ docker-compose.yml
βββ README.md
```## Requirements
|Resource|Minimum|Recommended|
|--------|-------|-----------|
|CPU |2 cores|4+ cores |
|RAM |6GB |16GB |
|Storage |8GB |24GB |## Installation
### Quick Start: Docker Container
1. Clone the repository.
```
git clone https://github.com/BayoAdejare/lightning-containers.git
cd lightning-containers
```2. Can be ran with docker containers or installed locally.
```
docker-compose up -d # spin up containers
```### Local install
Make sure you have the virtual environment configured:
```
python -m venv venv
source venv/bin/activate # On Windows, use `venv\Scripts\activate`
```For requirements, this can be installed from the project directory via pip's setup command:
`pip install -r requirements.txt # =< python3.12 `
### Start Flow
Run the command to start the prefect workflow orchestration:
`prefect server start # Start prefect engine and UI i.e. http://localhost:4200/`
The prefect orchestration platform is required to start the scheduling, from the prefect ui, you can run and monitor the data flows.
Run the command to start the data app.
`python src/flows.py # Start backend`
`streamlit run app/dashboard.py # Start frontend i.e. http://localhost:8501/`
## ETL Flow
ETL flow data tasks:
+ `Source`: **extracts** NOAA GOES-R GLM file datasets from AWS s3 bucket, default is GOES-18.
+ `Transformations`: **transforms** dataset into time series csv.
+ `Sink`: **loads** dataset to persistant storage.#### Data Ingestion
Ingests the data needed based on specified time window: start and end dates.
##### Data Processes
+ `extract`: downloads NOAA GOES-R GLM netCDF4 files from AWS s3 bucket.
+ `transform`: converts GLM netCDF into time and geo series CSVs.
+ `load`: loads CSVs to a local backend, persistant SQLite with Spatialite extension.## Clustering Flow
#### Cluster Analysis
Performs grouping of the ingested data by implementing K-Means clustering algorithm.
##### Data Tasks
+ `preprocessor`: prepares the data for cluster model, clean and normalize the data.
+ `kmeans_cluster`: fits the data to an implementation of k-means cluster algorithm.
+ `silhouette_evaluator`: evaluates the choice of 'k' clusters by calculating the silhouette coefficient for each k in defined range.
+ `elbow_evaluator`: evaluates the choice of 'k' clusters by calculating the sum of the squared distance for each k in defined range.## Dashboard Map
|
![]()
|:--:|
|Lightning containers dashboard|## Testing
Use the following command to run tests:
`pytest`
## CI/CD
This project uses GitHub Actions for CI/CD. The workflow is defined in the `.github/workflows/docker-image.yml` file. This includes:
- Automated testing on pull requests
- Data quality checks on scheduled intervals
- Deployment of updated ml models and Spark jobs to production## Contributing
Please read [CONTRIBUTING.md](CONTRIBUTING.md) for details on our contributing guidelines and the process for submitting pull requests.
## License
This project is licensed under the Apache 2.0 License - see the [Apache 2.0 License](LICENSE) file for details.
## Acknowledgements
This work would not have been possible without amazing open source software and datasets, including but not limited to:
+ [GLM Dataset from NOAA NESDIS](https://www.star.nesdis.noaa.gov/goesr/documents/ATBDs/Baseline/ATBD_GOES-R_GLM_v3.0_Jul2012.pdf)
+ [Prefect from PrefectHQ](https://docs.prefect.io/api-ref/prefect/)
+ [Streamlit](https://docs.streamlit.io/)
+ Built on the codebase of [Lightning Streams](https://github.com/BayoAdejare/lightning-streams).Thank you to the authors of these software and datasets for making them available to the community!