https://github.com/bayoadejare/lightning-streams
Batch/stream ETL pipeline of NOAA GLM dataset, using Python frameworks: Dagster, PySpark and Parquet storage.
https://github.com/bayoadejare/lightning-streams
clustering csv data-engineering data-pipeline data-warehousing database etl-pipeline jupyter-notebook k-means-clustering machine-learning noaa-data orchestration parquet pyspark python spark-sql spark-streaming sql streaming
Last synced: 10 months ago
JSON representation
Batch/stream ETL pipeline of NOAA GLM dataset, using Python frameworks: Dagster, PySpark and Parquet storage.
- Host: GitHub
- URL: https://github.com/bayoadejare/lightning-streams
- Owner: BayoAdejare
- License: apache-2.0
- Created: 2023-09-14T21:47:20.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2023-09-18T16:58:47.000Z (over 2 years ago)
- Last Synced: 2025-08-11T04:53:24.833Z (10 months ago)
- Topics: clustering, csv, data-engineering, data-pipeline, data-warehousing, database, etl-pipeline, jupyter-notebook, k-means-clustering, machine-learning, noaa-data, orchestration, parquet, pyspark, python, spark-sql, spark-streaming, sql, streaming
- Language: Python
- Homepage:
- Size: 63.4 MB
- Stars: 5
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: readme.md
- License: LICENSE
Awesome Lists containing this project
README
# Lightning Streams
An example of a simple `stream and batch query` made by implementing PySpark, python API of [Apache Spark™](https://spark.apache.org/), queries on a Lightning flash dataset collected from [NOAA's GLM](https://www.goes-r.gov/spacesegment/glm.html).
Uses [Apache Parquet](https://parquet.apache.org/) file format as the storage backend and [Dagster Software-Defined Assets](https://docs.dagster.io/concepts/assets/software-defined-assets) to orchestrate the batch/stream processing pipeline.
Blog post: [Lightning Streams: PySpark Batch & Streaming Queries](https://bayoadejare.medium.com/lightning-streams-pyspark-batch-streaming-queries-f13fc6a68cb6)
|
|:--:|
|Dagster + PySpark + Parquet|
## Installation
First make sure, you have the requirements installed, this can be installed from the project directory via pip's setup command:
`pip install . # =< python3.11 `
## Quick Start
Run the command to start the dagster orchestration framework:
`dagster dev # Start dagster daemon and dagit ui`
The dagster daemon is required to start the scheduling, from the dagit ui, you can run and monitor the data assets.
## ETL Pipeline
ETL pipe data assets:
+ `Source`: **extracts** NOAA GOES-R GLM file datasets from AWS s3 bucket.
+ `Transformations`: **transforms** dataset into time series csv.
+ `Sink`: **loads** dataset to persistant storage.
Sink loading process refactored to use `pyspark` (batch and structured streaming queries) and parquet as the storage backend.
|
|:--:|
|ETL Data asset group|
## Clustering Pipeline
Blog post: [Exploratory Data Analysis with Lightning Streaming Pipeline](https://medium.com/@adebayoadejare/exploratory-data-analysis-with-lightning-clustering-pipeline-6a2bca17d0d3)
|
|:--:|
|Materializing Lightning clustering pipeline|
### Data Ingestion
Ingests the data needed based on specified time window: start and end dates.
#### Data Assets
+ `ingestor`: Composed of `extract`, `transform`, and `load` data assets.
+ `extract`: downloads [NOAA GOES-R GLM](https://www.goes-r.gov/spacesegment/glm.html) netCDF files from AWS s3 bucket
+ `transform`: converts GLM netCDF into time and geo series CSVs
+ `load`: loads CSVs to a local backend, persistant duckdb
### Cluster Analysis
Performs grouping of the ingested data by implementing K-Means clustering algorithm.
|
|:--:|
|Visual of clustering process|
#### Data Assets
+ `preprocessor`: prepares the data for cluster model, clean and normalize the data.
+ `kmeans_cluster`: fits the data to an implementation of k-means cluster algorithm.
+ `silhouette_evaluator`: evaluates the choice of 'k' clusters by calculating the silhouette coefficient for each k in defined range.
+ `elbow_evaluator`: evaluates the choice of 'k' clusters by calculating the sum of the squared distance for each k in defined range.
|
|:--:|
|Displaying Clusering analysis data assets|
|
|:--:|
|Lightning clustering map|
## Testing
Use the following command to run tests:
`pytest`
## License
[Apache 2.0 License](LICENSE)