https://github.com/bayoadejare/lightning-streams

Batch/stream ETL pipeline of NOAA GLM dataset, using Python frameworks: Dagster, PySpark and Parquet storage.
https://github.com/bayoadejare/lightning-streams

clustering csv data-engineering data-pipeline data-warehousing database etl-pipeline jupyter-notebook k-means-clustering machine-learning noaa-data orchestration parquet pyspark python spark-sql spark-streaming sql streaming

Last synced: 11 months ago
JSON representation

Batch/stream ETL pipeline of NOAA GLM dataset, using Python frameworks: Dagster, PySpark and Parquet storage.

Host: GitHub
URL: https://github.com/bayoadejare/lightning-streams
Owner: BayoAdejare
License: apache-2.0
Created: 2023-09-14T21:47:20.000Z (almost 3 years ago)
Default Branch: main
Last Pushed: 2023-09-18T16:58:47.000Z (almost 3 years ago)
Last Synced: 2025-08-11T04:53:24.833Z (11 months ago)
Topics: clustering, csv, data-engineering, data-pipeline, data-warehousing, database, etl-pipeline, jupyter-notebook, k-means-clustering, machine-learning, noaa-data, orchestration, parquet, pyspark, python, spark-sql, spark-streaming, sql, streaming
Language: Python
Homepage:
Size: 63.4 MB
Stars: 5
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: readme.md
- License: LICENSE

Awesome Lists containing this project

README

# Lightning Streams

An example of a simple `stream and batch query` made by implementing PySpark, python API of [Apache Spark™](https://spark.apache.org/), queries on a Lightning flash dataset collected from [NOAA's GLM](https://www.goes-r.gov/spacesegment/glm.html).
Uses [Apache Parquet](https://parquet.apache.org/) file format as the storage backend and [Dagster Software-Defined Assets](https://docs.dagster.io/concepts/assets/software-defined-assets) to orchestrate the batch/stream processing pipeline.

Blog post: [Lightning Streams: PySpark Batch & Streaming Queries](https://bayoadejare.medium.com/lightning-streams-pyspark-batch-streaming-queries-f13fc6a68cb6)

|
|:--:|
|Dagster + PySpark + Parquet|

## Installation

First make sure, you have the requirements installed, this can be installed from the project directory via pip's setup command:

`pip install . # =< python3.11 `

## Quick Start

Run the command to start the dagster orchestration framework:

`dagster dev # Start dagster daemon and dagit ui`

The dagster daemon is required to start the scheduling, from the dagit ui, you can run and monitor the data assets.

## ETL Pipeline

ETL pipe data assets:

+ `Source`: **extracts** NOAA GOES-R GLM file datasets from AWS s3 bucket.
+ `Transformations`: **transforms** dataset into time series csv.
+ `Sink`: **loads** dataset to persistant storage.

Sink loading process refactored to use `pyspark` (batch and structured streaming queries) and parquet as the storage backend.

|
|:--:|
|ETL Data asset group|

## Clustering Pipeline

Blog post: [Exploratory Data Analysis with Lightning Streaming Pipeline](https://medium.com/@adebayoadejare/exploratory-data-analysis-with-lightning-clustering-pipeline-6a2bca17d0d3)

|
|:--:|
|Materializing Lightning clustering pipeline|

### Data Ingestion

Ingests the data needed based on specified time window: start and end dates.

#### Data Assets

+ `ingestor`: Composed of `extract`, `transform`, and `load` data assets.
+ `extract`: downloads [NOAA GOES-R GLM](https://www.goes-r.gov/spacesegment/glm.html) netCDF files from AWS s3 bucket
+ `transform`: converts GLM netCDF into time and geo series CSVs
+ `load`: loads CSVs to a local backend, persistant duckdb

### Cluster Analysis

Performs grouping of the ingested data by implementing K-Means clustering algorithm.

|
|:--:|
|Visual of clustering process|

#### Data Assets

+ `preprocessor`: prepares the data for cluster model, clean and normalize the data.
+ `kmeans_cluster`: fits the data to an implementation of k-means cluster algorithm.
+ `silhouette_evaluator`: evaluates the choice of 'k' clusters by calculating the silhouette coefficient for each k in defined range.
+ `elbow_evaluator`: evaluates the choice of 'k' clusters by calculating the sum of the squared distance for each k in defined range.

|
|:--:|
|Displaying Clusering analysis data assets|

|
|:--:|
|Lightning clustering map|

## Testing

Use the following command to run tests:

`pytest`

## License

[Apache 2.0 License](LICENSE)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/bayoadejare/lightning-streams

Awesome Lists containing this project

README