An open API service indexing awesome lists of open source software.

https://github.com/bayoadejare/lightning-streams

Batch/stream ETL pipeline of NOAA GLM dataset, using Python frameworks: Dagster, PySpark and Parquet storage.
https://github.com/bayoadejare/lightning-streams

clustering csv data-engineering data-pipeline data-warehousing database etl-pipeline jupyter-notebook k-means-clustering machine-learning noaa-data orchestration parquet pyspark python spark-sql spark-streaming sql streaming

Last synced: 10 months ago
JSON representation

Batch/stream ETL pipeline of NOAA GLM dataset, using Python frameworks: Dagster, PySpark and Parquet storage.

Awesome Lists containing this project

README

          

# Lightning Streams

An example of a simple `stream and batch query` made by implementing PySpark, python API of [Apache Spark™](https://spark.apache.org/), queries on a Lightning flash dataset collected from [NOAA's GLM](https://www.goes-r.gov/spacesegment/glm.html).
Uses [Apache Parquet](https://parquet.apache.org/) file format as the storage backend and [Dagster Software-Defined Assets](https://docs.dagster.io/concepts/assets/software-defined-assets) to orchestrate the batch/stream processing pipeline.

Blog post: [Lightning Streams: PySpark Batch & Streaming Queries](https://bayoadejare.medium.com/lightning-streams-pyspark-batch-streaming-queries-f13fc6a68cb6)

|Technologies used and respective logos
|:--:|
|Dagster + PySpark + Parquet|

## Installation

First make sure, you have the requirements installed, this can be installed from the project directory via pip's setup command:

`pip install . # =< python3.11 `

## Quick Start

Run the command to start the dagster orchestration framework:

`dagster dev # Start dagster daemon and dagit ui`

The dagster daemon is required to start the scheduling, from the dagit ui, you can run and monitor the data assets.

## ETL Pipeline

ETL pipe data assets:

+ `Source`: **extracts** NOAA GOES-R GLM file datasets from AWS s3 bucket.
+ `Transformations`: **transforms** dataset into time series csv.
+ `Sink`: **loads** dataset to persistant storage.

Sink loading process refactored to use `pyspark` (batch and structured streaming queries) and parquet as the storage backend.

|ETL Data assets
|:--:|
|ETL Data asset group|

## Clustering Pipeline

Blog post: [Exploratory Data Analysis with Lightning Streaming Pipeline](https://medium.com/@adebayoadejare/exploratory-data-analysis-with-lightning-clustering-pipeline-6a2bca17d0d3)

|Lightning clustering pipeline Illustration
|:--:|
|Materializing Lightning clustering pipeline|

### Data Ingestion

Ingests the data needed based on specified time window: start and end dates.

#### Data Assets

+ `ingestor`: Composed of `extract`, `transform`, and `load` data assets.
+ `extract`: downloads [NOAA GOES-R GLM](https://www.goes-r.gov/spacesegment/glm.html) netCDF files from AWS s3 bucket
+ `transform`: converts GLM netCDF into time and geo series CSVs
+ `load`: loads CSVs to a local backend, persistant duckdb

### Cluster Analysis

Performs grouping of the ingested data by implementing K-Means clustering algorithm.

|An example clustering of flash data points
|:--:|
|Visual of clustering process|

#### Data Assets

+ `preprocessor`: prepares the data for cluster model, clean and normalize the data.
+ `kmeans_cluster`: fits the data to an implementation of k-means cluster algorithm.
+ `silhouette_evaluator`: evaluates the choice of 'k' clusters by calculating the silhouette coefficient for each k in defined range.
+ `elbow_evaluator`: evaluates the choice of 'k' clusters by calculating the sum of the squared distance for each k in defined range.

|Display of clustering materialized assets
|:--:|
|Displaying Clusering analysis data assets|

|An example clustering of flash data points
|:--:|
|Lightning clustering map|

## Testing

Use the following command to run tests:

`pytest`

## License

[Apache 2.0 License](LICENSE)