Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/adeshinao/rustlang-github-insights
Simple batch processing pipeline
https://github.com/adeshinao/rustlang-github-insights
chartjs spark spring-boot
Last synced: about 1 month ago
JSON representation
Simple batch processing pipeline
- Host: GitHub
- URL: https://github.com/adeshinao/rustlang-github-insights
- Owner: adeshinaO
- License: other
- Created: 2020-12-10T15:26:15.000Z (about 4 years ago)
- Default Branch: master
- Last Pushed: 2021-01-19T15:12:22.000Z (about 4 years ago)
- Last Synced: 2024-12-17T07:56:11.087Z (about 1 month ago)
- Topics: chartjs, spark, spring-boot
- Language: Java
- Homepage:
- Size: 4.21 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE.txt
Awesome Lists containing this project
README
# Overview: Rust Github Insights.
This is a simple batch processing pipeline project. The dashbord shows statistics for various activities on the Github repositories belonging to the [Rust Language Project](https://github.com/rust-lang).
The raw data is obtained from the [Github's Activity API](https://docs.github.com/en/rest/reference/activity).![](rai.gif)
# Architecture
![](rai-arch.png)
The pipeline starts with [extractor](extractor-function), An AWS Lambda function, fetching event data from Github's API and writing newline delimited JSON to a file on AWS S3. Then [processor](spark-processor), an Apache Spark application, reads the file on S3, aggregates the data and writes to an AWS DynamoDB table. [Dashboard](web-dashboard) is a Spring Boot application that provides a web UI and a REST API endpoint for accessing the processed data stored on DynamoDB.
The biggest improvement that can be made to this architecture is to automate the data pipeline using a workflow orchestration tool like Apache Airflow.
# Tech Stack
* Java 11
* Spring Boot - Used to serve the dashboard UI and provide a REST endpoint for data.
* Bulma - CSS library for styling the UI of the dashboard.
* Apache Spark - For distributed batch processing of the raw data from Github's API.
* ChartJS - Used to create charts for visualizing the data.* ## AWS Services
- Lambda - Runs the extractor program that populates an S3 bucket with raw Github data.
- S3 - Storage for raw Github data before the Spark job processes them.
- DynamoDB - Persistence for the final output of the Spark job.
- Elastic Beanstalk - Provides a way to conveniently deploy dashboard on EC2.