Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/viyadb/viyadb-spark
Data processing ang ingestion backend for ViyaDB based on Spark streaming
https://github.com/viyadb/viyadb-spark
spark spark-streaming spark-streaming-kafka viyadb
Last synced: about 1 month ago
JSON representation
Data processing ang ingestion backend for ViyaDB based on Spark streaming
- Host: GitHub
- URL: https://github.com/viyadb/viyadb-spark
- Owner: viyadb
- License: apache-2.0
- Created: 2017-07-30T16:51:18.000Z (over 7 years ago)
- Default Branch: master
- Last Pushed: 2024-04-08T07:14:11.000Z (9 months ago)
- Last Synced: 2024-04-08T08:30:15.032Z (9 months ago)
- Topics: spark, spark-streaming, spark-streaming-kafka, viyadb
- Language: Scala
- Homepage:
- Size: 182 KB
- Stars: 1
- Watchers: 3
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
viyadb-spark
=============Data processing backend (indexer) for ViyaDB based on Spark.
[![Build Status](https://travis-ci.org/viyadb/viyadb-spark.png)](https://travis-ci.org/viyadb/viyadb-spark)
[![Coverage](https://codecov.io/github/viyadb/viyadb-spark/coverage.svg?branch=master)](https://codecov.io/github/viyadb/viyadb-spark?branch=master)
There are two processes defined in this project:
* Streaming process
* Batch processStreaming process reads events in real-time, pre-aggregates them, and dumps loadable into ViyaDB TSV files
to a deep storage. Batch process creates historical view of data containing events from previous batch plus
events created afterwards in the streaming process.The process can be graphically presented like this:
+----------------+
| |
| Streaming Job |
| |
+----------------+
|
| writes current events
v
+------------------+ +--------+---------+
| Previous Period | | Current Period |
| Real-Time Events |--+ | Real-Time Events |
+------------------+ | +------------------+
|
+------------------+ | +------------------+
| Historical | | | Historical |
| Events | | | Events |
+------------------+ | +------------------+ ...
| | ^
-----------|------------------|-------------------|----------------------------->
| | | Timeline
| v |
| +-------------+ |
| | | | unions previous period events
+------------> | Batch Job |---------+ with all the historical events
| | that existed before
+-------------+## Features
### Real-time Process
Real-time process responsibility:
* Read data from a source (for now only Kafka support is provided as part of the code, but it can be easily extended), and parse it
* Aggregate events by configured time window
* Generate data loadable by ViyaDB (TSV format)### Batch Process
Batch process does the following:
* Reads events that were generated by the real-time process
* Optionally, clean the dataset out from irrelevant events
* Aggregate the dataset
* Partition the data to equal parts in terms of data size (aggregated rows number), and write these partitions back to historical storage## Prerequisites
* [Consul](https://www.consul.io)
Consul is used for storing configuration as well as for synchronizing different parts that ViyaDB cluster consists of.
For running either real-time or batch processes the following configurations must present in Consul:* `/tables//config` - Table configuration
* `/indexers//config` - Indexer configuration## Building
```bash
mvn package
```## Running
```bash
spark-submit --class target/viyadb-spark_2.11-0.1.0-uberjar.jar \
--consul-host "" --consul-prefix "viyadb" \
--indexer-id ""
```To run streaming job use `com.github.viyadb.spark.streaming.Job` for `jobClass`, to run batch job
use `com.github.viyadb.spark.batch.Job`.To see all available options, run:
```bash
spark-submit --class target/viyadb-spark_2.11-0.1.0-uberjar.jar --help
```