https://github.com/renardeinside/wikiflow
Wikipedia updates streaming, transformation and visualisation
https://github.com/renardeinside/wikiflow
akka-http apache-spark kafka spark spark-streaming visualization wikipedia
Last synced: about 1 year ago
JSON representation
Wikipedia updates streaming, transformation and visualisation
- Host: GitHub
- URL: https://github.com/renardeinside/wikiflow
- Owner: renardeinside
- Created: 2019-03-25T13:54:01.000Z (about 7 years ago)
- Default Branch: master
- Last Pushed: 2023-11-29T22:23:55.000Z (over 2 years ago)
- Last Synced: 2025-04-23T03:45:38.568Z (about 1 year ago)
- Topics: akka-http, apache-spark, kafka, spark, spark-streaming, visualization, wikipedia
- Language: Scala
- Homepage:
- Size: 240 KB
- Stars: 5
- Watchers: 2
- Forks: 21
- Open Issues: 4
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Spark flow on top of the Wikipedia SSE Stream
## How-to run
- Create the docker network:
```bash
make create-network
```
- Run the streaming appliance
```bash
make run-appliance
```
- To run streaming consumption of data via legacy API (DStreams), please run:
```bash
make run-legacy-consumer
```
- To run streaming consumption of data via structured API, please run:
```bash
make run-structured-consumer
```
- To run streaming consumption of data via structured API with write to delta, please run:
```bash
make run-analytics-consumer
```
You could also access the SparkUI for this Job at http://localhost:4040/jobs
## Known issues
- Sometimes you need to increase docker memory limit for your machine (for Mac it's 2.0GB by default).
- To debug memory usage and status of the containers, please use this command:
```bash
docker stats
```
- Sometimes docker couldn't gracefully stop the consuming applications, please use this command in case if container hangs:
```bash
docker-compose -f .yaml down
```