https://github.com/iht/scio-quickstart
This repository contains a sample pipeline for starting with Scio, the Scala framework to develop Apache Beam pipelines. Fork this repository so you can commit your changes in your own repository.
https://github.com/iht/scio-quickstart
scala scio
Last synced: 10 days ago
JSON representation
This repository contains a sample pipeline for starting with Scio, the Scala framework to develop Apache Beam pipelines. Fork this repository so you can commit your changes in your own repository.
- Host: GitHub
- URL: https://github.com/iht/scio-quickstart
- Owner: iht
- License: apache-2.0
- Created: 2022-07-19T03:20:47.000Z (almost 4 years ago)
- Default Branch: main
- Last Pushed: 2022-07-20T16:21:39.000Z (almost 4 years ago)
- Last Synced: 2025-01-13T02:37:06.513Z (over 1 year ago)
- Topics: scala, scio
- Language: Scala
- Homepage:
- Size: 822 KB
- Stars: 0
- Watchers: 1
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Scio quickstart
This repository contains a sample pipeline for starting with [Scio](https://spotify.github.io/scio/), the Scala
framework to develop Apache Beam pipelines.
Fork this repository so you can commit your changes in your own repository.
# Pipeline
The goal of this example is to count the words in Don Quixote, the famous novel by Miguel de Cervantes. The novel has
several characters: Sancho, the buddy of Don Quixote; Dulcinea, the significant other of Don Quixote; Rocinante, the
fearful horse of Don Quixote, etc.
The pipeline does not only count the words, it also sorts the words by number of occurrences, and provides an answer
to an existential question: who is mentioned more in the novel, Sancho or Dulcinea?
Let's find out with the help of Scio.
## Compile
The first step to solve the mysterious question is to compile the code. For that, you will need to have installed SBT:
* https://www.scala-sbt.org/
When you have installed, you can run
* `sbt compile` to compile the code (for instance, while you are developing the code for the pipeline)
* `sbt stage` to produce a runnable package
## Input data
In the `data` directory you will find two files:
* `sample.txt`, small extract of the novel. You can use this for tests while you are developing the pipeline
* `el_quijote.txt`, the full novel, to solve the important question about Sancho or Dulcinea
## Running the example
Once you have run `sbt stage`, there will be a script in the directory `target/universal/stage/bin`. You can use that
script to run the pipeline.
For instance, to find the top 10 words in the sample data:
`./target/universal/stage/bin/scio-quickstart --input-file=./data/sample.txt --output-file=tmp --num-words=10`
After that you should find a file with a name like ` part-00000-of-00001.txt` in the `tmp` subdirectory.
To run with the full data and top 100 words:
`./target/universal/stage/bin/scio-quickstart --input-file=./data/el_quijote.txt --output-file=tmp --num-words=100`
Search for `sancho` and `dulcinea` in the output to solve this burning question.
# Development
The pipeline is initially empty. Your task, should you accept it, is to create the pipeline that is required to solve
the Sancho vs. Dulcinea question.