https://github.com/xmlking/beam-examples

Apache Beam / Google Dataflow Examples
https://github.com/xmlking/beam-examples

apache-beam beam dataflow gcd gradle kotlin monorepo

Last synced: 2 months ago
JSON representation

Apache Beam / Google Dataflow Examples

Host: GitHub
URL: https://github.com/xmlking/beam-examples
Owner: xmlking
Created: 2019-11-21T16:16:41.000Z (over 6 years ago)
Default Branch: master
Last Pushed: 2020-03-09T04:25:01.000Z (over 6 years ago)
Last Synced: 2025-03-25T06:28:16.848Z (over 1 year ago)
Topics: apache-beam, beam, dataflow, gcd, gradle, kotlin, monorepo
Language: Kotlin
Homepage:
Size: 193 KB
Stars: 0
Watchers: 2
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# beam-examples

A set of example **Streaming** and **Batch** jobs implementation with **Apache Beam**

`Dataflow brings life to Datalakes`

![“DataLake with Cloud Dataflow](./docs/dataflow.png)

### Features
1. Monorepo(apps, libs) project to showcase workspace setup with multiple apps and shared libraries
2. **Polyglot** - Support multiple languages (java, kotlin)
4. Support making `FatJar` for submitting jobs form CI Environment
7. Cloud Native (Run Local, Run on Cloud, Deploy as Template for GCD)
8. Multiple Runtime (Flink, Spark, Google Cloud Dataflow, Hazelcast Jet )

### Prerequisites
> see [PLAYBOOK](./docs/PLAYBOOK.md)

### Quick Start

Run WordCount kotlin example:

gradle :apps:wordcount:run --args="--runner=DirectRunner --inputFile=./src/test/resources/data/input.txt --output=./build/output.txt"

WordCount pipeline will run on local and produce the output file in `apps/wordcount/build` directory.

WordCount pipeline can run on Google Cloud Dataflow if you have a project setup in your local.

PROJECT_ID=
GCS_BUCKET=
export GOOGLE_APPLICATION_CREDENTIALS=

gradle :apps:wordcount:run --args="--runner=DataflowRunner --project=$PROJECT_ID --gcpTempLocation=gs://$GCS_BUCKET/dataflow/wordcount/temp/ --stagingLocation=gs://$GCS_BUCKET/dataflow/wordcount/staging/ --inputFile=gs://$GCS_BUCKET/dataflow/wordcount/input/shakespeare.txt --output=gs://$GCS_BUCKET/dataflow/wordcount/output/output.txt"

The `inputFile` option is defined by default in WordCount options, so that it will run with the input file and produce output files in

### Reference

1. [Apache Beam Programming Guide](https://beam.apache.org/documentation/programming-guide/)
1. https://github.com/xmlking/micro-apps
2. https://github.com/sfeir-open-source/kbeam
3. https://github.com/thinhha/gcp-data-project-template
4. https://google.github.io/flogger/best_practice
5. https://github.com/apache/beam/tree/master/examples/kotlin

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/xmlking/beam-examples

Awesome Lists containing this project

README