https://github.com/1995parham-learning/beam
Learn how to use Apache Beam
https://github.com/1995parham-learning/beam
dataflow pipeline pipelines stream-processing
Last synced: 9 months ago
JSON representation
Learn how to use Apache Beam
- Host: GitHub
- URL: https://github.com/1995parham-learning/beam
- Owner: 1995parham-learning
- License: gpl-3.0
- Created: 2022-07-14T12:06:04.000Z (almost 4 years ago)
- Default Branch: main
- Last Pushed: 2023-09-18T17:37:12.000Z (over 2 years ago)
- Last Synced: 2025-08-18T22:11:11.687Z (10 months ago)
- Topics: dataflow, pipeline, pipelines, stream-processing
- Language: Java
- Homepage:
- Size: 494 KB
- Stars: 2
- Watchers: 1
- Forks: 0
- Open Issues: 10
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
Apache beam
Beam SDKs are used to create data processing pipelines
## Overview π
You need to first create a driver program. Your driver program defines your pipeline,
including all of the inputs,
transforms, and outputs; it also sets execution options for your pipeline.
These include the Pipeline Runner, which
determines what back-end your pipeline will run on.
The beam abstractions work with both batch and streaming data sources. Abstractions:
### Pipeline
All Beam driver programs must create a **Pipeline**. When you create if the,
you must also specify the execution options
that tell the **Pipeline** where and how to run.
### PCollection
A PCollection represents a distributed data set that your
Beam pipeline operates on.
### PTransform
A **PTransform** represents a data processing operation, or a step, in your pipeline.
Every **PTransform** takes one or
more **PCollection** objects as input, performs a processing function that
you provide on the elements of that
**PCollection**, and produces zero ot more output **PCollection** objects.
### Scope
The Go SDK has an explicit scope variable used to build a **Pipeline**.
A **Pipeline** can return itβs root scope with
the **Root()** method. The scope variable is passed to **PTransform**
functions to place them in the **Pipeline** that
owns the **Scope**.
### I/O transforms
## Typical Beam Driver Work Flow πͺ
### Create a Pipeline
### Create an initial PCollection
Either using the IOs (external storage) or using a **Create**
transform to build a **PCollection** from in-memory data.
### Apply PTransforms to each PCollection
A transform creates a new output **PCollection** without modifying the input collection.
Think of **PCollection**s as
variables and **PTransform**s as functions applied to these variables:
the shape of the pipeline can be an arbitrary
complex processing graph.
### Use IOs to write final PCollection to an external source
### Run using the designated Pipeline Runner
The Pipeline Runner that you designate constructs a **workflow graph**.
That graph is then executed using the appropriate
distributed processing back-end,
becoming an asynchronous "job" (or equivalent) on that back-end.
### Configuring pipeline options
### Setting PipelineOptions from command-line arguments
Use Go flags. Flags must be parsed before beam.Init() is called.
### Creating custom options
### Reading from an external source
Each data source adapter has a **Read** transform;
to read, you must apply that transform to the Pipeline object itself.
#### PCollection characteristics
A PCollection is owned by the specific Pipeline object for
which it is created; multiple pipelines cannot share a
PCollection.
> SKIPPED FOR NOW
## Core Beam transforms
### ParDo
It's for generic parallel processing.
It considers each element in the input **PCollection**, performs some processing
function (your code) on that element,
and emits zero, one, or multiple elements to an output **PCollection**.
ParDo is useful for:
1. Filtering a data set
2. Formatting or type-converting each element in a data set
3. Extracting parts of each element in a data set
4. Performing computations on each element in a data set
When you apply a ParDo transform, you'll need to provide user
code in the form of a DoFn object. DoFn is a Beam SDK
class that defines a distributed processing function.
All DoFns should be registered using a generic register.DoFnXxY[...]
function. This allows
the Go SDK to infer an
encoding from any inputs/outputs,
registers the DoFn for execution on remote runners, and optimizes the runtime
execution of the DoFns via reflection.
> SKIPPED FOR NOW (Also the code of ParDo)
## Creating cross-language transform
To make transforms written in one language available to pipelines written
in another language,
Beam uses an expansion service, which creates and
injects the appropriate language-specific pipeline fragments into the pipeline.

At runtime, the Beam runner will execute both Python and
Java transforms to run the pipeline.
> SKIPPED FOR NOW
## Development βοΈ
For doing development first you must create gradle wrappers so language servers can help you:
```bash
gradle wrapper
```
## How to run? ποΈ
In order to run with `openjdk-17` we need to use `--add-exports java.base/sun.nio.ch=ALL-UNNAMED` as a JVM option.
For having `kafka` we need to set bootstrap servers with the `--bootstrapServers=172.21.88.8:9094` flag.
```bash
gradle shadowJar
# run above spark runner
java -jar \
--add-exports java.base/sun.nio.ch=ALL-UNNAMED \
kafka-consumer-spark/build/libs/kafka-consumer-spark.jar \
--runner=SparkRunner --bootstrapServers=172.21.88.8:9094
# run above direct runner
java -jar \
--add-exports java.base/sun.nio.ch=ALL-UNNAMED \
kafka-consumer-direct/build/libs/kafka-consumer-direct-all.jar \
--runner=DirectRunner --bootstrapServers=172.21.88.8:9094
```