{"id":20768830,"url":"https://github.com/1995parham-learning/beam","last_synced_at":"2025-10-04T09:13:17.715Z","repository":{"id":52189160,"uuid":"513885714","full_name":"1995parham-learning/beam","owner":"1995parham-learning","description":"Learn how to use Apache Beam","archived":false,"fork":false,"pushed_at":"2023-09-18T17:37:12.000Z","size":506,"stargazers_count":2,"open_issues_count":10,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-08-18T22:11:11.687Z","etag":null,"topics":["dataflow","pipeline","pipelines","stream-processing"],"latest_commit_sha":null,"homepage":"","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/1995parham-learning.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-07-14T12:06:04.000Z","updated_at":"2023-03-23T07:39:24.000Z","dependencies_parsed_at":"2024-11-17T23:48:39.830Z","dependency_job_id":null,"html_url":"https://github.com/1995parham-learning/beam","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/1995parham-learning/beam","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/1995parham-learning%2Fbeam","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/1995parham-learning%2Fbeam/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/1995parham-learning%2Fbeam/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/1995parham-learning%2Fbeam/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/1995parham-learning","download_url":"https://codeload.github.com/1995parham-learning/beam/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/1995parham-learning%2Fbeam/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":278290742,"owners_count":25962611,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-04T02:00:05.491Z","response_time":63,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["dataflow","pipeline","pipelines","stream-processing"],"created_at":"2024-11-17T11:41:08.754Z","updated_at":"2025-10-04T09:13:17.678Z","avatar_url":"https://github.com/1995parham-learning.png","language":"Java","funding_links":[],"categories":[],"sub_categories":[],"readme":"\n\u003ch1 align=\"center\"\u003e\n  \u003ca href=\"https://github.com/apache/beam/\"\u003eApache beam\u003c/a\u003e\n\u003c/h1\u003e\n\u003ch6 align=\"center\"\u003eBeam SDKs are used to create data processing pipelines\u003c/h6\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003cimg alt=\"GitHub Workflow Status\" src=\"https://img.shields.io/github/actions/workflow/status/1995parham-learning/beam/test.yaml?style=for-the-badge\u0026logo=github\u0026branch=main\"\u003e\n  \u003cimg alt=\"GitHub\" src=\"https://img.shields.io/github/license/1995parham-learning/beam?logo=gnu\u0026style=for-the-badge\"\u003e\n\u003c/p\u003e\n\n## Overview 👀\n\nYou need to first create a driver program. Your driver program defines your pipeline,\nincluding all of the inputs,\ntransforms, and outputs; it also sets execution options for your pipeline.\nThese include the Pipeline Runner, which\ndetermines what back-end your pipeline will run on.\n\nThe beam abstractions work with both batch and streaming data sources. Abstractions:\n\n### Pipeline\n\nAll Beam driver programs must create a **Pipeline**. When you create if the,\nyou must also specify the execution options\nthat tell the **Pipeline** where and how to run.\n\n### PCollection\n\nA PCollection represents a distributed data set that your\nBeam pipeline operates on.\n\n### PTransform\n\nA **PTransform** represents a data processing operation, or a step, in your pipeline.\nEvery **PTransform** takes one or\nmore **PCollection** objects as input, performs a processing function that\nyou provide on the elements of that\n**PCollection**, and produces zero ot more output **PCollection** objects.\n\n### Scope\n\nThe Go SDK has an explicit scope variable used to build a **Pipeline**.\nA **Pipeline** can return it’s root scope with\nthe **Root()** method. The scope variable is passed to **PTransform**\nfunctions to place them in the **Pipeline** that\nowns the **Scope**.\n\n### I/O transforms\n\n## Typical Beam Driver Work Flow 🪠\n\n### Create a Pipeline\n\n### Create an initial PCollection\n\nEither using the IOs (external storage) or using a **Create**\ntransform to build a **PCollection** from in-memory data.\n\n### Apply PTransforms to each PCollection\n\nA transform creates a new output **PCollection** without modifying the input collection.\nThink of **PCollection**s as\nvariables and **PTransform**s as functions applied to these variables:\nthe shape of the pipeline can be an arbitrary\ncomplex processing graph.\n\n### Use IOs to write final PCollection to an external source\n\n### Run using the designated Pipeline Runner\n\nThe Pipeline Runner that you designate constructs a **workflow graph**.\nThat graph is then executed using the appropriate\ndistributed processing back-end,\nbecoming an asynchronous \"job\" (or equivalent) on that back-end.\n\n### Configuring pipeline options\n\n### Setting PipelineOptions from command-line arguments\n\nUse Go flags. Flags must be parsed before beam.Init() is called.\n\n### Creating custom options\n\n### Reading from an external source\n\nEach data source adapter has a **Read** transform;\nto read, you must apply that transform to the Pipeline object itself.\n\n#### PCollection characteristics\n\nA PCollection is owned by the specific Pipeline object for\nwhich it is created; multiple pipelines cannot share a\nPCollection.\n\n\u003e SKIPPED FOR NOW\n\n## Core Beam transforms\n\n### ParDo\n\nIt's for generic parallel processing.\nIt considers each element in the input **PCollection**, performs some processing\nfunction (your code) on that element,\nand emits zero, one, or multiple elements to an output **PCollection**.\n\nParDo is useful for:\n\n1. Filtering a data set\n2. Formatting or type-converting each element in a data set\n3. Extracting parts of each element in a data set\n4. Performing computations on each element in a data set\n\nWhen you apply a ParDo transform, you'll need to provide user\ncode in the form of a DoFn object. DoFn is a Beam SDK\nclass that defines a distributed processing function.\n\nAll DoFns should be registered using a generic register.DoFnXxY[...]\nfunction. This allows\nthe Go SDK to infer an\nencoding from any inputs/outputs,\nregisters the DoFn for execution on remote runners, and optimizes the runtime\nexecution of the DoFns via reflection.\n\n\u003e SKIPPED FOR NOW (Also the code of ParDo)\n\n## Creating cross-language transform\n\nTo make transforms written in one language available to pipelines written\nin another language,\nBeam uses an expansion service, which creates and\ninjects the appropriate language-specific pipeline fragments into the pipeline.\n\n![multi-language-pipelines-diagram](./multi-language-pipelines-diagram.svg)\n\nAt runtime, the Beam runner will execute both Python and\nJava transforms to run the pipeline.\n\n\u003e SKIPPED FOR NOW\n\n## Development ⛏️\n\nFor doing development first you must create gradle wrappers so language servers can help you:\n\n```bash\ngradle wrapper\n```\n\n## How to run? 🏎️\n\nIn order to run with `openjdk-17` we need to use `--add-exports java.base/sun.nio.ch=ALL-UNNAMED` as a JVM option.\nFor having `kafka` we need to set bootstrap servers with the `--bootstrapServers=172.21.88.8:9094` flag.\n\n```bash\ngradle shadowJar\n\n# run above spark runner\n\njava -jar  \\\n  --add-exports java.base/sun.nio.ch=ALL-UNNAMED \\\n  kafka-consumer-spark/build/libs/kafka-consumer-spark.jar \\\n  --runner=SparkRunner --bootstrapServers=172.21.88.8:9094\n\n# run above direct runner\n\njava -jar  \\\n  --add-exports java.base/sun.nio.ch=ALL-UNNAMED \\\n  kafka-consumer-direct/build/libs/kafka-consumer-direct-all.jar \\\n  --runner=DirectRunner --bootstrapServers=172.21.88.8:9094\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2F1995parham-learning%2Fbeam","html_url":"https://awesome.ecosyste.ms/projects/github.com%2F1995parham-learning%2Fbeam","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2F1995parham-learning%2Fbeam/lists"}