{"id":28964006,"url":"https://github.com/pompierninja/beam-amazon-batch-example","last_synced_at":"2025-12-14T22:47:36.873Z","repository":{"id":177655430,"uuid":"207026336","full_name":"pompierninja/beam-amazon-batch-example","owner":"pompierninja","description":"A practical example of batch processing on Google Cloud Dataflow using the Go SDK for Apache Beam :fire:","archived":false,"fork":false,"pushed_at":"2019-09-26T15:54:09.000Z","size":466,"stargazers_count":3,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-10-14T14:12:44.739Z","etag":null,"topics":["amazon","apache-beam","batch-processing","big-data","golang","google-cloud-dataflow"],"latest_commit_sha":null,"homepage":"","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/pompierninja.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2019-09-07T21:11:21.000Z","updated_at":"2024-06-18T16:28:21.000Z","dependencies_parsed_at":null,"dependency_job_id":"83ec56e2-faea-4a00-9ae3-a08bab61d8c0","html_url":"https://github.com/pompierninja/beam-amazon-batch-example","commit_stats":null,"previous_names":["angulartist/beam-amazon-batch-example","angulartist/gobeam-amazon-reviews","pompierninja/beam-amazon-batch-example"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/pompierninja/beam-amazon-batch-example","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pompierninja%2Fbeam-amazon-batch-example","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pompierninja%2Fbeam-amazon-batch-example/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pompierninja%2Fbeam-amazon-batch-example/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pompierninja%2Fbeam-amazon-batch-example/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/pompierninja","download_url":"https://codeload.github.com/pompierninja/beam-amazon-batch-example/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pompierninja%2Fbeam-amazon-batch-example/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":27738353,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-12-14T02:00:11.348Z","response_time":56,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["amazon","apache-beam","batch-processing","big-data","golang","google-cloud-dataflow"],"created_at":"2025-06-24T04:13:09.331Z","updated_at":"2025-12-14T22:47:36.826Z","avatar_url":"https://github.com/pompierninja.png","language":"Go","readme":"# :eyes: gobeam-amazon-reviews :eyes:\nA practical example of batch processing 50 * 100 Amazon reviews .csv chunks using the Go SDK for Apache Beam. :fire:\n\n\u003e Be aware that the Go SDK is still in the experimental phase and may not be fully safe for production.\n\n#### Dataset\n\nThe data sample was retrieved from [Kaggle](https://www.kaggle.com/datafiniti/consumer-reviews-of-amazon-products/version/5#) and chunked into several .csv files.\n\n#### Output\n\nThe pipeline applies a set of transformation steps over 5000 amazon's reviews and builds and extract useful stats such as the top most helpful reviews, an overview of the rating, the recommendation ratio etc.\n\n### Get started\n\n1/ Make sure you have properly installed Go and have added it to your $PATH.\n\n```sh\ngo version\n```\n\n2/ Install the project dependencies\n\n```sh\ngo get -u -v\n```\n\n3/ Run the pipeline locally (use the Direct Runner, reserved for testing/debugging)\n\n```sh\ngo run main.go\n```\n\n4/ (Optional) Deploy to Google Cloud Dataflow runner (requires a worker harness container image)\n\n\u003e Open `deploy_job_to_dataflow.sh` file and replace placeholders by your GCP project ID, your Cloud Storage bucket name and your worker harness container image (if you don't have one: see below).\n\n```sh\nchmod -X ./deploy_job_to_dataflow.sh\n./deploy_job_to_dataflow.sh\n```\n\n4 bis/ (Optional) Run the pipeline on a local Spark cluster\n\n```sh\n# cd to the beam source folder\ncd ~/\u003cGO_PATH\u003e/github.com/apache/beam/\n# build a docker image for the Go SDK\n./gradlew :sdks:go:container:docker\n# Run the spark job-server\n./gradlew :runners:spark:job-server:runShadow\n# When the server is running, execute the pipeline :\n./run_with_apache_spark.sh\n```\n\nYou can monitor the running job via the Web Interface : `http://[driverHostname]:4040`.\n\n### How to build and push a worker harness container image\n\n\u003e This is the Docker image that Dataflow will use to host the binary that was uploaded to the staging location on GCS.\n\n1/ Go to the apache-beam source folder\n\n```sh\ncd go/src/github.com/apache/beam\n```\n\n2/ Run Gradle with the Docker target for Go\n\n```sh\n./gradlew -p sdks/go/container docker\n```\n\n3/ Tag your image and push it to the repository\n\n**Bintray**\n\n```sh\ndocker tag \u003cyour_image\u003e \u003cyour_repo\u003e.bintray.io/beam/go:latest\n\ndocker push \u003cyour_repo\u003e.bintray.io/beam/go:latest\n```\n\n**Cloud Registry**\n\n\n```sh\ndocker tag \u003cyour_image\u003e gcr.io/\u003cyour_project_id\u003e/beam/go:latest\n\ndocker push gcr.io/\u003cyour_project_id\u003e/beam/go:latest\n```\n\n4/ Update the `./deploy_job_to_dataflow.sh` file with the new Docker image and run it\n\n```\n./deploy_job_to_dataflow.sh\n```\n\n[Click here](https://github.com/apache/beam/blob/master/sdks/CONTAINERS.md) to view a detailed guide.\n\n\n## Understand Beam basics in a minute\n\nBeam is a **unified programming model** that allow you to write data processing pipelines that can be run in a batch or a streaming mode on **different runners** such as Cloud Dataflow, Apache Spark, Apache Flink, Storm, Samza etc.\n\nThis is a great alternative to the **Lambda Architecture** as you have to write and maintain one single code (may slightly differ) to work with bounded or unbounded dataset.\n\nYou have to define a **Pipeline** which is a function that contains several transformation steps. Each step is called a **PTransform**. A PTransform is a function that takes a **PCollection** as a main input (and eventually side input PCollections), then compute data and outputs 0, 1 or multiple PCollections.\nA PTransform can also be a **composite transform** which is a combination of multiple transformation steps. That's useful to write high level transformation steps and to structure your code to improve resuability and readability.\n\n![A pipeline](http://streamingsystems.net/static/images/figures/stsy_1009.png)\n\nThe SDK provides built-in element-wise/count/aggregate primitive transformations such as **ParDo**, **Combine**, **Count**, **GroupByKey** which can be composite transforms under the hood. Beam hides that to make it easier for developers.\n\n![PTransforms](http://streamingsystems.net/static/images/figures/stsy_0202.png)\n\nA PCollection is like a box that will contain all the data that will pass through your pipeline. It's immutable (cannot be modified) and can be big, massive or unbounded. The nature of a PCollection depends on which **source** has created it. Most of the time, a fixed size PCollection is created by a text file or a database table and an infinite size PCollection is created by a streaming source such as **Cloud Pub/Sub** or **Apache Kafka**.\n\nWhen you deploy the pipeline to a runner, it will generate an optimised Directed Acyclic Graph (or DAG) which is basically a combination of nodes and edges (nodes are going to be PCollections, edges gonna be PTransforms).\n\n![Optimisations](http://streamingsystems.net/static/images/figures/stsy_0503.png)\n\nThe targeted runner will next set up a cluster of workers and execute some of the transformation steps in parallel in a Map-Shuffle-Reduce style's algorithm.\n\n![MapReduce](http://streamingsystems.net/static/images/figures/stsy_1003.png)\n\nLearn more about Apache Beam [here](https://beam.apache.org/documentation/programming-guide/).\n\nThe Go SDK is still **experimental** and doesnt provides features that makes streaming mode possible, such as advanced windowing strategies, watermarks, triggers and all of the tools to handle late data.\n\n\n\u003e Images credits: [Streaming Systems](http://streamingsystems.net/)\n\n## Learn more about Apache-Beam\n\n* [The GoDoc for the Beam Go SDK](https://godoc.org/github.com/apache/beam/sdks/go/pkg/beam)\n* [The official Apache Beam programming guide](https://beam.apache.org/documentation/programming-guide/)\n* [Andrew Brampton's article](https://blog.bramp.net/post/2019/01/05/apache-beam-and-google-dataflow-in-go/)\n* [Beam execution model](https://beam.apache.org/documentation/execution-model/)\n* [Dataflow whitepaper](https://storage.googleapis.com/pub-tools-public-publication-data/pdf/43864.pdf)\n* [RFC: Beam Go SDK](https://docs.google.com/document/d/1yjhttps://github.com/angulartist/gobeam-amazon-reviews/edit/master/README.md0_hxq2J1iestjFUUrm_BVQLsFxQiiqtcFhgodzIgM)\n* [Streaming systems](http://streamingsystems.net/)\n* [Martin Gorner - Dataflow Explained](https://www.youtube.com/watch?v=AZht1rkHIxk)\n* [Google MapReduce paper](https://static.googleusercontent.com/media/research.google.com/en//archive/mapreduce-osdi04.pdf)\n* [FlumeJava paper](https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/35650.pdf)\n* [Computerphile - MapReduce](https://www.youtube.com/watch?v=cvhKoniK5Uo)\n* [GCP Podcasts](https://www.gcppodcast.com/search/?s=beam)\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpompierninja%2Fbeam-amazon-batch-example","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpompierninja%2Fbeam-amazon-batch-example","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpompierninja%2Fbeam-amazon-batch-example/lists"}