{"id":19094036,"url":"https://github.com/datastacktv/apache-beam-batch-processing","last_synced_at":"2025-08-18T08:06:54.885Z","repository":{"id":107497670,"uuid":"295737091","full_name":"datastacktv/apache-beam-batch-processing","owner":"datastacktv","description":"Public source code for the Batch Processing with Apache Beam (Python) online course","archived":false,"fork":false,"pushed_at":"2020-09-29T14:15:36.000Z","size":83,"stargazers_count":18,"open_issues_count":0,"forks_count":9,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-04-30T12:59:28.294Z","etag":null,"topics":["apache-beam","cloud-dataflow"],"latest_commit_sha":null,"homepage":"https://datastack.tv/apache-beam-course.html","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/datastacktv.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-09-15T13:28:50.000Z","updated_at":"2024-12-19T17:36:45.000Z","dependencies_parsed_at":"2023-05-17T15:00:15.929Z","dependency_job_id":null,"html_url":"https://github.com/datastacktv/apache-beam-batch-processing","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/datastacktv/apache-beam-batch-processing","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/datastacktv%2Fapache-beam-batch-processing","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/datastacktv%2Fapache-beam-batch-processing/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/datastacktv%2Fapache-beam-batch-processing/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/datastacktv%2Fapache-beam-batch-processing/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/datastacktv","download_url":"https://codeload.github.com/datastacktv/apache-beam-batch-processing/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/datastacktv%2Fapache-beam-batch-processing/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":270962391,"owners_count":24675965,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-18T02:00:08.743Z","response_time":89,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apache-beam","cloud-dataflow"],"created_at":"2024-11-09T03:27:14.865Z","updated_at":"2025-08-18T08:06:54.860Z","avatar_url":"https://github.com/datastacktv.png","language":"Python","readme":"# Batch Processing with Apache Beam in Python\n\n[![Twitter](https://img.shields.io/badge/-Twitter-1DA1F2)](https://twitter.com/datastacktv)\n[![YouTube](https://img.shields.io/badge/-YouTube-FF0000)](https://www.youtube.com/channel/UCQSbqkMlvf_J949HDWxOt7Q)\n[![Website](https://img.shields.io/badge/-Website-565CD8)](https://datastack.tv/)\n\nThis repository holds the source code for the [Batch Processing with Apache Beam](https://datastack.tv/apache-beam-course.html) online mini-course by [@alexandraabbas](https://github.com/alexandraabbas).\n\nIn this course we use Apache Beam in Python to build the following batch data processing pipeline.\n\n![apache beam pipeline infographic](img/pipeline.png)\n\nSubscribe to [datastack.tv](https://datastack.tv/pricing.html) in order to take this course. [Browse our courses here!](https://datastack.tv/courses.html)\n\n## Set up your local environment\n\nBefore installing Apache Beam, create and activate a virtual environment. Beam Python SDK supports Python 2.7, 3.5, 3.6, and 3.7. I recommend you create a virtual environment with Python 3+.\n\n```bash\n# create a virtual environment using conda or virtualenv\nconda create -n apache-beam-tutorial python=3.7\n\n# activate your virtual environment\nconda activate apache-beam-tutorial\n```\n\nNow, install Beam using pip. Install the Google Cloud extra dependency that is required for Google Cloud Dataflow runner.\n\n```bash\npip install apache-beam[gcp]\n```\n\n## Run pipeline locally\n\n```bash\npython pipeline.py \\\n--input data.csv \\\n--output output \\\n--runner DirectRunner\n```\n\n## Deploy pipeline to Google Cloud Dataflow\n\n### Set up your Google Cloud environment\n\nFollow these step to set up all necessary resources in [Google Cloud Console](https://console.cloud.google.com/).\n\n1. Create a Google Cloud project\n2. Enable Dataflow API (in APIs \u0026 Services)\n3. Create a Storage bucket in `us-central1` region\n\nTake note of the project ID and the bucket name and use these when configuring your pipeline below.\n\n### Run pipeline with Google Cloud Dataflow\n\n```bash\npython pipeline.py \\\n--input gs://\u003cBUCKET\u003e/data.csv \\\n--output gs://\u003cBUCKET\u003e/output \\\n--runner DataflowRunner \\\n--project \u003cPROJECT\u003e \\\n--staging_location gs://\u003cBUCKET\u003e/staging \\\n--temp_location gs://\u003cBUCKET\u003e/temp \\\n--region us-central1 \\\n--save_main_session\n```\n\nNow, open the [Dataflow Jobs dashboard in Google Cloud Console](https://console.cloud.google.com/dataflow/jobs) and wait for your job to finish. It will take around 5 minutes.\n\nWhen finished, you should find a new file called `output-00000-of-00001.csv` in the storage bucket you've created. This is the output file that our pipeline has produced.\n\n### Clean up Google Cloud\n\nI recommend you delete the Google Cloud project you've created. When deleting a project all resources in that project are deleted as well.\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdatastacktv%2Fapache-beam-batch-processing","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdatastacktv%2Fapache-beam-batch-processing","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdatastacktv%2Fapache-beam-batch-processing/lists"}