https://github.com/datastacktv/apache-beam-batch-processing
Public source code for the Batch Processing with Apache Beam (Python) online course
https://github.com/datastacktv/apache-beam-batch-processing
apache-beam cloud-dataflow
Last synced: 5 months ago
JSON representation
Public source code for the Batch Processing with Apache Beam (Python) online course
- Host: GitHub
- URL: https://github.com/datastacktv/apache-beam-batch-processing
- Owner: datastacktv
- License: mit
- Created: 2020-09-15T13:28:50.000Z (over 5 years ago)
- Default Branch: master
- Last Pushed: 2020-09-29T14:15:36.000Z (over 5 years ago)
- Last Synced: 2025-04-30T12:59:28.294Z (8 months ago)
- Topics: apache-beam, cloud-dataflow
- Language: Python
- Homepage: https://datastack.tv/apache-beam-course.html
- Size: 81.1 KB
- Stars: 18
- Watchers: 1
- Forks: 9
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Batch Processing with Apache Beam in Python
[](https://twitter.com/datastacktv)
[](https://www.youtube.com/channel/UCQSbqkMlvf_J949HDWxOt7Q)
[](https://datastack.tv/)
This repository holds the source code for the [Batch Processing with Apache Beam](https://datastack.tv/apache-beam-course.html) online mini-course by [@alexandraabbas](https://github.com/alexandraabbas).
In this course we use Apache Beam in Python to build the following batch data processing pipeline.

Subscribe to [datastack.tv](https://datastack.tv/pricing.html) in order to take this course. [Browse our courses here!](https://datastack.tv/courses.html)
## Set up your local environment
Before installing Apache Beam, create and activate a virtual environment. Beam Python SDK supports Python 2.7, 3.5, 3.6, and 3.7. I recommend you create a virtual environment with Python 3+.
```bash
# create a virtual environment using conda or virtualenv
conda create -n apache-beam-tutorial python=3.7
# activate your virtual environment
conda activate apache-beam-tutorial
```
Now, install Beam using pip. Install the Google Cloud extra dependency that is required for Google Cloud Dataflow runner.
```bash
pip install apache-beam[gcp]
```
## Run pipeline locally
```bash
python pipeline.py \
--input data.csv \
--output output \
--runner DirectRunner
```
## Deploy pipeline to Google Cloud Dataflow
### Set up your Google Cloud environment
Follow these step to set up all necessary resources in [Google Cloud Console](https://console.cloud.google.com/).
1. Create a Google Cloud project
2. Enable Dataflow API (in APIs & Services)
3. Create a Storage bucket in `us-central1` region
Take note of the project ID and the bucket name and use these when configuring your pipeline below.
### Run pipeline with Google Cloud Dataflow
```bash
python pipeline.py \
--input gs:///data.csv \
--output gs:///output \
--runner DataflowRunner \
--project \
--staging_location gs:///staging \
--temp_location gs:///temp \
--region us-central1 \
--save_main_session
```
Now, open the [Dataflow Jobs dashboard in Google Cloud Console](https://console.cloud.google.com/dataflow/jobs) and wait for your job to finish. It will take around 5 minutes.
When finished, you should find a new file called `output-00000-of-00001.csv` in the storage bucket you've created. This is the output file that our pipeline has produced.
### Clean up Google Cloud
I recommend you delete the Google Cloud project you've created. When deleting a project all resources in that project are deleted as well.