https://github.com/datastacktv/apache-beam-batch-processing

Public source code for the Batch Processing with Apache Beam (Python) online course
https://github.com/datastacktv/apache-beam-batch-processing

apache-beam cloud-dataflow

Last synced: 5 months ago
JSON representation

Public source code for the Batch Processing with Apache Beam (Python) online course

Host: GitHub
URL: https://github.com/datastacktv/apache-beam-batch-processing
Owner: datastacktv
License: mit
Created: 2020-09-15T13:28:50.000Z (over 5 years ago)
Default Branch: master
Last Pushed: 2020-09-29T14:15:36.000Z (over 5 years ago)
Last Synced: 2025-04-30T12:59:28.294Z (8 months ago)
Topics: apache-beam, cloud-dataflow
Language: Python
Homepage: https://datastack.tv/apache-beam-course.html
Size: 81.1 KB
Stars: 18
Watchers: 1
Forks: 9
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# Batch Processing with Apache Beam in Python

[![Twitter](https://img.shields.io/badge/-Twitter-1DA1F2)](https://twitter.com/datastacktv)
[![YouTube](https://img.shields.io/badge/-YouTube-FF0000)](https://www.youtube.com/channel/UCQSbqkMlvf_J949HDWxOt7Q)
[![Website](https://img.shields.io/badge/-Website-565CD8)](https://datastack.tv/)

This repository holds the source code for the [Batch Processing with Apache Beam](https://datastack.tv/apache-beam-course.html) online mini-course by [@alexandraabbas](https://github.com/alexandraabbas).

In this course we use Apache Beam in Python to build the following batch data processing pipeline.

![apache beam pipeline infographic](img/pipeline.png)

Subscribe to [datastack.tv](https://datastack.tv/pricing.html) in order to take this course. [Browse our courses here!](https://datastack.tv/courses.html)

## Set up your local environment

Before installing Apache Beam, create and activate a virtual environment. Beam Python SDK supports Python 2.7, 3.5, 3.6, and 3.7. I recommend you create a virtual environment with Python 3+.

```bash
# create a virtual environment using conda or virtualenv
conda create -n apache-beam-tutorial python=3.7

# activate your virtual environment
conda activate apache-beam-tutorial
```

Now, install Beam using pip. Install the Google Cloud extra dependency that is required for Google Cloud Dataflow runner.

```bash
pip install apache-beam[gcp]
```

## Run pipeline locally

```bash
python pipeline.py \
--input data.csv \
--output output \
--runner DirectRunner
```

## Deploy pipeline to Google Cloud Dataflow

### Set up your Google Cloud environment

Follow these step to set up all necessary resources in [Google Cloud Console](https://console.cloud.google.com/).

1. Create a Google Cloud project
2. Enable Dataflow API (in APIs & Services)
3. Create a Storage bucket in `us-central1` region

Take note of the project ID and the bucket name and use these when configuring your pipeline below.

### Run pipeline with Google Cloud Dataflow

```bash
python pipeline.py \
--input gs:///data.csv \
--output gs:///output \
--runner DataflowRunner \
--project \
--staging_location gs:///staging \
--temp_location gs:///temp \
--region us-central1 \
--save_main_session
```

Now, open the [Dataflow Jobs dashboard in Google Cloud Console](https://console.cloud.google.com/dataflow/jobs) and wait for your job to finish. It will take around 5 minutes.

When finished, you should find a new file called `output-00000-of-00001.csv` in the storage bucket you've created. This is the output file that our pipeline has produced.

### Clean up Google Cloud

I recommend you delete the Google Cloud project you've created. When deleting a project all resources in that project are deleted as well.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/datastacktv/apache-beam-batch-processing

Awesome Lists containing this project

README