Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/johannaojeling/go-beam-pipeline

Data pipeline built with the Apache Beam Go SDK
https://github.com/johannaojeling/go-beam-pipeline

apache-beam batch-processing bigquery cloud-sql cloud-storage dataflow elasticsearch firestore go google-cloud memorystore mongodb mysql postgresql redis

Last synced: about 3 hours ago
JSON representation

Data pipeline built with the Apache Beam Go SDK

Host: GitHub
URL: https://github.com/johannaojeling/go-beam-pipeline
Owner: johannaojeling
License: mit
Created: 2022-05-06T18:44:44.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2023-03-07T06:40:02.000Z (almost 2 years ago)
Last Synced: 2024-06-20T22:37:31.709Z (7 months ago)
Topics: apache-beam, batch-processing, bigquery, cloud-sql, cloud-storage, dataflow, elasticsearch, firestore, go, google-cloud, memorystore, mongodb, mysql, postgresql, redis
Language: Go
Homepage:
Size: 887 KB
Stars: 4
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# Go Beam Pipeline

## Introduction

This project contains a pipeline with a number of IO transforms developed with the Apache Beam Go SDK. The pipeline
reads from a source and writes to a sink. Which source and sink to use can be configured in a templated yaml file, which
is passed to the program as an argument. Example configuration is in the [examples/config](examples/config) folder.

Supported sources:

- BigQuery
- Cloud Storage (avro, csv, json, parquet)
- Cloud SQL (MySQL, PostgreSQL)
- Elasticsearch
- Firestore
- Memorystore (Redis)
- MongoDB

Supported sinks:

- BigQuery
- Cloud Storage (avro, csv, json, parquet)
- Cloud SQL (MySQL, PostgreSQL)
- Elasticsearch
- Firestore
- Memorystore (Redis)
- MongoDB

## Prerequisites

- Go version 1.19
- Gcloud CLI
- Docker

## Development

### Setup

Install dependencies

```bash
go mod download
```

### Testing

Run unit tests

```bash
go test ./... -short
```

Run unit tests and long-running integration tests

```bash
go test ./...
```

### Running with DirectRunner

Set variables

| Variable | Description |
|-------------|----------------------------------------------------|
| CONFIG_PATH | Path to configuration file (local or GCS path) |
| PROJECT | GCP project |
| BUCKET | Bucket for data storage (if source or sink is GCS) |

Run pipeline

```bash
go run main.go --configPath=${CONFIG_PATH} --project=${PROJECT} --bucket=${BUCKET}
```

## Deployment

Set variables

| Variable | Description |
|-----------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| CONFIG_PATH | Path to configuration file (local or GCS path) |
| PROJECT | GCP project |
| BUCKET | Bucket for data storage (if source or sink is GCS) |
| REGION | Compute region |
| SUBNETWORK | Subnetwork |
| SA_EMAIL | Email of service account used for Dataflow. Needs the roles:

`roles/dataflow.worker`

`roles/bigquery.dataEditor`

`roles/bigquery.jobUser`

`roles/datastore.user`

`roles/storage.objectAdmin`

|
| DATAFLOW_BUCKET | Bucket for Dataflow staging data |

### Running with DataflowRunner

```bash
go run main.go \
--configPath=${CONFIG_PATH} \
--project=${PROJECT} \
--bucket=${BUCKET} \
--runner=dataflow \
--region=${REGION} \
--subnetwork=${SUBNETWORK} \
--service_account_email=${SA_EMAIL} \
--staging_location=gs://${DATAFLOW_BUCKET}/staging \
--job_name=${JOB_NAME}-$(date +%s)
```