Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/johannaojeling/go-beam-pipeline
Data pipeline built with the Apache Beam Go SDK
https://github.com/johannaojeling/go-beam-pipeline
apache-beam batch-processing bigquery cloud-sql cloud-storage dataflow elasticsearch firestore go google-cloud memorystore mongodb mysql postgresql redis
Last synced: about 3 hours ago
JSON representation
Data pipeline built with the Apache Beam Go SDK
- Host: GitHub
- URL: https://github.com/johannaojeling/go-beam-pipeline
- Owner: johannaojeling
- License: mit
- Created: 2022-05-06T18:44:44.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2023-03-07T06:40:02.000Z (almost 2 years ago)
- Last Synced: 2024-06-20T22:37:31.709Z (7 months ago)
- Topics: apache-beam, batch-processing, bigquery, cloud-sql, cloud-storage, dataflow, elasticsearch, firestore, go, google-cloud, memorystore, mongodb, mysql, postgresql, redis
- Language: Go
- Homepage:
- Size: 887 KB
- Stars: 4
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Go Beam Pipeline
## Introduction
This project contains a pipeline with a number of IO transforms developed with the Apache Beam Go SDK. The pipeline
reads from a source and writes to a sink. Which source and sink to use can be configured in a templated yaml file, which
is passed to the program as an argument. Example configuration is in the [examples/config](examples/config) folder.Supported sources:
- BigQuery
- Cloud Storage (avro, csv, json, parquet)
- Cloud SQL (MySQL, PostgreSQL)
- Elasticsearch
- Firestore
- Memorystore (Redis)
- MongoDBSupported sinks:
- BigQuery
- Cloud Storage (avro, csv, json, parquet)
- Cloud SQL (MySQL, PostgreSQL)
- Elasticsearch
- Firestore
- Memorystore (Redis)
- MongoDB## Prerequisites
- Go version 1.19
- Gcloud CLI
- Docker## Development
### Setup
Install dependencies
```bash
go mod download
```### Testing
Run unit tests
```bash
go test ./... -short
```Run unit tests and long-running integration tests
```bash
go test ./...
```### Running with DirectRunner
Set variables
| Variable | Description |
|-------------|----------------------------------------------------|
| CONFIG_PATH | Path to configuration file (local or GCS path) |
| PROJECT | GCP project |
| BUCKET | Bucket for data storage (if source or sink is GCS) |Run pipeline
```bash
go run main.go --configPath=${CONFIG_PATH} --project=${PROJECT} --bucket=${BUCKET}
```## Deployment
Set variables
| Variable | Description |
|-----------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| CONFIG_PATH | Path to configuration file (local or GCS path) |
| PROJECT | GCP project |
| BUCKET | Bucket for data storage (if source or sink is GCS) |
| REGION | Compute region |
| SUBNETWORK | Subnetwork |
| SA_EMAIL | Email of service account used for Dataflow. Needs the roles:
- `roles/dataflow.worker`
- `roles/bigquery.dataEditor`
- `roles/bigquery.jobUser`
- `roles/datastore.user`
- `roles/storage.objectAdmin`
| DATAFLOW_BUCKET | Bucket for Dataflow staging data |
### Running with DataflowRunner
```bash
go run main.go \
--configPath=${CONFIG_PATH} \
--project=${PROJECT} \
--bucket=${BUCKET} \
--runner=dataflow \
--region=${REGION} \
--subnetwork=${SUBNETWORK} \
--service_account_email=${SA_EMAIL} \
--staging_location=gs://${DATAFLOW_BUCKET}/staging \
--job_name=${JOB_NAME}-$(date +%s)
```