An open API service indexing awesome lists of open source software.

Projects in Awesome Lists tagged with apache-beam

A curated list of projects in awesome lists tagged with apache-beam .

https://github.com/tensorflow/tfx

TFX is an end-to-end platform for deploying production ML pipelines

apache-beam machine-learning tensorflow

Last synced: 15 May 2025

https://tensorflow.github.io/tfx/

TFX is an end-to-end platform for deploying production ML pipelines

apache-beam machine-learning tensorflow

Last synced: 23 Mar 2025

https://github.com/googlecloudplatform/flink-on-k8s-operator

[DEPRECATED] Kubernetes operator for managing the lifecycle of Apache Flink and Beam applications.

apache-beam apache-flink flink-operator google-cloud-dataproc kubernetes kubernetes-operator operator

Last synced: 03 Oct 2025

https://github.com/GoogleCloudPlatform/flink-on-k8s-operator

[DEPRECATED] Kubernetes operator for managing the lifecycle of Apache Flink and Beam applications.

apache-beam apache-flink flink-operator google-cloud-dataproc kubernetes kubernetes-operator operator

Last synced: 23 Mar 2025

https://github.com/blockchain-etl/bitcoin-etl

ETL scripts for Bitcoin, Litecoin, Dash, Zcash, Doge, Bitcoin Cash. Available in Google BigQuery https://goo.gl/oY5BCQ

apache-beam bitcoin bitcoincash blockchain-analytics crypto cryptocurrency dash data-analytics data-engineering dogecoin etl gcp google-dataflow google-pubsub litecoin on-chain-analysis web3 zcash

Last synced: 10 Apr 2025

https://github.com/ohs-foundation/fhir-data-pipes

A collection of tools for extracting FHIR resources and analytics services on top of that data.

analytics apache-beam etl fhir fhir-store parquet

Last synced: 08 Jun 2026

https://github.com/google/weather-tools

Tools to make weather data accessible and useful.

apache-beam python weather

Last synced: 05 Apr 2025

https://github.com/spotify/flink-on-k8s-operator

Kubernetes operator for managing the lifecycle of Apache Flink and Beam applications.

apache-beam apache-flink flink flink-operator kubernetes kubernetes-operator

Last synced: 15 May 2025

https://github.com/ngrunwald/datasplash

Clojure API for a more dynamic Google Dataflow

apache-beam clojure distributed-computing google-cloud google-dataflow

Last synced: 23 Feb 2026

https://github.com/tosun-si/asgarde

Asgarde allows simplifying error handling with Apache Beam Java, with less code, more concise and expressive code.

apache-beam cloud-dataflow error-handling google-cloud-platform java kotlin

Last synced: 01 Feb 2026

https://github.com/mercari/dataflowtemplate

Mercari Dataflow Template

apache-beam cloud-dataflow google-cloud

Last synced: 06 Apr 2025

https://github.com/doitintl/banias

Opinionated serverless event analytics pipeline

analytics apache-beam bigdata dataflow golang

Last synced: 06 Mar 2026

https://github.com/tosun-si/pasgarde

Asgarde allows simplifying error handling with Apache Beam Python, with less code, more concise and expressive code.

apache-beam cloud-dataflow error-handling google-cloud-platform python

Last synced: 01 Feb 2026

https://github.com/sayakpaul/count-tokens-hf-datasets

This project shows how to derive the total number of training tokens from a large text dataset from 🤗 datasets with Apache Beam and Dataflow.

apache-beam dataflow hf-datasets tokenizers transformers unigram-tokenization

Last synced: 05 Sep 2025

https://github.com/google-parfait/dataset_grouper

Libraries for efficient and scalable group-structured dataset pipelines.

apache-beam datasets federated-learning jax pytorch tensorflow tensorflow-datasets

Last synced: 13 Aug 2025

https://github.com/mozilla-services/foxsec-pipeline

Log analysis pipeline utilizing Apache Beam

apache-beam dataflow log-analysis security

Last synced: 11 Apr 2025

https://github.com/mercari/dataflowtemplates

Convenient Dataflow pipelines for transforming data between cloud data sources

apache-beam bigquery dataflow dataflow-templates spanner

Last synced: 25 Oct 2025

https://github.com/janaom/gcp-data-engineering-etl-with-composer-dataflow

This project leverages GCS, Composer, Dataflow, BigQuery, and Looker on Google Cloud Platform (GCP) to build a robust data engineering solution for processing, storing, and reporting daily transaction data in the online food delivery industry.

airflow apache-beam cloud-composer cloud-storage data-engineering dataflow de-project gcp gcs looker

Last synced: 12 Apr 2025

https://github.com/datastacktv/apache-beam-explained

Source code for the YouTube video, Apache Beam Explained in 12 Minutes

apache-beam

Last synced: 27 Feb 2026

https://github.com/esakik/beam-mysql-connector

An Apache Beam I/O connector for seamless integration with MySQL database 🔗 https://beam.apache.org/documentation/io/connectors/#other-io-connectors-for-apache-beam

apache-beam mysql pypi python

Last synced: 14 Jan 2026

https://github.com/datastacktv/apache-beam-batch-processing

Public source code for the Batch Processing with Apache Beam (Python) online course

apache-beam cloud-dataflow

Last synced: 18 Aug 2025

https://github.com/ganeshsivakumar/langchain-beam

Integrates LLMs as PTransform in Apache Beam pipelines using LangChain

apache-beam data-engineering dataflow etl langchain langchian-beam rag

Last synced: 20 Oct 2025

https://github.com/medzin/beam-postgres

Light IO transforms for Postgres read/write in Apache Beam pipelines.

apache-beam python

Last synced: 24 Jun 2025

https://github.com/japila-books/apache-beam-internals

The Internals of Apache Beam

apache-beam book internals

Last synced: 06 Apr 2026

https://github.com/janaom/gcp-de-project-streaming-pubsub-beam-dataflow

This project demonstrates an end-to-end solution for processing and analyzing real-time conversations data from a JSON file using GCP services and infrastructure automation, showcasing data storage, streaming, processing, and analysis at scale.

apache-beam bigquery dataflow de-project gcp pubsub streaming-data

Last synced: 18 Oct 2025

https://github.com/mkuthan/example-beam

Playground for Apache Beam and Scio experiments, driven by real-world use cases.

apache-beam gcp-dataflow scala scio

Last synced: 07 Apr 2025

https://github.com/mkuthan/stream-processing

Learn how to develop and test stateful streaming and batch data pipelines

apache-beam scio stream-processing

Last synced: 07 Apr 2025

https://github.com/ksalama/data2cooc2emb2ann

Learning embeddings from item co-occurrence statistics, and building an approx. nearest neighbour index

apache-beam bigquery dataflow embeddings machine-learning python3 tensorflow

Last synced: 13 Jun 2025

https://github.com/mkuthan/gcp-dataflow-tampermonkey

Tampermonkey script for GCP Dataflow console with enhanced view for finding job bottlenecks

apache-beam dataflow gcp tampermonkey-userscript

Last synced: 07 Apr 2025

https://github.com/regadas/scio-cats

leverage cats type classes and data types in scio pipelines

apache-beam cats functional-programming scala scio

Last synced: 07 Oct 2025

https://github.com/solaceproducts/solace-apache-beam

Solace connector for Apache Beam / Google Cloud Dataflow

apache-beam beam google-dataflow java solace

Last synced: 18 Aug 2025

https://github.com/gjbae1212/go-apachebeam-gzipio

To transform for reading and writing gzip files in apache beam using Golang.

apache-beam apache-beam-io go golang gzip

Last synced: 14 May 2025

https://github.com/tosun-si/world-cup-qatar-team-stats-kotlin-midgard

This application shows a full Apache Beam pipeline with Kotlin and Midgard library. The use case works on the last Qatar FIFA world cup data and calculate players statistics per team. This application will be presented at Beam Summit 2023 in New York

apache-beam beam-summit data kotlin midgard world-cup-2022

Last synced: 01 Feb 2026

https://github.com/ryanmcdowell/dataflow-pubsub-event-router

An example pipeline which re-publishes events to different topics based a message attribute.

apache-beam google-cloud-dataflow google-cloud-platform google-cloud-pubsub

Last synced: 18 Jul 2025

https://github.com/pompierninja/beam-amazon-batch-example

A practical example of batch processing on Google Cloud Dataflow using the Go SDK for Apache Beam :fire:

amazon apache-beam batch-processing big-data golang google-cloud-dataflow

Last synced: 28 May 2026

https://github.com/eliias/gleam

Fun DSL for Apache Beam and Kotlin.

apache-beam data-engineering stream-processing

Last synced: 18 Oct 2025

https://github.com/goatcheesesaladwithpeanutoildressing/beam-amazon-batch-example

A practical example of batch processing on Google Cloud Dataflow using the Go SDK for Apache Beam :fire:

amazon apache-beam batch-processing big-data golang google-cloud-dataflow

Last synced: 25 Feb 2025

https://github.com/marceloneppel/apache-beam-golang-udf

Run UDFs (User Defined Functions) on Apache Beam Golang SDK.

apache-beam big-data cloud dataflow flink golang udf

Last synced: 25 Mar 2025

https://github.com/davidkhala/etl

Collection of data Extract, Transform, Load

apache-beam dbt elt etl fivetran

Last synced: 17 Feb 2026

https://github.com/alxmrs/beam-cli-example

How to structure Apache Beam pipelines as pip-installable CLIs.

apache-beam cli

Last synced: 17 Jun 2026

https://github.com/mbari-org/aipipeline

Library for running detection, clustering or classification ai pipelines plus performance monitoring using ApacheBeam

apache-beam foundation-models image-classification object-detection object-tracking video-processing-pipeline

Last synced: 13 Apr 2025

https://github.com/ryanmcdowell/dataflow-bigquery-dynamic-destinations

An example pipeline for dynamically routing events from Pub/Sub to different BigQuery tables based on a message attribute.

apache-beam bigquery google-cloud-dataflow google-cloud-platform

Last synced: 09 Sep 2025

https://github.com/davidgasquez/apache-beam-jupyter-notebook

☄️ A simple Apache Beam pipeline running in a Jupyter Notebook

apache-beam docker hacktoberfest jupyter

Last synced: 12 Apr 2025

https://github.com/beam-pyio/firehose_pyio

Apache Beam Python I/O connector for Amazon Data Firehose

apache-beam aws data-engineering data-streaming firehose python

Last synced: 05 May 2025

https://github.com/googlecloudplatform/dataflow-metrics-exporter

CLI tool to collect dataflow resource & execution metrics and export to either BigQuery or Google Cloud Storage. Tool will be useful to compare & visualize the metrics while benchmarking the dataflow pipelines using various data formats, resource configurations etc

apache-beam google-cloud-dataflow

Last synced: 08 Oct 2025

https://github.com/olahsymbo/mini-etl-apache-beam

ETL Pipeline (apache-beam, python)

apache-beam data-pipeline etl python

Last synced: 26 Mar 2025

https://github.com/rm3l/apache-beam-java-firestore-batch-dataflow

Companion Repo for blog post : https://rm3l.org/batch-writes-to-google-cloud-firestore-using-the-apache-beam-java-sdk-on-google-cloud-dataflow/

apache-beam beam dataflow firestore google-cloud-dataflow google-cloud-firestore

Last synced: 26 Mar 2025

https://github.com/seahrh/fraud-detection-dataflow

Working example of a real-time inference pipeline on GCP Cloud Dataflow

apache-beam cloud-dataflow data-engineering dataflow fraud-detection gcp machine-learning

Last synced: 29 Mar 2025

https://github.com/arquivei/arqbeam-app

An Apache Beam application wrapper using go-app.

apache-beam dataflow go golang hacktoberfest

Last synced: 12 Jan 2026

https://github.com/landerox/cloud-landerox-data

Reference architecture baseline for GCP data platforms (Apache Beam, BigQuery, Cloud Functions, Pub/Sub). Hybrid warehouse/lakehouse with batch + streaming, Medallion layering. Consumed by private runtime repos.

apache-beam batch-processing bigquery cloud-functions cloud-storage data-engineering data-platform dataform gcp google-cloud-dataflow iceberg lakehouse medallion-architecture opentelemetry pubsub python reference-architecture slsa streaming supply-chain-security

Last synced: 21 May 2026

https://github.com/viveknaskar/cloud-dataflow-template-poc

Creating Cloud Dataflow template using Java for counting a number of words from a document.

apache-beam cloud-dataflow gcp google-cloud-platform java

Last synced: 24 May 2026

https://github.com/beam-pyio/dynamodb_pyio

Apache Beam Python I/O connector for Amazon DynamoDB

apache-beam aws data-engineering data-streaming dynamodb python

Last synced: 04 Jan 2026

https://github.com/beam-pyio/sqs_pyio

Apache Beam Python I/O connector for Amazon SQS

apache-beam aws data-engineering data-streaming python sqs

Last synced: 05 Jan 2026

https://github.com/camilajaviera91/apache-beam-pipeline-first-approach

This code demonstrates how to integrate Apache Beam with scikit-learn datasets and perform simple data transformations. It loads the Linnerud dataset from scikit-learn, converts it into a Pandas DataFrame for easier manipulation.

apache-beam dataframes glob kmeans-clustering matplotlib-pyplot mean-absolute-error mean-square-error numpy os pandas pipelines scipy-stats seaborn silhouette-score sklearn sklearn-datasets standardscaler

Last synced: 28 Apr 2026

https://github.com/tansudasli/beam-sandbox

Apache beam sandbox w/ Dataflow for 10+ use cases

apache-beam gcp-dataflow python

Last synced: 09 Jun 2026

https://github.com/data-mission/dota2-cast-assist

Real-time Dota2 broadcaster’s assistant integrates the live Steam API with Dota GSI to provide game metrics like GPM, XPM, kills, deaths, damage, buybacks, and more, enhancing commentary with insights on player performance and the in-game economy

apache-beam broadcast dota dota2 game-state-integration gamestate-integration gsi k8s kubernetes poetry python steam-api valve-games

Last synced: 07 May 2026

https://github.com/neo4j-field/dataflow-flex-pyarrow-to-gds

Google Dataflow Flex Templates (in Python) for large scale Graph Loading with GDS and Apache Arrow

apache-arrow apache-beam bigquery dataflow neo4j python

Last synced: 09 May 2026

https://github.com/janaom/apache-beam-practice

It's time to learn Beam! This repository contains a collection of tasks and exercises focused on Apache Beam.

apache-beam

Last synced: 04 Jul 2025

https://github.com/datafabricrus/rya-beam-pipelines

Apache Beam Pipelines for Apache Rya

apache-beam apache-rya google-dataflow

Last synced: 16 May 2025

https://github.com/pedrodeoliveira/unbabel-bec

A streaming version of the Unbabel's BEC using GCP Pub/Sub and Apache Beam.

apache-beam dataflow docker gitlab-ci gke python streaming

Last synced: 17 Apr 2026

https://github.com/iht/bigquery-dataflow-cdc-example

A Dataflow streaming pipeline written in Java, reading data from Pubsub and recovering the sessions from potentially unordered data, and upserting the session data into BigQuery with no duplicates

apache-beam bigquery cdc dataflow google-cloud pubsub

Last synced: 04 Jan 2026

https://github.com/eduardogr/playing-apache-beam-tour

Playing with Apache Beam Tour: https://tour.beam.apache.org

apache-beam data-pipelines go

Last synced: 12 Aug 2025

https://github.com/fabiothiroki/java-apachebeam-tour

Apache Beam using Java use cases written as jUnit tests

apache-beam java

Last synced: 30 May 2026

https://github.com/sanderploegsma/beam-di

Dependency Injection in Apache Beam

apache-beam dependency-injection

Last synced: 11 Aug 2025

https://github.com/goatcheesesaladwithpeanutoildressing/hands-on-apache-beam

Work In Progress - Une explication simple de qu'est-ce que c'est que le traitement par lots (batch) et le traitement par flux (stream) avec Apache Beam et Cloud Dataflow.

apache-beam google-cloud-dataflow

Last synced: 25 Feb 2025

https://github.com/ivanildobarauna-dev/data-pipeline-async-ingest

Pipeline for processing and consuming streaming data from Pub/Sub, integrating with Dataflow for real-time data processing

apache-beam data-pipeline dataflow portfolio-display python

Last synced: 15 Oct 2025

https://github.com/vladimirrotariu/parallel-monte-carlo-simulations

A package to orchestrate parallel (Monte Carlo) simulations via Apache Beam for an arbitrary number of models, with low-level parameter granularity, and flexible random number generator choice.

apache-beam apache-spark data-engineering monte-carlo-simulation parallel-computing parallel-processing python quantitative-finance random-number-generators

Last synced: 07 Mar 2026

https://github.com/bsrikanth24/gcp-data-engineering-etl-with-composer-dataflow

This project leverages GCS, Composer, Dataflow, BigQuery, and Looker on Google Cloud Platform (GCP) to build a robust data engineering solution for processing, storing, and reporting daily transaction data in the online food delivery industry.

apache-beam cloud-storage cloudcomposer data-engineering dataflow gcp

Last synced: 14 Jul 2025

https://github.com/wozz/beam

apache beam utility packages for golang

apache-beam golang

Last synced: 30 Jan 2026

https://github.com/miozilla/dataflowbeam

dataflowbeam

apache-beam dataflow iam

Last synced: 31 Jan 2026

https://github.com/jey-37/nginx-pipeline

The Apache Beam program which reads nginx access logs from Google Cloud Pub/Sub, parses them, and saves into BigQuery.

apache-beam bigquery dataflow gcp-pubsub

Last synced: 16 May 2026

https://github.com/janaom/gcp-de-project-connect-four-with-python-dataflow

Connect Four Data Engineering Project: leveraging GCS for scalable and durable storage, Dataflow for data extraction and transformation, BigQuery as the data repository, Slack Integration for real-time sharing, Looker for insightful reports and visualizations, and Email Scheduler for automated report delivery.

apache-beam data-engineering dataflow etl gcp python slack-integration

Last synced: 12 May 2026

https://github.com/iht/beam-cloud-build-terraform

The scripts in this repo will build the Apache Beam Java SDK packages, using Cloud Build and Artifact Registry, for a personal Beam fork.

apache-beam artifact-registry cloud-build google-cloud

Last synced: 06 Mar 2026

https://github.com/samuelmarks/workflow-schemata

An exploration of various popular workflow tools from a schema level (in TOML & serde)

apache-airflow apache-beam argo-workflows github-actions kubeflow-pipelines mlflow

Last synced: 15 May 2026

https://github.com/hieuung/streaming-kafka

Using various data processing tool for real time data pipeline with Kafka

apache-beam apache-flink apache-spark kafka kafka-consumer kafka-producer spark-streaming spark-streaming-kafka

Last synced: 27 Feb 2026

https://github.com/beam-pyio/pyio-cookiecutter

Cookiecutter template for creating a package for the Apache Beam Python I/O Connectors project

apache-beam apache-beam-io cookiecutter-template python

Last synced: 08 Jun 2026

https://github.com/pompierninja/hands-on-apache-beam

Work In Progress - Une explication simple de qu'est-ce que c'est que le traitement par lots (batch) et le traitement par flux (stream) avec Apache Beam et Cloud Dataflow.

apache-beam google-cloud-dataflow

Last synced: 03 Mar 2026