Projects in Awesome Lists tagged with apache-beam
A curated list of projects in awesome lists tagged with apache-beam .
https://github.com/tensorflow/tfx
TFX is an end-to-end platform for deploying production ML pipelines
apache-beam machine-learning tensorflow
Last synced: 15 May 2025
https://tensorflow.github.io/tfx/
TFX is an end-to-end platform for deploying production ML pipelines
apache-beam machine-learning tensorflow
Last synced: 23 Mar 2025
https://github.com/googlecloudplatform/dataflowtemplates
Cloud Dataflow Google-provided templates for solving in-Cloud data tasks
apache-beam bigquery bigtable dataflow-templates google-cloud-dataflow google-cloud-spanner google-cloud-storage
Last synced: 08 Apr 2026
https://github.com/GoogleCloudPlatform/DataflowTemplates
Cloud Dataflow Google-provided templates for solving in-Cloud data tasks
apache-beam bigquery bigtable dataflow-templates google-cloud-dataflow google-cloud-spanner google-cloud-storage
Last synced: 06 Apr 2025
https://github.com/nielsbasjes/yauaa
Yet Another UserAgent Analyzer
analyzer apache-beam apache-flink apache-hive client-hints flink hive java nifi-processor nifi-processors parse snowflake snowplow snowplowanalytics trino-plugin user-agent user-agent-analysis user-agent-parser useragent-parser useragentparser
Last synced: 16 Jan 2026
https://github.com/googlecloudplatform/flink-on-k8s-operator
[DEPRECATED] Kubernetes operator for managing the lifecycle of Apache Flink and Beam applications.
apache-beam apache-flink flink-operator google-cloud-dataproc kubernetes kubernetes-operator operator
Last synced: 03 Oct 2025
https://github.com/GoogleCloudPlatform/flink-on-k8s-operator
[DEPRECATED] Kubernetes operator for managing the lifecycle of Apache Flink and Beam applications.
apache-beam apache-flink flink-operator google-cloud-dataproc kubernetes kubernetes-operator operator
Last synced: 23 Mar 2025
https://github.com/blockchain-etl/bitcoin-etl
ETL scripts for Bitcoin, Litecoin, Dash, Zcash, Doge, Bitcoin Cash. Available in Google BigQuery https://goo.gl/oY5BCQ
apache-beam bitcoin bitcoincash blockchain-analytics crypto cryptocurrency dash data-analytics data-engineering dogecoin etl gcp google-dataflow google-pubsub litecoin on-chain-analysis web3 zcash
Last synced: 10 Apr 2025
https://github.com/ohs-foundation/fhir-data-pipes
A collection of tools for extracting FHIR resources and analytics services on top of that data.
analytics apache-beam etl fhir fhir-store parquet
Last synced: 08 Jun 2026
https://github.com/google/weather-tools
Tools to make weather data accessible and useful.
Last synced: 05 Apr 2025
https://github.com/spotify/flink-on-k8s-operator
Kubernetes operator for managing the lifecycle of Apache Flink and Beam applications.
apache-beam apache-flink flink flink-operator kubernetes kubernetes-operator
Last synced: 15 May 2025
https://github.com/ngrunwald/datasplash
Clojure API for a more dynamic Google Dataflow
apache-beam clojure distributed-computing google-cloud google-dataflow
Last synced: 23 Feb 2026
https://github.com/tosun-si/asgarde
Asgarde allows simplifying error handling with Apache Beam Java, with less code, more concise and expressive code.
apache-beam cloud-dataflow error-handling google-cloud-platform java kotlin
Last synced: 01 Feb 2026
https://github.com/blockchain-etl/blockchain-etl-streaming
Streaming Ethereum and Bitcoin blockchain data to Google Pub/Sub or Postgres in Kubernetes
apache-beam bitcoin blockchain blockchain-analytics crypto cryptocurrency data-analytics data-engineering ethereum etl gcp google-bigquery google-cloud-platform google-dataflow google-pubsub on-chain-analysis real-time real-time-analytics stream-processing web3
Last synced: 25 Jun 2025
https://github.com/mercari/dataflowtemplate
Mercari Dataflow Template
apache-beam cloud-dataflow google-cloud
Last synced: 06 Apr 2025
https://github.com/xmlking/micro-apps
Microservices in Post-Kubernetes Era. A polyglot monorepo
apache-beam conventional-changelog conventional-commits gitflow gitflow-workflow gitops gradle-kotlin-dsl jenkins kotlin micronaut microservice monorepo quarkusio semantic-release semantic-versioning sprintboot
Last synced: 27 Jul 2025
https://github.com/blockchain-etl/blockchain-etl-architecture
Blockchain ETL Architecture
apache-beam blockchain blockchain-analytics crypto cryptocurrency data-analytics data-engineering ethereum etl gcp gke google-bigquery google-cloud google-cloud-platform google-container-engine google-dataflow google-pubsub kubernetes on-chain-analysis real-time-analytics
Last synced: 03 Apr 2025
https://github.com/doitintl/banias
Opinionated serverless event analytics pipeline
analytics apache-beam bigdata dataflow golang
Last synced: 06 Mar 2026
https://github.com/tosun-si/pasgarde
Asgarde allows simplifying error handling with Apache Beam Python, with less code, more concise and expressive code.
apache-beam cloud-dataflow error-handling google-cloud-platform python
Last synced: 01 Feb 2026
https://github.com/sayakpaul/count-tokens-hf-datasets
This project shows how to derive the total number of training tokens from a large text dataset from 🤗 datasets with Apache Beam and Dataflow.
apache-beam dataflow hf-datasets tokenizers transformers unigram-tokenization
Last synced: 05 Sep 2025
https://github.com/google-parfait/dataset_grouper
Libraries for efficient and scalable group-structured dataset pipelines.
apache-beam datasets federated-learning jax pytorch tensorflow tensorflow-datasets
Last synced: 13 Aug 2025
https://github.com/mozilla-services/foxsec-pipeline
Log analysis pipeline utilizing Apache Beam
apache-beam dataflow log-analysis security
Last synced: 11 Apr 2025
https://github.com/mercari/dataflowtemplates
Convenient Dataflow pipelines for transforming data between cloud data sources
apache-beam bigquery dataflow dataflow-templates spanner
Last synced: 25 Oct 2025
https://github.com/janaom/gcp-data-engineering-etl-with-composer-dataflow
This project leverages GCS, Composer, Dataflow, BigQuery, and Looker on Google Cloud Platform (GCP) to build a robust data engineering solution for processing, storing, and reporting daily transaction data in the online food delivery industry.
airflow apache-beam cloud-composer cloud-storage data-engineering dataflow de-project gcp gcs looker
Last synced: 12 Apr 2025
https://github.com/datastacktv/apache-beam-explained
Source code for the YouTube video, Apache Beam Explained in 12 Minutes
Last synced: 27 Feb 2026
https://github.com/esakik/beam-mysql-connector
An Apache Beam I/O connector for seamless integration with MySQL database 🔗 https://beam.apache.org/documentation/io/connectors/#other-io-connectors-for-apache-beam
Last synced: 14 Jan 2026
https://github.com/o2-czech-republic/proxima-platform
The Proxima platform.
analytical-platform apache-beam apache-flink apache-spark batch-processing data-mesh iot-platform stream-processing unified-data-processing
Last synced: 17 Jan 2026
https://github.com/datastacktv/apache-beam-batch-processing
Public source code for the Batch Processing with Apache Beam (Python) online course
Last synced: 18 Aug 2025
https://github.com/ganeshsivakumar/langchain-beam
Integrates LLMs as PTransform in Apache Beam pipelines using LangChain
apache-beam data-engineering dataflow etl langchain langchian-beam rag
Last synced: 20 Oct 2025
https://github.com/medzin/beam-postgres
Light IO transforms for Postgres read/write in Apache Beam pipelines.
Last synced: 24 Jun 2025
https://github.com/chermenin/kio
Kotlin extensions for Apache Beam
apache apache-beam batch beam big-data cep dataflow dataflow-programming google-cloud-platform kotlin kotlin-extensions sql streaming
Last synced: 15 Apr 2025
https://github.com/japila-books/apache-beam-internals
The Internals of Apache Beam
Last synced: 06 Apr 2026
https://github.com/blockchain-etl/hedera-etl
ETL scripts for Hedera Hashgraph
apache-beam blockchain-analytics crypto cryptocurrency data-analytics data-engineering etl gcp google-bigquery google-cloud google-cloud-platform google-dataflow google-pubsub hedera hedera-hashgraph on-chain-analysis web3
Last synced: 25 Jun 2025
https://github.com/janaom/gcp-de-project-streaming-pubsub-beam-dataflow
This project demonstrates an end-to-end solution for processing and analyzing real-time conversations data from a JSON file using GCP services and infrastructure automation, showcasing data storage, streaming, processing, and analysis at scale.
apache-beam bigquery dataflow de-project gcp pubsub streaming-data
Last synced: 18 Oct 2025
https://github.com/blockchain-etl/eos-etl
ETL scripts for EOS.
apache-beam blockchain-analytics crypto cryptocurrency data-analytics data-engineering eos eosio etl gcp google-bigquery google-cloud google-cloud-platform google-dataflow google-pubsub on-chain-analysis web3
Last synced: 14 Jul 2025
https://github.com/mkuthan/example-beam
Playground for Apache Beam and Scio experiments, driven by real-world use cases.
apache-beam gcp-dataflow scala scio
Last synced: 07 Apr 2025
https://github.com/sanderploegsma/beam-scheduling-kubernetes
Scheduled Dataflow pipelines using Kubernetes Cronjobs
apache-beam cronjob dataflow google-cloud google-cloud-dataflow google-cloud-platform kotlin kubernetes
Last synced: 29 Apr 2025
https://github.com/mkuthan/stream-processing
Learn how to develop and test stateful streaming and batch data pipelines
apache-beam scio stream-processing
Last synced: 07 Apr 2025
https://github.com/ksalama/data2cooc2emb2ann
Learning embeddings from item co-occurrence statistics, and building an approx. nearest neighbour index
apache-beam bigquery dataflow embeddings machine-learning python3 tensorflow
Last synced: 13 Jun 2025
https://github.com/mkuthan/gcp-dataflow-tampermonkey
Tampermonkey script for GCP Dataflow console with enhanced view for finding job bottlenecks
apache-beam dataflow gcp tampermonkey-userscript
Last synced: 07 Apr 2025
https://github.com/arkady-emelyanov/toy-data-platform
Toy data platform for a company that provides web analytics
apache-beam apache-druid apache-kafka fasthttp flink-stream-processing golang helm message-bus redash spark-stream-kafka terraform web-analytics
Last synced: 02 Apr 2025
https://github.com/regadas/scio-cats
leverage cats type classes and data types in scio pipelines
apache-beam cats functional-programming scala scio
Last synced: 07 Oct 2025
https://github.com/blockchain-etl/anomalous-transactions-detector-dataflow
Dataflow pipeline for detecting anomalous transactions on the Ethereum and Bitcoin blockchains
anomaly-detection apache-beam bitcoin blockchain-analytics crypto cryptocurrency data-analytics data-engineering data-science ethereum gcp google-cloud google-cloud-platform google-dataflow google-pubsub on-chain-analysis real-time real-time-analytics stream-processing web3
Last synced: 15 Apr 2026
https://github.com/johannaojeling/go-beam-pipeline
Data pipeline built with the Apache Beam Go SDK
apache-beam batch-processing bigquery cloud-sql cloud-storage dataflow elasticsearch firestore go google-cloud memorystore mongodb mysql postgresql redis
Last synced: 16 Jun 2025
https://github.com/solaceproducts/solace-apache-beam
Solace connector for Apache Beam / Google Cloud Dataflow
apache-beam beam google-dataflow java solace
Last synced: 18 Aug 2025
https://github.com/gjbae1212/go-apachebeam-gzipio
To transform for reading and writing gzip files in apache beam using Golang.
apache-beam apache-beam-io go golang gzip
Last synced: 14 May 2025
https://github.com/tosun-si/world-cup-qatar-team-stats-kotlin-midgard
This application shows a full Apache Beam pipeline with Kotlin and Midgard library. The use case works on the last Qatar FIFA world cup data and calculate players statistics per team. This application will be presented at Beam Summit 2023 in New York
apache-beam beam-summit data kotlin midgard world-cup-2022
Last synced: 01 Feb 2026
https://github.com/ryanmcdowell/dataflow-pubsub-event-router
An example pipeline which re-publishes events to different topics based a message attribute.
apache-beam google-cloud-dataflow google-cloud-platform google-cloud-pubsub
Last synced: 18 Jul 2025
https://github.com/pompierninja/beam-amazon-batch-example
A practical example of batch processing on Google Cloud Dataflow using the Go SDK for Apache Beam :fire:
amazon apache-beam batch-processing big-data golang google-cloud-dataflow
Last synced: 28 May 2026
https://github.com/eliias/gleam
Fun DSL for Apache Beam and Kotlin.
apache-beam data-engineering stream-processing
Last synced: 18 Oct 2025
https://github.com/goatcheesesaladwithpeanutoildressing/beam-amazon-batch-example
A practical example of batch processing on Google Cloud Dataflow using the Go SDK for Apache Beam :fire:
amazon apache-beam batch-processing big-data golang google-cloud-dataflow
Last synced: 25 Feb 2025
https://github.com/marceloneppel/apache-beam-golang-udf
Run UDFs (User Defined Functions) on Apache Beam Golang SDK.
apache-beam big-data cloud dataflow flink golang udf
Last synced: 25 Mar 2025
https://github.com/davidkhala/etl
Collection of data Extract, Transform, Load
apache-beam dbt elt etl fivetran
Last synced: 17 Feb 2026
https://github.com/alxmrs/beam-cli-example
How to structure Apache Beam pipelines as pip-installable CLIs.
Last synced: 17 Jun 2026
https://github.com/vikramtiwari/dataflow-samples
samples for dataflow
apache-beam dataflow google-cloud python
Last synced: 30 Jul 2025
https://github.com/mbari-org/aipipeline
Library for running detection, clustering or classification ai pipelines plus performance monitoring using ApacheBeam
apache-beam foundation-models image-classification object-detection object-tracking video-processing-pipeline
Last synced: 13 Apr 2025
https://github.com/ryanmcdowell/dataflow-bigquery-dynamic-destinations
An example pipeline for dynamically routing events from Pub/Sub to different BigQuery tables based on a message attribute.
apache-beam bigquery google-cloud-dataflow google-cloud-platform
Last synced: 09 Sep 2025
https://github.com/davidgasquez/apache-beam-jupyter-notebook
☄️ A simple Apache Beam pipeline running in a Jupyter Notebook
apache-beam docker hacktoberfest jupyter
Last synced: 12 Apr 2025
https://github.com/beam-pyio/firehose_pyio
Apache Beam Python I/O connector for Amazon Data Firehose
apache-beam aws data-engineering data-streaming firehose python
Last synced: 05 May 2025
https://github.com/googlecloudplatform/dataflow-metrics-exporter
CLI tool to collect dataflow resource & execution metrics and export to either BigQuery or Google Cloud Storage. Tool will be useful to compare & visualize the metrics while benchmarking the dataflow pipelines using various data formats, resource configurations etc
apache-beam google-cloud-dataflow
Last synced: 08 Oct 2025
https://github.com/goatcheesesaladwithpeanutoildressing/ip-cameras-monitoring
distributed computer vision
apache-airflow apache-beam deep-learning opencv tensorflow
Last synced: 25 Feb 2025
https://github.com/olahsymbo/mini-etl-apache-beam
ETL Pipeline (apache-beam, python)
apache-beam data-pipeline etl python
Last synced: 26 Mar 2025
https://github.com/rm3l/apache-beam-java-firestore-batch-dataflow
Companion Repo for blog post : https://rm3l.org/batch-writes-to-google-cloud-firestore-using-the-apache-beam-java-sdk-on-google-cloud-dataflow/
apache-beam beam dataflow firestore google-cloud-dataflow google-cloud-firestore
Last synced: 26 Mar 2025
https://github.com/seahrh/fraud-detection-dataflow
Working example of a real-time inference pipeline on GCP Cloud Dataflow
apache-beam cloud-dataflow data-engineering dataflow fraud-detection gcp machine-learning
Last synced: 29 Mar 2025
https://github.com/arquivei/arqbeam-app
An Apache Beam application wrapper using go-app.
apache-beam dataflow go golang hacktoberfest
Last synced: 12 Jan 2026
https://github.com/landerox/cloud-landerox-data
Reference architecture baseline for GCP data platforms (Apache Beam, BigQuery, Cloud Functions, Pub/Sub). Hybrid warehouse/lakehouse with batch + streaming, Medallion layering. Consumed by private runtime repos.
apache-beam batch-processing bigquery cloud-functions cloud-storage data-engineering data-platform dataform gcp google-cloud-dataflow iceberg lakehouse medallion-architecture opentelemetry pubsub python reference-architecture slsa streaming supply-chain-security
Last synced: 21 May 2026
https://github.com/viveknaskar/cloud-dataflow-template-poc
Creating Cloud Dataflow template using Java for counting a number of words from a document.
apache-beam cloud-dataflow gcp google-cloud-platform java
Last synced: 24 May 2026
https://github.com/beam-pyio/dynamodb_pyio
Apache Beam Python I/O connector for Amazon DynamoDB
apache-beam aws data-engineering data-streaming dynamodb python
Last synced: 04 Jan 2026
https://github.com/goatcheesesaladwithpeanutoildressing/IP-Cameras-Monitoring
distributed computer vision
apache-airflow apache-beam deep-learning opencv tensorflow
Last synced: 25 Apr 2025
https://github.com/beam-pyio/sqs_pyio
Apache Beam Python I/O connector for Amazon SQS
apache-beam aws data-engineering data-streaming python sqs
Last synced: 05 Jan 2026
https://github.com/camilajaviera91/apache-beam-pipeline-first-approach
This code demonstrates how to integrate Apache Beam with scikit-learn datasets and perform simple data transformations. It loads the Linnerud dataset from scikit-learn, converts it into a Pandas DataFrame for easier manipulation.
apache-beam dataframes glob kmeans-clustering matplotlib-pyplot mean-absolute-error mean-square-error numpy os pandas pipelines scipy-stats seaborn silhouette-score sklearn sklearn-datasets standardscaler
Last synced: 28 Apr 2026
https://github.com/tansudasli/beam-sandbox
Apache beam sandbox w/ Dataflow for 10+ use cases
apache-beam gcp-dataflow python
Last synced: 09 Jun 2026
https://github.com/pompierninja/ip-cameras-monitoring
distributed computer vision
apache-airflow apache-beam deep-learning opencv tensorflow
Last synced: 06 May 2026
https://github.com/data-mission/dota2-cast-assist
Real-time Dota2 broadcaster’s assistant integrates the live Steam API with Dota GSI to provide game metrics like GPM, XPM, kills, deaths, damage, buybacks, and more, enhancing commentary with insights on player performance and the in-game economy
apache-beam broadcast dota dota2 game-state-integration gamestate-integration gsi k8s kubernetes poetry python steam-api valve-games
Last synced: 07 May 2026
https://github.com/neo4j-field/dataflow-flex-pyarrow-to-gds
Google Dataflow Flex Templates (in Python) for large scale Graph Loading with GDS and Apache Arrow
apache-arrow apache-beam bigquery dataflow neo4j python
Last synced: 09 May 2026
https://github.com/janaom/apache-beam-practice
It's time to learn Beam! This repository contains a collection of tasks and exercises focused on Apache Beam.
Last synced: 04 Jul 2025
https://github.com/datafabricrus/rya-beam-pipelines
Apache Beam Pipelines for Apache Rya
apache-beam apache-rya google-dataflow
Last synced: 16 May 2025
https://github.com/pedrodeoliveira/unbabel-bec
A streaming version of the Unbabel's BEC using GCP Pub/Sub and Apache Beam.
apache-beam dataflow docker gitlab-ci gke python streaming
Last synced: 17 Apr 2026
https://github.com/iht/bigquery-dataflow-cdc-example
A Dataflow streaming pipeline written in Java, reading data from Pubsub and recovering the sessions from potentially unordered data, and upserting the session data into BigQuery with no duplicates
apache-beam bigquery cdc dataflow google-cloud pubsub
Last synced: 04 Jan 2026
https://github.com/eduardogr/playing-apache-beam-tour
Playing with Apache Beam Tour: https://tour.beam.apache.org
Last synced: 12 Aug 2025
https://github.com/fabiothiroki/java-apachebeam-tour
Apache Beam using Java use cases written as jUnit tests
Last synced: 30 May 2026
https://github.com/sanderploegsma/beam-di
Dependency Injection in Apache Beam
apache-beam dependency-injection
Last synced: 11 Aug 2025
https://github.com/goatcheesesaladwithpeanutoildressing/hands-on-apache-beam
Work In Progress - Une explication simple de qu'est-ce que c'est que le traitement par lots (batch) et le traitement par flux (stream) avec Apache Beam et Cloud Dataflow.
apache-beam google-cloud-dataflow
Last synced: 25 Feb 2025
https://github.com/goatcheesesaladwithpeanutoildressing/scio-demo
Playing w/ Scio
Last synced: 25 Feb 2025
https://github.com/goatcheesesaladwithpeanutoildressing/parallelism-test
beam bam boom
Last synced: 25 Feb 2025
https://github.com/ivanildobarauna-dev/data-pipeline-async-ingest
Pipeline for processing and consuming streaming data from Pub/Sub, integrating with Dataflow for real-time data processing
apache-beam data-pipeline dataflow portfolio-display python
Last synced: 15 Oct 2025
https://github.com/ngyewch/beam-sdks-java-io-s3-file-system
Apache Beam S3 Filesysten.
apache-beam aws beam filesystem java s3
Last synced: 29 Apr 2026
https://github.com/vladimirrotariu/parallel-monte-carlo-simulations
A package to orchestrate parallel (Monte Carlo) simulations via Apache Beam for an arbitrary number of models, with low-level parameter granularity, and flexible random number generator choice.
apache-beam apache-spark data-engineering monte-carlo-simulation parallel-computing parallel-processing python quantitative-finance random-number-generators
Last synced: 07 Mar 2026
https://github.com/bsrikanth24/gcp-data-engineering-etl-with-composer-dataflow
This project leverages GCS, Composer, Dataflow, BigQuery, and Looker on Google Cloud Platform (GCP) to build a robust data engineering solution for processing, storing, and reporting daily transaction data in the online food delivery industry.
apache-beam cloud-storage cloudcomposer data-engineering dataflow gcp
Last synced: 14 Jul 2025
https://github.com/jey-37/nginx-pipeline
The Apache Beam program which reads nginx access logs from Google Cloud Pub/Sub, parses them, and saves into BigQuery.
apache-beam bigquery dataflow gcp-pubsub
Last synced: 16 May 2026
https://github.com/janaom/gcp-de-project-connect-four-with-python-dataflow
Connect Four Data Engineering Project: leveraging GCS for scalable and durable storage, Dataflow for data extraction and transformation, BigQuery as the data repository, Slack Integration for real-time sharing, Looker for insightful reports and visualizations, and Email Scheduler for automated report delivery.
apache-beam data-engineering dataflow etl gcp python slack-integration
Last synced: 12 May 2026
https://github.com/iht/beam-cloud-build-terraform
The scripts in this repo will build the Apache Beam Java SDK packages, using Cloud Build and Artifact Registry, for a personal Beam fork.
apache-beam artifact-registry cloud-build google-cloud
Last synced: 06 Mar 2026
https://github.com/samuelmarks/workflow-schemata
An exploration of various popular workflow tools from a schema level (in TOML & serde)
apache-airflow apache-beam argo-workflows github-actions kubeflow-pipelines mlflow
Last synced: 15 May 2026
https://github.com/hieuung/streaming-kafka
Using various data processing tool for real time data pipeline with Kafka
apache-beam apache-flink apache-spark kafka kafka-consumer kafka-producer spark-streaming spark-streaming-kafka
Last synced: 27 Feb 2026
https://github.com/beam-pyio/pyio-cookiecutter
Cookiecutter template for creating a package for the Apache Beam Python I/O Connectors project
apache-beam apache-beam-io cookiecutter-template python
Last synced: 08 Jun 2026
https://github.com/pompierninja/hands-on-apache-beam
Work In Progress - Une explication simple de qu'est-ce que c'est que le traitement par lots (batch) et le traitement par flux (stream) avec Apache Beam et Cloud Dataflow.
apache-beam google-cloud-dataflow
Last synced: 03 Mar 2026