Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/apache/flink-cdc
Flink CDC is a streaming data integration tool
batch cdc change-data-capture data-integration data-pipeline distributed elt etl flink kafka mysql paimon postgresql real-time schema-evolution
Last synced: 26 Jun 2024
![](https://github.com/apache.png)
https://github.com/AgnostiqHQ/covalent
Pythonic tool for orchestrating machine-learning/high performance/quantum-computing workflows in heterogeneous compute environments.
covalent data-pipeline data-science deep-learning hacktoberfest hpc hpc-applications machine-learning machinelearning machinelearning-python orchestration parallelization pipelines python quantum quantum-computing quantum-machine-learning workflow workflow-automation workflow-management
Last synced: 23 Jun 2024
![](https://github.com/AgnostiqHQ.png)
https://github.com/ubisoft/mobydq
:whale: Tool to automate data quality checks on data pipelines
big-data data-pipeline data-quality data-quality-checks data-quality-monitoring data-warehouse
Last synced: 17 Jun 2024
![](https://github.com/ubisoft.png)
https://github.com/pipeline-tools/gusty
Making DAG construction easier
airflow data-etl data-pipeline
Last synced: 15 Jun 2024
![](https://github.com/pipeline-tools.png)
https://github.com/bruin-data/ingestr
ingestr is a CLI tool to copy data between any databases with a single command seamlessly.
bigquery copy-database data-ingestion data-integration data-pipeline duckdb ingestion-pipeline mssql postgresql snowflake
Last synced: 15 Jun 2024
![](https://github.com/bruin-data.png)
https://github.com/adilkhash/Data-Engineering-HowTo
A list of useful resources to learn Data Engineering from scratch
cloud-providers data-engineering data-pipeline distributed-systems scala
Last synced: 12 Jun 2024
![](https://github.com/adilkhash.png)
https://github.com/bytedance/bitsail
BitSail is a distributed high-performance data integration engine which supports batch, streaming and incremental scenarios. BitSail is widely used to synchronize hundreds of trillions of data every day.
big-data data-integration data-lake data-pipeline data-synchronization flink high-performance real-time
Last synced: 07 Jun 2024
![](https://github.com/bytedance.png)
https://github.com/Multiwoven/multiwoven
🔥🔥🔥 Open Source Alternative to Hightouch, Census, and RudderStack. Leading Reverse ETL and Customer Data Platform (CDP) for Data Teams.
bigquery cdp customer-data-platform data-activation data-engineering data-pipeline data-warehouse databricks dbt etl hacktoberfest open-source postresql react redshift reverse-etl ruby self-hosted snowflake typescript
Last synced: 05 Jun 2024
![](https://github.com/Multiwoven.png)
https://github.com/FAIRDataPipeline/rDataPipeline
R implementation of the FAIR Data Pipeline API
Last synced: 04 Jun 2024
![](https://github.com/FAIRDataPipeline.png)
https://github.com/ominibyte/richflow
A Node.js and JavaScript synchronous data pipeline processing, data sharing and stream processing library. Actionable & Transformable Pipeline data processing.
data-flow data-pipeline data-processor data-stream data-transformation flow javascript nodejs pipe-data pipeline-framework streaming-data synchronous
Last synced: 02 Jun 2024
![](https://github.com/ominibyte.png)
https://github.com/shipyardapp/postgresql-blueprints
Simplified blueprints for building data pipelines with PostgreSQL.
cli data-analysis data-engineering data-pipeline data-science database elt etl postgres postgresql
Last synced: 27 May 2024
![](https://github.com/shipyardapp.png)
https://github.com/ooni/pipeline
OONI data processing pipeline
big-data data-pipeline open-data
Last synced: 26 May 2024
![](https://github.com/ooni.png)
https://github.com/shirosaidev/saisoku
Saisoku is a Python module that helps you build complex pipelines of batch file/directory transfer/sync jobs.
data-pipeline data-synchronization data-transfer directory-transfer file-transfer luigi luigi-pipeline orchestration-framework pipeline python rclone s3 scheduling sync sync-directories tornado transfer-files transfer-server
Last synced: 26 May 2024
![](https://github.com/shirosaidev.png)
https://github.com/rudderlabs/rudder-server
Privacy and Security focused Segment-alternative, in Golang and React
bigquery customer-data customer-data-lake customer-data-pipeline customer-data-platform data-integration data-pipeline data-synchronization data-warehouse etl golang hybrid-cloud privacy redshift rudderstack security segment-alternative snowflake warehouse-first warehouse-management
Last synced: 16 May 2024
![](https://github.com/rudderlabs.png)
https://github.com/teckkean/GTFS-Data-Pipeline-TfNSW-Bus
GTFS Data Pipeline for TfNSW Bus Datasets
data-pipeline datapipeline gtfs gtfs-realtime gtfs-static open-data opendata python tfnsw
Last synced: 15 May 2024
![](https://github.com/teckkean.png)
https://github.com/vincentclaes/datajob
Build and deploy a serverless data pipeline on AWS with no effort.
aws aws-cdk data-pipeline glue glue-job machine-learning pipeline sagemaker serverless stepfunctions
Last synced: 14 May 2024
![](https://github.com/vincentclaes.png)
https://github.com/aeksco/aws-pdf-textract-pipeline
:mag: Data pipeline for crawling PDFs from the Web and transforming their contents into structured data using AWS textract. Built with AWS CDK + TypeScript
aws aws-cdk aws-textract cdk cloudformation data-pipeline dynamodb jest lambda pdf puppeteer s3 serverless sns textract typescript webscraping
Last synced: 14 May 2024
![](https://github.com/aeksco.png)
https://github.com/whylabs/whylogs
An open-source data logging library for machine learning models and data pipelines. 📚 Provides visibility into data quality & model performance over time. 🛡️ Supports privacy-preserving data collection, ensuring safety & robustness. 📈
ai-pipelines analytics approximate-statistics calculate-statistics constraints data-constraints data-pipeline data-quality data-science dataops dataset logging machine-learning ml-pipelines mlops model-performance python statistical-properties
Last synced: 14 May 2024
![](https://github.com/whylabs.png)
https://github.com/kestra-io/kestra
Infinitely scalable, event-driven, language-agnostic orchestration and scheduling platform to manage millions of workflows declaratively in code.
data data-engineering data-integration data-orchestration data-orchestrator data-pipeline data-quality elt etl low-code orchestration pipeline reverse-etl scheduler workflow workflow-engine
Last synced: 14 May 2024
![](https://github.com/kestra-io.png)
https://github.com/adilkhash/luigi-telegram
Luigi Tasks status notifications to Telegram
data-pipeline data-processing etl luigi notification-plugin
Last synced: 13 May 2024
![](https://github.com/adilkhash.png)
https://github.com/patterns-app/patterns-devkit
Data pipelines from re-usable components
data-analysis data-engineering data-pipeline data-pipelines data-science etl etl-framework etl-pipeline etl-pipelines functional-reactive-programming immutability pipelines sql
Last synced: 13 May 2024
![](https://github.com/patterns-app.png)
https://github.com/InfuseAI/awesome-public-dbt-projects
A curated list of awesome public DBT projects
data-pipeline dbt transformation
Last synced: 13 May 2024
![](https://github.com/InfuseAI.png)
https://github.com/elementary-data/elementary
The dbt-native data observability solution for data & analytics engineers. Monitor your data pipelines in minutes. Available as self-hosted or cloud service with premium features.
analytics-engineer bigquery data-analysis data-governance data-lineage data-observability data-pipeline data-pipelines data-reliability data-warehouse dataops dbt dbt-artifacts dbt-packages lineage redshift snowflake
Last synced: 13 May 2024
![](https://github.com/elementary-data.png)
https://github.com/InfuseAI/piperider
Code review for data in dbt
code-review continuous-integration data-exploration data-observability data-pipeline data-profiler data-profiling data-quality data-reliability data-science data-testing data-visualization dbt dbt-metrics eda exploratory-data-analysis pull-requests python reporting
Last synced: 13 May 2024
![](https://github.com/InfuseAI.png)
https://github.com/fremantle-industries/slurpee
A GUI frontend to manage blockchain ingestion with slurp
Last synced: 13 May 2024
![](https://github.com/fremantle-industries.png)
https://github.com/superstreamlabs/memphis
Memphis.dev is a highly scalable and effortless data streaming platform
data data-engineering data-pipeline data-stream-processing data-streaming enrichment golang kubernetes message-broker message-bus message-queue messaging-queue microservices schema-registry
Last synced: 11 May 2024
![](https://github.com/superstreamlabs.png)
https://github.com/spotify/klio
Smarter data pipelines for audio.
audio-processing data-pipeline media-processing signal-processing
Last synced: 10 May 2024
![](https://github.com/spotify.png)
https://github.com/airbytehq/airbyte
The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
bigquery change-data-capture data data-analysis data-collection data-engineering data-integration data-pipeline elt etl java mssql mysql pipeline postgresql python redshift s3 self-hosted snowflake
Last synced: 09 May 2024
![](https://github.com/airbytehq.png)
https://github.com/infoslack/awesome-kafka
A list about Apache Kafka
apache-kafka apache-spark data-pipeline data-processing infrastructure kafka kafka-streams stream-processing streaming-data
Last synced: 07 May 2024
![](https://github.com/infoslack.png)
https://github.com/scicloj/scicloj.ml
A Clojure machine learning library
classification clojure clustering data-pipeline data-science experiment-tracking hyperparameter-optimization machine-learning nlp regression scicloj
Last synced: 01 May 2024
![](https://github.com/scicloj.png)
https://github.com/reugn/go-streams
A lightweight stream processing library for Go
aerospike data-pipeline data-stream etl kafka kafka-streams low-code nats-streaming pipeline pulsar redis stream-processing stream-processor streaming-api streaming-data streams throttling websocket windowing workflow
Last synced: 29 Apr 2024
![](https://github.com/reugn.png)
https://github.com/tejzpr/ordered-concurrently
Ordered-concurrently a library for concurrent processing with ordered output in Go. Process work concurrently and returns output in a channel in the order of input. It is useful in concurrently processing items in a queue, and get output in the order provided by the queue.
concurrent concurrent-data-structure data-pipeline data-science golang golang-library ordered parallel parallel-computing
Last synced: 29 Apr 2024
![](https://github.com/tejzpr.png)
https://github.com/electronick1/stairs
Framework which helps you to make parallel/distributed calculations using data pipelines
data-engineering data-pipeline data-science distributed-computing python
Last synced: 27 Apr 2024
![](https://github.com/electronick1.png)
https://github.com/snowplow/snowplow
The enterprise-grade behavioral data engine (web, mobile, server-side, webhooks), running cloud-natively on AWS and GCP
analytics data data-collection data-pipeline marketing-analytics product-analytics snowplow snowplow-events snowplow-pipeline
Last synced: 25 Apr 2024
![](https://github.com/snowplow.png)
https://github.com/msamogh/nonechucks
Deal with bad samples in your dataset dynamically, use Transforms as Filters, and more!
data-cleaning data-pipeline data-preprocessing data-processing machine-learning preprocessing pytorch torch
Last synced: 19 Apr 2024
![](https://github.com/msamogh.png)
https://github.com/unnati-xyz/scalable-data-science-platform
Content for architecting a data science platform for products using Luigi, Spark & Flask.
data-engineer data-pipeline data-science luigi machine-learning rest-api spark
Last synced: 17 Apr 2024
![](https://github.com/unnati-xyz.png)
https://github.com/GoogleCloudPlatform/data-science-on-gcp
Source code accompanying book: Data Science on the Google Cloud Platform, Valliappa Lakshmanan, O'Reilly 2017
cloud-computing data-analysis data-engineering data-pipeline data-processing data-science data-visualization machine-learning
Last synced: 17 Apr 2024
![](https://github.com/GoogleCloudPlatform.png)
https://github.com/apache/seatunnel-web
SeaTunnel is a distributed, high-performance data integration platform for the synchronization and transformation of massive data (offline & real-time).
apache data-integration data-pipeline etl-framework high-performance offline real-time seatunnel sql-engine
Last synced: 16 Apr 2024
![](https://github.com/apache.png)
https://github.com/yahwang/Awesome-Data-Engineering
📒(GitBook) A curated list of awesome Data Engineering resources
data-engineering data-lake data-pipeline
Last synced: 10 Apr 2024
![](https://github.com/yahwang.png)
https://github.com/awsdocs/aws-data-pipeline-developer-guide
The open source version of the AWS Data Pipeline documentation. To provide feedback & requests for changes, submit issues in this repository, or make proposed changes & submit a pull request.
aws data-pipeline documentation
Last synced: 10 Apr 2024
![](https://github.com/awsdocs.png)
https://github.com/feldera/feldera
Feldera Continuous Analytics Platform
analytics continous data-analysis data-pipeline database etl kafka materialized-view realtime rust sql streaming
Last synced: 09 Apr 2024
![](https://github.com/feldera.png)
https://github.com/openbridge/ob_bulkstash
Bulk Stash is a docker rclone service to sync, or copy, files between different storage services. For example, you can copy files either to or from a remote storage services like Amazon S3 to Google Cloud Storage, or locally from your laptop to a remote storage.
amazon-web-services data-pipeline docker docker-image docker-rclone docker-service google-cloud google-cloud-storage oracle-cloud rclone s3 sftp-synchronisation storage-service sync
Last synced: 09 Apr 2024
![](https://github.com/openbridge.png)
https://github.com/zazuko/barnard59
An intuitive and flexible RDF pipeline solution designed to simplify and automate ETL processes for efficient data management.
data-integration data-pipeline data-processing etl json-ld linked-data pipeline rdf semantic-web
Last synced: 01 Apr 2024
![](https://github.com/zazuko.png)
https://github.com/streamlet-dev/tributary
Streaming reactive and dataflow graphs in Python
asynchronous data-pipeline kafka lazy-evaluation python python-data-streams python3 reactive-data-streams stream streaming websockets
Last synced: 27 Mar 2024
![](https://github.com/streamlet-dev.png)
https://github.com/olirice/flupy
Fluent data pipelines for python and your shell
collections data-pipeline fluent python
Last synced: 26 Mar 2024
![](https://github.com/olirice.png)
https://github.com/pydoit/doit
task management & automation tool
build-automation build-system build-tool data-pipeline data-science hacktoberfest python task-runner workflow workflow-automation workflow-management
Last synced: 14 Mar 2024
![](https://github.com/pydoit.png)