Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

https://github.com/ubisoft/mobydq

:whale: Tool to automate data quality checks on data pipelines

big-data data-pipeline data-quality data-quality-checks data-quality-monitoring data-warehouse

Last synced: 17 Jun 2024

https://github.com/pipeline-tools/gusty

Making DAG construction easier

airflow data-etl data-pipeline

Last synced: 15 Jun 2024

https://github.com/bruin-data/ingestr

ingestr is a CLI tool to copy data between any databases with a single command seamlessly.

bigquery copy-database data-ingestion data-integration data-pipeline duckdb ingestion-pipeline mssql postgresql snowflake

Last synced: 15 Jun 2024

https://github.com/adilkhash/Data-Engineering-HowTo

A list of useful resources to learn Data Engineering from scratch

cloud-providers data-engineering data-pipeline distributed-systems scala

Last synced: 12 Jun 2024

https://github.com/bytedance/bitsail

BitSail is a distributed high-performance data integration engine which supports batch, streaming and incremental scenarios. BitSail is widely used to synchronize hundreds of trillions of data every day.

big-data data-integration data-lake data-pipeline data-synchronization flink high-performance real-time

Last synced: 07 Jun 2024

https://github.com/Multiwoven/multiwoven

🔥🔥🔥 Open Source Alternative to Hightouch, Census, and RudderStack. Leading Reverse ETL and Customer Data Platform (CDP) for Data Teams.

bigquery cdp customer-data-platform data-activation data-engineering data-pipeline data-warehouse databricks dbt etl hacktoberfest open-source postresql react redshift reverse-etl ruby self-hosted snowflake typescript

Last synced: 05 Jun 2024

https://github.com/FAIRDataPipeline/rDataPipeline

R implementation of the FAIR Data Pipeline API

data-pipeline fair r

Last synced: 04 Jun 2024

https://github.com/ominibyte/richflow

A Node.js and JavaScript synchronous data pipeline processing, data sharing and stream processing library. Actionable & Transformable Pipeline data processing.

data-flow data-pipeline data-processor data-stream data-transformation flow javascript nodejs pipe-data pipeline-framework streaming-data synchronous

Last synced: 02 Jun 2024

https://github.com/shipyardapp/postgresql-blueprints

Simplified blueprints for building data pipelines with PostgreSQL.

cli data-analysis data-engineering data-pipeline data-science database elt etl postgres postgresql

Last synced: 27 May 2024

https://github.com/ooni/pipeline

OONI data processing pipeline

big-data data-pipeline open-data

Last synced: 26 May 2024

https://github.com/vincentclaes/datajob

Build and deploy a serverless data pipeline on AWS with no effort.

aws aws-cdk data-pipeline glue glue-job machine-learning pipeline sagemaker serverless stepfunctions

Last synced: 14 May 2024

https://github.com/aeksco/aws-pdf-textract-pipeline

:mag: Data pipeline for crawling PDFs from the Web and transforming their contents into structured data using AWS textract. Built with AWS CDK + TypeScript

aws aws-cdk aws-textract cdk cloudformation data-pipeline dynamodb jest lambda pdf puppeteer s3 serverless sns textract typescript webscraping

Last synced: 14 May 2024

https://github.com/whylabs/whylogs

An open-source data logging library for machine learning models and data pipelines. 📚 Provides visibility into data quality & model performance over time. 🛡️ Supports privacy-preserving data collection, ensuring safety & robustness. 📈

ai-pipelines analytics approximate-statistics calculate-statistics constraints data-constraints data-pipeline data-quality data-science dataops dataset logging machine-learning ml-pipelines mlops model-performance python statistical-properties

Last synced: 14 May 2024

https://github.com/kestra-io/kestra

Infinitely scalable, event-driven, language-agnostic orchestration and scheduling platform to manage millions of workflows declaratively in code.

data data-engineering data-integration data-orchestration data-orchestrator data-pipeline data-quality elt etl low-code orchestration pipeline reverse-etl scheduler workflow workflow-engine

Last synced: 14 May 2024

https://github.com/adilkhash/luigi-telegram

Luigi Tasks status notifications to Telegram

data-pipeline data-processing etl luigi notification-plugin

Last synced: 13 May 2024

https://github.com/InfuseAI/awesome-public-dbt-projects

A curated list of awesome public DBT projects

data-pipeline dbt transformation

Last synced: 13 May 2024

https://github.com/elementary-data/elementary

The dbt-native data observability solution for data & analytics engineers. Monitor your data pipelines in minutes. Available as self-hosted or cloud service with premium features.

analytics-engineer bigquery data-analysis data-governance data-lineage data-observability data-pipeline data-pipelines data-reliability data-warehouse dataops dbt dbt-artifacts dbt-packages lineage redshift snowflake

Last synced: 13 May 2024

https://github.com/fremantle-industries/slurpee

A GUI frontend to manage blockchain ingestion with slurp

blockchain data-pipeline evm

Last synced: 13 May 2024

https://github.com/spotify/klio

Smarter data pipelines for audio.

audio-processing data-pipeline media-processing signal-processing

Last synced: 10 May 2024

https://github.com/airbytehq/airbyte

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.

bigquery change-data-capture data data-analysis data-collection data-engineering data-integration data-pipeline elt etl java mssql mysql pipeline postgresql python redshift s3 self-hosted snowflake

Last synced: 09 May 2024

https://github.com/tejzpr/ordered-concurrently

Ordered-concurrently a library for concurrent processing with ordered output in Go. Process work concurrently and returns output in a channel in the order of input. It is useful in concurrently processing items in a queue, and get output in the order provided by the queue.

concurrent concurrent-data-structure data-pipeline data-science golang golang-library ordered parallel parallel-computing

Last synced: 29 Apr 2024

https://github.com/electronick1/stairs

Framework which helps you to make parallel/distributed calculations using data pipelines

data-engineering data-pipeline data-science distributed-computing python

Last synced: 27 Apr 2024

https://github.com/snowplow/snowplow

The enterprise-grade behavioral data engine (web, mobile, server-side, webhooks), running cloud-natively on AWS and GCP

analytics data data-collection data-pipeline marketing-analytics product-analytics snowplow snowplow-events snowplow-pipeline

Last synced: 25 Apr 2024

https://github.com/msamogh/nonechucks

Deal with bad samples in your dataset dynamically, use Transforms as Filters, and more!

data-cleaning data-pipeline data-preprocessing data-processing machine-learning preprocessing pytorch torch

Last synced: 19 Apr 2024

https://github.com/unnati-xyz/scalable-data-science-platform

Content for architecting a data science platform for products using Luigi, Spark & Flask.

data-engineer data-pipeline data-science luigi machine-learning rest-api spark

Last synced: 17 Apr 2024

https://github.com/GoogleCloudPlatform/data-science-on-gcp

Source code accompanying book: Data Science on the Google Cloud Platform, Valliappa Lakshmanan, O'Reilly 2017

cloud-computing data-analysis data-engineering data-pipeline data-processing data-science data-visualization machine-learning

Last synced: 17 Apr 2024

https://github.com/apache/seatunnel-web

SeaTunnel is a distributed, high-performance data integration platform for the synchronization and transformation of massive data (offline & real-time).

apache data-integration data-pipeline etl-framework high-performance offline real-time seatunnel sql-engine

Last synced: 16 Apr 2024

https://github.com/yahwang/Awesome-Data-Engineering

📒(GitBook) A curated list of awesome Data Engineering resources

data-engineering data-lake data-pipeline

Last synced: 10 Apr 2024

https://github.com/awsdocs/aws-data-pipeline-developer-guide

The open source version of the AWS Data Pipeline documentation. To provide feedback & requests for changes, submit issues in this repository, or make proposed changes & submit a pull request.

aws data-pipeline documentation

Last synced: 10 Apr 2024

https://github.com/openbridge/ob_bulkstash

Bulk Stash is a docker rclone service to sync, or copy, files between different storage services. For example, you can copy files either to or from a remote storage services like Amazon S3 to Google Cloud Storage, or locally from your laptop to a remote storage.

amazon-web-services data-pipeline docker docker-image docker-rclone docker-service google-cloud google-cloud-storage oracle-cloud rclone s3 sftp-synchronisation storage-service sync

Last synced: 09 Apr 2024

https://github.com/zazuko/barnard59

An intuitive and flexible RDF pipeline solution designed to simplify and automate ETL processes for efficient data management.

data-integration data-pipeline data-processing etl json-ld linked-data pipeline rdf semantic-web

Last synced: 01 Apr 2024

https://github.com/olirice/flupy

Fluent data pipelines for python and your shell

collections data-pipeline fluent python

Last synced: 26 Mar 2024