An open API service indexing awesome lists of open source software.

Projects in Awesome Lists tagged with data-pipelines

A curated list of projects in awesome lists tagged with data-pipelines .

https://github.com/apache/dolphinscheduler

Apache DolphinScheduler is the modern data orchestration platform. Agile to create high performance workflow with low-code

airflow azkaban cloud-native data-pipelines job-scheduler orchestration powerful-data-pipelines task-scheduler workflow workflow-orchestration workflow-schedule

Last synced: 15 Jan 2026

https://github.com/StructuredLabs/preswald

Preswald is a WASM packager for Python-based interactive data apps: bundle full complex data workflows, particularly visualizations, into single files, runnable completely in-browser, using Pyodide, DuckDB, Pandas, and Plotly, Matplotlib, etc. Build dashboards, reports, and notebooks that run offline, load fast, and share like a document.

ai analytics analytics-engineering copilot data data-applications data-infrastructure data-pipelines data-sdk data-visualization gpt llm open-source python schema-management vscode

Last synced: 11 May 2025

https://github.com/structuredlabs/preswald

Preswald is a framework for building and deploying interactive data apps, internal tools, and dashboards with Python. With one command, you can launch, share, and deploy locally or in the cloud, turning Python scripts into powerful shareable apps.

ai analytics analytics-engineering copilot data data-applications data-infrastructure data-pipelines data-sdk data-visualization gpt llm open-source python schema-management vscode

Last synced: 13 May 2025

https://github.com/elementary-data/elementary

The dbt-native data observability solution for data & analytics engineers. Monitor your data pipelines in minutes. Available as self-hosted or cloud service with premium features.

analytics-engineer bigquery data-analysis data-governance data-lineage data-observability data-pipeline data-pipelines data-reliability data-warehouse dataops dbt dbt-artifacts dbt-packages lineage redshift snowflake

Last synced: 19 May 2026

https://github.com/meltano/meltano

Meltano: the declarative code-first data integration engine that powers your wildest data and ML-powered product ideas. Say goodbye to writing, maintaining, and scaling your own API integrations.

connectors data data-engineering data-pipelines dataops dataops-platform elt extract-data integration loaders meltano meltano-sdk open-source opensource pipelines singer tap taps target targets

Last synced: 03 Feb 2026

https://github.com/data-engineering-community/data-engineering-wiki

The best place to learn data engineering. Built and maintained by the data engineering community.

data data-engineer data-engineering data-modeling data-pipelines database etl sql

Last synced: 14 May 2025

https://github.com/bruin-data/bruin

Build data pipelines with SQL and Python, ingest data from different sources, add quality checks, and build end-to-end flows.

analytics bigquery data-analysis data-ingestion data-modeling data-pipelines data-platform data-transformation python snowflake sql

Last synced: 06 Jun 2026

https://github.com/ucbepic/docetl

A system for agentic LLM-powered data processing and ETL

agents data data-pipelines elt etl llm python workflow

Last synced: 12 Oct 2025

https://github.com/combust/mleap

MLeap: Deploy ML Pipelines to Production

data-pipelines python scala scikit-learn spark tensorflow transformers

Last synced: 16 Jan 2026

https://github.com/fmind/mlops-python-package

Kickstart your MLOps initiative with a flexible, robust, and productive Python package.

automation data-pipelines data-science machine-learning mlflow mlops pandera pydantic python

Last synced: 14 May 2025

https://github.com/dataform-co/dataform

Dataform is a framework for managing SQL based data operations in BigQuery

analytics business-intelligence data-engineering data-pipelines elt etl hacktoberfest

Last synced: 04 Feb 2026

https://github.com/artie-labs/transfer

Database replication platform that leverages change data capture. Stream production data from databases to your data warehouse (Snowflake, BigQuery, Redshift, Databricks) in real-time.

apache-kafka bigquery cdc change-data-capture data-integration data-pipelines database debezium elt golang kafka redshift snowflake

Last synced: 30 Apr 2026

https://github.com/raystack/optimus

Optimus is an easy-to-use, reliable, and performant workflow orchestrator for data transformation, data modeling, pipelines, and data quality management.

airflow analytics analytics-engineering automation bigquery business-intelligence data-modelling data-pipelines data-transformation data-warehouse dataops elt etl golang workflows

Last synced: 16 May 2025

https://github.com/elementary-data/dbt-data-reliability

dbt package that is part of Elementary, the dbt-native data observability solution for data & analytics engineers. Monitor your data pipelines in minutes. Available as self-hosted or cloud service with premium features.

analytics analytics-engineering data data-lineage data-observability data-pipeline-monitoring data-pipelines data-reliability dbt dbt-artifacts dbt-packages dbt-tests

Last synced: 16 May 2025

https://github.com/gabledata/recap

Work with your web service, database, and streaming schemas in a single format.

data-catalog data-discovery data-engineering data-integration data-pipelines etl metadata recap

Last synced: 11 Mar 2026

https://github.com/dataplane-app/dataplane

Dataplane is an Airflow inspired unified data platform with additional data mesh and RPA capability to automate, schedule and design data pipelines and workflows. Dataplane is written in Golang with a React front end.

airflow data data-analysis data-engineering data-integration data-pipelines data-science dataplane datawarehouse etl finance golang kubernetes pipelines robotics-process-automation rpa scheduler workflow workflow-automation workflows

Last synced: 27 Dec 2025

https://github.com/kevin-hanselman/dud

A lightweight CLI tool for versioning data alongside source code and building data pipelines.

data-engineering data-pipelines data-science dataset dvcs machine-learning mlops

Last synced: 29 Dec 2025

https://github.com/koolreport/core

An Open Source PHP Reporting Framework that helps you to write perfect data reports or to construct awesome dashboards in PHP. Working great with all PHP versions from 5.6 to latest 8.0. Fully compatible with all kinds of MVC frameworks like Laravel, CodeIgniter, Symfony.

data-analysis data-pipelines data-pivot data-summarization data-visualization data-viz framework mysql-reporting-tools php php-reporting-tools php-reports report-generator reporting reporting-engine reporting-tool

Last synced: 22 Jan 2026

https://github.com/linkedin/hoptimator

Multi-hop declarative data pipelines

brooklin cdc data-pipelines flink kafka kafka-connect

Last synced: 28 May 2026

https://github.com/smart-data-lake/smart-data-lake

Smart Automation Tool for building modern Data Lakes and Data Pipelines

data-lake data-pipelines deltalake hadoop hive scala smart-data-lake spark transform-data

Last synced: 13 Apr 2025

https://github.com/mycelial/mycelial

Move your data with ease.

data-pipelines edge-computing etl etl-pipeline rust

Last synced: 11 Apr 2025

https://github.com/iesahin/xvc

A robust (🐒) and fast (πŸ‡) MLOps tool for managing data and pipelines in Rust (πŸ¦€)

command-line-tool data data-engineering data-pipelines data-science devops machine-learning machine-learning-engineering mlops rust

Last synced: 28 Jun 2025

https://github.com/eschizoid/kpipe

Composable Kafka consumer library for building modular, testable JVM data pipelines.

apache-kafka data-pipelines event-driven functional-programming java kafka stream-processing

Last synced: 20 May 2026

https://github.com/flipkart-incubator/spark-transformers

Spark-Transformers: Library for exporting Apache Spark MLLIB models to use them in any Java application with no other dependencies.

apache-spark data-pipelines export java machine-learning machine-learning-algorithms machine-learning-library mllib scala spark transformers

Last synced: 29 Oct 2025

https://github.com/tabsdata/tabsdata

A Pub/Sub for Tables based data integration platform, to discover, publish, modify and consume data effortlessly.

data-engineering data-integration data-pipelines elt-pipeline etl-pipeline python rust tables tabsdata

Last synced: 03 Feb 2026

https://github.com/mdh266/airflowdatapipeline

Example of an ETL Pipeline using Airflow

airflow data-engineering data-pipelines etl postgresql python

Last synced: 30 Jul 2025

https://github.com/montara-io/dbt-command-center

Never sift through endless dbtβ„’ logs again. dbt Command Center is a free, open-source, local web application that provides a user-friendly interface to monitor and manage dbt runs.

analytics-engineering bigquery data-analysis data-catalog data-engineering data-lineage data-observability data-pipeline data-pipelines data-validation data-warehouse dataops dbt dbt-packages elt etl orchestration python redshift

Last synced: 05 May 2025

https://github.com/arakat-community/arakat

ARAKAT - Big Data Analysis and Business Intelligence Application Development Platform

big-data-analytics business-intelligence cloud-native-applications data-pipelines distributed-systems docker docker-swarm predictive-maintenance

Last synced: 07 May 2025

https://github.com/kestra-io/examples

Best practices for data workflows, integrations with the Modern Data Stack (MDS), Infrastructure as Code (IaC), Cloud Provider Services

analytics-engineering automation data-engineering data-orchestration data-pipelines data-workflows orchestration

Last synced: 09 Oct 2025

https://github.com/larribas/dagger

Define sophisticated data pipelines with Python and run them on different distributed systems (such as Argo Workflows).

argo-workflows data-engineering data-pipelines data-science distributed-systems pipelines-as-code workflows

Last synced: 28 Jul 2025

https://github.com/marcio-azevedo/fsharp-data-processing-pipeline

Provides an extensible solution for creating Data Processing Pipelines in F#.

data-pipelines filter filter-pattern fsharp infrastructure pipe pipes-and-filters

Last synced: 18 Jul 2025

https://github.com/anna-geller/kestra-ci-cd

CI/CD repository template to automate deployments of your production flows

automation data-engineering data-orchestration data-pipelines data-workflows orchestration

Last synced: 04 Mar 2026

https://github.com/tuva-health/provider

A dbt project that transforms messy public provider datasets into usable data for the Tuva Project.

analytics-engineering data-analytics data-governance data-lineage data-pipelines data-warehouse dbt healthcare healthcare-analysis healthcare-data open-source providers snowflake sql

Last synced: 18 Mar 2026

https://github.com/pr1m8/haive-dataflow

Data processing pipelines and ETL workflows for Haive agents

data-pipelines etl fastapi postgres registry serialization supabase

Last synced: 02 May 2026

https://github.com/aredier/chariots

versioned machine learning pipelines

data-pipelines flask machine-learning project-template python

Last synced: 04 Oct 2025

https://github.com/glassflow/cli

GlassFlow CLI to create and manage data pipelines

cli data-pipelines data-transformation real-time stream-processing

Last synced: 13 Nov 2025

https://github.com/unicef/magasin

Cloud native open-source end-to-end data / AI / ML platform

cloud dagster data data-pipelines data-science data-visualization helm-charts kubernetes magasin

Last synced: 21 Apr 2025

https://github.com/snehil-shah/seismic-alerts-streamer

A Realtime Seismic Logging & Alerts Service with Live Monitoring & Email Alerts made using Kafka Data Pipelines, all Dockerized & Deployment Ready!

containerized-build data-pipelines docker flask kafka websocket

Last synced: 18 Aug 2025

https://github.com/lynxkite/lynxkite-2000

GPU-accelerated graph analytics and data science with a friendly face

data-pipelines data-science graph

Last synced: 27 Mar 2026

https://github.com/DataDrivenGit/Music-Streaming-App-using-AWS-ETL

Implemented Data Warehouse, Data Lake on AWS and Data modeling with Postgres and Apache Cassandra, Also used Apache Airflow to create data pipeline

airflow-operators cassandra data-lake data-pipelines datawarehouse postgres python3 sql

Last synced: 20 Jul 2025

https://github.com/rcorrero/light-pipe

A high-level syntax for data pipelines, designed to make pipeline development quick and painless.

data data-pipelines data-processing geospatial-analysis geospatial-processing pipeline

Last synced: 14 Dec 2025

https://github.com/zkan/introduction-to-data-pipelines-and-apache-airflow

Introduction to Data Pipelines and Apache Airflow

apache-airflow data-pipelines

Last synced: 21 Sep 2025

https://github.com/estuary/examples

Examples on using Estuary: tutorials, demo pipelines, and data transformations

data-pipelines data-transformation estuary examples

Last synced: 13 Mar 2026

https://github.com/todofixthis/filters

πŸ€” What if we took the UNIX philosophy and applied it to input validation?

data-pipelines input-validation

Last synced: 29 Jun 2025

https://github.com/allanchua101/ipynta

Rapidly build image processing pipelines

ai data-pipelines image image-processing python

Last synced: 14 Dec 2025

https://github.com/the-swarm-corporation/custom-swarms-spec-template

Build your dream AI agent swarm with enterprise-grade reliability and scalability. This repository contains our official specification template for custom swarm development using the powerful Swarms Framework.

agents ai data-pipelines enterprise enterprise-grade fintech healthcare insurance ml multi-agent multi-agent-collaboration quant radiology security security-tools soc2 soc3 swarms swarms-agents swarms-of-agents

Last synced: 16 Feb 2026

https://github.com/zkan/building-data-pipelines-with-apache-airflow

Building Data Pipelines with Apache Airflow

apache-airflow data-pipelines docker

Last synced: 19 Aug 2025

https://github.com/santiagortiiz/snowflake-data-pipelines

EPAM's Snowflake hands-on lab. We built a pipeline to read and load data from S3 into Snowflake, developed an ETL workflow to clean the data and stored it in a data warehouse with the 3NF and Star schemas for data mart analysis.

business-intelligence data-lake data-pipelines data-warehouse etl snowflake streams

Last synced: 26 Jun 2025

https://github.com/vanderschaarlab/temporai-mivdp

TemporAI-MIVDP: Adaptation of MIMIC-IV-Data-Pipeline for TemporAI

data-pipelines mimic-iv

Last synced: 26 Feb 2025

https://github.com/nbigot/ministream

Ministream is a small, stand-alone, real-time event messaging streaming server

cloud-native data-pipelines event-streaming-database eventing go golang json messaging ministream nosql real-time-processing server streaming-data webapi

Last synced: 22 Jan 2026

https://github.com/welovejeff/tamper-evident-verification

Tamper Signal: signed receipts for vibe-coded data pipelines. Proves nobody changed your data, and shows the exact link if they did.

analytics data-integrity data-pipelines ed25519 hash-chain provenance python signed-receipts tamper-evident tamper-signal verification vibe-coding

Last synced: 11 Jun 2026

https://github.com/willie-conway/relational-database-administration-capstone-project

🧱 Relational Database Administration Capstone Project focuses on design, secure, optimize, and automate OLTP & Data Warehouse systems using MySQL, PostgreSQL, Apache Airflow, and shell scripting. πŸ’ΎπŸ”πŸ“Šβš™οΈ

airflow backup data-pipelines data-warehousing database-admin database-security encryption etl mysql oltp optimization phpmyadmin phppgadmin postgresql restore shell-scripting sql

Last synced: 16 Apr 2026

https://github.com/jmoussa/go-sentitweet

CLI Application holding a sentiment analysis data (Twitter tweets) pipeline with its own Web API to query results in the database. Written entirely in Go.

api channels cli cli-app cobra data-pipeline data-pipelines gin gin-framework gin-gonic go go-twitter golang gorilla-mux mongodb nlp sentiment-analysis twitter-api

Last synced: 04 May 2026

https://github.com/cuonghoangit/geomineralinsight

This project uses machine learning to analyze geological, geochemical, aeromagnetic, and remote sensing data over 39,000 sq. km in southern India. It identifies high-probability zones for concealed Au, Cu, and PGE deposits using XGBoost, SHAP, and GeoPandas. Key features include automated pipelines, explainable AI, and GIS-ready maps.

data-pipelines explainable-ai feature-engineering geopandas geoscience geospatial-analysis gis hackathon-project machine-learning mineral-exploration python rasterio remote-sensing shap

Last synced: 04 Oct 2025

https://github.com/sbdk-dev/sbdk.dev

A complete reference implementation of a local-first ecosystem for AI-powered analytics. This repository contains the source code for the SBDK.dev website, the central hub for the SBDK suite of open-source tools.

ai-powered-analytics data data-engineering data-engineeringlocal-first data-pipeline-automation data-pipelines dbt dlt duckdb elt etl-pipeline llm local-first machine-learning pipeline sbdk semantic-layer

Last synced: 27 May 2026

https://github.com/datatweets/airflow-pyspark-k8s

Run Apache Airflow with KubernetesExecutor and PySpark on Kubernetes using Helm charts and Kind for local development

airflow airflow-dags apache-spark data-engineering data-pipelines kubernetes-deployment python

Last synced: 20 May 2026

https://github.com/dataforgeopenaihub/mlops-credit-card-fraud-detection-end-to-end

End to End Machine Learning MLOps Project for Credit Card Fraud Detection using Ensemble Models, Data and Model Versioning through DVC, Github Actions, and Deployment

aws-lambda credit-risk data-pipelines dvc-pipeline fastapi github-actions google-drive-api machine-learning mlops-project mlops-workflow python

Last synced: 14 Feb 2026

https://github.com/stevehoober254/dataengineer-portfolio

πŸ“Š End-to-end ETL pipelines, Airflow DAGs, notebook-driven analytics & data warehousing

airflow analytics big-data dagster data-engineering data-lake data-pipelines etl python spark

Last synced: 18 Apr 2026

https://github.com/anuj7411/nifty-sensex-data-pipeline

A resilient Python data pipeline for collecting, cleaning, and exporting historical Nifty 50 and BSE Sensex market data.

data-pipelines dataengineering financial-data india nifty50 pandas python sensex stock-market yfinance

Last synced: 17 May 2026

https://github.com/armahdavi/analytics_statistics_ml_plotting_dust_extraction_hvac_filters_ph2

PhD Technical Paper 1 - Phase 2 - Mahdavi & Siegel (2020) (Aerosol Science & Technology; AS&T) - Sharing all the data pipelines, processing codes, descriptive statistics, statistical modellings, and plotting/visualizations - Project Miestone: 2017 - 2020 - Full-length article is available

data-pipelines data-science data-visualization machine-learning matplotlib-pyplot numpy pandas-dataframe python scipy-stats sklearn statistics

Last synced: 14 Apr 2026

https://github.com/matz1979/airflow

My apache airflow project

airflow aws-s3 data-pipelines pipelines python s3-bucket

Last synced: 13 May 2026

https://github.com/nabilshadman/spark-essential-training-data-engineering

Exercise files of the (Apache Spark Essential Training: Big Data Engineering) course

apache-spark big-data data-engineering data-pipelines data-science kafka mariadb pyspark redis

Last synced: 15 Apr 2026