Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
Projects in Awesome Lists tagged with data-pipelines
A curated list of projects in awesome lists tagged with data-pipelines .
https://github.com/apache/airflow
Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
airflow apache apache-airflow automation dag data-engineering data-integration data-orchestrator data-pipelines data-science elt etl machine-learning mlops orchestration python scheduler workflow workflow-engine workflow-orchestration
Last synced: 16 Dec 2024
https://github.com/apache/incubator-airflow
Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
airflow apache apache-airflow automation dag data-engineering data-integration data-orchestrator data-pipelines data-science elt etl machine-learning mlops orchestration python scheduler workflow workflow-engine workflow-orchestration
Last synced: 23 Nov 2024
https://github.com/infiniflow/ragflow
RAGFlow is an open-source RAG (Retrieval-Augmented Generation) engine based on deep document understanding.
agent agents ai-search chatbot chatgpt data-pipelines deep-learning document-parser document-understanding genai graph graphrag llm nlp pdf-to-text preprocessing rag retrieval-augmented-generation table-structure-recognition text2sql
Last synced: 16 Dec 2024
https://github.com/apache/dolphinscheduler
Apache DolphinScheduler is the modern data orchestration platform. Agile to create high performance workflow with low-code
airflow azkaban cloud-native data-pipelines job-scheduler orchestration powerful-data-pipelines task-scheduler workflow workflow-orchestration workflow-schedule
Last synced: 17 Dec 2024
https://github.com/dagster-io/dagster
An orchestration platform for the development, production, and observation of data assets.
analytics dagster data-engineering data-integration data-orchestrator data-pipelines data-science etl metadata mlops orchestration python scheduler workflow workflow-automation
Last synced: 16 Dec 2024
https://github.com/unstructured-io/unstructured
Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
data-pipelines deep-learning document-image-analysis document-image-processing document-parser document-parsing docx donut information-retrieval langchain llm machine-learning ml natural-language-processing nlp ocr pdf pdf-to-json pdf-to-text preprocessing
Last synced: 16 Dec 2024
https://github.com/Unstructured-IO/unstructured
Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
data-pipelines deep-learning document-image-analysis document-image-processing document-parser document-parsing docx donut information-retrieval langchain llm machine-learning ml natural-language-processing nlp ocr pdf pdf-to-json pdf-to-text preprocessing
Last synced: 30 Oct 2024
https://github.com/mage-ai/mage-ai
🧙 Build, run, and manage data pipelines for integrating and transforming data.
artificial-intelligence data data-engineering data-integration data-pipelines data-science dbt elt etl machine-learning orchestration pipeline pipelines python reverse-etl spark sql transformation
Last synced: 16 Dec 2024
https://github.com/pathwaycom/pathway
Python ETL framework for stream processing, real-time analytics, LLM pipelines, and RAG.
batch-processing data-analytics data-pipelines data-processing dataflow etl etl-framework iot-analytics kafka machine-learning-algorithms pathway python real-time rust stream-processing streaming time-series-analysis
Last synced: 18 Dec 2024
https://github.com/orchest/orchest
Build data pipelines, the easy way 🛠️
airflow cloud dag data-pipelines data-science deployment docker etl etl-pipeline ide jupyter jupyterlab kubernetes machine-learning notebooks orchest pipelines python self-hosted
Last synced: 18 Dec 2024
https://github.com/infinyon/fluvio
Lean and mean distributed stream processing system written in rust and web assembly.
cloud-native data-flow data-integration data-pipelines distributed-systems event-driven-architecture real-time rust serverless stateful stream-processing stream-processing-engine streaming streaming-data streaming-data-pipelines streaming-data-processing webassembly
Last synced: 31 Oct 2024
https://github.com/elementary-data/elementary
The dbt-native data observability solution for data & analytics engineers. Monitor your data pipelines in minutes. Available as self-hosted or cloud service with premium features.
analytics-engineer bigquery data-analysis data-governance data-lineage data-observability data-pipeline data-pipelines data-reliability data-warehouse dataops dbt dbt-artifacts dbt-packages lineage redshift snowflake
Last synced: 17 Dec 2024
https://github.com/meltano/meltano
Meltano: the declarative code-first data integration engine that powers your wildest data and ML-powered product ideas. Say goodbye to writing, maintaining, and scaling your own API integrations.
connectors data data-engineering data-pipelines dataops dataops-platform elt extract-data integration loaders meltano meltano-sdk open-source opensource pipelines singer tap taps target targets
Last synced: 17 Dec 2024
https://github.com/combust/mleap
MLeap: Deploy ML Pipelines to Production
data-pipelines python scala scikit-learn spark tensorflow transformers
Last synced: 17 Dec 2024
https://github.com/data-engineering-community/data-engineering-wiki
The best place to learn data engineering. Built and maintained by the data engineering community.
data data-engineer data-engineering data-modeling data-pipelines database etl sql
Last synced: 19 Dec 2024
https://github.com/opendatadiscovery/odd-platform
First open-source data discovery and observability platform. We make a life for data practitioners easy so you can focus on your business.
alerting bigdata data-catalog data-discovery data-engineering data-exploration data-governance data-lineage data-observability data-pipelines data-platform data-profiling data-quality data-science datacatalog lineage metadata metadata-management observability oss
Last synced: 19 Dec 2024
https://github.com/SciPhi-AI/R2R
The framework for fast development and deployment of RAG backends.
artificial-intelligence chatbot data-pipelines deep-learning langchain large-language-models llama-index llm machine-learning ocr pdf question-answering retrieval retrieval-augmented-generation retrieval-systems search
Last synced: 28 Oct 2024
https://github.com/dataform-co/dataform
Dataform is a framework for managing SQL based data operations in BigQuery
analytics business-intelligence data-engineering data-pipelines elt etl hacktoberfest
Last synced: 17 Dec 2024
https://github.com/fmind/mlops-python-package
Kickstart your MLOps initiative with a flexible, robust, and productive Python package.
automation data-pipelines data-science machine-learning mlflow mlops pandera pydantic python
Last synced: 20 Dec 2024
https://github.com/raystack/optimus
Optimus is an easy-to-use, reliable, and performant workflow orchestrator for data transformation, data modeling, pipelines, and data quality management.
airflow analytics analytics-engineering automation bigquery business-intelligence data-modelling data-pipelines data-transformation data-warehouse dataops elt etl golang workflows
Last synced: 20 Dec 2024
https://github.com/feldera/feldera
The Feldera Incremental Computation Engine
data-analytics data-pipelines database incremental-computation incremental-view-maintenance ivm materialized-views real-time rust sql streaming
Last synced: 08 Nov 2024
https://github.com/artie-labs/transfer
Database replication platform that leverages change data capture. Stream production data from databases to your data warehouse (Snowflake, BigQuery, Redshift, Databricks) in real-time.
apache-kafka bigquery cdc change-data-capture data-integration data-pipelines database debezium elt golang kafka redshift snowflake
Last synced: 20 Dec 2024
https://github.com/vmware/versatile-data-kit
One framework to develop, deploy and operate data workflows with Python and SQL.
analytics data data-engineer data-engineering data-engineering-pipeline data-lineage data-pipelines data-science data-structures data-warehouse database dataops elt etl pipeline python snowflake sql trino warehouse
Last synced: 21 Dec 2024
https://github.com/elementary-data/dbt-data-reliability
dbt package that is part of Elementary, the dbt-native data observability solution for data & analytics engineers. Monitor your data pipelines in minutes. Available as self-hosted or cloud service with premium features.
analytics analytics-engineering data data-lineage data-observability data-pipeline-monitoring data-pipelines data-reliability dbt dbt-artifacts dbt-packages dbt-tests
Last synced: 21 Dec 2024
https://github.com/recap-build/recap
Work with your web service, database, and streaming schemas in a single format.
data-catalog data-discovery data-engineering data-integration data-pipelines etl metadata recap
Last synced: 13 Dec 2024
https://github.com/gabledata/recap
Work with your web service, database, and streaming schemas in a single format.
data-catalog data-discovery data-engineering data-integration data-pipelines etl metadata recap
Last synced: 11 Nov 2024
https://github.com/dataplane-app/dataplane
Dataplane is an Airflow inspired unified data platform with additional data mesh and RPA capability to automate, schedule and design data pipelines and workflows. Dataplane is written in Golang with a React front end.
airflow data data-analysis data-engineering data-integration data-pipelines data-science dataplane datawarehouse etl finance golang kubernetes pipelines robotics-process-automation rpa scheduler workflow workflow-automation workflows
Last synced: 12 Nov 2024
https://github.com/dataflint/spark
Performance Observability for Apache Spark
apache-spark big-data data-pipeline data-pipelines databricks dataproc emr etl observability optimization spark-operator
Last synced: 20 Dec 2024
https://github.com/tuva-health/tuva
Main repo including core data model, data marts, reference data, terminology, and the clinical concept library
analytics-engineering bigquery data-analytics data-governance data-lineage data-pipelines data-warehouse dbt dbt-packages healthcare healthcare-analysis healthcare-data open-source redshift snowflake sql terminology
Last synced: 17 Dec 2024
https://github.com/kevin-hanselman/dud
A lightweight CLI tool for versioning data alongside source code and building data pipelines.
data-engineering data-pipelines data-science dataset dvcs machine-learning mlops
Last synced: 26 Oct 2024
https://github.com/datajoint/datajoint-python
Relational data pipelines for the science lab
cloud-computing data-analysis data-pipelines databases datajoint mysql pipeline-framework python relational-algebra relational-databases relational-model s3 scientific-computing workflow-management
Last synced: 15 Dec 2024
https://github.com/koolreport/core
An Open Source PHP Reporting Framework that helps you to write perfect data reports or to construct awesome dashboards in PHP. Working great with all PHP versions from 5.6 to latest 8.0. Fully compatible with all kinds of MVC frameworks like Laravel, CodeIgniter, Symfony.
data-analysis data-pipelines data-pivot data-summarization data-visualization data-viz framework mysql-reporting-tools php php-reporting-tools php-reports report-generator reporting reporting-engine reporting-tool
Last synced: 20 Dec 2024
https://github.com/googlecloudplatform/public-datasets-pipelines
Cloud-native, data onboarding architecture for Google Cloud Datasets
airflow bigquery cloud-composer cloud-native cloud-storage data-architecture data-engineering data-pipelines datasets google-cloud open-data
Last synced: 21 Dec 2024
https://github.com/GoogleCloudPlatform/public-datasets-pipelines
Cloud-native, data onboarding architecture for Google Cloud Datasets
airflow bigquery cloud-composer cloud-native cloud-storage data-architecture data-engineering data-pipelines datasets google-cloud open-data
Last synced: 10 Nov 2024
https://github.com/smart-data-lake/smart-data-lake
Smart Automation Tool for building modern Data Lakes and Data Pipelines
data-lake data-pipelines deltalake hadoop hive scala smart-data-lake spark transform-data
Last synced: 20 Dec 2024
https://github.com/basis-os/basis-devkit
Data pipelines from re-usable components
data-analysis data-engineering data-pipeline data-pipelines data-science etl etl-framework etl-pipeline etl-pipelines functional-reactive-programming immutability pipelines sql
Last synced: 23 Nov 2024
https://github.com/patterns-app/patterns-devkit
Data pipelines from re-usable components
data-analysis data-engineering data-pipeline data-pipelines data-science etl etl-framework etl-pipeline etl-pipelines functional-reactive-programming immutability pipelines sql
Last synced: 22 Nov 2024
https://github.com/beneath-hq/beneath
Beneath is a serverless real-time data platform ⚡️
analytics beneath data-engineering data-pipelines data-science data-warehouse dataops developer-tools etl go kubernetes mlops python sql streaming
Last synced: 04 Nov 2024
https://github.com/mycelial/mycelial
Move your data with ease.
data-pipelines edge-computing etl etl-pipeline rust
Last synced: 14 Nov 2024
https://github.com/bruin-data/bruin
Build data pipelines with SQL and Python, ingest data from different sources, add quality checks, and build end-to-end flows.
analytics bigquery data-analysis data-modeling data-pipelines data-transformation python snowflake sql
Last synced: 18 Nov 2024
https://github.com/kenthsu/udacity-data-engineering-nanodgree
Udacity Data Engineering Nanodegree Program
apache-airflow apache-cassandra apache-spark aws-redshift aws-s3 data-engineering data-lake data-pipelines data-quality data-warehouses postgresql
Last synced: 12 Oct 2024
https://github.com/flipkart-incubator/spark-transformers
Spark-Transformers: Library for exporting Apache Spark MLLIB models to use them in any Java application with no other dependencies.
apache-spark data-pipelines export java machine-learning machine-learning-algorithms machine-learning-library mllib scala spark transformers
Last synced: 11 Oct 2024
https://github.com/galileo-galilei/kedro-pandera
A kedro plugin to use pandera in your kedro projects
data-contracts data-pipelines data-schemas kedro kedro-plugin pandera pipelines-testing
Last synced: 18 Nov 2024
https://github.com/mdh266/airflowdatapipeline
Example of an ETL Pipeline using Airflow
airflow data-engineering data-pipelines etl postgresql python
Last synced: 04 Dec 2024
https://github.com/iesahin/xvc
A robust (🐢) and fast (🐇) MLOps tool for managing data and pipelines in Rust (🦀)
command-line-tool data data-engineering data-pipelines data-science devops machine-learning machine-learning-engineering mlops rust
Last synced: 11 Nov 2024
https://github.com/arakat-community/arakat
ARAKAT - Big Data Analysis and Business Intelligence Application Development Platform
big-data-analytics business-intelligence cloud-native-applications data-pipelines distributed-systems docker docker-swarm predictive-maintenance
Last synced: 14 Nov 2024
https://github.com/giacbrd/smartpipeline
A framework for rapid development of robust data pipelines following a simple design pattern
data-analysis data-analytics data-mining data-pipelines data-processing data-science dataops design-patterns etl machine-learning mlops pipeline pipeline-framework pipelines reproducibility task-queue workflow
Last synced: 28 Oct 2024
https://github.com/kestra-io/examples
Best practices for data workflows, integrations with the Modern Data Stack (MDS), Infrastructure as Code (IaC), Cloud Provider Services
analytics-engineering automation data-engineering data-orchestration data-pipelines data-workflows orchestration
Last synced: 09 Nov 2024
https://github.com/riveryio/rivery_cli
Rivery CLI
data-pipeline data-pipelines data-science database database-management dataops dataops-platform dwh dwh-team elt etl rivery
Last synced: 21 Nov 2024
https://github.com/larribas/dagger
Define sophisticated data pipelines with Python and run them on different distributed systems (such as Argo Workflows).
argo-workflows data-engineering data-pipelines data-science distributed-systems pipelines-as-code workflows
Last synced: 03 Dec 2024
https://github.com/brunocampos01/data-engineering
algorithms-techniques big-data big-o-notation bigdata cookbook data-engineering data-pipelines data-processing data-sctructures database-fundamentals dataops design-patterns design-systems java mysql paradigms python spark sql storage
Last synced: 16 Nov 2024
https://github.com/anna-geller/kestra-ci-cd
CI/CD repository template to automate deployments of your production flows
automation data-engineering data-orchestration data-pipelines data-workflows orchestration
Last synced: 16 Dec 2024
https://github.com/unicef/magasin
Cloud native open-source end-to-end data / AI / ML platform
cloud dagster data data-pipelines data-science data-visualization helm-charts kubernetes magasin
Last synced: 09 Nov 2024
https://github.com/DataDrivenGit/Music-Streaming-App-using-AWS-ETL
Implemented Data Warehouse, Data Lake on AWS and Data modeling with Postgres and Apache Cassandra, Also used Apache Airflow to create data pipeline
airflow-operators cassandra data-lake data-pipelines datawarehouse postgres python3 sql
Last synced: 27 Nov 2024
https://github.com/zkan/introduction-to-data-pipelines-and-apache-airflow
Introduction to Data Pipelines and Apache Airflow
Last synced: 19 Dec 2024
https://github.com/federicoserini/dend-project-5-data-pipelines
Project 5 - Data Engineering Nanodegree
apache-airflow aws aws-redshift aws-s3 data-engineering data-pipelines udacity-nanodegree
Last synced: 10 Nov 2024
https://github.com/zkan/building-data-pipelines-with-apache-airflow
Building Data Pipelines with Apache Airflow
apache-airflow data-pipelines docker
Last synced: 19 Dec 2024
https://github.com/snehil-shah/seismic-alerts-streamer
A Realtime Seismic Logging & Alerts Service with Live Monitoring & Email Alerts made using Kafka Data Pipelines, all Dockerized & Deployment Ready!
containerized-build data-pipelines docker flask kafka websocket
Last synced: 12 Oct 2024
https://github.com/jmoussa/go-sentitweet
CLI Application holding a sentiment analysis data (Twitter tweets) pipeline with its own Web API to query results in the database. Written entirely in Go.
api channels cli cli-app cobra data-pipeline data-pipelines gin gin-framework gin-gonic go go-twitter golang gorilla-mux mongodb nlp sentiment-analysis twitter-api
Last synced: 10 Nov 2024
https://github.com/dataforgeopenaihub/mlops-credit-card-fraud-detection-end-to-end
End to End Machine Learning MLOps Project for Credit Card Fraud Detection using Ensemble Models, Data and Model Versioning through DVC, Github Actions, and Deployment
credit-risk data-pipelines dvc-pipeline github-actions google-drive-api machine-learning mlops-project mlops-workflow python
Last synced: 06 Dec 2024
https://github.com/vanderschaarlab/temporai-mivdp
TemporAI-MIVDP: Adaptation of MIMIC-IV-Data-Pipeline for TemporAI
Last synced: 11 Nov 2024
https://github.com/the-swarm-corporation/custom-swarms-spec-template
Build your dream AI agent swarm with enterprise-grade reliability and scalability. This repository contains our official specification template for custom swarm development using the powerful Swarms Framework.
agents ai data-pipelines enterprise enterprise-grade fintech healthcare insurance ml multi-agent multi-agent-collaboration quant radiology security security-tools soc2 soc3 swarms swarms-agents swarms-of-agents
Last synced: 02 Dec 2024
https://github.com/aquemy/dolap_2019_supplementary_material
Supplementary material for DOLAP 2019 submission
data-pipelines data-preprocessing hyperparameter-optimization hyperparameters-optimization
Last synced: 02 Dec 2024
https://github.com/siddharth-nandagopal/billionaires-rag-query
Billionaires RAG Query uses LLMs and a RAG framework to analyze the world's billionaires list. Extracts tabular data from PDFs, converts to multiple formats, and enables precise queries about net worth, age, and more. Integrates with Poetry and asdf for easy setup and management.
asdf billionaires-list camelot csv data-conversion data-ingestion data-pipelines financial-analysis json llm machine-learning natural-language-processing openai pdf-extraction poetry python rag structured-data tabular-data wealth-data
Last synced: 20 Dec 2024
https://github.com/mxagar/data_engineering_guide
Personal notes on the IBM Data Engineering Certificate as well as other sources focusing on AWS.
airflow aws data-lake data-modeling data-pipelines data-science no-sql spark sql warehouse
Last synced: 05 Nov 2024
https://github.com/joe-heffer-shef/airflow
Data engineering project template
data-engineering data-pipelines etl
Last synced: 24 Nov 2024
https://github.com/mpolinowski/apache-airflow-intro
Introduction to Apache Airflow
apache-airflow dag data-pipelines machine-learning
Last synced: 30 Nov 2024
https://github.com/dina-hosny/sparkify---data-lake-with-aws
Sparkify - Data Lake with AWS - Udacity Data Engineering Expert Track.
analytics aws data-engineering data-lake data-pipelines dataset etl fwd udacity
Last synced: 14 Nov 2024
https://github.com/anna-geller/prefect-cloud-automations
Examples of Prefect Automations (triggers & actions)
automation cloud data-engineering data-pipelines dataflow event-driven prefect python workflow-automation
Last synced: 16 Dec 2024
https://github.com/dina-hosny/sparkify---data-modeling-with-postgres
Sparkify - Data Modeling with Postgres - Udacity Data Engineering Expert Track.
data-engineering data-modeling data-pipelines database dataset fwd postgresql python sql udacity
Last synced: 14 Nov 2024
https://github.com/dina-hosny/data-engineering-capstone-project
Data Engineering Capstone Project - Udacity Data Engineering Expert Track.
analytics cassandra data-engineering data-pipelines data-science etl fwd spark udacity
Last synced: 14 Nov 2024
https://github.com/santiagortiiz/snowflake-data-pipelines
EPAM's Snowflake hands-on lab. We built a pipeline to read and load data from S3 into Snowflake, developed an ETL workflow to clean the data and stored it in a data warehouse with the 3NF and Star schemas for data mart analysis.
business-intelligence data-lake data-pipelines data-warehouse etl snowflake streams
Last synced: 10 Nov 2024
https://github.com/armahdavi/analytics-data-pipelines-statistics-ml-plotting---dust-extraction-hvac-filters---phase-2
PhD Technical Paper 1 - Phase 2 - Mahdavi & Siegel (2020) (Aerosol Science & Technology; AS&T) - Sharing all the data pipelines, processing codes, descriptive statistics, statistical modellings, and plotting/visualizations - Project Miestone: 2017 - 2020 - Full-length article is available
data-pipelines data-science data-visualization machine-learning matplotlib-pyplot numpy pandas-dataframe python scipy-stats sklearn statistics
Last synced: 12 Nov 2024
https://github.com/farukalamai/tomato-leaf-diseases-ditection
tomato leaf diseases ditection using yolov8 and yolov5
computer-vision data-pipelines deep-learning disease-detection image-processing image-recognition leaf-diseases-detection object-tracking python yolov5 yolov8
Last synced: 07 Nov 2024
https://github.com/matz1979/airflow
My apache airflow project
airflow aws-s3 data-pipelines pipelines python s3-bucket
Last synced: 12 Nov 2024
https://github.com/cloudformations/training.dataintegration
Training content for course delegates.
data-factory data-pipelines microsoft microsoft-fabric
Last synced: 18 Dec 2024
https://github.com/dr4ks/airflow_cheatsheet
The Airflow CheatSheet repository is a comprehensive reference guide for Apache Airflow users, whether you're a beginner or an experienced practitioner. This repository aims to provide a quick and easy-to-use resource that covers the key concepts, commands, best practices, and tips related to Apache Airflow.
airflow-commands apache-airflow best-practices cheat-sheet dags data data-pipelines etl etl-pipelines python reference-guide task-dependencies task-operators task-scheduling workflow workflow-automation workflow-design workflow-management
Last synced: 06 Nov 2024
https://github.com/jbossdemocentral/edge-to-cloud-data-pipelines-demo
Solution Pattern: Edge to Core Data Pipelines for AI/ML
ai-ml data-acquisition data-pipelines data-science demo edge-computing soluton-pattern
Last synced: 22 Nov 2024
https://github.com/blacksujit/problems-i-have-faced-in-my-journey-of-programming
This repository contains the issues and errors which i have faced in my Prgramming and Machine Learning and Deep learning Journey
algorithms data-pipelines deep-learning errors etl-pipeline grade-applications machine-learning pipeline-processor problem-solving problems production-code production-errors
Last synced: 01 Dec 2024