Projects in Awesome Lists tagged with data-pipelines
A curated list of projects in awesome lists tagged with data-pipelines .
https://github.com/apache/airflow
Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
airflow apache apache-airflow automation dag data-engineering data-integration data-orchestrator data-pipelines data-science elt etl machine-learning mlops orchestration python scheduler workflow workflow-engine workflow-orchestration
Last synced: 07 Apr 2026
https://github.com/pathwaycom/pathway
Python ETL framework for stream processing, real-time analytics, LLM pipelines, and RAG.
batch-processing data-analytics data-pipelines data-processing dataflow etl etl-framework iot-analytics kafka machine-learning-algorithms pathway python real-time rust stream-processing streaming time-series-analysis
Last synced: 08 Jan 2026
https://github.com/apache/dolphinscheduler
Apache DolphinScheduler is the modern data orchestration platform. Agile to create high performance workflow with low-code
airflow azkaban cloud-native data-pipelines job-scheduler orchestration powerful-data-pipelines task-scheduler workflow workflow-orchestration workflow-schedule
Last synced: 15 Jan 2026
https://github.com/dagster-io/dagster
An orchestration platform for the development, production, and observation of data assets.
analytics dagster data-engineering data-integration data-orchestrator data-pipelines data-science etl metadata mlops orchestration python scheduler workflow workflow-automation
Last synced: 09 Apr 2026
https://github.com/unstructured-io/unstructured
Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
data-pipelines deep-learning document-image-analysis document-image-processing document-parser document-parsing docx donut information-retrieval langchain llm machine-learning ml natural-language-processing nlp ocr pdf pdf-to-json pdf-to-text preprocessing
Last synced: 24 Apr 2026
https://github.com/Unstructured-IO/unstructured
Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
data-pipelines deep-learning document-image-analysis document-image-processing document-parser document-parsing docx donut information-retrieval langchain llm machine-learning ml natural-language-processing nlp ocr pdf pdf-to-json pdf-to-text preprocessing
Last synced: 26 Mar 2025
https://github.com/mage-ai/mage-ai
π§ Build, run, and manage data pipelines for integrating and transforming data.
artificial-intelligence data data-engineering data-integration data-pipelines data-science dbt elt etl machine-learning orchestration pipeline pipelines python reverse-etl spark sql transformation
Last synced: 21 Jan 2026
https://github.com/fluvio-community/fluvio
π¦ event stream processing for developers to collect and transform data in motion to power responsive data intensive applications.
cloud-native data-analytics data-flow data-integration data-pipelines distributed-systems event-driven-architecture real-time rust serverless stateful stream-processing stream-processing-engine streaming streaming-analytics streaming-data streaming-data-pipelines streaming-data-processing webassembly
Last synced: 09 Mar 2026
https://github.com/infinyon/fluvio
π¦ event stream processing for developers to stream and process data in motion to power responsive data intensive applications.
cloud-native data-analytics data-flow data-integration data-pipelines distributed-systems event-driven-architecture real-time rust serverless stateful stream-processing stream-processing-engine streaming streaming-analytics streaming-data streaming-data-pipelines streaming-data-processing webassembly
Last synced: 13 May 2025
https://github.com/orchest/orchest
Build data pipelines, the easy way π οΈ
airflow cloud dag data-pipelines data-science deployment docker etl etl-pipeline ide jupyter jupyterlab kubernetes machine-learning notebooks orchest pipelines python self-hosted
Last synced: 14 May 2025
https://github.com/StructuredLabs/preswald
Preswald is a WASM packager for Python-based interactive data apps: bundle full complex data workflows, particularly visualizations, into single files, runnable completely in-browser, using Pyodide, DuckDB, Pandas, and Plotly, Matplotlib, etc. Build dashboards, reports, and notebooks that run offline, load fast, and share like a document.
ai analytics analytics-engineering copilot data data-applications data-infrastructure data-pipelines data-sdk data-visualization gpt llm open-source python schema-management vscode
Last synced: 11 May 2025
https://github.com/structuredlabs/preswald
Preswald is a framework for building and deploying interactive data apps, internal tools, and dashboards with Python. With one command, you can launch, share, and deploy locally or in the cloud, turning Python scripts into powerful shareable apps.
ai analytics analytics-engineering copilot data data-applications data-infrastructure data-pipelines data-sdk data-visualization gpt llm open-source python schema-management vscode
Last synced: 13 May 2025
https://github.com/elementary-data/elementary
The dbt-native data observability solution for data & analytics engineers. Monitor your data pipelines in minutes. Available as self-hosted or cloud service with premium features.
analytics-engineer bigquery data-analysis data-governance data-lineage data-observability data-pipeline data-pipelines data-reliability data-warehouse dataops dbt dbt-artifacts dbt-packages lineage redshift snowflake
Last synced: 19 May 2026
https://github.com/meltano/meltano
Meltano: the declarative code-first data integration engine that powers your wildest data and ML-powered product ideas. Say goodbye to writing, maintaining, and scaling your own API integrations.
connectors data data-engineering data-pipelines dataops dataops-platform elt extract-data integration loaders meltano meltano-sdk open-source opensource pipelines singer tap taps target targets
Last synced: 03 Feb 2026
https://github.com/feldera/feldera
The Feldera Incremental Computation Engine
data-analytics data-pipelines database incremental-computation incremental-view-maintenance ivm materialized-views real-time rust sql streaming
Last synced: 11 May 2026
https://github.com/data-engineering-community/data-engineering-wiki
The best place to learn data engineering. Built and maintained by the data engineering community.
data data-engineer data-engineering data-modeling data-pipelines database etl sql
Last synced: 14 May 2025
https://github.com/bruin-data/bruin
Build data pipelines with SQL and Python, ingest data from different sources, add quality checks, and build end-to-end flows.
analytics bigquery data-analysis data-ingestion data-modeling data-pipelines data-platform data-transformation python snowflake sql
Last synced: 06 Jun 2026
https://github.com/ucbepic/docetl
A system for agentic LLM-powered data processing and ETL
agents data data-pipelines elt etl llm python workflow
Last synced: 12 Oct 2025
https://github.com/combust/mleap
MLeap: Deploy ML Pipelines to Production
data-pipelines python scala scikit-learn spark tensorflow transformers
Last synced: 16 Jan 2026
https://github.com/opendatadiscovery/odd-platform
First open-source data discovery and observability platform. We make a life for data practitioners easy so you can focus on your business.
alerting bigdata data-catalog data-discovery data-engineering data-exploration data-governance data-lineage data-observability data-pipelines data-platform data-profiling data-quality data-science datacatalog lineage metadata metadata-management observability oss
Last synced: 02 Apr 2026
https://github.com/fmind/mlops-python-package
Kickstart your MLOps initiative with a flexible, robust, and productive Python package.
automation data-pipelines data-science machine-learning mlflow mlops pandera pydantic python
Last synced: 14 May 2025
https://github.com/dataform-co/dataform
Dataform is a framework for managing SQL based data operations in BigQuery
analytics business-intelligence data-engineering data-pipelines elt etl hacktoberfest
Last synced: 04 Feb 2026
https://github.com/artie-labs/transfer
Database replication platform that leverages change data capture. Stream production data from databases to your data warehouse (Snowflake, BigQuery, Redshift, Databricks) in real-time.
apache-kafka bigquery cdc change-data-capture data-integration data-pipelines database debezium elt golang kafka redshift snowflake
Last synced: 30 Apr 2026
https://github.com/raystack/optimus
Optimus is an easy-to-use, reliable, and performant workflow orchestrator for data transformation, data modeling, pipelines, and data quality management.
airflow analytics analytics-engineering automation bigquery business-intelligence data-modelling data-pipelines data-transformation data-warehouse dataops elt etl golang workflows
Last synced: 16 May 2025
https://github.com/vmware/versatile-data-kit
One framework to develop, deploy and operate data workflows with Python and SQL.
analytics data data-engineer data-engineering data-engineering-pipeline data-lineage data-pipelines data-science data-structures data-warehouse database dataops elt etl pipeline python snowflake sql trino warehouse
Last synced: 15 May 2025
https://github.com/elementary-data/dbt-data-reliability
dbt package that is part of Elementary, the dbt-native data observability solution for data & analytics engineers. Monitor your data pipelines in minutes. Available as self-hosted or cloud service with premium features.
analytics analytics-engineering data data-lineage data-observability data-pipeline-monitoring data-pipelines data-reliability dbt dbt-artifacts dbt-packages dbt-tests
Last synced: 16 May 2025
https://github.com/gabledata/recap
Work with your web service, database, and streaming schemas in a single format.
data-catalog data-discovery data-engineering data-integration data-pipelines etl metadata recap
Last synced: 11 Mar 2026
https://github.com/tuva-health/tuva
Main repo including core data model, data marts, data quality tests, and terminology sets.
analytics-engineering bigquery data-analytics data-governance data-lineage data-pipelines data-warehouse dbt dbt-packages healthcare healthcare-analysis healthcare-data open-source redshift snowflake sql terminology
Last synced: 06 Feb 2026
https://github.com/dataflint/spark
Performance Observability for Apache Spark
apache-spark big-data data-pipeline data-pipelines databricks dataproc emr etl observability optimization spark-operator
Last synced: 10 May 2026
https://github.com/dataplane-app/dataplane
Dataplane is an Airflow inspired unified data platform with additional data mesh and RPA capability to automate, schedule and design data pipelines and workflows. Dataplane is written in Golang with a React front end.
airflow data data-analysis data-engineering data-integration data-pipelines data-science dataplane datawarehouse etl finance golang kubernetes pipelines robotics-process-automation rpa scheduler workflow workflow-automation workflows
Last synced: 27 Dec 2025
https://github.com/kevin-hanselman/dud
A lightweight CLI tool for versioning data alongside source code and building data pipelines.
data-engineering data-pipelines data-science dataset dvcs machine-learning mlops
Last synced: 29 Dec 2025
https://github.com/mitdbg/palimpzest
A System for Optimized Semantic Computation
agentic-workflow agents data-pipelines llm optimization semantic-computation semantic-operators unstructured-data
Last synced: 11 Mar 2026
https://github.com/datajoint/datajoint-python
Relational data pipelines for the science lab
data-engineering data-integrity data-lineage data-pipelines datajoint declarative metadata-management mysql neuroscience object-storage postgresql python relational-model reproducibility research-software schema-management scientific-computing workflow-management
Last synced: 17 Feb 2026
https://github.com/koolreport/core
An Open Source PHP Reporting Framework that helps you to write perfect data reports or to construct awesome dashboards in PHP. Working great with all PHP versions from 5.6 to latest 8.0. Fully compatible with all kinds of MVC frameworks like Laravel, CodeIgniter, Symfony.
data-analysis data-pipelines data-pivot data-summarization data-visualization data-viz framework mysql-reporting-tools php php-reporting-tools php-reports report-generator reporting reporting-engine reporting-tool
Last synced: 22 Jan 2026
https://github.com/googlecloudplatform/public-datasets-pipelines
Cloud-native, data onboarding architecture for Google Cloud Datasets
airflow bigquery cloud-composer cloud-native cloud-storage data-architecture data-engineering data-pipelines datasets google-cloud open-data
Last synced: 12 Apr 2025
https://github.com/GoogleCloudPlatform/public-datasets-pipelines
Cloud-native, data onboarding architecture for Google Cloud Datasets
airflow bigquery cloud-composer cloud-native cloud-storage data-architecture data-engineering data-pipelines datasets google-cloud open-data
Last synced: 23 Apr 2025
https://github.com/linkedin/hoptimator
Multi-hop declarative data pipelines
brooklin cdc data-pipelines flink kafka kafka-connect
Last synced: 28 May 2026
https://github.com/smart-data-lake/smart-data-lake
Smart Automation Tool for building modern Data Lakes and Data Pipelines
data-lake data-pipelines deltalake hadoop hive scala smart-data-lake spark transform-data
Last synced: 13 Apr 2025
https://github.com/patterns-app/patterns-devkit
Data pipelines from re-usable components
data-analysis data-engineering data-pipeline data-pipelines data-science etl etl-framework etl-pipeline etl-pipelines functional-reactive-programming immutability pipelines sql
Last synced: 13 Jul 2025
https://github.com/conductor-oss/python-sdk
Conductor OSS SDK for Python programming language
conductor data-pipelines durable-computing durable-execution etl-pipeline python workflow
Last synced: 01 Apr 2026
https://github.com/beneath-hq/beneath
Beneath is a serverless real-time data platform β‘οΈ
analytics beneath data-engineering data-pipelines data-science data-warehouse dataops developer-tools etl go kubernetes mlops python sql streaming
Last synced: 03 Apr 2025
https://github.com/gorango/flowcraft
A lightweight workflow engine
agentic-workflows ai-agent background-jobs dag data-pipelines declarative-workflows distributed-systems etl llm-orchestration orchestration rag state-machine workflow-engine
Last synced: 30 Oct 2025
https://github.com/mycelial/mycelial
Move your data with ease.
data-pipelines edge-computing etl etl-pipeline rust
Last synced: 11 Apr 2025
https://github.com/iesahin/xvc
A robust (π’) and fast (π) MLOps tool for managing data and pipelines in Rust (π¦)
command-line-tool data data-engineering data-pipelines data-science devops machine-learning machine-learning-engineering mlops rust
Last synced: 28 Jun 2025
https://github.com/kenthsu/udacity-data-engineering-nanodgree
Udacity Data Engineering Nanodegree Program
apache-airflow apache-cassandra apache-spark aws-redshift aws-s3 data-engineering data-lake data-pipelines data-quality data-warehouses postgresql
Last synced: 10 Apr 2025
https://github.com/eschizoid/kpipe
Composable Kafka consumer library for building modular, testable JVM data pipelines.
apache-kafka data-pipelines event-driven functional-programming java kafka stream-processing
Last synced: 20 May 2026
https://github.com/bakdata/streams-explorer
Explore Apache Kafka data pipelines in Kubernetes.
apache-kafka data-pipelines data-stream hacktoberfest kafka-connect kafka-streams kubernetes python react
Last synced: 10 Apr 2025
https://github.com/flipkart-incubator/spark-transformers
Spark-Transformers: Library for exporting Apache Spark MLLIB models to use them in any Java application with no other dependencies.
apache-spark data-pipelines export java machine-learning machine-learning-algorithms machine-learning-library mllib scala spark transformers
Last synced: 29 Oct 2025
https://github.com/tabsdata/tabsdata
A Pub/Sub for Tables based data integration platform, to discover, publish, modify and consume data effortlessly.
data-engineering data-integration data-pipelines elt-pipeline etl-pipeline python rust tables tabsdata
Last synced: 03 Feb 2026
https://github.com/galileo-galilei/kedro-pandera
A kedro plugin to use pandera in your kedro projects
data-contracts data-pipelines data-schemas kedro kedro-plugin pandera pipelines-testing
Last synced: 29 Jun 2025
https://github.com/mdh266/airflowdatapipeline
Example of an ETL Pipeline using Airflow
airflow data-engineering data-pipelines etl postgresql python
Last synced: 30 Jul 2025
https://github.com/montara-io/dbt-command-center
Never sift through endless dbtβ’ logs again. dbt Command Center is a free, open-source, local web application that provides a user-friendly interface to monitor and manage dbt runs.
analytics-engineering bigquery data-analysis data-catalog data-engineering data-lineage data-observability data-pipeline data-pipelines data-validation data-warehouse dataops dbt dbt-packages elt etl orchestration python redshift
Last synced: 05 May 2025
https://github.com/arakat-community/arakat
ARAKAT - Big Data Analysis and Business Intelligence Application Development Platform
big-data-analytics business-intelligence cloud-native-applications data-pipelines distributed-systems docker docker-swarm predictive-maintenance
Last synced: 07 May 2025
https://github.com/kestra-io/examples
Best practices for data workflows, integrations with the Modern Data Stack (MDS), Infrastructure as Code (IaC), Cloud Provider Services
analytics-engineering automation data-engineering data-orchestration data-pipelines data-workflows orchestration
Last synced: 09 Oct 2025
https://github.com/giacbrd/smartpipeline
A framework for rapid development of robust data pipelines following a simple design pattern
data-analysis data-analytics data-mining data-pipelines data-processing data-science dataops design-patterns etl machine-learning mlops pipeline pipeline-framework pipelines reproducibility task-queue workflow
Last synced: 21 Mar 2025
https://github.com/pachyderm/neon-workshop
A Pachyderm deep learning tutorial for conference workshops
containers data-engineering data-pipelines data-science deep-learning docker kubernetes machine-learning python
Last synced: 02 Mar 2026
https://github.com/riveryio/rivery_cli
Rivery CLI
data-pipeline data-pipelines data-science database database-management dataops dataops-platform dwh dwh-team elt etl rivery
Last synced: 11 Jul 2025
https://github.com/larribas/dagger
Define sophisticated data pipelines with Python and run them on different distributed systems (such as Argo Workflows).
argo-workflows data-engineering data-pipelines data-science distributed-systems pipelines-as-code workflows
Last synced: 28 Jul 2025
https://github.com/marcio-azevedo/fsharp-data-processing-pipeline
Provides an extensible solution for creating Data Processing Pipelines in F#.
data-pipelines filter filter-pattern fsharp infrastructure pipe pipes-and-filters
Last synced: 18 Jul 2025
https://github.com/ketgo/marshmallow-pyspark
Marshmallow serializer integration with pyspark
data-cleaning data-engineering data-engineering-pipeline data-pipelines data-schemas marshmallow pyspark schema spark
Last synced: 02 Feb 2026
https://github.com/anna-geller/kestra-ci-cd
CI/CD repository template to automate deployments of your production flows
automation data-engineering data-orchestration data-pipelines data-workflows orchestration
Last synced: 04 Mar 2026
https://github.com/brunocampos01/data-engineering
algorithms-techniques big-data big-o-notation bigdata cookbook data-engineering data-pipelines data-processing data-sctructures database-fundamentals dataops design-patterns design-systems java mysql paradigms python spark sql storage
Last synced: 15 Apr 2025
https://github.com/matttriano/analytics_data_where_house
An analytics engineering sandbox focusing on real estates prices in Cook County, IL
airflow data-catalog data-discovery data-engineering data-pipelines data-platform data-warehousing dbt docker elt mkdocs-material open-source python superset
Last synced: 18 Jan 2026
https://github.com/tuva-health/provider
A dbt project that transforms messy public provider datasets into usable data for the Tuva Project.
analytics-engineering data-analytics data-governance data-lineage data-pipelines data-warehouse dbt healthcare healthcare-analysis healthcare-data open-source providers snowflake sql
Last synced: 18 Mar 2026
https://github.com/pr1m8/haive-dataflow
Data processing pipelines and ETL workflows for Haive agents
data-pipelines etl fastapi postgres registry serialization supabase
Last synced: 02 May 2026
https://github.com/aredier/chariots
versioned machine learning pipelines
data-pipelines flask machine-learning project-template python
Last synced: 04 Oct 2025
https://github.com/glassflow/cli
GlassFlow CLI to create and manage data pipelines
cli data-pipelines data-transformation real-time stream-processing
Last synced: 13 Nov 2025
https://github.com/unicef/magasin
Cloud native open-source end-to-end data / AI / ML platform
cloud dagster data data-pipelines data-science data-visualization helm-charts kubernetes magasin
Last synced: 21 Apr 2025
https://github.com/snehil-shah/seismic-alerts-streamer
A Realtime Seismic Logging & Alerts Service with Live Monitoring & Email Alerts made using Kafka Data Pipelines, all Dockerized & Deployment Ready!
containerized-build data-pipelines docker flask kafka websocket
Last synced: 18 Aug 2025
https://github.com/dwhitena/pach-neon
An example Pachyderm ML pipeline using Nervana Neon
artificial-intelligence data-pipelines data-science deep-learning docker machine-learning pachyderm
Last synced: 19 May 2026
https://github.com/lynxkite/lynxkite-2000
GPU-accelerated graph analytics and data science with a friendly face
data-pipelines data-science graph
Last synced: 27 Mar 2026
https://github.com/DataDrivenGit/Music-Streaming-App-using-AWS-ETL
Implemented Data Warehouse, Data Lake on AWS and Data modeling with Postgres and Apache Cassandra, Also used Apache Airflow to create data pipeline
airflow-operators cassandra data-lake data-pipelines datawarehouse postgres python3 sql
Last synced: 20 Jul 2025
https://github.com/rcorrero/light-pipe
A high-level syntax for data pipelines, designed to make pipeline development quick and painless.
data data-pipelines data-processing geospatial-analysis geospatial-processing pipeline
Last synced: 14 Dec 2025
https://github.com/zkan/introduction-to-data-pipelines-and-apache-airflow
Introduction to Data Pipelines and Apache Airflow
Last synced: 21 Sep 2025
https://github.com/federicoserini/dend-project-5-data-pipelines
Project 5 - Data Engineering Nanodegree
apache-airflow aws aws-redshift aws-s3 data-engineering data-pipelines udacity-nanodegree
Last synced: 22 Apr 2025
https://github.com/abeltavares/versioned-data-lakehouse
π Git-like Version Control for Data with Nessie, Iceberg, and Spark
apache-iceberg apache-nessie apache-spark atomic-etl block-storage branch-based-development data-engineering data-lakehouse data-pipelines data-versioning dataops distributed-systems etl etl-pipeline git-for-data minio s3 spark-etl table-format time-travel
Last synced: 20 May 2026
https://github.com/estuary/examples
Examples on using Estuary: tutorials, demo pipelines, and data transformations
data-pipelines data-transformation estuary examples
Last synced: 13 Mar 2026
https://github.com/todofixthis/filters
π€ What if we took the UNIX philosophy and applied it to input validation?
data-pipelines input-validation
Last synced: 29 Jun 2025
https://github.com/allanchua101/ipynta
Rapidly build image processing pipelines
ai data-pipelines image image-processing python
Last synced: 14 Dec 2025
https://github.com/the-swarm-corporation/custom-swarms-spec-template
Build your dream AI agent swarm with enterprise-grade reliability and scalability. This repository contains our official specification template for custom swarm development using the powerful Swarms Framework.
agents ai data-pipelines enterprise enterprise-grade fintech healthcare insurance ml multi-agent multi-agent-collaboration quant radiology security security-tools soc2 soc3 swarms swarms-agents swarms-of-agents
Last synced: 16 Feb 2026
https://github.com/zkan/building-data-pipelines-with-apache-airflow
Building Data Pipelines with Apache Airflow
apache-airflow data-pipelines docker
Last synced: 19 Aug 2025
https://github.com/santiagortiiz/snowflake-data-pipelines
EPAM's Snowflake hands-on lab. We built a pipeline to read and load data from S3 into Snowflake, developed an ETL workflow to clean the data and stored it in a data warehouse with the 3NF and Star schemas for data mart analysis.
business-intelligence data-lake data-pipelines data-warehouse etl snowflake streams
Last synced: 26 Jun 2025
https://github.com/vanderschaarlab/temporai-mivdp
TemporAI-MIVDP: Adaptation of MIMIC-IV-Data-Pipeline for TemporAI
Last synced: 26 Feb 2025
https://github.com/aquemy/dolap_2019_supplementary_material
Supplementary material for DOLAP 2019 submission
data-pipelines data-preprocessing hyperparameter-optimization hyperparameters-optimization
Last synced: 04 Jan 2026
https://github.com/nbigot/ministream
Ministream is a small, stand-alone, real-time event messaging streaming server
cloud-native data-pipelines event-streaming-database eventing go golang json messaging ministream nosql real-time-processing server streaming-data webapi
Last synced: 22 Jan 2026
https://github.com/welovejeff/tamper-evident-verification
Tamper Signal: signed receipts for vibe-coded data pipelines. Proves nobody changed your data, and shows the exact link if they did.
analytics data-integrity data-pipelines ed25519 hash-chain provenance python signed-receipts tamper-evident tamper-signal verification vibe-coding
Last synced: 11 Jun 2026
https://github.com/willie-conway/relational-database-administration-capstone-project
π§± Relational Database Administration Capstone Project focuses on design, secure, optimize, and automate OLTP & Data Warehouse systems using MySQL, PostgreSQL, Apache Airflow, and shell scripting. πΎππβοΈ
airflow backup data-pipelines data-warehousing database-admin database-security encryption etl mysql oltp optimization phpmyadmin phppgadmin postgresql restore shell-scripting sql
Last synced: 16 Apr 2026
https://github.com/jmoussa/go-sentitweet
CLI Application holding a sentiment analysis data (Twitter tweets) pipeline with its own Web API to query results in the database. Written entirely in Go.
api channels cli cli-app cobra data-pipeline data-pipelines gin gin-framework gin-gonic go go-twitter golang gorilla-mux mongodb nlp sentiment-analysis twitter-api
Last synced: 04 May 2026
https://github.com/cuonghoangit/geomineralinsight
This project uses machine learning to analyze geological, geochemical, aeromagnetic, and remote sensing data over 39,000 sq. km in southern India. It identifies high-probability zones for concealed Au, Cu, and PGE deposits using XGBoost, SHAP, and GeoPandas. Key features include automated pipelines, explainable AI, and GIS-ready maps.
data-pipelines explainable-ai feature-engineering geopandas geoscience geospatial-analysis gis hackathon-project machine-learning mineral-exploration python rasterio remote-sensing shap
Last synced: 04 Oct 2025
https://github.com/sbdk-dev/sbdk.dev
A complete reference implementation of a local-first ecosystem for AI-powered analytics. This repository contains the source code for the SBDK.dev website, the central hub for the SBDK suite of open-source tools.
ai-powered-analytics data data-engineering data-engineeringlocal-first data-pipeline-automation data-pipelines dbt dlt duckdb elt etl-pipeline llm local-first machine-learning pipeline sbdk semantic-layer
Last synced: 27 May 2026
https://github.com/datatweets/airflow-pyspark-k8s
Run Apache Airflow with KubernetesExecutor and PySpark on Kubernetes using Helm charts and Kind for local development
airflow airflow-dags apache-spark data-engineering data-pipelines kubernetes-deployment python
Last synced: 20 May 2026
https://github.com/dataforgeopenaihub/mlops-credit-card-fraud-detection-end-to-end
End to End Machine Learning MLOps Project for Credit Card Fraud Detection using Ensemble Models, Data and Model Versioning through DVC, Github Actions, and Deployment
aws-lambda credit-risk data-pipelines dvc-pipeline fastapi github-actions google-drive-api machine-learning mlops-project mlops-workflow python
Last synced: 14 Feb 2026
https://github.com/stevehoober254/dataengineer-portfolio
π End-to-end ETL pipelines, Airflow DAGs, notebook-driven analytics & data warehousing
airflow analytics big-data dagster data-engineering data-lake data-pipelines etl python spark
Last synced: 18 Apr 2026
https://github.com/anuj7411/nifty-sensex-data-pipeline
A resilient Python data pipeline for collecting, cleaning, and exporting historical Nifty 50 and BSE Sensex market data.
data-pipelines dataengineering financial-data india nifty50 pandas python sensex stock-market yfinance
Last synced: 17 May 2026
https://github.com/armahdavi/analytics_statistics_ml_plotting_dust_extraction_hvac_filters_ph2
PhD Technical Paper 1 - Phase 2 - Mahdavi & Siegel (2020) (Aerosol Science & Technology; AS&T) - Sharing all the data pipelines, processing codes, descriptive statistics, statistical modellings, and plotting/visualizations - Project Miestone: 2017 - 2020 - Full-length article is available
data-pipelines data-science data-visualization machine-learning matplotlib-pyplot numpy pandas-dataframe python scipy-stats sklearn statistics
Last synced: 14 Apr 2026
https://github.com/matz1979/airflow
My apache airflow project
airflow aws-s3 data-pipelines pipelines python s3-bucket
Last synced: 13 May 2026
https://github.com/mavaji/free-monad
data-pipelines free-monads monad scala
Last synced: 26 Oct 2025
https://github.com/nabilshadman/spark-essential-training-data-engineering
Exercise files of the (Apache Spark Essential Training: Big Data Engineering) course
apache-spark big-data data-engineering data-pipelines data-science kafka mariadb pyspark redis
Last synced: 15 Apr 2026
https://github.com/theoddysey/scikit-pipeline
Bank Customer Churn Prediction Project π°
data-pipelines jupyter-notebooks scikitlearn-machine-learning visualization
Last synced: 11 Jul 2025