An open API service indexing awesome lists of open source software.

Projects in Awesome Lists tagged with data-engineering

A curated list of projects in awesome lists tagged with data-engineering .

https://github.com/datatalksclub/data-engineering-zoomcamp

Data Engineering Zoomcamp is a free nine-week course that covers the fundamentals of data engineering.

data-engineering dbt docker kafka kestra spark

Last synced: 13 May 2025

https://github.com/DataTalksClub/data-engineering-zoomcamp

Data Engineering Zoomcamp is a free nine-week course that covers the fundamentals of data engineering.

data-engineering dbt docker kafka kestra spark

Last synced: 14 Mar 2025

https://github.com/prefecthq/prefect

Prefect is a workflow orchestration framework for building resilient data pipelines in Python.

automation data data-engineering data-ops data-science infrastructure ml-ops observability orchestration pipeline prefect python workflow workflow-engine

Last synced: 12 May 2025

https://github.com/airbytehq/airbyte

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.

bigquery change-data-capture data data-analysis data-collection data-engineering data-integration data-pipeline elt etl java mssql mysql pipeline postgresql python redshift s3 self-hosted snowflake

Last synced: 12 May 2025

https://github.com/PrefectHQ/prefect

Prefect is a workflow orchestration framework for building resilient data pipelines in Python.

automation data data-engineering data-ops data-science infrastructure ml-ops observability orchestration pipeline prefect python workflow workflow-engine

Last synced: 24 Mar 2025

https://github.com/datastacktv/data-engineer-roadmap

Roadmap to becoming a data engineer in 2021

cloud data-engineer-roadmap data-engineering roadmap

Last synced: 23 Mar 2025

https://github.com/RisingWaveLabs/risingwave

Best-in-class stream processing, analytics, and management. Perform continuous analytics, or build event-driven applications, real-time ETL pipelines, and feature stores in minutes. Unified streaming and batch. PostgreSQL compatible.

analytics big-data cloud-native data-engineering database distributed-database etl flink kafka ksqldb materialized-view postgres postgresql real-time real-time-analytics rust serverless spark-streaming sql stream-processing

Last synced: 29 Mar 2025

https://github.com/evidence-dev/evidence

Business intelligence as code: build fast, interactive data visualizations in SQL and markdown

analytics business-intelligence dashboard data-engineering data-science data-visualization dbt duckdb exploratory-data-analysis self-hosted sql svelte tailwindcss webassembly

Last synced: 13 May 2025

https://github.com/whoiskatrin/sql-translator

SQL Translator is a tool for converting natural language queries into SQL code using artificial intelligence. This project is 100% free and open source.

data-analysis data-engineering dataquery datascience dataset openai postgresql query sql

Last synced: 14 May 2025

https://github.com/aws/aws-sdk-pandas

pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, Neptune, OpenSearch, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).

amazon-athena amazon-sagemaker-notebook apache-arrow apache-parquet athena aws aws-glue aws-lambda data-engineering data-science emr etl glue-catalog lambda modin mysql pandas python ray redshift

Last synced: 12 May 2025

https://github.com/adilkhash/data-engineering-howto

A list of useful resources to learn Data Engineering from scratch

cloud-providers data-engineering data-pipeline distributed-systems scala

Last synced: 14 May 2025

https://github.com/adilkhash/Data-Engineering-HowTo

A list of useful resources to learn Data Engineering from scratch

cloud-providers data-engineering data-pipeline distributed-systems scala

Last synced: 28 Mar 2025

https://github.com/ploomber/ploomber

The fastest ⚑️ way to build data pipelines. Develop iteratively, deploy anywhere. ☁️

data-engineering data-science jupyter jupyter-notebooks machine-learning mlops notebooks papermill pipelines pycharm vscode workflow

Last synced: 29 Apr 2025

https://github.com/dlt-hub/dlt

data load tool (dlt) is an open source Python library that makes data loading easy πŸ› οΈ

data data-engineering data-lake data-loading data-warehouse elt extract load python transform

Last synced: 26 Mar 2025

https://github.com/gokumohandas/mlops-course

Learn how to design, develop, deploy and iterate on production-grade ML applications.

data-engineering data-quality data-science deep-learning distributed-ml llms machine-learning mlops natural-language-processing python pytorch ray

Last synced: 15 May 2025

https://github.com/GokuMohandas/mlops-course

Learn how to design, develop, deploy and iterate on production-grade ML applications.

data-engineering data-quality data-science deep-learning distributed-ml llms machine-learning mlops natural-language-processing python pytorch ray

Last synced: 27 Mar 2025

https://github.com/eventual-inc/daft

Distributed data engine for Python/SQL designed for the cloud, powered by Rust

big-data data-engineering data-science dataframe distributed-computing machine-learning python rust

Last synced: 08 May 2025

https://github.com/apache/incubator-devlake

Apache DevLake is an open-source dev data platform to ingest, analyze, and visualize the fragmented data from DevOps tools, extracting insights for engineering excellence, developer experience, and community growth.

dashboard-friendly data data-analysis data-engineering data-integration data-transfers devops domain-layer dora etl golang hacktoberfest integration jira open-source user-friendly

Last synced: 14 May 2025

https://github.com/Eventual-Inc/Daft

Distributed data engine for Python/SQL designed for the cloud, powered by Rust

big-data data-engineering data-science dataframe distributed-computing machine-learning python rust

Last synced: 09 Apr 2025

https://github.com/dagworks-inc/hamilton

Hamilton helps data scientists and engineers define testable, modular, self-documenting dataflows, that encode lineage/tracing and metadata. Runs and scales everywhere python does.

dag data-analysis data-engineering data-science dataframe etl etl-framework etl-pipeline feature-engineering hacktoberfest lineage llmops machine-learning mlops orchestration pandas python rag software-engineering

Last synced: 13 May 2025

https://github.com/metarank/metarank

A low code Machine Learning personalized ranking service for articles, listings, search results, recommendations that boosts user engagement. A friendly Learn-to-Rank engine

automl data-engineering data-science deep-learning feature-engineering feature-extraction kubernetes machine-learning neural-networks personalization ranking scala search

Last synced: 14 May 2025

https://github.com/meltano/meltano

Meltano: the declarative code-first data integration engine that powers your wildest data and ML-powered product ideas. Say goodbye to writing, maintaining, and scaling your own API integrations.

connectors data data-engineering data-pipelines dataops dataops-platform elt extract-data integration loaders meltano meltano-sdk open-source opensource pipelines singer tap taps target targets

Last synced: 12 May 2025

https://github.com/DAGWorks-Inc/hamilton

Hamilton helps data scientists and engineers define testable, modular, self-documenting dataflows, that encode lineage/tracing and metadata. Runs and scales everywhere python does.

dag data-analysis data-engineering data-science dataframe etl etl-framework etl-pipeline feature-engineering hacktoberfest lineage llmops machine-learning mlops orchestration pandas python rag software-engineering

Last synced: 26 Mar 2025

https://github.com/alexioannides/pyspark-example-project

Implementing best practices for PySpark ETL jobs and applications.

data-engineering data-science etl etl-job etl-pipeline pyspark python spark

Last synced: 14 Apr 2025

https://github.com/data-engineering-community/data-engineering-wiki

The best place to learn data engineering. Built and maintained by the data engineering community.

data data-engineer data-engineering data-modeling data-pipelines database etl sql

Last synced: 14 May 2025

https://github.com/mlrun/mlrun

MLRun is an open source MLOps platform for quickly building and managing continuous ML applications across their lifecycle. MLRun integrates into your development and CI/CD environment and automates the delivery of production data, ML pipelines, and online applications.

data-engineering data-science experiment-tracking kubernetes machine-learning mlops mlops-workflow model-serving python workflow

Last synced: 13 May 2025

https://github.com/pyjanitor-devs/pyjanitor

Clean APIs for data cleaning. Python implementation of R package Janitor

cleaning-data data data-engineering dataframe hacktoberfest pandas pydata

Last synced: 13 May 2025

https://github.com/googlecloudplatform/data-science-on-gcp

Source code accompanying book: Data Science on the Google Cloud Platform, Valliappa Lakshmanan, O'Reilly 2017

cloud-computing data-analysis data-engineering data-pipeline data-processing data-science data-visualization machine-learning

Last synced: 14 Apr 2025

https://github.com/ericmjl/pyjanitor

Clean APIs for data cleaning. Python implementation of R package Janitor

cleaning-data data data-engineering dataframe hacktoberfest pandas pydata

Last synced: 07 Jan 2025

https://github.com/quiltdata/quilt

Quilt is a data mesh for connecting people with actionable data

data data-engineering data-version-control data-versioning parquet python serialization

Last synced: 13 May 2025

https://github.com/GoogleCloudPlatform/data-science-on-gcp

Source code accompanying book: Data Science on the Google Cloud Platform, Valliappa Lakshmanan, O'Reilly 2017

cloud-computing data-analysis data-engineering data-pipeline data-processing data-science data-visualization machine-learning

Last synced: 27 Nov 2024

https://github.com/cocoindex-io/cocoindex

ETL framework to turn your data AI-ready - with realtime incremental updates and support custom logic like lego.

ai change-data-capture data data-engineering data-indexing data-infrastructure data-processing dataflow etl help-wanted indexing knowledge-graph llm pipeline python rag real-time rust semantic-search streaming

Last synced: 14 May 2025

https://github.com/dataform-co/dataform

Dataform is a framework for managing SQL based data operations in BigQuery

analytics business-intelligence data-engineering data-pipelines elt etl hacktoberfest

Last synced: 13 May 2025

https://github.com/yobulkdev/yobulkdev

πŸ”₯ πŸ”₯ πŸ”₯Open Source & AI driven Data Onboarding Platform:Free flatfile.com alternative

csv-import csv-parser csv-reader data-engineering datacleaning embeddable javascript languagemodel mongodb nextjs nodejs open-source react stream streaming

Last synced: 21 Apr 2025

https://github.com/stitchfix/hamilton

A scalable general purpose micro-framework for defining dataflows. THIS REPOSITORY HAS BEEN MOVED TO www.github.com/dagworks-inc/hamilton

dag data-engineering data-platform data-science dataframe etl etl-framework etl-pipeline feature-engineering featurization hamilton hamiltonian machine-learning numpy pandas python software-engineering stitch-fix

Last synced: 18 Jan 2025

https://github.com/neumtry/neumai

Neum AI is a best-in-class framework to manage the creation and synchronization of vector embeddings at large scale.

ai chatgpt data data-engineering database embeddings etl llm llmops mlops ops pipeline python rag retrieval vector-database vectors

Last synced: 21 Apr 2025

https://github.com/oleg-agapov/data-engineering-book

Accumulated knowledge and experience in the field of Data Engineering

data data-engineering engineering

Last synced: 15 Apr 2025

https://github.com/NeumTry/NeumAI

Neum AI is a best-in-class framework to manage the creation and synchronization of vector embeddings at large scale.

ai chatgpt data data-engineering database embeddings etl llm llmops mlops ops pipeline python rag retrieval vector-database vectors

Last synced: 11 Apr 2025