Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/patterns-app/patterns-devkit
Data pipelines from re-usable components
data-analysis data-engineering data-pipeline data-pipelines data-science etl etl-framework etl-pipeline etl-pipelines functional-reactive-programming immutability pipelines sql
Last synced: 13 May 2024
![](https://github.com/patterns-app.png)
https://github.com/GokuMohandas/mlops-course
Learn how to design, develop, deploy and iterate on production-grade ML applications.
data-engineering data-quality data-science deep-learning distributed-ml llms machine-learning mlops natural-language-processing python pytorch ray
Last synced: 13 May 2024
![](https://github.com/GokuMohandas.png)
https://github.com/datafold/data-diff
Compare tables within or across databases
data data-diffing data-engineering data-quality data-quality-monitoring data-science database databricks-sql dataengineering dataquality dbt mysql oracle-database postgres postgresql python rdbms snowflake sql trino
Last synced: 13 May 2024
![](https://github.com/datafold.png)
https://github.com/treeverse/lakeFS
lakeFS - Data version control for your data lake | Git for data
apache-spark apache-sparksql aws-s3 azure-blob-storage azure-storage data-engineering data-lake data-quality data-version-control data-versioning datalake datalakes git-for-data go golang google-cloud-storage hadoop-filesystem lakefs object-storage
Last synced: 13 May 2024
![](https://github.com/treeverse.png)
https://github.com/DataKitchen/data-observability-installer
Installer for DataKitchen's Open Source Data Observability Products. Data breaks. Servers break. Your toolchain breaks. Ensure your team is the first to know and the first to solve with visibility across and down your data estate. Save time with simple, fast data quality test generation and execution. Trust your data, tools, and systems end to end.
data data-engineering data-observability data-profiling data-quality data-reliability data-science datachecker datacleaner datacleaning dataops dataquality datatesting datavalidation mssql pipeline-tests postgresql redshift self-hosted snowflake
Last synced: 12 May 2024
![](https://github.com/DataKitchen.png)
https://github.com/morph-kgc/morph-kgc
Powerful RDF Knowledge Graph Generation with RML Mappings
data-engineering data-integration database etl knowledge-graph python r2rml rdf rdf-star rml
Last synced: 12 May 2024
![](https://github.com/morph-kgc.png)
https://github.com/zero-one-group/geni
A Clojure dataframe library that runs on Spark
big-data clojure clojure-library clojure-repl data-engineering data-science dataframe distributed-computing high-performance-computing machine-learning parallel-computing spark
Last synced: 11 May 2024
![](https://github.com/zero-one-group.png)
https://github.com/superstreamlabs/memphis
Memphis.dev is a highly scalable and effortless data streaming platform
data data-engineering data-pipeline data-stream-processing data-streaming enrichment golang kubernetes message-broker message-bus message-queue messaging-queue microservices schema-registry
Last synced: 11 May 2024
![](https://github.com/superstreamlabs.png)
https://github.com/datastacktv/data-engineer-roadmap
Roadmap to becoming a data engineer in 2021
cloud data-engineer-roadmap data-engineering roadmap
Last synced: 11 May 2024
![](https://github.com/datastacktv.png)
https://github.com/xonsh/xonsh
:shell: Python-powered, cross-platform, Unix-gazing shell.
bash cli command-line console data-engineering data-science devops fish hacktoberfest iterm2 prompt python python-shell script security-automation shell terminal windows-terminal xonsh zsh
Last synced: 10 May 2024
![](https://github.com/xonsh.png)
https://github.com/DataTalksClub/data-engineering-zoomcamp
Free Data Engineering course!
data-engineering dbt docker kafka prefect spark
Last synced: 10 May 2024
![](https://github.com/DataTalksClub.png)
https://github.com/airbytehq/airbyte
The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
bigquery change-data-capture data data-analysis data-collection data-engineering data-integration data-pipeline elt etl java mssql mysql pipeline postgresql python redshift s3 self-hosted snowflake
Last synced: 09 May 2024
![](https://github.com/airbytehq.png)
https://github.com/growthbook/growthbook
Open Source Feature Flagging and A/B Testing Platform
ab-testing abtest abtesting analytics bigquery clickhouse continuous-delivery data-analysis data-engineering data-science experimentation feature-flagging feature-flags mixpanel redshift remote-config snowflake split-testing statistics
Last synced: 09 May 2024
![](https://github.com/growthbook.png)
https://github.com/cloudquery/cloudquery
The open source high performance ELT framework powered by Apache Arrow
airbyte attack-surface-management aws azure bigquery cspm data data-analysis data-collection data-engineering data-integration elt etl etl-framework gcp github-api go google kubernetes sql
Last synced: 08 May 2024
![](https://github.com/cloudquery.png)
https://github.com/ploomber/ploomber
The fastest ⚡️ way to build data pipelines. Develop iteratively, deploy anywhere. ☁️
data-engineering data-science jupyter jupyter-notebooks machine-learning mlops notebooks papermill pipelines pycharm vscode workflow
Last synced: 06 May 2024
![](https://github.com/ploomber.png)
https://github.com/insitro/redun
Yet another redundant workflow engine
aws bioinformatics data-engineering data-science docker etl gcp ml python workflow-engine
Last synced: 05 May 2024
![](https://github.com/insitro.png)
https://github.com/rdagumampan/yuniql
Free and open source schema versioning and database migration made natively with .NET/6. NEW THIS MAY 2022! v1.3.15 released!
amazon-rds azure-sql-database data-engineering database-migrations datawarehouse dotnet-core dotnet-tool mariadb mysql oracle postgresql redshift snowflake sql sqlserver yuniql
Last synced: 05 May 2024
![](https://github.com/rdagumampan.png)
https://github.com/kdeldycke/awesome-billing
💰 Billing & Payments knowledge for cloud platforms
accounting awesome awesome-list billing business-intelligence cloud cost-forecast cost-management credit-card data-engineering finance fraud invoice marketplace metering payments pricing product-catalog tax telemetry
Last synced: 05 May 2024
![](https://github.com/kdeldycke.png)
https://github.com/ocademy-ai/machine-learning
Learn AI together, for free. AI learning and teaching resources for everyone.
ai data-engineering data-science deep-learning jupyter jupyter-notebook machine-learning ml mlops python scikit-learn visualization
Last synced: 04 May 2024
![](https://github.com/ocademy-ai.png)
https://ddotta.github.io/cookbook-rpolars/
Cookbook to provide solutions to common tasks and problems in using Polars with R
benchmark cookbook data-engineering data-science datatable dplyr polars r tidyr
Last synced: 02 May 2024
![](https://github.com/ddotta.png)
https://github.com/automaticmode/active_workflow
Polyglot workflows without leaving the comfort of your technology stack.
activeworkflow agents data-engineering data-ops event-driven ifttt orchestration-framework scheduler scheduling self-hosted services-platform workflow
Last synced: 30 Apr 2024
![](https://github.com/automaticmode.png)
https://github.com/galliaproject/gallia-core
A schema-aware Scala library for data transformation
data-engineering data-manipulation data-science data-transformation etl feature-engineering json nesting scala spark
Last synced: 30 Apr 2024
![](https://github.com/galliaproject.png)
https://github.com/benthosdev/benthos
Fancy stream processing made operationally mundane
amqp cqrs data-engineering data-ops etl event-sourcing go golang kafka logs message-bus message-queue nats rabbitmq stream-processing stream-processor streaming-data
Last synced: 29 Apr 2024
![](https://github.com/benthosdev.png)
https://github.com/jqnatividad/qsv
CSVs sliced, diced & analyzed.
ckan cli csv data-engineering data-wrangling datapackage excel geocode luau opendata parquet polars postgresql rust snappy sql sqlite timeseries tsv
Last synced: 29 Apr 2024
![](https://github.com/jqnatividad.png)
https://github.com/RisingWaveLabs/risingwave
Cloud-native SQL stream processing, analytics, and management. KsqlDB and Apache Flink alternative. 🚀 10x more productive. 🚀 10x more cost-efficient.
analytics big-data cloud-native data-engineering database distributed-database flink kafka ksqldb materialized-view postgres postgresql postgresql-database real-time rust serverless spark spark-streaming sql stream-processing
Last synced: 29 Apr 2024
![](https://github.com/risingwavelabs.png)
https://github.com/e-alizadeh/sample_dbt_project
Companion template repo for the blog post "dbt for Data Transformation - A Hands-on Tutorial" (https://ealizadeh.com/blog/dbt-tutorial)
data-engineering data-transformation database dbt dbt-packages dbtcloud etl sql
Last synced: 28 Apr 2024
![](https://github.com/e-alizadeh.png)
https://github.com/great-expectations/great_expectations
Always know what to expect from your data.
cleandata data-engineering data-profilers data-profiling data-quality data-science data-unit-tests datacleaner datacleaning dataquality dataunittest eda exploratory-analysis exploratory-data-analysis exploratorydataanalysis mlops pipeline pipeline-debt pipeline-testing pipeline-tests
Last synced: 28 Apr 2024
![](https://github.com/great-expectations.png)
https://github.com/DAGWorks-Inc/hamilton
Hamilton helps data scientists and engineers define testable, modular, self-documenting dataflows, that encode lineage and metadata. Runs and scales everywhere python does.
dag data-analysis data-engineering data-science dataframe etl etl-framework etl-pipeline feature-engineering featurization hacktoberfest lineage llmops machine-learning mlops numpy orchestration pandas python software-engineering
Last synced: 28 Apr 2024
![](https://github.com/DAGWorks-Inc.png)
https://github.com/pyjanitor-devs/pyjanitor
Clean APIs for data cleaning. Python implementation of R package Janitor
cleaning-data data data-engineering dataframe hacktoberfest pandas pydata
Last synced: 28 Apr 2024
![](https://github.com/pyjanitor-devs.png)
https://github.com/PrefectHQ/prefect
Prefect is a workflow orchestration tool empowering developers to build, observe, and react to data pipelines
automation data data-engineering data-ops data-science infrastructure ml-ops observability orchestration pipeline prefect python workflow workflow-engine
Last synced: 28 Apr 2024
![](https://github.com/PrefectHQ.png)
https://github.com/aws/aws-sdk-pandas
pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, Neptune, OpenSearch, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).
amazon-athena amazon-sagemaker-notebook apache-arrow apache-parquet athena aws aws-glue aws-lambda data-engineering data-science emr etl glue-catalog lambda modin mysql pandas python ray redshift
Last synced: 28 Apr 2024
![](https://github.com/aws.png)
https://github.com/electronick1/stairs
Framework which helps you to make parallel/distributed calculations using data pipelines
data-engineering data-pipeline data-science distributed-computing python
Last synced: 27 Apr 2024
![](https://github.com/electronick1.png)
https://github.com/apache/airflow
Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
airflow apache apache-airflow automation dag data-engineering data-integration data-orchestrator data-pipelines data-science elt etl machine-learning mlops orchestration python scheduler workflow workflow-engine workflow-orchestration
Last synced: 26 Apr 2024
![](https://github.com/apache.png)
https://github.com/argoproj/argo-workflows
Workflow Engine for Kubernetes
airflow argo argo-workflows batch-processing cloud-native cncf dag data-engineering gitops hacktoberfest k8s knative kubernetes machine-learning mlops pipelines workflow workflow-engine
Last synced: 26 Apr 2024
![](https://github.com/argoproj.png)
https://github.com/feast-dev/feast
Feature Store for Machine Learning
big-data data-engineering data-quality data-science feature-store features machine-learning ml mlops python
Last synced: 23 Apr 2024
![](https://github.com/feast-dev.png)
https://github.com/bytehub-ai/bytehub
ByteHub: making feature stores simple
bytehub-cloud dask data-engineering data-science feature-engineering feature-store featurestore forecasting machine-learning machinelearning machinelearning-python pandas timeseries
Last synced: 23 Apr 2024
![](https://github.com/bytehub-ai.png)
https://github.com/ankurchavda/streamify
A data engineering project with Kafka, Spark Streaming, dbt, Docker, Airflow, Terraform, GCP and much more!
airflow data-engineering dbt gcp kafka python spark
Last synced: 22 Apr 2024
![](https://github.com/ankurchavda.png)
https://github.com/Desbordante/desbordante-core
Desbordante is a high-performance data profiler that is capable of discovering many different patterns in data using various algorithms. It also allows to run data cleaning scenarios using these algorithms. Desbordante has a console version and an easy-to-use web application.
anomaly-detection correlations data-analytics data-cleaning data-cleansing data-engineering data-exploration data-mining data-mining-algorithms data-preprocessing data-profiling data-science data-wrangling exploratory-data-analysis feature-engineering feature-extraction feature-selection knowledge-discovery spreadsheets tabular-data
Last synced: 21 Apr 2024
![](https://github.com/Desbordante.png)
https://github.com/stitchfix/hamilton
A scalable general purpose micro-framework for defining dataflows. THIS REPOSITORY HAS BEEN MOVED TO www.github.com/dagworks-inc/hamilton
dag data-engineering data-platform data-science dataframe etl etl-framework etl-pipeline feature-engineering featurization hamilton hamiltonian machine-learning numpy pandas python software-engineering stitch-fix
Last synced: 20 Apr 2024
![](https://github.com/stitchfix.png)
https://github.com/rupurt/odbc-scanner-duckdb-extension
A DuckDB extension to read data directly from databases supporting the ODBC interface
analytics bigquery columnar-database cpp data-engineering db2 duckdb mariadb mssql mysql nix odbc olap oracle postgres snowflake vector-engine
Last synced: 20 Apr 2024
![](https://github.com/rupurt.png)
https://github.com/mage-ai/mage-ai
🧙 Build, run, and manage data pipelines for integrating and transforming data.
artificial-intelligence data data-engineering data-integration data-pipelines data-science dbt elt etl machine-learning orchestration pipeline pipelines python reverse-etl spark sql transformation
Last synced: 20 Apr 2024
![](https://github.com/mage-ai.png)
https://github.com/GoogleCloudPlatform/data-science-on-gcp
Source code accompanying book: Data Science on the Google Cloud Platform, Valliappa Lakshmanan, O'Reilly 2017
cloud-computing data-analysis data-engineering data-pipeline data-processing data-science data-visualization machine-learning
Last synced: 17 Apr 2024
![](https://github.com/GoogleCloudPlatform.png)
https://github.com/raptor-ml/raptor
Transform your pythonic research to an artifact that engineers can deploy easily.
ai-infra data-engineering data-science dataops feature-engineering feature-extraction feature-platform featurestore kubeflow kubernetes machine-learning ml mlops model-deployment production raptor raptor-ml reactive-ml
Last synced: 16 Apr 2024
![](https://github.com/raptor-ml.png)
https://github.com/atfortes/DataGenLM
Collection of synthetic data generation code for Language Models
chain-of-thought chatgpt coin concatenation data-engineering dataset dataset-generation flip gpt-3 json language-model large-language-models last letter machine-learning natural-language-processing random-generation reasoning symbolic symbolic-reasoning
Last synced: 15 Apr 2024
![](https://github.com/atfortes.png)
https://github.com/blockchain-etl/eos-etl
ETL scripts for EOS.
apache-beam blockchain-analytics crypto cryptocurrency data-analytics data-engineering eos eosio etl gcp google-bigquery google-cloud google-cloud-platform google-dataflow google-pubsub on-chain-analysis web3
Last synced: 15 Apr 2024
![](https://github.com/blockchain-etl.png)
https://github.com/bitol-io/open-data-contract-standard
Home of the Open Data Contract Standard (ODCS).
data data-contract data-contracts data-engineering data-mesh data-quality
Last synced: 13 Apr 2024
![](https://github.com/bitol-io.png)
https://github.com/Dineshkarthik/awesome-data-science-and-engineering
A curated list of Data Science and Engineering frameworks, tools, libraries and related list of tutorials.
beginner-friendly data-engineering data-science tutorials
Last synced: 11 Apr 2024
![](https://github.com/Dineshkarthik.png)
https://github.com/yahwang/Awesome-Data-Engineering
📒(GitBook) A curated list of awesome Data Engineering resources
data-engineering data-lake data-pipeline
Last synced: 10 Apr 2024
![](https://github.com/yahwang.png)
https://github.com/yarncraft/awesome-edge
A qualitative compilation of production-ready frameworks, services and repositories with a focus on Edge Computing & IoT
awesome awesome-list cloud cloud-native cloudcomputing computing data-engineering database edge edge-computing iot iot-platform kubernetes
Last synced: 09 Apr 2024
![](https://github.com/yarncraft.png)
https://github.com/vmware/versatile-data-kit
One framework to develop, deploy and operate data workflows with Python and SQL.
analytics data data-engineer data-engineering data-engineering-pipeline data-lineage data-pipelines data-science data-structures data-warehouse database dataops elt etl pipeline python snowflake sql trino warehouse
Last synced: 05 Apr 2024
![](https://github.com/vmware.png)
https://github.com/saucam/airflow-runner
airflow automation automation-testing data-engineering workflow
Last synced: 05 Apr 2024
![](https://github.com/saucam.png)
https://github.com/InsightDataScience/ansible-playbook
Ansible playbook to deploy distributed technologies
ansible ansible-playbooks aws data-engineering devops ec2-instance infrastructure-management kafka zookeeper
Last synced: 02 Apr 2024
![](https://github.com/InsightDataScience.png)
https://github.com/iesahin/xvc
A robust (🐢) and fast (🐇) MLOps tool for managing data and pipelines in Rust (🦀)
command-line-tool data data-engineering data-pipelines data-science devops machine-learning machine-learning-engineering mlops rust
Last synced: 01 Apr 2024
![](https://github.com/iesahin.png)
https://github.com/recap-build/recap
Work with your web service, database, and streaming schemas in a single format.
data-catalog data-discovery data-engineering data-integration data-pipelines etl metadata recap
Last synced: 01 Apr 2024
![](https://github.com/recap-build.png)
https://github.com/swoop-inc/spark-alchemy
Collection of open-source Spark tools & frameworks that have made the data engineering and data science teams at Swoop highly productive
data-engineering data-science scala spark
Last synced: 31 Mar 2024
![](https://github.com/swoop-inc.png)
https://github.com/odpi/egeria
Egeria core
data-engineering data-governance egeria governance hacktoberfest java linux-foundation metadata-management odpi odpi-egeria
Last synced: 31 Mar 2024
![](https://github.com/odpi.png)
https://pyjanitor-devs.github.io/pyjanitor/
Clean APIs for data cleaning. Python implementation of R package Janitor
cleaning-data data data-engineering dataframe hacktoberfest pandas pydata
Last synced: 29 Mar 2024
![](https://github.com/pyjanitor-devs.png)
https://github.com/benthecoder/yt-channels-DS-AI-ML-CS
A comprehensive list of 180+ YouTube Channels for Data Science, Data Engineering, Machine Learning, Deep learning, Computer Science, programming, software engineering, etc.
ai artificial-intelligence awesome awesome-list coding data data-analysis data-engineering data-science deep-learning machine-learning math ml programming python resources software-engineering statistics web-development youtube
Last synced: 29 Mar 2024
![](https://github.com/benthecoder.png)
https://github.com/bytewax/bytewax
Python Stream Processing
data-engineering data-processing data-science dataflow machine-learning python rust stream-processing streaming-data
Last synced: 23 Mar 2024
![](https://github.com/bytewax.png)
https://github.com/SETL-Framework/setl
A simple Spark-powered ETL framework that just works 🍺
big-data data-analysis data-engineering data-science data-transformation dataset etl etl-pipeline framework machine-learning modularization pipeline scala setl spark
Last synced: 23 Mar 2024
![](https://github.com/SETL-Framework.png)
https://github.com/Minyus/pipelinex
PipelineX: Python package to build ML pipelines for experimentation with Kedro, MLflow, and more
data-engineering data-science deep-learning experimentation machine-learning pipeline
Last synced: 23 Mar 2024
![](https://github.com/Minyus.png)
https://github.com/dgarnitz/vectorflow
VectorFlow is a high volume vector embedding pipeline that ingests raw data, transforms it into vectors and writes it to a vector DB of your choice.
ai data-engineering embeddings machine-learning nlp vectors
Last synced: 21 Mar 2024
![](https://github.com/dgarnitz.png)
https://github.com/awslabs/aws-serverless-data-lake-framework
Enterprise-grade, production-hardened, serverless data lake on AWS
analytics aws best-practices data-engineering data-lake etl framework iac lake-formation serverless
Last synced: 19 Mar 2024
![](https://github.com/awslabs.png)
https://github.com/ris-tlp/audiophile-e2e-pipeline
Pipeline that extracts data from Crinacle's Headphone and InEarMonitor databases and finalizes data for a Metabase Dashboard.
airflow aws data-engineering metabase python terraform
Last synced: 18 Mar 2024
![](https://github.com/ris-tlp.png)
https://github.com/metarank/metarank
A low code Machine Learning personalized ranking service for articles, listings, search results, recommendations that boosts user engagement. A friendly Learn-to-Rank engine
automl data-engineering data-science deep-learning feature-engineering feature-extraction kubernetes machine-learning neural-networks personalization ranking scala search
Last synced: 17 Mar 2024
![](https://github.com/metarank.png)
https://github.com/kennethleungty/Failed-ML
Compilation of high-profile real-world examples of failed machine learning projects
ai artificial-intelligence classification computer-vision data-engineering data-quality data-science deep-learning failed-data-science failed-machine-learning failed-ml fml forecasting machine-learning ml natural-language-processing production recsys regression
Last synced: 17 Mar 2024
![](https://github.com/kennethleungty.png)
https://github.com/twosigma/uberjob
uberjob is a Python package for building and running call graphs.
Last synced: 16 Mar 2024
![](https://github.com/twosigma.png)
https://github.com/blockchain-etl/awesome-bigquery-views
Useful SQL queries for Blockchain ETL datasets in BigQuery.
blockchain-analytics crypto cryptocurrency data-analytics data-engineering data-science gcp google-cloud google-cloud-platform on-chain-analysis web3
Last synced: 16 Mar 2024
![](https://github.com/blockchain-etl.png)
https://github.com/blockchain-etl/blockchain-etl-architecture
Blockchain ETL Architecture
apache-beam blockchain blockchain-analytics crypto cryptocurrency data-analytics data-engineering ethereum etl gcp gke google-bigquery google-cloud google-cloud-platform google-container-engine google-dataflow google-pubsub kubernetes on-chain-analysis real-time-analytics
Last synced: 16 Mar 2024
![](https://github.com/blockchain-etl.png)