awesome-modern-data-stack
A curated list of awesome tools, frameworks, and resources for the Modern Data Stack (MDS).
https://github.com/bricefotzo/awesome-modern-data-stack
Last synced: 6 days ago
JSON representation
-
Business Intelligence & Analytics
-
Commercial / Managed
-
Open Source
- Apache Superset - Modern data exploration and visualization platform.
- Metabase - Open-source business intelligence with SQL and visual query builder.
- Redash - Connect and query data sources, build dashboards.
- Lightdash - Open-source BI for dbt users.
- Evidence - Code-based BI with Markdown and SQL.
- Grafana - Open-source analytics and monitoring platform.
-
-
Community
-
Community Resources
- r/dataengineering - Reddit community.
- Data Engineering Wiki - Community-maintained wiki.
- Awesome Data Engineering - Another curated list.
- Modern Data Stack Glossary - Data terminology.
-
Conferences
- Snowflake Summit - Snowflake user conference.
-
Slack Communities
- Airbyte Slack - Data integration community.
- Apache Airflow Slack - Workflow orchestration.
- dbt Community Slack - 50k+ members discussing analytics engineering.
- Airbyte Slack - Data integration community.
-
-
Data Catalog & Discovery
-
Commercial / Managed
-
Open Source
- DataHub - Open-source metadata platform by LinkedIn.
- Amundsen - Open-source data discovery platform by Lyft.
- OpenMetadata - Open-source metadata platform with discovery and governance.
-
-
Data Contracts
-
Commercial / Managed
- Soda Data Contracts - Define and verify data contracts.
- Bitol - Open-source data contract specification.
- DataContract CLI - CLI for managing data contracts.
- JSON Schema - Schema specification for JSON data.
- Protobuf - Protocol Buffers for schema definition.
- Soda Data Contracts - Define and verify data contracts.
-
-
Data Integration & Ingestion
-
Commercial / Managed
- Boomi - Modern data integration and agent management platform.
- Fivetran - Automated data integration with 500+ connectors. Industry leader in managed ELT.
- Stitch - Simple, extensible ETL built for developers. Part of Talend.
- Hevo Data - No-code data pipeline platform.
- Informatica - Enterprise data integration and management platform.
- Talend - Enterprise data integration suite.
-
Open Source
- Airbyte - Open-source data integration platform with 300+ connectors. ELT-first approach.
- dlt (data load tool) - Python library for data loading with automatic schema inference.
- Singer - Open-source standard for writing scripts that move data (taps and targets).
- Meltano - Open-source DataOps platform built on Singer. CLI-first, version-controlled pipelines.
- Ingestr - CLI tool to copy data between databases with a single command.
-
-
Data Lakes & Storage
-
Data Lake Formats
- Apache Parquet - Columnar storage file format.
- Apache ORC - Columnar storage format for Hadoop.
- Apache Avro - Row-based data serialization format.
- Lance - Modern columnar format for ML datasets.
-
Object Storage
- Amazon S3 - Industry-standard object storage by AWS.
- Google Cloud Storage - Object storage by Google Cloud.
- Azure Blob Storage - Object storage by Microsoft Azure.
- Cloudflare R2 - S3-compatible storage with zero egress fees.
-
-
Data Notebooks & Exploration
-
Commercial / Managed
- Google Colab - Free Jupyter notebooks by Google.
- Amazon SageMaker Studio - ML IDE by AWS.
- Hex - Collaborative data workspace with notebooks and apps.
- Deepnote - Collaborative data notebook for teams.
- Amazon SageMaker Studio - ML IDE by AWS.
-
Open Source
-
-
Data Observability
-
DataOps & Version Control
-
Commercial / Managed
- DVC - Version control for data and ML models.
-
-
Data Orchestration
-
Commercial / Managed
- Astronomer - Managed Airflow platform.
- Dagster Cloud - Managed Dagster platform.
- Prefect Cloud - Managed Prefect platform.
- Google Cloud Composer - Managed Apache Airflow by Google.
- Amazon MWAA - Managed Airflow by AWS.
- Orchestra - Unified data orchestration platform.
- Dagster Cloud - Managed Dagster platform.
-
Open Source
- Apache Airflow - Platform to programmatically author, schedule, and monitor workflows.
- Dagster - Cloud-native orchestration platform with software-defined assets.
- Prefect - Modern workflow orchestration with Python-native approach.
- Mage - Open-source data pipeline tool with notebook-style interface.
- Kestra - Event-driven orchestration platform with declarative YAML.
- Luigi - Python module for building complex pipelines (by Spotify).
- Argo Workflows - Kubernetes-native workflow engine.
-
-
Data Quality & Testing
-
Commercial / Managed
- Soda - Data quality platform with SodaCL language.
- Monte Carlo - Data observability platform with ML-powered anomaly detection.
-
Open Source
- Great Expectations - Python library for data validation and documentation.
- Elementary - Open-source data observability for dbt.
- Provero - Vendor-neutral, declarative data quality engine with YAML-based checks.
- dbt Tests - Built-in testing framework in dbt.
-
-
Data Sharing
-
Commercial / Managed
- Google Analytics Hub - Data exchange by Google Cloud.
- Snowflake Data Marketplace - Data sharing and marketplace by Snowflake.
- Databricks Delta Sharing - Open protocol for secure data sharing.
- AWS Data Exchange - Data marketplace by AWS.
-
-
Data Transformation
-
Distributed Computing
- Apache Spark - Unified analytics engine for large-scale data processing.
- Dask - Flexible parallel computing library for analytics.
- Ray - Unified framework for scaling AI and Python applications.
- Apache Flink - Stream and batch processing framework.
- Apache Beam - Unified programming model for batch and streaming.
- Asgarde - Java library for simplified error handling in Beam pipelines.
-
Python-Based Transformation
- Pandas - Data manipulation and analysis library for Python.
- DuckDB - In-process SQL OLAP database. Perfect for local data transformation.
- PySpark - Python API for Apache Spark.
- Vaex - Out-of-core DataFrames for large datasets.
- Ibis - Python DataFrame API that compiles to SQL.
- Fugue - Unified interface for distributed computing.
- Hamilton - Micro-framework for dataframe generation.
- Polars - Lightning-fast DataFrame library written in Rust.
-
SQL-Based Transformation
- dbt (data build tool) - Industry-standard SQL-first transformation tool. Version-controlled, tested, documented.
- SQLMesh - Data transformation framework with built-in data quality and CI/CD.
- SDF (Semantic Data Fabric) - SQL compiler for data transformation with static analysis.
- Coalesce - Data transformation platform purpose-built for Snowflake.
- Dataform - SQL-based data transformation (acquired by Google).
- SDF (Semantic Data Fabric) - SQL compiler for data transformation with static analysis.
-
-
Data Warehouses & Lakehouses
-
Cloud Data Warehouses
- Google BigQuery - Serverless, highly scalable data warehouse by Google.
- Amazon Redshift - Fast, scalable data warehouse by AWS.
- Azure Synapse Analytics - Limitless analytics service by Microsoft.
- Databricks SQL - Serverless SQL analytics on the Lakehouse.
- Firebolt - Cloud data warehouse built for high-performance analytics.
- ClickHouse Cloud - Managed ClickHouse for real-time analytics.
- StarRocks - High-performance analytical database.
- MotherDuck - Serverless analytics powered by DuckDB.
- Snowflake - Cloud-native data warehouse with separation of storage and compute.
-
Lakehouse Platforms
- Databricks - Unified analytics platform combining data lake and warehouse.
- Apache Iceberg - Open table format for huge analytic datasets.
- Delta Lake - Open-source storage layer with ACID transactions on data lakes.
- Apache Hudi - Data lake platform for incremental data processing.
- Dremio - Lakehouse platform with Apache Iceberg support.
- Onehouse - Managed lakehouse platform built on Apache Hudi.
- Tabular - Managed Apache Iceberg service by its creators.
-
-
Feature Stores
-
Commercial / Managed
- Vertex AI Feature Store - Google Cloud feature store.
- Databricks Feature Store - Feature store in Databricks.
- Feast - Open-source feature store for ML.
- Hopsworks - Platform with feature store and MLOps.
- Amazon SageMaker Feature Store - AWS managed feature store.
-
-
Learning Resources
-
Blogs & Newsletters
- Blef.fr - The hub to explore Data News links.
- Joe Reis's Substack - Insights on data engineering.
- Airbyte Blog - Data integration and engineering content.
- DataStackGuide - Independent reviews of B2B data stack tools (CRMs, enrichment, BI, sales engagement) backed by analysis of 23,000+ job postings.
- DataStackGuide - Independent reviews of B2B data stack tools (CRMs, enrichment, BI, sales engagement) backed by analysis of 23,000+ job postings.
-
Books
- Fundamentals of Data Engineering - by Joe Reis & Matt Housley
- The Data Warehouse Toolkit - by Ralph Kimball
- Designing Data-Intensive Applications - by Martin Kleppmann
- Data Mesh - by Zhamak Dehghani
- Building a Scalable Data Warehouse with Data Vault 2.0 - by Dan Linstedt
- Building a Scalable Data Warehouse with Data Vault 2.0 - by Dan Linstedt
-
Courses & Certifications
- DataCamp - Data science and engineering courses.
- DataTalks.Club - Free data engineering zoomcamp.
- Snowflake Training - Snowflake certifications.
- dbt Learn - Free dbt fundamentals course.
- Coursera Data Engineering - Google Cloud Data Engineering certificate.
- Databricks Training - Databricks certifications.
-
Podcasts
- DataGen - Interviews with French data practitioners & leaders.
- The Data Engineering Podcast - Interviews with data engineering practitioners.
- Data Engineering Show - Discussions on data engineering topics.
-
-
Metrics Layer & Semantic Layer
-
Commercial / Managed
- dbt Semantic Layer - Define metrics in dbt, query from any tool.
- Cube - Headless BI and semantic layer with caching.
- MetricFlow - Semantic layer engine (now part of dbt).
- Minerva (Airbnb) - Airbnb's internal metrics platform.
-
-
ML Platforms & MLOps
-
Commercial / Managed
- Google Vertex AI - Unified ML platform by Google Cloud.
- Databricks MLflow - Managed MLflow on Databricks.
- Amazon SageMaker - Fully managed ML service by AWS.
- Azure Machine Learning - ML platform by Microsoft.
-
Open Source
-
-
Query Engines
-
Change Data Capture (CDC)
- PrestoDB - Distributed SQL query engine by Meta.
- Apache Drill - Schema-free SQL query engine.
- Apache Impala - Massively parallel SQL query engine.
- ClickHouse - Column-oriented DBMS for OLAP.
- Apache Druid - Real-time analytics database.
-
-
Reverse ETL
-
Streaming & Real-Time
-
Change Data Capture (CDC)
- Boomi Data Integration - Automated CDC for operational databases and apps.
- Debezium - Open-source distributed CDC platform.
- Airbyte CDC - CDC connectors in Airbyte.
- Fivetran CDC - Managed CDC by Fivetran.
-
Message Brokers & Streaming Platforms
- Apache Kafka - Distributed event streaming platform.
- Confluent - Enterprise Kafka platform with managed cloud.
- Amazon Kinesis - Real-time streaming data service by AWS.
- Google Pub/Sub - Messaging service by Google Cloud.
- Azure Event Hubs - Big data streaming platform by Azure.
- RabbitMQ - Open-source message broker.
-
Stream Processing
- Apache Kafka Streams - Client library for stream processing.
- Apache Spark Streaming - Spark module for stream processing.
-
-
Uncategorized
-
Uncategorized
- querybear.com - Ask your data anything – with persistent memory, schema learning, and more.
-
-
Vector Databases
-
Commercial / Managed
- MongoDB - Managed vector store alognside your operational data in MongoDB Atlas.
- Pinecone - Managed vector database for similarity search.
- Weaviate - Open-source vector search engine.
- Qdrant - Open-source vector similarity search engine.
- Chroma - Open-source embedding database.
- pgvector - Vector similarity search for PostgreSQL.
- Elasticsearch - Search engine with vector search capabilities.
-
Categories
Learning Resources
20
Data Transformation
20
Data Warehouses & Lakehouses
16
Data Orchestration
14
Business Intelligence & Analytics
12
Streaming & Real-Time
12
Data Integration & Ingestion
11
Community
9
Data Notebooks & Exploration
9
Data Lakes & Storage
8
ML Platforms & MLOps
8
Vector Databases
7
Data Catalog & Discovery
6
Data Contracts
6
Data Quality & Testing
6
Query Engines
5
Feature Stores
5
Metrics Layer & Semantic Layer
4
Data Sharing
4
Reverse ETL
3
Data Observability
3
License
1
Uncategorized
1
DataOps & Version Control
1
Sub Categories
Commercial / Managed
66
Open Source
33
Change Data Capture (CDC)
9
Cloud Data Warehouses
9
Python-Based Transformation
8
Lakehouse Platforms
7
Message Brokers & Streaming Platforms
6
Courses & Certifications
6
Distributed Computing
6
Books
6
SQL-Based Transformation
6
Blogs & Newsletters
5
Community Resources
5
Object Storage
4
Data Lake Formats
4
Slack Communities
4
Podcasts
3
Stream Processing
2
Uncategorized
1
Conferences
1
Keywords
kotlin
1
java
1
google-cloud-platform
1
error-handling
1
cloud-dataflow
1
apache-beam
1
semantic-layer
1
pypi
1
metrics
1
data-modeling
1
data
1
business-intelligence
1
analytics
1
datamesh
1
datacontract
1
snowflake
1
postgresql
1
mssql
1
ingestion-pipeline
1
duckdb
1
data-pipeline
1
data-integration
1
data-ingestion
1
copy-database
1
bigquery
1
workflow-engine
1
workflow
1
pipelines
1
mlops
1
machine-learning
1
kubernetes
1
knative
1
k8s
1
gitops
1
data-engineering
1
dag
1
cncf
1
cloud-native
1
batch-processing
1
argo-workflows
1
argo
1
airflow
1
nearest-neighbor-search
1
approximate-nearest-neighbor-search
1
scheduling
1
python
1
orchestration-framework
1
luigi
1
hadoop
1
awesome-list
1