awesome-modern-data-stack

A curated list of awesome tools, frameworks, and resources for the Modern Data Stack (MDS).
https://github.com/bricefotzo/awesome-modern-data-stack

Last synced: 2 days ago
JSON representation

Business Intelligence & Analytics
- Commercial / Managed
  - Tableau - Visual analytics platform (Salesforce).
  - Power BI - Business analytics by Microsoft.
  - Qlik - Data analytics and business intelligence platform.
  - Preset - Managed Apache Superset.
  - Omni - Shared-model BI platform.
  - Holistics - Self-service BI platform with data modeling.
- Open Source
  - Apache Superset - Modern data exploration and visualization platform.
  - Metabase - Open-source business intelligence with SQL and visual query builder.
  - Redash - Connect and query data sources, build dashboards.
  - Lightdash - Open-source BI for dbt users.
  - Evidence - Code-based BI with Markdown and SQL.
  - Grafana - Open-source analytics and monitoring platform.
Community
- Community Resources
  - r/dataengineering - Reddit community.
  - Data Engineering Wiki - Community-maintained wiki.
  - Awesome Data Engineering - Another curated list.
  - Modern Data Stack Glossary - Data terminology.
- Conferences
  - Snowflake Summit - Snowflake user conference.
- Slack Communities
  - Airbyte Slack - Data integration community.
  - Apache Airflow Slack - Workflow orchestration.
Data Catalog & Discovery
- Commercial / Managed
  - Atlan - Active metadata platform with collaboration features.
  - Alation - Enterprise data intelligence platform.
  - Collibra - Data intelligence platform with governance and catalog.
- Open Source
  - DataHub - Open-source metadata platform by LinkedIn.
  - Amundsen - Open-source data discovery platform by Lyft.
  - OpenMetadata - Open-source metadata platform with discovery and governance.
Data Contracts
- Commercial / Managed
  - Soda Data Contracts - Define and verify data contracts.
  - Bitol - Open-source data contract specification.
  - DataContract CLI - CLI for managing data contracts.
  - JSON Schema - Schema specification for JSON data.
  - Protobuf - Protocol Buffers for schema definition.
  - Soda Data Contracts - Define and verify data contracts.
Data Integration & Ingestion
- Commercial / Managed
  - Boomi - Modern data integration and agent management platform.
  - Fivetran - Automated data integration with 500+ connectors. Industry leader in managed ELT.
  - Stitch - Simple, extensible ETL built for developers. Part of Talend.
  - Hevo Data - No-code data pipeline platform.
  - Informatica - Enterprise data integration and management platform.
  - Talend - Enterprise data integration suite.
- Open Source
  - Airbyte - Open-source data integration platform with 300+ connectors. ELT-first approach.
  - dlt (data load tool) - Python library for data loading with automatic schema inference.
  - Singer - Open-source standard for writing scripts that move data (taps and targets).
  - Meltano - Open-source DataOps platform built on Singer. CLI-first, version-controlled pipelines.
  - Ingestr - CLI tool to copy data between databases with a single command.
Data Lakes & Storage
- Data Lake Formats
  - Apache Parquet - Columnar storage file format.
  - Apache ORC - Columnar storage format for Hadoop.
  - Apache Avro - Row-based data serialization format.
  - Lance - Modern columnar format for ML datasets.
- Object Storage
  - Amazon S3 - Industry-standard object storage by AWS.
  - Google Cloud Storage - Object storage by Google Cloud.
  - Azure Blob Storage - Object storage by Microsoft Azure.
  - Cloudflare R2 - S3-compatible storage with zero egress fees.
Data Notebooks & Exploration
- Commercial / Managed
  - Google Colab - Free Jupyter notebooks by Google.
  - Amazon SageMaker Studio - ML IDE by AWS.
  - Hex - Collaborative data workspace with notebooks and apps.
  - Deepnote - Collaborative data notebook for teams.
  - Amazon SageMaker Studio - ML IDE by AWS.
- Open Source
  - Jupyter - Web-based interactive computing platform.
  - Zeppelin - Web-based notebook for data analytics.
  - Marimo - Reactive Python notebook with reproducibility.
  - Streamlit - Python framework for data apps.
  - JupyterLab - Next-generation Jupyter interface.
Data Observability
- Commercial / Managed
  - Datadog - Monitoring and security platform for developers.
  - Datafold - Data reliability platform with data diff and regression testing.
  - Sifflet - Full-stack data observability platform.
DataOps & Version Control
- Commercial / Managed
  - DVC - Version control for data and ML models.
Data Orchestration
- Commercial / Managed
  - Astronomer - Managed Airflow platform.
  - Dagster Cloud - Managed Dagster platform.
  - Prefect Cloud - Managed Prefect platform.
  - Google Cloud Composer - Managed Apache Airflow by Google.
  - Amazon MWAA - Managed Airflow by AWS.
  - Orchestra - Unified data orchestration platform.
  - Dagster Cloud - Managed Dagster platform.
- Open Source
  - Apache Airflow - Platform to programmatically author, schedule, and monitor workflows.
  - Dagster - Cloud-native orchestration platform with software-defined assets.
  - Prefect - Modern workflow orchestration with Python-native approach.
  - Mage - Open-source data pipeline tool with notebook-style interface.
  - Kestra - Event-driven orchestration platform with declarative YAML.
  - Luigi - Python module for building complex pipelines (by Spotify).
  - Argo Workflows - Kubernetes-native workflow engine.
Data Quality & Testing
- Commercial / Managed
  - Soda - Data quality platform with SodaCL language.
  - Monte Carlo - Data observability platform with ML-powered anomaly detection.
- Open Source
  - Great Expectations - Python library for data validation and documentation.
  - Elementary - Open-source data observability for dbt.
  - Provero - Vendor-neutral, declarative data quality engine with YAML-based checks.
  - dbt Tests - Built-in testing framework in dbt.
Data Sharing
- Commercial / Managed
  - Google Analytics Hub - Data exchange by Google Cloud.
  - Snowflake Data Marketplace - Data sharing and marketplace by Snowflake.
  - Databricks Delta Sharing - Open protocol for secure data sharing.
  - AWS Data Exchange - Data marketplace by AWS.
Data Transformation
- Distributed Computing
  - Apache Spark - Unified analytics engine for large-scale data processing.
  - Dask - Flexible parallel computing library for analytics.
  - Ray - Unified framework for scaling AI and Python applications.
  - Apache Flink - Stream and batch processing framework.
  - Apache Beam - Unified programming model for batch and streaming.
  - Asgarde - Java library for simplified error handling in Beam pipelines.
- Python-Based Transformation
  - Pandas - Data manipulation and analysis library for Python.
  - DuckDB - In-process SQL OLAP database. Perfect for local data transformation.
  - PySpark - Python API for Apache Spark.
  - Vaex - Out-of-core DataFrames for large datasets.
  - Ibis - Python DataFrame API that compiles to SQL.
  - Fugue - Unified interface for distributed computing.
  - Hamilton - Micro-framework for dataframe generation.
  - Polars - Lightning-fast DataFrame library written in Rust.
- SQL-Based Transformation
  - dbt (data build tool) - Industry-standard SQL-first transformation tool. Version-controlled, tested, documented.
  - SQLMesh - Data transformation framework with built-in data quality and CI/CD.
  - SDF (Semantic Data Fabric) - SQL compiler for data transformation with static analysis.
  - Coalesce - Data transformation platform purpose-built for Snowflake.
  - Dataform - SQL-based data transformation (acquired by Google).
  - SDF (Semantic Data Fabric) - SQL compiler for data transformation with static analysis.
Data Warehouses & Lakehouses
- Cloud Data Warehouses
  - Google BigQuery - Serverless, highly scalable data warehouse by Google.
  - Amazon Redshift - Fast, scalable data warehouse by AWS.
  - Azure Synapse Analytics - Limitless analytics service by Microsoft.
  - Databricks SQL - Serverless SQL analytics on the Lakehouse.
  - Firebolt - Cloud data warehouse built for high-performance analytics.
  - ClickHouse Cloud - Managed ClickHouse for real-time analytics.
  - StarRocks - High-performance analytical database.
  - MotherDuck - Serverless analytics powered by DuckDB.
  - Snowflake - Cloud-native data warehouse with separation of storage and compute.
- Lakehouse Platforms
  - Databricks - Unified analytics platform combining data lake and warehouse.
  - Apache Iceberg - Open table format for huge analytic datasets.
  - Delta Lake - Open-source storage layer with ACID transactions on data lakes.
  - Apache Hudi - Data lake platform for incremental data processing.
  - Dremio - Lakehouse platform with Apache Iceberg support.
  - Onehouse - Managed lakehouse platform built on Apache Hudi.
  - Tabular - Managed Apache Iceberg service by its creators.
Feature Stores
- Commercial / Managed
  - Vertex AI Feature Store - Google Cloud feature store.
  - Databricks Feature Store - Feature store in Databricks.
  - Feast - Open-source feature store for ML.
  - Hopsworks - Platform with feature store and MLOps.
  - Amazon SageMaker Feature Store - AWS managed feature store.
Learning Resources
- Blogs & Newsletters
  - Blef.fr - The hub to explore Data News links.
  - Joe Reis's Substack - Insights on data engineering.
  - Airbyte Blog - Data integration and engineering content.
  - DataStackGuide - Independent reviews of B2B data stack tools (CRMs, enrichment, BI, sales engagement) backed by analysis of 23,000+ job postings.
  - dbt Blog - Articles on analytics engineering.
- Books
  - Fundamentals of Data Engineering - by Joe Reis & Matt Housley
  - The Data Warehouse Toolkit - by Ralph Kimball
  - Designing Data-Intensive Applications - by Martin Kleppmann
  - Data Mesh - by Zhamak Dehghani
  - Building a Scalable Data Warehouse with Data Vault 2.0 - by Dan Linstedt
  - Building a Scalable Data Warehouse with Data Vault 2.0 - by Dan Linstedt
- Courses & Certifications
  - DataCamp - Data science and engineering courses.
  - DataTalks.Club - Free data engineering zoomcamp.
  - Snowflake Training - Snowflake certifications.
  - dbt Learn - Free dbt fundamentals course.
- Podcasts
  - DataGen - Interviews with French data practitioners & leaders.
  - The Data Engineering Podcast - Interviews with data engineering practitioners.
  - Data Engineering Show - Discussions on data engineering topics.
Metrics Layer & Semantic Layer
- Commercial / Managed
  - dbt Semantic Layer - Define metrics in dbt, query from any tool.
  - Cube - Headless BI and semantic layer with caching.
  - MetricFlow - Semantic layer engine (now part of dbt).
  - Minerva (Airbnb) - Airbnb's internal metrics platform.
ML Platforms & MLOps
- Commercial / Managed
  - Google Vertex AI - Unified ML platform by Google Cloud.
  - Databricks MLflow - Managed MLflow on Databricks.
  - Amazon SageMaker - Fully managed ML service by AWS.
  - Azure Machine Learning - ML platform by Microsoft.
- Open Source
  - MLflow - Open-source platform for ML lifecycle management.
  - Kubeflow - ML toolkit for Kubernetes.
  - Metaflow - Framework for real-life data science (by Netflix).
  - ZenML - MLOps framework for reproducible pipelines.
Query Engines
- Change Data Capture (CDC)
  - PrestoDB - Distributed SQL query engine by Meta.
  - Apache Drill - Schema-free SQL query engine.
  - Apache Impala - Massively parallel SQL query engine.
  - ClickHouse - Column-oriented DBMS for OLAP.
  - Apache Druid - Real-time analytics database.
Reverse ETL
- Commercial / Managed
  - Census - Operational analytics platform for syncing data to business tools.
  - Hightouch - Data activation platform for reverse ETL.
  - Polytomic - Sync data bidirectionally between databases and SaaS.
Streaming & Real-Time
- Change Data Capture (CDC)
  - Boomi Data Integration - Automated CDC for operational databases and apps.
  - Debezium - Open-source distributed CDC platform.
  - Airbyte CDC - CDC connectors in Airbyte.
  - Fivetran CDC - Managed CDC by Fivetran.
- Message Brokers & Streaming Platforms
  - Apache Kafka - Distributed event streaming platform.
  - Confluent - Enterprise Kafka platform with managed cloud.
  - Amazon Kinesis - Real-time streaming data service by AWS.
  - Google Pub/Sub - Messaging service by Google Cloud.
  - Azure Event Hubs - Big data streaming platform by Azure.
  - RabbitMQ - Open-source message broker.
- Stream Processing
  - Apache Kafka Streams - Client library for stream processing.
  - Apache Spark Streaming - Spark module for stream processing.
Vector Databases
- Commercial / Managed
  - MongoDB - Managed vector store alognside your operational data in MongoDB Atlas.
  - Pinecone - Managed vector database for similarity search.
  - Weaviate - Open-source vector search engine.
  - Qdrant - Open-source vector similarity search engine.
  - Chroma - Open-source embedding database.
  - pgvector - Vector similarity search for PostgreSQL.
  - Elasticsearch - Search engine with vector search capabilities.

Programming Languages

Python 4 Java 1 Go 1 C 1

awesome-modern-data-stack

Business Intelligence & Analytics

Commercial / Managed

Open Source

Community

Community Resources

Conferences

Slack Communities

Data Catalog & Discovery

Commercial / Managed

Open Source

Data Contracts

Commercial / Managed

Data Integration & Ingestion

Commercial / Managed

Open Source

Data Lakes & Storage

Data Lake Formats

Object Storage

Data Notebooks & Exploration

Commercial / Managed

Open Source

Data Observability

Commercial / Managed

DataOps & Version Control

Commercial / Managed

Data Orchestration

Commercial / Managed

Open Source

Data Quality & Testing

Commercial / Managed

Open Source

Data Sharing

Commercial / Managed

Data Transformation

Distributed Computing

Python-Based Transformation

SQL-Based Transformation

Data Warehouses & Lakehouses

Cloud Data Warehouses

Lakehouse Platforms

Feature Stores

Commercial / Managed

Learning Resources

Blogs & Newsletters

Books

Courses & Certifications

Podcasts

Metrics Layer & Semantic Layer

Commercial / Managed

ML Platforms & MLOps

Commercial / Managed

Open Source

Query Engines

Change Data Capture (CDC)

Reverse ETL

Commercial / Managed

Streaming & Real-Time

Change Data Capture (CDC)

Message Brokers & Streaming Platforms

Stream Processing

Vector Databases

Commercial / Managed