An open API service indexing awesome lists of open source software.

awesome-modern-data-stack

A curated list of awesome tools, frameworks, and resources for the Modern Data Stack (MDS).
https://github.com/bricefotzo/awesome-modern-data-stack

Last synced: 6 days ago
JSON representation

  • Business Intelligence & Analytics

    • Commercial / Managed

      • Tableau - Visual analytics platform (Salesforce).
      • Power BI - Business analytics by Microsoft.
      • Qlik - Data analytics and business intelligence platform.
      • Preset - Managed Apache Superset.
      • Omni - Shared-model BI platform.
      • Holistics - Self-service BI platform with data modeling.
    • Open Source

      • Apache Superset - Modern data exploration and visualization platform.
      • Metabase - Open-source business intelligence with SQL and visual query builder.
      • Redash - Connect and query data sources, build dashboards.
      • Lightdash - Open-source BI for dbt users.
      • Evidence - Code-based BI with Markdown and SQL.
      • Grafana - Open-source analytics and monitoring platform.
  • Community

  • Data Catalog & Discovery

    • Commercial / Managed

      • Atlan - Active metadata platform with collaboration features.
      • Alation - Enterprise data intelligence platform.
      • Collibra - Data intelligence platform with governance and catalog.
    • Open Source

      • DataHub - Open-source metadata platform by LinkedIn.
      • Amundsen - Open-source data discovery platform by Lyft.
      • OpenMetadata - Open-source metadata platform with discovery and governance.
  • Data Contracts

  • Data Integration & Ingestion

    • Commercial / Managed

      • Boomi - Modern data integration and agent management platform.
      • Fivetran - Automated data integration with 500+ connectors. Industry leader in managed ELT.
      • Stitch - Simple, extensible ETL built for developers. Part of Talend.
      • Hevo Data - No-code data pipeline platform.
      • Informatica - Enterprise data integration and management platform.
      • Talend - Enterprise data integration suite.
    • Open Source

      • Airbyte - Open-source data integration platform with 300+ connectors. ELT-first approach.
      • dlt (data load tool) - Python library for data loading with automatic schema inference.
      • Singer - Open-source standard for writing scripts that move data (taps and targets).
      • Meltano - Open-source DataOps platform built on Singer. CLI-first, version-controlled pipelines.
      • Ingestr - CLI tool to copy data between databases with a single command.
  • Data Lakes & Storage

  • Data Notebooks & Exploration

  • Data Observability

    • Commercial / Managed

      • Datadog - Monitoring and security platform for developers.
      • Datafold - Data reliability platform with data diff and regression testing.
      • Sifflet - Full-stack data observability platform.
  • DataOps & Version Control

    • Commercial / Managed

      • DVC - Version control for data and ML models.
  • Data Orchestration

    • Commercial / Managed

    • Open Source

      • Apache Airflow - Platform to programmatically author, schedule, and monitor workflows.
      • Dagster - Cloud-native orchestration platform with software-defined assets.
      • Prefect - Modern workflow orchestration with Python-native approach.
      • Mage - Open-source data pipeline tool with notebook-style interface.
      • Kestra - Event-driven orchestration platform with declarative YAML.
      • Luigi - Python module for building complex pipelines (by Spotify).
      • Argo Workflows - Kubernetes-native workflow engine.
  • Data Quality & Testing

    • Commercial / Managed

      • Soda - Data quality platform with SodaCL language.
      • Monte Carlo - Data observability platform with ML-powered anomaly detection.
    • Open Source

      • Great Expectations - Python library for data validation and documentation.
      • Elementary - Open-source data observability for dbt.
      • Provero - Vendor-neutral, declarative data quality engine with YAML-based checks.
      • dbt Tests - Built-in testing framework in dbt.
  • Data Sharing

  • Data Transformation

    • Distributed Computing

      • Apache Spark - Unified analytics engine for large-scale data processing.
      • Dask - Flexible parallel computing library for analytics.
      • Ray - Unified framework for scaling AI and Python applications.
      • Apache Flink - Stream and batch processing framework.
      • Apache Beam - Unified programming model for batch and streaming.
      • Asgarde - Java library for simplified error handling in Beam pipelines.
    • Python-Based Transformation

      • Pandas - Data manipulation and analysis library for Python.
      • DuckDB - In-process SQL OLAP database. Perfect for local data transformation.
      • PySpark - Python API for Apache Spark.
      • Vaex - Out-of-core DataFrames for large datasets.
      • Ibis - Python DataFrame API that compiles to SQL.
      • Fugue - Unified interface for distributed computing.
      • Hamilton - Micro-framework for dataframe generation.
      • Polars - Lightning-fast DataFrame library written in Rust.
    • SQL-Based Transformation

      • dbt (data build tool) - Industry-standard SQL-first transformation tool. Version-controlled, tested, documented.
      • SQLMesh - Data transformation framework with built-in data quality and CI/CD.
      • SDF (Semantic Data Fabric) - SQL compiler for data transformation with static analysis.
      • Coalesce - Data transformation platform purpose-built for Snowflake.
      • Dataform - SQL-based data transformation (acquired by Google).
      • SDF (Semantic Data Fabric) - SQL compiler for data transformation with static analysis.
  • Data Warehouses & Lakehouses

    • Cloud Data Warehouses

    • Lakehouse Platforms

      • Databricks - Unified analytics platform combining data lake and warehouse.
      • Apache Iceberg - Open table format for huge analytic datasets.
      • Delta Lake - Open-source storage layer with ACID transactions on data lakes.
      • Apache Hudi - Data lake platform for incremental data processing.
      • Dremio - Lakehouse platform with Apache Iceberg support.
      • Onehouse - Managed lakehouse platform built on Apache Hudi.
      • Tabular - Managed Apache Iceberg service by its creators.
  • Feature Stores

  • Learning Resources

  • Metrics Layer & Semantic Layer

    • Commercial / Managed

  • ML Platforms & MLOps

  • Query Engines

  • Reverse ETL

    • Commercial / Managed

      • Census - Operational analytics platform for syncing data to business tools.
      • Hightouch - Data activation platform for reverse ETL.
      • Polytomic - Sync data bidirectionally between databases and SaaS.
  • Streaming & Real-Time

  • Uncategorized

    • Uncategorized

      • querybear.com - Ask your data anything – with persistent memory, schema learning, and more.
  • Vector Databases

    • Commercial / Managed

      • MongoDB - Managed vector store alognside your operational data in MongoDB Atlas.
      • Pinecone - Managed vector database for similarity search.
      • Weaviate - Open-source vector search engine.
      • Qdrant - Open-source vector similarity search engine.
      • Chroma - Open-source embedding database.
      • pgvector - Vector similarity search for PostgreSQL.
      • Elasticsearch - Search engine with vector search capabilities.