awesome-olap

A curated list of awesome Online Analytical Processing databases, frameworks, ressources and other awesomeness.
https://github.com/samber/awesome-olap

Last synced: 16 days ago
JSON representation

OLAP Databases
- Managed cloud services
- Real-time analytics
  - shared-nothing architecture - second response time. DDL, DML and DCL are operated via SQL. These databases also support tiering for long-term cold storage.
  - Apache Doris
  - Apache Druid
  - Apache HBase
  - Apache Pinot
  - StarRocks
  - Dremio
- Search engines
  - Elasticsearch - Search and analytics engine based on Apache Lucene.
  - Meilisearch - Open source search engine, that aims to be a ready-to-go solution.
  - OpenSearch - Apache 2.0 fork of Elasticsearch.
  - Quickwit - Search engine on top of object storage, using shared-everything architecture.
  - Typesense - Оpen-source, typo-tolerant search engine optimized for instant search-as-you-type experiences and developer productivity.
- Hybrid OLAP/OLTP NewSQL (aka HTAP)
  - Citus - PostgreSQL compatible distributed table.
  - TiDB - MySQL compatible SQL database that supports hybrid transactional and analytical processing workloads.
- Timeseries
  - Grafana Mimir - Prometheus compatible TSDB on top of object storage.
  - TimeScaleDB - PostgreSQL compatible TSDB.
Readings
- Vector similarity search
  - HNSW
  - HNSW
  - HNSW
  - ANN (approximate nearest neighbor)
  - kNN (k nearest neighbor)
  - Faiss
  - HNSW
  - HNSW
  - HNSW
  - HNSW
  - HNSW
  - HNSW
  - HNSW
  - HNSW
  - HNSW
  - HNSW
  - HNSW
  - HNSW
  - HNSW
  - HNSW
  - HNSW
  - HNSW
  - HNSW
  - HNSW
  - HNSW
  - HNSW
  - HNSW
  - HNSW
  - HNSW
  - HNSW
  - HNSW
  - HNSW
  - HNSW
- Blogs to follow
- Papers
- Architecture
- Data modeling
  - Schema evolution
  - CDC
- Index
- Vectorized query processing
- Querying
- Transactions
  - ACID properties
  - Serializable transaction
- Consensus
  - Paxos
  - Raft
- Challenging platforms
  - Datadog event store
  - Cloudflare logging
Ingestion and querying
- In-memory processing
  - Apache Arrow Datafusion - High level SQL interface for Apache Arrow.
  - Apache Arrow - Low-level in-memory data processing. Zero-copy data manipulation for any language, via gRPC/IPC interfaces.
  - Apache Arrow Datafusion - High level SQL interface for Apache Arrow.
  - Delta Standalone - Standalone DeltaLake driver for Java and Scala. Do not depend on Spark.
  - DuckDB - In-process SQL query engine for processing Parquet files. Built on top of Apache Arrow.
  - Pandas - Python data analysis and manipulation tool.
  - clickhouse-local - Lightweight CLI version of Clickhouse for running SQL queries against CSV, JSON, Parquet, etc files.
  - delta-rs - Standalone DeltaLake driver for Python and Rust. Do not depend on Spark.
- Stream processing
  - Apache Beam - Unified SDK for cross language stream processing. Available in Go, Python, Java, Scala and Typescript.
  - Apache Flink - Stateful stream processing.
  - Apache Kafka stream - Stream processing.
  - Apache Spark streaming - Stream processing on top of Spark.
  - Akka stream - Stream processing.
  - Benthos - Go stream processing.
- Batch processing
  - Apache Spark
  - MapReduce
- Distributed SQL processing
  - Apache Spark SQL - Distributed SQL query engine that sit on top of Spark.
  - ksql - SQL interface for Kafka.
  - PrestoDB - Distributed SQL query engine.
  - Trino - Distributed SQL query engine. Fork of PrestoDB.
Data lake
- File formats and serialization
  - Apache Arrow Columnar Format - Columnar format for in-memory Apache Arrow processing.
  - Apache Avro - Row-oriented serialization for data streaming purpose.
  - Apache ORC - Column-oriented serialization for data storage purpose. Part of Hadoop platform.
  - Apache Parquet - Column-oriented serialization for data storage purpose.
  - Apache Thrift - Row-oriented serialization for RPC purpose.
  - Google Protobuf - Row-oriented serialization for RPC purpose.
  - Schema Registry - Centralized repository for validating row-oriented events. Part of Kafka and Confluent platform.
  - Cap’n Proto - Row-oriented serialization with zero-copy access, as fast as mmap.
  - Flatbuffer - Row-oriented serialization with zero-copy access, as fast as mmap.
- Open table formats
- Metastore
  - AWS Glue
  - Databricks unity catalog
  - Hive Metastore - Component of Hadoop HiveServer2, that can be used standalone.
  - Nessie
- Object Storage
  - Apache HDFS - Hadoop distributed file system.
  - AWS S3
  - Azure Blob Storage
  - GCP Cloud Storage
  - Minio - S3 compatible and self-hosted object storage.
- Codecs, encoding and compression
  - Bit packing
  - Brotli
  - Deflate
  - Delta
  - LZ4
  - RLE
  - Snappy
  - zstd
  - Gorilla
Brokers and distributed messaging
- Codecs, encoding and compression
Scheduler
- Distributed SQL processing
  - Apache Airflow
  - Dagster
ETL, ELT and reverse ETL
- Distributed SQL processing
  - Airbyte - ELT.
  - Census - Reverse ETL.
  - RudderStack - Customer Data Platform. Pipeline between a tracking plan, event transformation, and destination tools (datawarehouse or SaaS).
Datasets
- Distributed SQL processing
  - awesome-public-datasets
  - CommonCrawl
  - Criteo
  - Entso-e
  - GitHub Archives
  - Kaggle - Community sourced dataset.
  - NYCTaxy
Benchmark
- Distributed SQL processing
  - Jepsen - Distributed databases, queues and consensus protocols testing.
  - TPC family benchmarks - For big data based database.
👤 Contributors
- More
  - Contributors
💫 Show your support
- More
  - ![GitHub Sponsors
📝 License
- More
  - Samuel Berthe

Programming Languages

Rust 1 Go 1

awesome-olap

OLAP Databases

Managed cloud services

Real-time analytics

Search engines

Hybrid OLAP/OLTP NewSQL (aka HTAP)

Timeseries

Readings

Vector similarity search

Blogs to follow

Papers

Architecture

Data modeling

Index

Vectorized query processing

Querying

Transactions

Consensus

Challenging platforms

Ingestion and querying

In-memory processing

Stream processing

Batch processing

Distributed SQL processing

Data lake

File formats and serialization

Open table formats

Metastore

Object Storage

Codecs, encoding and compression

Brokers and distributed messaging

Codecs, encoding and compression

Scheduler

Distributed SQL processing

ETL, ELT and reverse ETL

Distributed SQL processing

Datasets

Distributed SQL processing

Benchmark

Distributed SQL processing

👤 Contributors

More

💫 Show your support

More

📝 License

More