Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
awesome-olap
A curated list of awesome Online Analytical Processing databases, frameworks, ressources and other awesomeness.
https://github.com/samber/awesome-olap
Last synced: 4 days ago
JSON representation
-
OLAP Databases
-
Real-time analytics
- shared-nothing architecture - second response time. DDL, DML and DCL are operated via SQL. These databases also support tiering for long-term cold storage.
- Apache Doris
- Apache Druid
- Apache HBase
- Apache Pinot
- StarRocks
- Clickhouse
-
Search engines
- Elasticsearch - Search and analytics engine based on Apache Lucene.
- Meilisearch - Open source search engine, that aims to be a ready-to-go solution.
- OpenSearch - Apache 2.0 fork of Elasticsearch.
- Quickwit - Search engine on top of object storage, using shared-everything architecture.
- Typesense - Πpen-source, typo-tolerant search engine optimized for instant search-as-you-type experiences and developer productivity.
-
NewSQL
-
Timeseries
- Grafana Mimir - Prometheus compatible TSDB on top of object storage.
- TimeScaleDB - PostgreSQL compatible TSDB.
-
Managed cloud services
-
-
Data lake
-
File formats and serialization
- Apache Arrow Columnar Format - Columnar format for in-memory Apache Arrow processing.
- Apache Avro - Row-oriented serialization for data streaming purpose.
- Apache ORC - Column-oriented serialization for data storage purpose. Part of Hadoop platform.
- Apache Parquet - Column-oriented serialization for data storage purpose.
- Apache Thrift - Row-oriented serialization for RPC purpose.
- Google Protobuf - Row-oriented serialization for RPC purpose.
- Schema Registry - Centralized repository for validating row-oriented events. Part of Kafka and Confluent platform.
-
Open table formats
-
Metastore
- AWS Glue
- Databricks unity catalog
- Hive Metastore - Component of Hadoop HiveServer2, that can be used standalone.
- Nessie
-
Object Storage
- Apache HDFS - Hadoop distributed file system.
- AWS S3
- Azure Blob Storage
- GCP Cloud Storage
- Minio - S3 compatible and self-hosted object storage.
-
Codecs, encoding and compression
-
-
Brokers and distributed messaging
-
Codecs, encoding and compression
-
-
Ingestion and querying
-
Stream processing
- Apache Beam - Unified SDK for cross language stream processing. Available in Go, Python, Java, Scala and Typescript.
- Apache Flink - Stateful stream processing.
- Apache Kafka stream - Stream processing.
- Apache Spark streaming - Stream processing on top of Spark.
- Akka stream - Stream processing.
- Benthos - Go stream processing.
-
Batch processing
-
In-memory processing
- Apache Arrow - Low-level in-memory data processing. Zero-copy data manipulation for any language, via gRPC/IPC interfaces.
- Apache Arrow Datafusion - High level SQL interface for Apache Arrow.
- Delta Standalone - Standalone DeltaLake driver for Java and Scala. Do not depend on Spark.
- DuckDB - In-process SQL query engine for processing Parquet files. Built on top of Apache Arrow.
- Pandas - Python data analysis and manipulation tool.
- clickhouse-local - Lightweight CLI version of Clickhouse for running SQL queries against CSV, JSON, Parquet, etc files.
- delta-rs - Standalone DeltaLake driver for Python and Rust. Do not depend on Spark.
- Apache Arrow Datafusion - High level SQL interface for Apache Arrow.
-
Distributed SQL processing
- Apache Spark SQL - Distributed SQL query engine that sit on top of Spark.
- ksql - SQL interface for Kafka.
- PrestoDB - Distributed SQL query engine.
- Trino - Distributed SQL query engine. Fork of PrestoDB.
-
-
Scheduler
-
Distributed SQL processing
-
-
ETL, ELT and reverse ETL
-
Distributed SQL processing
- Airbyte - ELT.
- Census - Reverse ETL.
- RudderStack - Customer Data Platform. Pipeline between a tracking plan, event transformation, and destination tools (datawarehouse or SaaS).
-
-
Datasets
-
Distributed SQL processing
- awesome-public-datasets
- CommonCrawl
- Criteo
- Entso-e
- GitHub Archives
- Kaggle - Community sourced dataset.
- NYCTaxy
-
-
Benchmark
-
Distributed SQL processing
- Jepsen - Distributed databases, queues and consensus protocols testing.
- TPC family benchmarks - For big data based database.
-
-
Readings
-
Papers
-
Architecture
-
Data modeling
-
Index
-
Vector similarity search
-
Vectorized query processing
-
Querying
-
Transactions
-
Consensus
-
Challenging platforms
-
Blogs to follow
- Engineering at Meta
- Engineering at Criteo
- Engineering at Uber
- Engineering at Airbnb
- Databricks
- Towards Data Science
- Engineering at Criteo
- Towards Data Science
- Engineering at Criteo
- Towards Data Science
- Engineering at Criteo
- Towards Data Science
- Engineering at Criteo
- Towards Data Science
- Engineering at Criteo
- Engineering at Criteo
- Towards Data Science
- Engineering at Criteo
- Towards Data Science
- Engineering at Criteo
- Towards Data Science
- Towards Data Science
- Engineering at Criteo
- Engineering at Criteo
- Towards Data Science
- Engineering at Criteo
- Towards Data Science
- Engineering at Criteo
- Towards Data Science
- Towards Data Science
- Engineering at Criteo
- Towards Data Science
- Engineering at Criteo
- Towards Data Science
-
-
π€ Contributors
-
More
-
-
π« Show your support
-
More
-
-
π License
-
More
-
Categories
Sub Categories
Blogs to follow
34
Distributed SQL processing
18
Vector similarity search
18
Index
15
Codecs, encoding and compression
12
In-memory processing
8
File formats and serialization
7
Real-time analytics
7
Managed cloud services
6
Stream processing
6
Architecture
6
Object Storage
5
Search engines
5
Open table formats
4
Vectorized query processing
4
Papers
4
Metastore
4
More
3
Querying
3
Data modeling
2
Transactions
2
Timeseries
2
NewSQL
2
Batch processing
2
Challenging platforms
2
Consensus
2
Keywords