awesome-dataops

:sunglasses: A curated list of awesome DataOps tools
https://github.com/kelvins/awesome-dataops

Last synced: 6 days ago
JSON representation

Data Catalog
- Amundsen - Data discovery and metadata engine for improving the productivity when interacting with data.
- Apache Atlas - Provides open metadata management and governance capabilities to build a data catalog.
- DataHub - LinkedIn's generalized metadata search & discovery tool.
- OpenMetadata - A Single place to discover, collaborate and get your data right.
- Marquez - Service for the collection, aggregation, and visualization of a data ecosystem's metadata.
- OpenLineage - Open standard for metadata and lineage collection.
- DataHub - LinkedIn's generalized metadata search & discovery tool.
- CKAN - Open-source DMS (data management system) for powering data hubs and data portals.
- Magda - A federated, open-source data catalog for all your big data and small data.
- Unity Catalog - Industry’s only universal catalog for data and AI.
Data Exploration
- Apache Zeppelin - Enables data-driven, interactive data analytics and collaborative documents.
- Jupyter Notebook - Web-based notebook environment for interactive computing.
- JupyterLab - The next-generation user interface for Project Jupyter.
- Polynote - The polyglot notebook with first-class Scala support.
- Jupytext - Jupyter Notebooks as Markdown Documents, Julia, Python or R scripts.
Data Ingestion
- Amazon Kinesis - Easily collect, process, and analyze video and data streams in real time.
- Google PubSub - Ingest events for streaming into BigQuery, data lakes or operational databases.
- RabbitMQ - One of the most popular open source message brokers.
- Apache Gobblin - A framework that simplifies common aspects of big data such as data ingestion.
- Apache Kafka - Open-source distributed event streaming platform used by thousands of companies.
- Apache Pulsar - Distributed pub-sub messaging platform with a flexible messaging model and intuitive API.
- Embulk - A parallel bulk data loader that helps data transfer between various storages.
- Fluentd - Collects events from various data sources and writes them to files.
- Nakadi - A distributed event bus that implements a RESTful API abstraction on top of Kafka-like queues.
- Pravega - An open source distributed storage service implementing Streams.
Data Processing
- Apache Hadoop MapReduce - A framework for writing applications which process vast amounts of data.
- Apache Beam - A unified model for defining both batch and streaming data-parallel processing pipelines.
- Apache Flink - An open source stream processing framework with powerful capabilities.
- Apache Nifi - An easy to use, powerful, and reliable system to process and distribute data.
- Apache Samza - A distributed stream processing framework which uses Apache Kafka and Hadoop YARN.
- Apache Spark - A unified analytics engine for large-scale data processing.
- Apache Storm - An open source distributed realtime computation system.
- Apache Tez - A generic data-processing pipeline engine envisioned as a low-level engine.
- Faust - A stream processing library, porting the ideas from Kafka Streams to Python.
Data Quality
- Great Expectations - A Python data validation framework that allows to test your data against datasets.
- JSON Schema - A vocabulary that allows you to annotate and validate JSON documents.
- Cerberus - Lightweight, extensible data validation library for Python.
- Cleanlab - Data-centric AI tool to detect (non-predefined) issues in ML data like label errors or outliers.
- DataProfiler - A Python library designed to make data analysis, monitoring, and sensitive data detection easy.
- Deequ - A library built on top of Apache Spark for measuring data quality in large datasets.
- SodaSQL - Data profiling, testing, and monitoring for SQL accessible data.
Data Visualization
- Data Table Format
  - Count - SQL/drag-and-drop querying and visualisation tool based on notebooks.
  - Data Studio - Reporting solution for power users who want to go beyond the data and dashboards of GA.
  - Metabase - The simplest, fastest way to get business intelligence and analytics to everyone.
  - Redash - Connect to any data source, easily visualize, dashboard and share your data.
  - Apache Superset - A modern data exploration and data visualization platform.
  - Dash - Analytical Web Apps for Python, R, Julia, and Jupyter.
  - HUE - A mature SQL Assistant for querying Databases & Data Warehouses.
  - Lux - Fast and easy data exploration by automating the visualization and data analysis process.
Data Warehouse
- Data Table Format
  - Amazon Redshift - Accelerate your time to insights with fast, easy, and secure cloud data warehousing.
  - Google BigQuery - Serverless, highly scalable, and cost-effective multicloud data warehouse.
  - Apache Hive - Facilitates reading, writing, and managing large datasets residing in distributed storage.
  - Apache Kylin - An open source, distributed analytical data warehouse for big data.
Database
- Columnar Database
  - Scylla - Designed to be compatible with Cassandra while achieving higher throughputs and lower latencies.
  - Apache Cassandra - Open source column based DBMS designed to handle large amounts of data.
  - Apache Druid - Designed to quickly ingest massive quantities of event data, and provide low-latency queries.
  - Apache HBase - An open-source, distributed, versioned, column-oriented store.
  - Scylla - Designed to be compatible with Cassandra while achieving higher throughputs and lower latencies.
- Key-Value Database
  - DynamoDB - Fast, flexible NoSQL database service for single-digit millisecond performance at any scale.
  - Apache Accumulo - A sorted, distributed key-value store that provides robust and scalable data storage.
  - Dragonfly - A modern in-memory datastore, fully compatible with Redis and Memcached APIs.
  - etcd - Distributed reliable key-value store for the most critical data of a distributed system.
  - EVCache - A distributed in-memory data store for the cloud.
  - Memcached - A high performance multithreaded event-based key/value cache store.
  - Redis - An in-memory key-value database that persists on disk.
- Vector Database
  - Pinecone - Managed and distributed vector similarity search used with a lightweight SDK.
  - Qdrant - An open source vector similarity search engine with extended filtering support.
- Document-Oriented Database
  - Apache CouchDB - An open-source document-oriented NoSQL database, implemented in Erlang.
  - Elasticsearch - A distributed document oriented database with a RESTful search engine.
  - MongoDB - A cross-platform document database that uses JSON-like documents with optional schemas.
  - RethinkDB - The first open-source scalable database built for realtime applications.
- Graph Database
  - ArangoDB - A scalable open-source multi-model database natively supporting graph, document and search.
  - Memgraph - An open source graph database, built for real-time streaming data, compatible with Neo4j.
  - Neo4j - A high performance graph store with all the features expected of a mature and robust database.
  - Titan - A highly scalable graph database optimized for storing and querying large graphs.
  - Age - A multi-model database that supports both graph and relational data models.
  - JanusGraph - Manage large graphs with billions of data distributed across a multi-machine cluster.
- Relational Database
  - CockroachDB - A distributed database designed to build, scale, and manage data-intensive apps.
  - Crate - A distributed SQL database that makes it simple to store and analyze massive amounts of data.
  - MariaDB - A replacement of MySQL with more features, new storage engines and better performance.
  - MySQL - One of the most popular open source transactional databases.
  - PostgreSQL - An advanced RDBMS that supports an extended subset of the SQL standard.
  - RQLite - A lightweight, distributed relational database, which uses SQLite as its storage engine.
  - SQLite - A popular choice as embedded database software for local/client storage.
- Time Series Database
  - Akumuli - Can be used to capture, store and process time-series data in real-time.
  - Atlas - An in-memory dimensional time series database.
  - InfluxDB - Scalable datastore for metrics, events, and real-time analytics.
  - QuestDB - An open source SQL database designed to process time series data, faster.
  - TimescaleDB - Open-source time-series SQL database optimized for fast ingest and complex queries.
File System
- Vector Database
  - Amazon Simple Storage Service (S3) - Object storage built to retrieve any amount of data from anywhere.
  - Apache Hadoop Distributed File System (HDFS) - A distributed file system.
  - Google Cloud Storage (GCS) - Object storage for companies of all sizes, to store any amount of data.
  - SeaweedFS - A fast distributed storage system for blobs, objects, files, and data lake.
  - SeaweedFS - A fast distributed storage system for blobs, objects, files, and data lake.
  - Alluxio - A virtual distributed storage system.
  - GlusterFS - A software defined distributed storage that can scale to several petabytes.
  - LakeFS - Open source tool that transforms your object storage into a Git-like repository.
  - LizardFS - A highly reliable, scalable and efficient distributed file system.
  - MinIO - High Performance, Kubernetes Native Object Storage compatible with Amazon S3 API.
  - Swift - A distributed object storage system designed to scale from a single machine to thousands of servers.
Metadata Service
- Vector Database
  - Hive Metastore - Service that stores metadata related to Apache Hive and other services.
  - Metacat - Provides you information about what data you have, where it resides and how to process it.
SQL Query Engine
- Vector Database
  - Dremio - Power high-performing BI dashboards and interactive analytics directly on data lake.
  - Apache Drill - Schema-free SQL Query Engine for Hadoop, NoSQL and Cloud Storage.
  - Apache Impala - Lightning-fast, distributed SQL queries for petabytes of data.
  - Presto - A distributed SQL query engine for big data.
  - Trino - A fast distributed SQL query engine for big data analytics.
Books
- Vector Database
Other Lists
- Vector Database
Slack
- Vector Database
  - Delta Lake Workspace
  - Trino Workspace
Data Serialization
- - Apache Parquet - A columnar storage format which provides efficient storage and encoding of data.
  - Apache Avro - A data serialization system which is compact, fast and provides rich data structures.
  - Apache ORC - A self-describing type-aware columnar file format designed for Hadoop workloads.
  - Kryo - A fast and efficient binary object graph serialization framework for Java.
  - ProtoBuf - Language-neutral, platform-neutral, extensible mechanism for serializing structured data.
- Data Compression
  - Pigz - A parallel implementation of gzip for modern multi-processor, multi-core machines.
  - Snappy - Open source compression library that is fast, stable and robuts.
- Data Table Format
  - Apache Hudi - Manages the storage of large analytical datasets on DFS.
  - Apache Iceberg - Open table format for huge analytic datasets.
  - Delta Lake - An open source project that enables building a Lakehouse architecture on top of data lakes.
Data Workflow
- Luigi - Python module that helps you build complex pipelines of batch jobs.
- Apache Airflow - A platform to programmatically author, schedule, and monitor workflows.
- Apache Oozie - An extensible, scalable and reliable system to manage complex Hadoop workloads.
- Azkaban - Batch workflow job scheduler created at LinkedIn to run Hadoop jobs.
- Dagster - An orchestration platform for the development, production, and observation of data assets.
- Prefect - A workflow management system, designed for modern infrastructure.
- Prefect - A workflow management system, designed for modern infrastructure.
Logging and Monitoring
- Vector Database
  - Grafana - Visualize metrics, logs, and traces from multiple sources like Prometheus, Loki, InfluxDB and more.
  - Loki - A horizontally-scalable, highly-available, multi-tenant log aggregation system inspired by Prometheus.
  - Prometheus - A monitoring system and time series database.
  - Whylogs - A tool for creating data logs, enabling monitoring for data drift and data quality issues.

Programming Languages

Java 40 Python 14 C++ 12 C 8 Go 7 Scala 4 Rust 2 TypeScript 2 Jupyter Notebook 1 Erlang 1

awesome-dataops

Data Catalog

Data Exploration

Data Ingestion

Data Processing

Data Quality

Data Visualization

Data Table Format

Data Warehouse

Data Table Format

Database

Columnar Database

Key-Value Database

Vector Database

Document-Oriented Database

Graph Database

Relational Database

Time Series Database

File System

Vector Database

Metadata Service

Vector Database

SQL Query Engine

Vector Database

Books

Vector Database

Other Lists

Vector Database

Slack

Vector Database

Data Serialization

Data Compression

Data Table Format

Data Workflow

Logging and Monitoring

Vector Database