awesome-hadoop

A curated list of amazingly awesome Hadoop and Hadoop ecosystem resources
https://github.com/youngwookim/awesome-hadoop

Last synced: 4 days ago
JSON representation

Libraries and Tools
- Spring for Apache Hadoop
- Kite Software Development Kit - A set of libraries, tools, examples, and documentation
- gohadoop - Native go clients for Apache Hadoop YARN.
- Hue - A Web interface for analyzing data with Apache Hadoop.
- Apache Zeppelin - A web-based notebook that enables interactive data analytics
- Apache Thrift
- Apache Avro - Apache Avro is a data serialization system.
- Spring for Apache Hadoop
- Oozie Eclipse Plugin - A graphical editor for editing Apache Oozie workflows inside Eclipse.
- snakebite - A pure python HDFS client
- Apache Parquet - Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.
- Apache Superset (incubating) - Apache Superset (incubating) is a modern, enterprise-ready business intelligence web application
- Schema Registry UI - Web tool for the Confluent Schema Registry in order to create / view / search / evolve / view history & configure Avro schemas of your Kafka cluster.
Hadoop
- Apache Hadoop - Apache Hadoop
- Apache Hadoop Ozone - An Object Store for Apache Hadoop
- Apache Tez - A Framework for YARN-based, Data Processing Applications In Hadoop
- SpatialHadoop - SpatialHadoop is a MapReduce extension to Apache Hadoop designed specially to work with spatial data.
- pydoop - Pydoop is a package that provides a Python API for Hadoop.
- White Elephant - Hadoop log aggregator and dashboard
- Apache Kylin - Apache Kylin is an open source Distributed Analytics Engine from eBay Inc. that provides SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets
- Apache Ignite - Distributed in-memory platform
YARN
- Apache Slider - Apache Slider is a project in incubation at the Apache Software Foundation with the goal of making it possible and easy to deploy existing applications onto a YARN cluster.
- Apache Twill - Apache Twill is an abstraction over Apache Hadoop® YARN that reduces the complexity of developing distributed applications, allowing developers to focus more on their application logic.
NoSQL
- Apache HBase - Apache HBase
- Apache Phoenix - A SQL skin over HBase supporting secondary indices
- happybase - A developer-friendly Python library to interact with Apache HBase.
- OpenTSDB - The Scalable Time Series Database
- Apache Cassandra
- happybase - A developer-friendly Python library to interact with Apache HBase.
- Apache Accumulo - The Apache Accumulo™ sorted, distributed key/value store is a robust, scalable, high performance data storage and retrieval system.
SQL on Hadoop
- Apache Hive - The Apache Hive data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL
- Apache Phoenix
- Lingual - SQL interface for Cascading (MR/Tez job generator)
- Apache Impala - Apache Impala is an open source massively parallel processing (MPP) SQL query engine for data stored in a computer cluster running Apache Hadoop. Impala has been described as the open-source equivalent of Google F1, which inspired its development in 2012.
- Presto - Distributed SQL Query Engine for Big Data. Open sourced by Facebook.
- Apache Tajo - Data warehouse system for Apache Hadoop
- Apache Drill - Schema-free SQL Query Engine
- Apache Trafodion
- Apache HAWQ (incubating) - Apache HAWQ is a Hadoop native SQL query engine that combines the key technological advantages of MPP database with the scalability and convenience of Hadoop
Data Management
- Apache Calcite - A Dynamic Data Management Framework
- Apache Atlas - Metadata tagging & lineage capture suppoting complex business data taxonomies
- Apache Kudu - Kudu provides a combination of fast inserts/updates and efficient columnar scans to enable multiple real-time analytic workloads across a single storage layer, complementing HDFS and Apache HBase.
Workflow, Lifecycle and Governance
- Apache Oozie - Apache Oozie
- Azkaban
- Apache Falcon - Data management and processing platform
- Apache NiFi - A dataflow system
- Apache AirFlow - Airflow is a workflow automation and scheduling system that can be used to author and manage data pipelines
- Luigi - Python package that helps you build complex pipelines of batch jobs
Data Ingestion and Integration
- Apache Flume - Apache Flume
- Apache Sqoop - Apache Sqoop
- Apache Kafka - Apache Kafka
- Gobblin from LinkedIn - Universal data ingestion framework for Hadoop
DSL
- Apache Pig - Apache Pig
- Apache DataFu - A collection of libraries for working with large-scale data in Hadoop
- seqpig - Simple and scalable scripting for large sequencing data set(ex: bioinfomation) in Hadoop
Realtime Data Processing
- Apache Storm
- Apache Samza
- Apache Spark
- Apache Flink - Apache Flink is a platform for efficient, distributed, general-purpose data processing. It supports exactly once stream processing.
- Apache Pulsar (incubating) - Apache Pulsar (incubating) is a highly scalable, low latency messaging platform running on commodity hardware. It provides simple pub-sub semantics over topics, guaranteed at-least-once delivery of messages, automatic cursor management for subscribers, and cross-datacenter replication.
- Apache Druid (incubating) - A high-performance, column-oriented, distributed data store.
- Apache Samza
- Apache Flink - Apache Flink is a platform for efficient, distributed, general-purpose data processing. It supports exactly once stream processing.
Distributed Computing and Programming
- Apache Spark
- Spark Packages - A community index of packages for Apache Spark
- Apache Crunch
- Cascading - Cascading is the proven application development platform for building data applications on Hadoop.
- Apache Flink - Apache Flink is a platform for efficient, distributed, general-purpose data processing.
- Apache Apex (incubating) - Enterprise-grade unified stream and batch processing engine.
- Apache Livy (incubating) - Apache Livy (incubating) is web service that exposes a REST interface for managing long running Apache Spark contexts in your cluster. With Livy, new applications can be built on top of Apache Spark that require fine grained interaction with many Spark contexts.
- SparkHub - A community site for Apache Spark
Packaging, Provisioning and Monitoring
- Apache Bigtop - Apache Bigtop: Packaging and tests of the Apache Hadoop ecosystem
- Apache Ambari - Apache Ambari
- Ganglia Monitoring System
- Apache Zookeeper - Apache Zookeeper
- Apache Curator - ZooKeeper client wrapper and rich ZooKeeper framework
- Logit.io - Send logs from Hadoop to Elasticsearch for monitoring and alerting.
Search
- ElasticSearch
- Apache Solr - Apache Solr is an open source search platform built upon a Java library called Lucene.
Search Engine Framework
- Apache Nutch - Apache Nutch is a highly extensible and scalable open source web crawler software project.
Security
- Apache Ranger - Ranger is a framework to enable, monitor and manage comprehensive data security across the Hadoop platform.
- Apache Sentry - An authorization module for Hadoop
- Apache Knox Gateway - A REST API Gateway for interacting with Hadoop clusters.
Benchmark
- Big Data Benchmark
- HiBench
Machine learning and Big Data analytics
- Apache Mahout
- MLlib - MLlib is Apache Spark's scalable machine learning library.
- R - R is a free software environment for statistical computing and graphics.
- RHadoop
- Apache Lens
- Apache SINGA (incubating) - SINGA is a general distributed deep learning platform for training big deep learning models over large datasets
- BigDL - BigDL is a distributed deep learning library for Apache Spark; with BigDL, users can write their deep learning applications as standard Spark programs, which can directly run on top of existing Spark or Hadoop clusters.
- Apache Hivemall (incubating) - Apache Hivemall is a scalable machine learning library that runs on Apache Hive, Spark and Pig.
Websites
- Hadoop Weekly
- The Hadoop Ecosystem Table
- Hadoop illuminated - Open Source Hadoop Book
- AWS BigData Blog
- Hadoop360
- How to monitor Hadoop metrics
Presentations
- Hadoop Performance at LinkedIn
- Apache Hadoop In Theory And Practice
- Hadoop Operations at LinkedIn
- Docker based Hadoop provisioning
Books
- Hadoop: The Definitive Guide
- Hadoop Operations
- Apache Hadoop Yarn
- HBase: The Definitive Guide
- Programming Pig
- Programming Hive
- Hadoop in Practice, Second Edition
- Hadoop in Action, Second Edition
- Hadoop Operations
Hadoop and Big Data Events
- ApacheCon
- Strata + Hadoop World
- DataWorks Summit
- awesome-awesomeness

Programming Languages

Ruby 1

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

awesome-hadoop

Libraries and Tools

Hadoop

YARN

NoSQL

SQL on Hadoop

Data Management

Workflow, Lifecycle and Governance

Data Ingestion and Integration

DSL

Realtime Data Processing

Distributed Computing and Programming

Packaging, Provisioning and Monitoring

Search

Search Engine Framework

Security

Benchmark

Machine learning and Big Data analytics

Websites

Presentations

Books

Hadoop and Big Data Events