awesome-hadoop

A curated list of amazingly awesome Hadoop and Hadoop ecosystem resources
https://github.com/eric-erki/awesome-hadoop

Last synced: 14 days ago
JSON representation

Hadoop and Big Data Events
- awesome-awesomeness
- DataWorks Summit
- Strata + Hadoop World
Libraries and Tools
- Apache Parquet - Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.
- Kite Software Development Kit - A set of libraries, tools, examples, and documentation
- Apache Zeppelin - A web-based notebook that enables interactive data analytics
- Oozie Eclipse Plugin - A graphical editor for editing Apache Oozie workflows inside Eclipse.
- Spring for Apache Hadoop
- Elephant Bird - Twitter's collection of LZO and Protocol Buffer-related Hadoop, Pig, Hive, and HBase code.
- hdfs - A native go client for HDFS
- Kite Software Development Kit - A set of libraries, tools, examples, and documentation
- snakebite - A pure python HDFS client
- Schema Registry UI - Web tool for the Confluent Schema Registry in order to create / view / search / evolve / view history & configure Avro schemas of your Kafka cluster.
Machine learning and Big Data analytics
- MLlib - MLlib is Apache Spark's scalable machine learning library.
- R - R is a free software environment for statistical computing and graphics.
- BigDL - BigDL is a distributed deep learning library for Apache Spark; with BigDL, users can write their deep learning applications as standard Spark programs, which can directly run on top of existing Spark or Hadoop clusters.
- RHadoop
- Apache SINGA (incubating) - SINGA is a general distributed deep learning platform for training big deep learning models over large datasets
- Oryx 2 - Lambda architecture on Spark, Kafka for real-time large scale machine learning
SQL on Hadoop
- Apache Drill - Schema-free SQL Query Engine
- Presto - Distributed SQL Query Engine for Big Data. Open sourced by Facebook.
- Apache Impala - Apache Impala is an open source massively parallel processing (MPP) SQL query engine for data stored in a computer cluster running Apache Hadoop. Impala has been described as the open-source equivalent of Google F1, which inspired its development in 2012.
- Lingual - SQL interface for Cascading (MR/Tez job generator)
Search
- ElasticSearch
- Banana - Kibana port for Apache Solr
NoSQL
- OpenTSDB - The Scalable Time Series Database
- Apache Accumulo - The Apache Accumulo™ sorted, distributed key/value store is a robust, scalable, high performance data storage and retrieval system.
- hindex - Secondary Index for HBase
- Haeinsa - Haeinsa is linearly scalable multi-row, multi-table transaction library for HBase
- OpenTSDB - The Scalable Time Series Database
- Hannibal - Hannibal is tool to help monitor and maintain HBase-Clusters that are configured for manual splitting.
- happybase - A developer-friendly Python library to interact with Apache HBase.
Data Management
- Apache Kudu - Kudu provides a combination of fast inserts/updates and efficient columnar scans to enable multiple real-time analytic workloads across a single storage layer, complementing HDFS and Apache HBase.
- Confluent Schema registry for Kafka - Schema Registry provides a serving layer for your metadata. It provides a RESTful interface for storing and retrieving Avro schemas.
- Hortonworks Schema Registry - Schema Registry is a framework to build metadata repositories.
Security
- Apache Knox Gateway - A REST API Gateway for interacting with Hadoop clusters.
- Project Rhino - Intel's open source effort to enhance the existing data protection capabilities of the Hadoop ecosystem to address security and compliance challenges, and contribute the code back to Apache.
Benchmark
- Big Data Benchmark
- Big-Bench
- YCSB - The Yahoo! Cloud Serving Benchmark (YCSB) is an open-source specification and program suite for evaluating retrieval and maintenance capabilities of computer programs. It is often used to compare relative performance of NoSQL database management systems.
- HiBench
Websites
- How to monitor Hadoop metrics
- Hadoop Weekly
- The Hadoop Ecosystem Table
- Hadoop illuminated - Open Source Hadoop Book
- Apache Hadoop YARN: Yet Another Resource Negotiator
- Hadoop illuminated - Open Source Hadoop Book
Distributed Computing and Programming
- Apache Livy (incubating) - Apache Livy (incubating) is web service that exposes a REST interface for managing long running Apache Spark contexts in your cluster. With Livy, new applications can be built on top of Apache Spark that require fine grained interaction with many Spark contexts.
- SparkHub - A community site for Apache Spark
- Cascading - Cascading is the proven application development platform for building data applications on Hadoop.
Books
- Hadoop in Action, Second Edition
- Apache Hadoop Yarn
- HBase: The Definitive Guide
- Programming Pig
- Programming Hive
- Hadoop Operations
- Hadoop: The Definitive Guide
Realtime Data Processing
- Apache Flink - Apache Flink is a platform for efficient, distributed, general-purpose data processing. It supports exactly once stream processing.
- Apache Storm
Misc.
- Hive-Sharp
- .Net FlumeNG Clients
- PyHive - Python interface to Hive and Presto
- Flume RabbitMQ source and sink
- shib - WebUI for query engines: Hive and Presto
- Beetest - A super simple utility for testing Apache Hive scripts locally for non-Java developers.
- Hive_test - Unit test framework for hive and hive-service
- Flume MongoDB Sink
- Flume UDP Source
DSL
- PigPen - PigPen is map-reduce for Clojure, or distributed Clojure. It compiles to Apache Pig, but you don't need to know much about Pig to use it.
- vahara - Machine learning and natural language processing with Apache Pig
- packetpig - Open Source Big Data Security Analytics
- akela - Mozilla's utility library for Hadoop, HBase, Pig, etc.
- Lipstick - Pig workflow visualization tool. [Introducing Lipstick on A(pache) Pig](http://techblog.netflix.com/2013/06/introducing-lipstick-on-apache-pig.html)
Data Ingestion and Integration
- Suro - Netflix's distributed Data Pipeline
- Gobblin from LinkedIn - Universal data ingestion framework for Hadoop
Hadoop
- Elasticsearch Hadoop - Elasticsearch real-time search and analytics natively integrated with Hadoop. Supports Map/Reduce, Cascading, Apache Hive and Apache Pig.
- Genie - Genie provides REST-ful APIs to run Hadoop, Hive and Pig jobs, and to manage multiple Hadoop resources and perform job submissions across them.
- hadoopy - Python MapReduce library written in Cython.
- Crunch - Go-based toolkit for ETL and feature extraction on Hadoop
- GIS Tools for Hadoop - Big Data Spatial Analytics for the Hadoop Framework
- pydoop - Pydoop is a package that provides a Python API for Hadoop.
- hdfs-du - HDFS-DU is an interactive visualization of the Hadoop distributed file system.
- GIS Tools for Hadoop - Big Data Spatial Analytics for the Hadoop Framework
- White Elephant - Hadoop log aggregator and dashboard
Workflow, Lifecycle and Governance
- Apache AirFlow - Airflow is a workflow automation and scheduling system that can be used to author and manage data pipelines
YARN
- mpich2-yarn - Running MPICH2 on Yarn
Packaging, Provisioning and Monitoring
- ankush - A big data cluster management tool that creates and manages clusters of different technologies.
- inviso - Inviso is a lightweight tool that provides the ability to search for Hadoop jobs, visualize the performance, and view cluster utilization.
- Ganglia Monitoring System
Presentations
- Hadoop Performance at LinkedIn
- Apache Hadoop In Theory And Practice
- Hadoop Operations at LinkedIn
- Docker based Hadoop provisioning

Programming Languages

Java 12 JavaScript 3 Go 2 C 1 Python 1 C# 1 Clojure 1 Ruby 1

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

awesome-hadoop

Hadoop and Big Data Events

Libraries and Tools

Machine learning and Big Data analytics

SQL on Hadoop

Search

NoSQL

Data Management

Security

Benchmark

Websites

Distributed Computing and Programming

Books

Realtime Data Processing

Misc.

DSL

Data Ingestion and Integration

Hadoop

Workflow, Lifecycle and Governance

YARN

Packaging, Provisioning and Monitoring

Presentations