awesome-hadoop

A curated list of amazingly awesome Hadoop and Hadoop ecosystem resources
https://github.com/eric-erki/awesome-hadoop

Last synced: about 18 hours ago
JSON representation

Libraries and Tools
- Spring for Apache Hadoop
- Kite Software Development Kit - A set of libraries, tools, examples, and documentation
- gohadoop - Native go clients for Apache Hadoop YARN.
- Apache Zeppelin - A web-based notebook that enables interactive data analytics
- Oozie Eclipse Plugin - A graphical editor for editing Apache Oozie workflows inside Eclipse.
- Apache Parquet - Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.
SQL on Hadoop
- Apache Drill - Schema-free SQL Query Engine
- Apache Impala - Apache Impala is an open source massively parallel processing (MPP) SQL query engine for data stored in a computer cluster running Apache Hadoop. Impala has been described as the open-source equivalent of Google F1, which inspired its development in 2012.
- Presto - Distributed SQL Query Engine for Big Data. Open sourced by Facebook.
Websites
- The Hadoop Ecosystem Table
- Apache Hadoop YARN: Yet Another Resource Negotiator
- Hadoop illuminated - Open Source Hadoop Book
- How to monitor Hadoop metrics
- Hadoop Weekly
Books
- Hadoop in Action, Second Edition
- Hadoop Operations
Hadoop and Big Data Events
- DataWorks Summit
- awesome-awesomeness
NoSQL
- Apache Accumulo - The Apache Accumulo™ sorted, distributed key/value store is a robust, scalable, high performance data storage and retrieval system.
- OpenTSDB - The Scalable Time Series Database
Data Management
- Apache Kudu - Kudu provides a combination of fast inserts/updates and efficient columnar scans to enable multiple real-time analytic workloads across a single storage layer, complementing HDFS and Apache HBase.
Realtime Data Processing
- Apache Flink - Apache Flink is a platform for efficient, distributed, general-purpose data processing. It supports exactly once stream processing.
Distributed Computing and Programming
- Apache Livy (incubating) - Apache Livy (incubating) is web service that exposes a REST interface for managing long running Apache Spark contexts in your cluster. With Livy, new applications can be built on top of Apache Spark that require fine grained interaction with many Spark contexts.
- SparkHub - A community site for Apache Spark
Search
- ElasticSearch
Security
- Apache Knox Gateway - A REST API Gateway for interacting with Hadoop clusters.
- Project Rhino - Intel's open source effort to enhance the existing data protection capabilities of the Hadoop ecosystem to address security and compliance challenges, and contribute the code back to Apache.
Benchmark
- Big Data Benchmark
- Big-Bench
Machine learning and Big Data analytics
- MLlib - MLlib is Apache Spark's scalable machine learning library.
- R - R is a free software environment for statistical computing and graphics.
- RHadoop
- Apache SINGA (incubating) - SINGA is a general distributed deep learning platform for training big deep learning models over large datasets
- BigDL - BigDL is a distributed deep learning library for Apache Spark; with BigDL, users can write their deep learning applications as standard Spark programs, which can directly run on top of existing Spark or Hadoop clusters.
Misc.
- Hive-Sharp

Programming Languages

Ruby 1

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

awesome-hadoop

Libraries and Tools

SQL on Hadoop

Websites

Books

Hadoop and Big Data Events

NoSQL

Data Management

Realtime Data Processing

Distributed Computing and Programming

Search

Security

Benchmark

Machine learning and Big Data analytics

Misc.