An open API service indexing awesome lists of open source software.

awesome-hadoop

A curated list of amazingly awesome Hadoop and Hadoop ecosystem resources
https://github.com/youngwookim/awesome-hadoop

Last synced: 10 days ago
JSON representation

  • Hadoop and Big Data Events

  • Libraries and Tools

  • Machine learning and Big Data analytics

    • MLlib - MLlib is Apache Spark's scalable machine learning library.
    • R - R is a free software environment for statistical computing and graphics.
    • BigDL - BigDL is a distributed deep learning library for Apache Spark; with BigDL, users can write their deep learning applications as standard Spark programs, which can directly run on top of existing Spark or Hadoop clusters.
    • RHadoop
    • Apache Lens
    • Apache SINGA (incubating) - SINGA is a general distributed deep learning platform for training big deep learning models over large datasets
    • Apache Hivemall (incubating) - Apache Hivemall is a scalable machine learning library that runs on Apache Hive, Spark and Pig.
    • Oryx 2 - Lambda architecture on Spark, Kafka for real-time large scale machine learning
  • SQL on Hadoop

    • Apache Drill - Schema-free SQL Query Engine
    • Presto - Distributed SQL Query Engine for Big Data. Open sourced by Facebook.
    • Apache Impala - Apache Impala is an open source massively parallel processing (MPP) SQL query engine for data stored in a computer cluster running Apache Hadoop. Impala has been described as the open-source equivalent of Google F1, which inspired its development in 2012.
    • Apache Trafodion
    • Apache HAWQ (incubating) - Apache HAWQ is a Hadoop native SQL query engine that combines the key technological advantages of MPP database with the scalability and convenience of Hadoop
    • Lingual - SQL interface for Cascading (MR/Tez job generator)
  • NoSQL

    • OpenTSDB - The Scalable Time Series Database
    • Apache Accumulo - The Apache Accumulo™ sorted, distributed key/value store is a robust, scalable, high performance data storage and retrieval system.
    • hindex - Secondary Index for HBase
    • Haeinsa - Haeinsa is linearly scalable multi-row, multi-table transaction library for HBase
    • OpenTSDB - The Scalable Time Series Database
    • Hannibal - Hannibal is tool to help monitor and maintain HBase-Clusters that are configured for manual splitting.
    • happybase - A developer-friendly Python library to interact with Apache HBase.
  • Data Management

    • Apache Kudu - Kudu provides a combination of fast inserts/updates and efficient columnar scans to enable multiple real-time analytic workloads across a single storage layer, complementing HDFS and Apache HBase.
    • Apache Atlas - Metadata tagging & lineage capture suppoting complex business data taxonomies
    • Confluent Schema registry for Kafka - Schema Registry provides a serving layer for your metadata. It provides a RESTful interface for storing and retrieving Avro schemas.
    • Hortonworks Schema Registry - Schema Registry is a framework to build metadata repositories.
  • Security

    • Apache Knox Gateway - A REST API Gateway for interacting with Hadoop clusters.
    • Apache Ranger - Ranger is a framework to enable, monitor and manage comprehensive data security across the Hadoop platform.
    • Apache Sentry - An authorization module for Hadoop
  • Benchmark

    • Big Data Benchmark
    • YCSB - The Yahoo! Cloud Serving Benchmark (YCSB) is an open-source specification and program suite for evaluating retrieval and maintenance capabilities of computer programs. It is often used to compare relative performance of NoSQL database management systems.
    • HiBench
  • Websites

  • Packaging, Provisioning and Monitoring

    • Logit.io - Send logs from Hadoop to Elasticsearch for monitoring and alerting.
    • ankush - A big data cluster management tool that creates and manages clusters of different technologies.
    • inviso - Inviso is a lightweight tool that provides the ability to search for Hadoop jobs, visualize the performance, and view cluster utilization.
    • Ganglia Monitoring System
  • Hadoop

    • Apache Hadoop Ozone - An Object Store for Apache Hadoop
    • SpatialHadoop - SpatialHadoop is a MapReduce extension to Apache Hadoop designed specially to work with spatial data.
    • Apache Kylin - Apache Kylin is an open source Distributed Analytics Engine from eBay Inc. that provides SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets
    • Elasticsearch Hadoop - Elasticsearch real-time search and analytics natively integrated with Hadoop. Supports Map/Reduce, Cascading, Apache Hive and Apache Pig.
    • Genie - Genie provides REST-ful APIs to run Hadoop, Hive and Pig jobs, and to manage multiple Hadoop resources and perform job submissions across them.
    • hadoopy - Python MapReduce library written in Cython.
    • Crunch - Go-based toolkit for ETL and feature extraction on Hadoop
    • GIS Tools for Hadoop - Big Data Spatial Analytics for the Hadoop Framework
    • pydoop - Pydoop is a package that provides a Python API for Hadoop.
    • hdfs-du - HDFS-DU is an interactive visualization of the Hadoop distributed file system.
    • GIS Tools for Hadoop - Big Data Spatial Analytics for the Hadoop Framework
    • White Elephant - Hadoop log aggregator and dashboard
  • YARN

    • Apache Slider - Apache Slider is a project in incubation at the Apache Software Foundation with the goal of making it possible and easy to deploy existing applications onto a YARN cluster.
    • Apache Twill - Apache Twill is an abstraction over Apache Hadoop® YARN that reduces the complexity of developing distributed applications, allowing developers to focus more on their application logic.
    • mpich2-yarn - Running MPICH2 on Yarn
  • Workflow, Lifecycle and Governance

    • Luigi - Python package that helps you build complex pipelines of batch jobs
    • Apache AirFlow - Airflow is a workflow automation and scheduling system that can be used to author and manage data pipelines
  • DSL

    • Apache DataFu - A collection of libraries for working with large-scale data in Hadoop
    • seqpig - Simple and scalable scripting for large sequencing data set(ex: bioinfomation) in Hadoop
    • PigPen - PigPen is map-reduce for Clojure, or distributed Clojure. It compiles to Apache Pig, but you don't need to know much about Pig to use it.
    • vahara - Machine learning and natural language processing with Apache Pig
    • packetpig - Open Source Big Data Security Analytics
    • akela - Mozilla's utility library for Hadoop, HBase, Pig, etc.
    • Lipstick - Pig workflow visualization tool. [Introducing Lipstick on A(pache) Pig](http://techblog.netflix.com/2013/06/introducing-lipstick-on-apache-pig.html)
  • Realtime Data Processing

    • Apache Pulsar (incubating) - Apache Pulsar (incubating) is a highly scalable, low latency messaging platform running on commodity hardware. It provides simple pub-sub semantics over topics, guaranteed at-least-once delivery of messages, automatic cursor management for subscribers, and cross-datacenter replication.
    • Apache Druid (incubating) - A high-performance, column-oriented, distributed data store.
    • Apache Storm
  • Distributed Computing and Programming

    • Spark Packages - A community index of packages for Apache Spark
    • Apache Livy (incubating) - Apache Livy (incubating) is web service that exposes a REST interface for managing long running Apache Spark contexts in your cluster. With Livy, new applications can be built on top of Apache Spark that require fine grained interaction with many Spark contexts.
    • SparkHub - A community site for Apache Spark
    • Cascading - Cascading is the proven application development platform for building data applications on Hadoop.
  • Presentations

  • Books

  • Data Ingestion and Integration

  • Misc.