Projects in Awesome Lists tagged with hadoop

https://github.com/Chabane/bigdata-playground

A complete example of a big data application using : Kubernetes (kops/aws), Apache Spark SQL/Streaming/MLib, Apache Flink, Scala, Python, Apache Kafka, Apache Hbase, Apache Parquet, Apache Avro, Apache Storm, Twitter Api, MongoDB, NodeJS, Angular, GraphQL

angular apache-flink apache-spark avro big-data docker graphql hadoop hbase kafka kops machine-learning mongodb nodejs parquet python scala spark-sql spark-streaming twitter-api

Last synced: 11 Nov 2024

https://github.com/apache/hadoop-hdfs

Mirror of Apache Hadoop HDFS

hadoop

Last synced: 28 Sep 2024

https://github.com/lynnlangit/learning-hadoop-and-spark

Companion to Learning Hadoop and Learning Spark courses on Linked In Learning

apache-spark dataproc emr hadoop learning-hadoop mapreduce spark wordcount

Last synced: 30 Dec 2024

https://github.com/dsaidgovsg/airflow-pipeline

An Airflow docker image preconfigured to work well with Spark and Hadoop/EMR

airflow docker hadoop spark

Last synced: 30 Oct 2024

https://github.com/aliyun/aliyun-emapreduce-datasources

Extended datasource support for Spark/Hadoop on Aliyun E-MapReduce.

aliyun datasources e-mapreduce hadoop kafka spark

Last synced: 31 Dec 2024

https://github.com/apache/hadoop-common

Mirror of Apache Hadoop common

hadoop

Last synced: 28 Sep 2024

https://github.com/nielsbasjes/logparser

Easy parsing of Apache HTTPD and NGINX access logs with Java, Hadoop, Hive, Flink, Beam, Storm, Drill, ...

apache beam drill flink hadoop hive httpd java logformat nginx parse parser

Last synced: 27 Dec 2024

https://github.com/cubefs/shuttle

Shuttle：High Available, High Performance Remote Shuffle Service

distributed hadoop remote shuffle spark

Last synced: 20 Dec 2024

https://github.com/avast/hdfs-shell

HDFS Shell is a HDFS manipulation tool to work with functions integrated in Hadoop DFS

big-data cli cli-application hadoop hdfs hdfs-manipulation linux shell

Last synced: 19 Dec 2024

https://github.com/sunchao/parquet-rs

Apache Parquet implementation in Rust

hadoop parquet rust

Last synced: 25 Nov 2024

https://github.com/51zero/eel-sdk

Big Data Toolkit for the JVM

big-data etl hadoop hive kafka kudu orc parquet scala

Last synced: 02 Jan 2025

https://github.com/zuinnote/hadoopcryptoledger

Hadoop Crypto Ledger - Analyzing CryptoLedgers, such as Bitcoin Blockchain, on Big Data platforms, such as Hadoop/Spark/Flink/Hive

bigdata bitcoin blockchain cryptoledger ethereum flink hadoop hive spark

Last synced: 02 Jan 2025

https://github.com/jcrist/skein

A tool and library for easily deploying applications on Apache YARN

apache-yarn cluster deployment hadoop hdfs python

Last synced: 28 Dec 2024

https://github.com/marcelmay/hadoop-hdfs-fsimage-exporter

Exports Hadoop HDFS content statistics to Prometheus

hadoop hadoop-fsimage hdfs hdfs-metrics monitoring prometheus-exporter

Last synced: 05 Nov 2024

https://github.com/touero/ctenopharyngodon-idella

Hadoop, MapReduce Distributed Crawling of Data Information from All Chinese Universities.

fastapi hadoop hadoop-mapreduce java mapreduce maven scraping

Last synced: 29 Dec 2024

https://github.com/archivesunleashed/aut

The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.

analysis apache-spark big-data big-data-analytics dataframe digital-humanities hadoop network-graphing pyspark python3 scala spark text-extraction webarchives

Last synced: 02 Jan 2025

https://github.com/isxcode/spark-yun

Big data computing platform based on Spark <至轻云-超轻量级大数据计算平台/数据中台>

docker hadoop hive platform saas spark

Last synced: 28 Dec 2024

https://github.com/gtkcyber/griffon-vm

Griffon Data Science Virtual Machine

apache-drill apache-spark big-data data-science database elasticsearch hadoop jupyter-notebook mysql node-js python r ruby scala virtual-machine

Last synced: 12 Oct 2024

https://github.com/GridProtectionAlliance/openPDC

Open Source Phasor Data Concentrator

bpa-pdc-stream complex-event-processing hadoop iec61850 ieee-1344 ieee-c37118 naspi openpdc pdc phasor-data-concentrator phasor-measurement-unit pmu stream-processing stream-processing-engine streaming-data synchrophasor time-series

Last synced: 08 Nov 2024

https://github.com/qihoo360/xlearning-xdml

extremely distributed machine learning

ai distributed hadoop hazelcast kudu machine-learning parameter-server spark

Last synced: 14 Nov 2024

https://github.com/Qihoo360/XLearning-XDML

extremely distributed machine learning

ai distributed hadoop hazelcast kudu machine-learning parameter-server spark

Last synced: 31 Oct 2024

https://github.com/apache/calcite-avatica-go

Apache Calcite Go

big-data calcite geospatial hadoop java sql

Last synced: 27 Dec 2024

https://github.com/harisekhon/knowledge-base

Large Tech Knowledge Base from 20 years in DevOps, Linux, Cloud, Big Data, AWS, GCP etc - gradually porting my large private knowledge base to public

aws azure bash bigdata cicd cloud devops elasticsearch gcp git groovy hadoop java jvm performance-tuning python scripting solr solrcloud spark

Last synced: 30 Dec 2024

https://github.com/smart-data-lake/smart-data-lake

Smart Automation Tool for building modern Data Lakes and Data Pipelines

data-lake data-pipelines deltalake hadoop hive scala smart-data-lake spark transform-data

Last synced: 27 Dec 2024

https://github.com/paypal/nnanalytics

NameNodeAnalytics is a self-help utility for scouting and maintaining the namespace of an HDFS instance.

fsimage hadoop hdfs metadata namespace scanner utility

Last synced: 28 Dec 2024

https://github.com/233zzh/TitanDataOperationSystem

最好的大数据项目。《Titan数据运营系统》，本项目是一个全栈闭环系统，我们有用作数据可视化的web系统，然后用flume-kafaka-flume进行日志的读取，在hive设计数仓，编写spark代码进行数仓表之间的转化以及ads层表到mysql的迁移，使用azkaban进行定时任务的调度，使用技术：Java/Scala语言，Hadoop、Spark、Hive、Kafka、Flume、Azkaban、SpringBoot，Bootstrap， Echart等；

azkaban flume hadoop hive kafka spark

Last synced: 30 Oct 2024

https://github.com/apache/hadoop-mapreduce

Mirror of Apache Hadoop MapReduce

hadoop

Last synced: 28 Sep 2024

https://github.com/mmolimar/kafka-connect-fs

Kafka Connect FileSystem Connector

apache-kafka azure-storage confluent files filesystem ftp gcp hadoop hadoop-filesystem hdfs kafka kafka-connect kafka-connect-fs kafka-connector s3

Last synced: 17 Nov 2024

https://github.com/gateway-experiments/hadoop-yarn-api-python-client

Python client for Hadoop® YARN API

hacktoberfest hadoop yarn

Last synced: 09 Nov 2024

https://github.com/rdblue/s3committer

Hadoop output committers for S3

hadoop netflix outputcommitter s3

Last synced: 06 Nov 2024

https://github.com/feng-li/Distributed-Statistical-Computing

Teaching Materials for Distributed Statistical Computing (大数据分布式计算教学材料)

hadoop mapreduce pyspark-tutorial spark spark-teaching statistical-models

Last synced: 30 Oct 2024

https://github.com/dimajix/flowman

Flowman is an ETL framework powered by Apache Spark. With its declarative approach, Flowman simplifies the development of complex data pipelines.

apache-spark big-data bigdata data-engineering etl flowman hadoop scala spark sql

Last synced: 28 Dec 2024

https://github.com/lewuathe/docker-hadoop-cluster

Multiple node cluster on Docker for self development.

docker hadoop

Last synced: 12 Nov 2024

https://github.com/harrisiirak/webhdfs

Node.js WebHDFS REST API client

hadoop javascript node-webhdfs webhdfs

Last synced: 01 Jan 2025

https://github.com/iamabug/BigDataParty

大数据组件 All-in-One 的 Dockerfile

big-data dockerfile hadoop kafka spark

Last synced: 12 Nov 2024

https://github.com/criteo/tf-yarn

Train TensorFlow models on YARN in just a few lines of code!

hadoop tensorflow yarn

Last synced: 01 Jan 2025

https://github.com/yahoo/hive-funnel-udf

Hive UDFs for funnel analysis

analytics funnel hadoop hive hive-udf udf

Last synced: 13 Nov 2024

https://github.com/apache/doris-website

Apache Doris Website

analytics apache big-data data-warehousing database datalake dbms distributed-system doris hadoop hive hudi iceberg mpp olap ssb tpch vectorized

Last synced: 01 Jan 2025

https://github.com/snowch/movie-recommender-demo

This project walks through how you can create recommendations using Apache Spark machine learning. There are a number of jupyter notebooks that you can run on IBM Data Science Experience, and there a live demo of a movie recommendation web application you can interact with. The demo also uses IBM Message Hub (kafka) to push application events to topic where they are consumed by a spark streaming job running on IBM BigInsights (hadoop).

alternating-least-squares biginsights bluemix bokeh cloudant collaborative-filtering dsx hadoop hive ibm-biginsights ibm-bluemix jupyter-notebook kafka machine-learning messagehub notebook python-flask-application redis spark spark-streaming

Last synced: 17 Nov 2024

https://github.com/spencertipping/ni

Say "ni" to data of any size

big-data datascience hadoop perl pipeline ssh visualization

Last synced: 25 Oct 2024

https://github.com/seznam/euphoria

Euphoria is an open source Java API for creating unified big-data processing flows. It provides an engine independent programming model which can express both batch and stream transformations.

apache-flink apache-spark batch-processing big-data hadoop hdfs java-api kafka streaming-data unified-bigdata-processing

Last synced: 19 Dec 2024

https://github.com/huangyueranbbc/SparkDemo

spark全示例代码(java、scala) Spark most full instance code DEMO (java、scala)

bigdata hadoop operator spark spark-sql spark-streaming sparkfun-products sparkjava sparkline sparkp

Last synced: 30 Oct 2024

https://github.com/ianmcook/implyr

SQL backend to dplyr for Impala

apache dplyr dplyr-sql-backends hadoop impala jdbc odbc r sql tidyverse

Last synced: 31 Dec 2024

https://github.com/flipkart-incubator/hbase-orm

A production-grade HBase ORM library that makes accessing HBase clean, fast and fun (Can also be used as Bigtable ORM)

bigtable bigtable-orm cloud-bigtable hadoop hbase hbase-orm mapreduce object-mapping orm

Last synced: 16 Nov 2024

https://github.com/P7h/docker-spark

:ship: Docker image for Apache Spark

docker hadoop java scala spark

Last synced: 29 Oct 2024

https://github.com/coxautomotivedatasolutions/waimak

Waimak is an open-source framework that makes it easier to create complex data flows in Apache Spark.

data-engineering hadoop scala spark

Last synced: 12 Oct 2024

https://github.com/cloudposse/terraform-aws-emr-cluster

Terraform module to provision an Elastic MapReduce (EMR) cluster on AWS

emr emr-cluster emr-notebooks emrfs hadoop hcl2 hive presto spark terraform terraform-aws terraform-module terraform-modules

Last synced: 28 Dec 2024

https://github.com/s911415/apache-hadoop-3.1.0-winutils

HADOOP 3.1.0 winutils

apache-hadoop hadoop native winutils

Last synced: 30 Oct 2024

https://github.com/shifuml/guagua

An iterative computing framework for both Hadoop MapReduce and Hadoop YARN.

hadoop in-memory iterative machine-learning yarn

Last synced: 10 Oct 2024

https://github.com/impetus/jumbune

Jumbune, an open source BigData APM & Data Quality Management Platform for Data Clouds. Enterprise feature offering is available at http://jumbune.com. More details of open source offering are at,

aiops apm cluster-monitoring data-analysis data-quality developer-tools devops-tools hadoop hadoop-cluster hadoop-monitor hadoop-monitoring monitoring-tool optimization-framework yarn yarn-hadoop-cluster

Last synced: 14 Nov 2024

https://github.com/nielsbasjes/splittablegzip

Splittable Gzip codec for Hadoop

codec gzip gzip-codec gzipped-files hadoop mapreduce-java pig spark splittable

Last synced: 01 Jan 2025

https://github.com/thomasweise/distributedcomputingexamples

Example codes for my Distributed Computing course at Hefei University.

axis2 c communication distributed-computing glassfish hadoop html java java-rmi java-servlet javascript javaserver-pages json-rpc jsp mpi servlet-container socket web-services xml xml-document

Last synced: 09 Nov 2024

https://github.com/groda/big_data

Tutorials on Big Data essentials: Hadoop, MapReduce, Spark.

apache-sedona apache-spark big-data bigdata bigtop docker gutenberg-ebooks hadoop hadoop-cluster hadoop-hdfs hadoop-mapreduce jupyter-notebook mapreduce mapreduce-bash mrjob pyspark spark spark-sql testdfsio

Last synced: 31 Dec 2024

https://github.com/zuinnote/hadoopoffice

HadoopOffice - Analyze Office documents using the Hadoop ecosystem (Spark/Flink/Hive)

analyze-office-documents bigdata excel flink hadoop hadoop-ecosystem hadoopoffice hive office poi spark

Last synced: 14 Oct 2024

https://github.com/mooseburger1/springboard-data-science-immersive

convolutional-neural-networks data-science deep-learning deep-neural-networks eda h5 hadoop nlp opencv pyspark python sql statistical-analysis statistical-inference statistical-modeling tensorboard tensorflow time-series-analysis time-series-prediction web-scraping

Last synced: 24 Nov 2024

https://github.com/longshilin/hdfs-netdisc

基于Hadoop的分布式云存储系统 :palm_tree:

hadoop hadoop-filesystem hdfs hdfs-netdisc netdisk

Last synced: 10 Nov 2024

https://github.com/vivek-bombatkar/mylearningnotes

Because its never late to start taking notes and 'public' it...

blockchain hadoop hive pandas python spark sparkml

Last synced: 31 Dec 2024

https://github.com/rubenafo/docker-spark-cluster

A Spark cluster setup running on Docker containers

big-data docker docker-image hadoop openjdk scala spark

Last synced: 13 Oct 2024

https://github.com/zhuyuqing/bestconf

A tool automatically improving the performance of large-scale systems by finding better configuration settings

benchmark cassandra configuration hadoop hive mysql optimization performance spark tomcat tuning

Last synced: 05 Nov 2024

https://github.com/googlecloudplatform/serverless-spark-workshop

Solution Accelerators for Serverless Spark on GCP, the industry's first auto-scaling and serverless Spark as a service

apache-spark autoscaling bigdata dataproc hadoop serverless solution-accelerator spark usecases

Last synced: 07 Oct 2024

https://github.com/turboway/pybigdata

使用 python 操作大数据的各种组件

elasticsearch hadoop hbase hive impala kafka mapreduce spark

Last synced: 15 Nov 2024

https://github.com/damiencarol/jsr203-hadoop

A Java NIO file system provider for HDFS

hadoop hdfs java nio

Last synced: 29 Dec 2024

https://github.com/punit-naik/mlhadoop

This repository contains Machine-Learning MapReduce codes for Hadoop which are written from scratch (without using any package or library). E.g. Prediction (Linear and Logistic Regression), Clustering (K-Means), Classification (KNN) etc.

hadoop java machine-learning

Last synced: 15 Nov 2024

https://github.com/myamafuj/hadoop-hive-spark-docker

Hadoop-Hive-Spark cluster + Jupyter on Docker

docker hadoop hive jupyter jupyter-notebook pyspark spark

Last synced: 11 Nov 2024

https://github.com/v5tech/cloud

云计算之hadoop、hive、hue、oozie、sqoop、hbase、zookeeper环境搭建及配置文件

flume flume-ng hadoop hbase hive hue oozie pig sqoop zookeeper

Last synced: 09 Nov 2024

https://github.com/dimajix/spark-training

Repository used for Spark Trainings

hadoop hadoop-training hive pyspark python scala spark spark-ml spark-streaming spark-training sqoop

Last synced: 09 Nov 2024

https://github.com/Cigna/ibis

IBIS is a workflow creation-engine that abstracts the Hadoop internals of ingesting RDBMS data.

cigna hadoop hadoop-ecosystem hadoop-framework ibis ingestion oozie sqoop sqoop2 workflow workflow-automation workflow-scheduler

Last synced: 27 Nov 2024

https://github.com/maicius/weblogsanalysissystem

A big data platform for analyzing web access logs

echarts hadoop hbase scala spark

Last synced: 11 Nov 2024

https://github.com/terascope/teraslice

Scalable data processing pipelines in JavaScript

elasticsearch hadoop hdfs json kafka

Last synced: 27 Dec 2024

https://github.com/pnavaro/big-data

Python tools for big data

dask data-science hadoop jupyter-book notebooks python spark

Last synced: 02 Nov 2024

https://github.com/wzdnzd/bigdata-notes

BigData Learning Notes

bigdata hadoop spark

Last synced: 01 Jan 2025

https://github.com/pbwebmedia/yarn-prometheus-exporter

Export Hadoop YARN (resource-manager) metrics in prometheus format

apache apache-hadoop exporter hadoop metrics prometheus resource-manager yarn yarn-hadoop-cluster

Last synced: 19 Dec 2024

https://github.com/asdf2014/yuzhouwan

Code Library for My Blog

ai algorithm bigdata clojure druid elasticsearch go hadoop hbase java python scala spark tensorflow yuzhouwan zookeeper

Last synced: 01 Jan 2025

https://github.com/palantir/hadoop-crypto

Library for per-file client-side encyption in Hadoop FileSystems such as HDFS or S3.

hadoop hadoop-crypto hadoop-filesystem octo-correct-managed

Last synced: 31 Dec 2024

https://github.com/josonle/bigdata-learning

大数据学习，主要涉及Kafka、ZooKeeper、Hive、HBase、Spark

hadoop hive java kafka scala spark zookeeper

Last synced: 25 Nov 2024

https://github.com/pierrekieffer/docker-spark-yarn-cluster

Docker multi-nodes Hadoop cluster with Spark 2.4.1 on Yarn

cluster docker hadoop spark yarn yarn-hadoop-cluster

Last synced: 02 Nov 2024

https://github.com/coxautomotivedatasolutions/spark-distcp

A re-implementation of Hadoop DistCP in Apache Spark

apache-spark data-engineering distcp hadoop spark

Last synced: 12 Oct 2024

https://github.com/niqdev/devops

DevOps

ansible cassandra docker hadoop kafka kubernetes mapreduce oozie spark vagrant zeppelin zookeeper

Last synced: 06 Nov 2024

https://github.com/jehiah/gomrjob

gomrjob - a Go Framework for Hadoop Map Reduce Jobs

dataproc go hadoop mapreduce mrjob

Last synced: 27 Oct 2024

https://github.com/aikuyun/bigdata-doc

大数据学习笔记，学习路线，技术案例整理。

bigdata flink hadoop hdfs hive kafka mapreduce

Last synced: 30 Oct 2024

https://github.com/dbiir/paraflow

A real-time analytical system for ID-associated data

hadoop kafka orc parquet presto spark-sql

Last synced: 21 Nov 2024

https://github.com/melin/spark-jobserver

REST job server for Apache Spark

hadoop hive java kerberos kubernetes spark yarn

Last synced: 05 Nov 2024

https://github.com/LB-Yu/data-systems-learning

Learning summary and examples about data systems.

antlr big-data calcite distributed-systems flink hadoop hbase spark

Last synced: 05 Nov 2024

https://github.com/rootsongjc/magpie

Yarn on Docker - Managing Hadoop Yarn cluster with Docker Swarm.

containers docker hadoop swarm yarn

Last synced: 27 Oct 2024

https://github.com/jacobstanley/hadoop-tools

Tools for working with Hadoop, written with performance in mind.

hadoop haskell hdfs

Last synced: 14 Nov 2024

https://github.com/bytedance/clickhouse_hadoop

Import data from clickhouse to hadoop with pure SQL

clickhouse hadoop

Last synced: 15 Nov 2024

https://github.com/basin-etl/basin

Basin is a visual programming editor for building Spark and PySpark pipelines. Easily build, debug, and deploy complex ETL pipelines from your browser

emr etl hadoop informatica odi pipeline pyspark spark

Last synced: 09 Nov 2024

https://github.com/pippozq/hadoop-ansible

Install hadoop cluster with ansible

ansible hadoop

Last synced: 23 Nov 2024

https://github.com/kakao/cmux

A set of commands for managing CDH clusters using Cloudera Manager REST API.

cdh fzf hadoop hbase

Last synced: 19 Nov 2024

https://github.com/agile-lab-dev/darwin

Avro Schema Evolution made easy

avro avro-schema hadoop hbase scala schema-evolution spark

Last synced: 14 Oct 2024

https://github.com/whitfin/efflux

Easy Hadoop Streaming and MapReduce interfaces in Rust

hadoop mapreduce processing

Last synced: 16 Nov 2024

https://github.com/oeljeklaus-you/loganalyzehelper

论坛日志分析系统清洗程序(包含IP规则库，UDF开发，MapReduce程序，日志数据)

hadoop java

Last synced: 05 Nov 2024

https://github.com/apache/doris-thirdparty

Self-managed thirdparty dependencies for Apache Doris

analytics big-data data-warehousing database datalake dbms distributed-database hadoop hive hudi iceberg mpp olap real-time sql ssb tpch vectorized

Last synced: 01 Jan 2025

https://github.com/absaoss/enceladus

Dynamic Conformance Engine

bigdata datalake hadoop mongodb scala spark spring

Last synced: 19 Dec 2024

https://github.com/openucx/sparkucx

A high-performance, scalable and efficient ShuffleManager plugin for Apache Spark, utilizing UCX communication layer

apache-spark big-data hadoop hpc rdma spark

Last synced: 10 Nov 2024

https://github.com/agile-lab-dev/wasp

WASP is a framework to build complex real time big data applications. It relies on a kind of Kappa/Lambda architecture mainly leveraging Kafka and Spark. If you need to ingest huge amount of heterogeneous data and analyze them through complex pipelines, this is the framework for you.

akka elasticsearch hadoop hbase hdfs jdbc kafka parquet scala solr spark spark-streaming yarn