Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
Projects in Awesome Lists tagged with hadoop
A curated list of projects in awesome lists tagged with hadoop .
https://github.com/Chabane/bigdata-playground
A complete example of a big data application using : Kubernetes (kops/aws), Apache Spark SQL/Streaming/MLib, Apache Flink, Scala, Python, Apache Kafka, Apache Hbase, Apache Parquet, Apache Avro, Apache Storm, Twitter Api, MongoDB, NodeJS, Angular, GraphQL
angular apache-flink apache-spark avro big-data docker graphql hadoop hbase kafka kops machine-learning mongodb nodejs parquet python scala spark-sql spark-streaming twitter-api
Last synced: 11 Nov 2024
https://github.com/lynnlangit/learning-hadoop-and-spark
Companion to Learning Hadoop and Learning Spark courses on Linked In Learning
apache-spark dataproc emr hadoop learning-hadoop mapreduce spark wordcount
Last synced: 30 Dec 2024
https://github.com/dsaidgovsg/airflow-pipeline
An Airflow docker image preconfigured to work well with Spark and Hadoop/EMR
Last synced: 30 Oct 2024
https://github.com/aliyun/aliyun-emapreduce-datasources
Extended datasource support for Spark/Hadoop on Aliyun E-MapReduce.
aliyun datasources e-mapreduce hadoop kafka spark
Last synced: 31 Dec 2024
https://github.com/cubefs/shuttle
Shuttle:High Available, High Performance Remote Shuffle Service
distributed hadoop remote shuffle spark
Last synced: 20 Dec 2024
https://github.com/avast/hdfs-shell
HDFS Shell is a HDFS manipulation tool to work with functions integrated in Hadoop DFS
big-data cli cli-application hadoop hdfs hdfs-manipulation linux shell
Last synced: 19 Dec 2024
https://github.com/sunchao/parquet-rs
Apache Parquet implementation in Rust
Last synced: 25 Nov 2024
https://github.com/zuinnote/hadoopcryptoledger
Hadoop Crypto Ledger - Analyzing CryptoLedgers, such as Bitcoin Blockchain, on Big Data platforms, such as Hadoop/Spark/Flink/Hive
bigdata bitcoin blockchain cryptoledger ethereum flink hadoop hive spark
Last synced: 02 Jan 2025
https://github.com/jcrist/skein
A tool and library for easily deploying applications on Apache YARN
apache-yarn cluster deployment hadoop hdfs python
Last synced: 28 Dec 2024
https://github.com/marcelmay/hadoop-hdfs-fsimage-exporter
Exports Hadoop HDFS content statistics to Prometheus
hadoop hadoop-fsimage hdfs hdfs-metrics monitoring prometheus-exporter
Last synced: 05 Nov 2024
https://github.com/touero/ctenopharyngodon-idella
Hadoop, MapReduce Distributed Crawling of Data Information from All Chinese Universities.
fastapi hadoop hadoop-mapreduce java mapreduce maven scraping
Last synced: 29 Dec 2024
https://github.com/archivesunleashed/aut
The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.
analysis apache-spark big-data big-data-analytics dataframe digital-humanities hadoop network-graphing pyspark python3 scala spark text-extraction webarchives
Last synced: 02 Jan 2025
https://github.com/gtkcyber/griffon-vm
Griffon Data Science Virtual Machine
apache-drill apache-spark big-data data-science database elasticsearch hadoop jupyter-notebook mysql node-js python r ruby scala virtual-machine
Last synced: 12 Oct 2024
https://github.com/GridProtectionAlliance/openPDC
Open Source Phasor Data Concentrator
bpa-pdc-stream complex-event-processing hadoop iec61850 ieee-1344 ieee-c37118 naspi openpdc pdc phasor-data-concentrator phasor-measurement-unit pmu stream-processing stream-processing-engine streaming-data synchrophasor time-series
Last synced: 08 Nov 2024
https://github.com/qihoo360/xlearning-xdml
extremely distributed machine learning
ai distributed hadoop hazelcast kudu machine-learning parameter-server spark
Last synced: 14 Nov 2024
https://github.com/Qihoo360/XLearning-XDML
extremely distributed machine learning
ai distributed hadoop hazelcast kudu machine-learning parameter-server spark
Last synced: 31 Oct 2024
https://github.com/apache/calcite-avatica-go
Apache Calcite Go
big-data calcite geospatial hadoop java sql
Last synced: 27 Dec 2024
https://github.com/harisekhon/knowledge-base
Large Tech Knowledge Base from 20 years in DevOps, Linux, Cloud, Big Data, AWS, GCP etc - gradually porting my large private knowledge base to public
aws azure bash bigdata cicd cloud devops elasticsearch gcp git groovy hadoop java jvm performance-tuning python scripting solr solrcloud spark
Last synced: 30 Dec 2024
https://github.com/smart-data-lake/smart-data-lake
Smart Automation Tool for building modern Data Lakes and Data Pipelines
data-lake data-pipelines deltalake hadoop hive scala smart-data-lake spark transform-data
Last synced: 27 Dec 2024
https://github.com/233zzh/TitanDataOperationSystem
最好的大数据项目。《Titan数据运营系统》,本项目是一个全栈闭环系统,我们有用作数据可视化的web系统,然后用flume-kafaka-flume进行日志的读取,在hive设计数仓,编写spark代码进行数仓表之间的转化以及ads层表到mysql的迁移,使用azkaban进行定时任务的调度,使用技术:Java/Scala语言,Hadoop、Spark、Hive、Kafka、Flume、Azkaban、SpringBoot,Bootstrap, Echart等;
azkaban flume hadoop hive kafka spark
Last synced: 30 Oct 2024
https://github.com/apache/hadoop-mapreduce
Mirror of Apache Hadoop MapReduce
Last synced: 28 Sep 2024
https://github.com/mmolimar/kafka-connect-fs
Kafka Connect FileSystem Connector
apache-kafka azure-storage confluent files filesystem ftp gcp hadoop hadoop-filesystem hdfs kafka kafka-connect kafka-connect-fs kafka-connector s3
Last synced: 17 Nov 2024
https://github.com/gateway-experiments/hadoop-yarn-api-python-client
Python client for Hadoop® YARN API
Last synced: 09 Nov 2024
https://github.com/rdblue/s3committer
Hadoop output committers for S3
hadoop netflix outputcommitter s3
Last synced: 06 Nov 2024
https://github.com/feng-li/Distributed-Statistical-Computing
Teaching Materials for Distributed Statistical Computing (大数据分布式计算教学材料)
hadoop mapreduce pyspark-tutorial spark spark-teaching statistical-models
Last synced: 30 Oct 2024
https://github.com/dimajix/flowman
Flowman is an ETL framework powered by Apache Spark. With its declarative approach, Flowman simplifies the development of complex data pipelines.
apache-spark big-data bigdata data-engineering etl flowman hadoop scala spark sql
Last synced: 28 Dec 2024
https://github.com/lewuathe/docker-hadoop-cluster
Multiple node cluster on Docker for self development.
Last synced: 12 Nov 2024
https://github.com/harrisiirak/webhdfs
Node.js WebHDFS REST API client
hadoop javascript node-webhdfs webhdfs
Last synced: 01 Jan 2025
https://github.com/iamabug/BigDataParty
大数据组件 All-in-One 的 Dockerfile
big-data dockerfile hadoop kafka spark
Last synced: 12 Nov 2024
https://github.com/criteo/tf-yarn
Train TensorFlow models on YARN in just a few lines of code!
Last synced: 01 Jan 2025
https://github.com/apache/doris-website
Apache Doris Website
analytics apache big-data data-warehousing database datalake dbms distributed-system doris hadoop hive hudi iceberg mpp olap ssb tpch vectorized
Last synced: 01 Jan 2025
https://github.com/snowch/movie-recommender-demo
This project walks through how you can create recommendations using Apache Spark machine learning. There are a number of jupyter notebooks that you can run on IBM Data Science Experience, and there a live demo of a movie recommendation web application you can interact with. The demo also uses IBM Message Hub (kafka) to push application events to topic where they are consumed by a spark streaming job running on IBM BigInsights (hadoop).
alternating-least-squares biginsights bluemix bokeh cloudant collaborative-filtering dsx hadoop hive ibm-biginsights ibm-bluemix jupyter-notebook kafka machine-learning messagehub notebook python-flask-application redis spark spark-streaming
Last synced: 17 Nov 2024
https://github.com/spencertipping/ni
Say "ni" to data of any size
big-data datascience hadoop perl pipeline ssh visualization
Last synced: 25 Oct 2024
https://github.com/seznam/euphoria
Euphoria is an open source Java API for creating unified big-data processing flows. It provides an engine independent programming model which can express both batch and stream transformations.
apache-flink apache-spark batch-processing big-data hadoop hdfs java-api kafka streaming-data unified-bigdata-processing
Last synced: 19 Dec 2024
https://github.com/huangyueranbbc/SparkDemo
spark全示例代码(java、scala) Spark most full instance code DEMO (java、scala)
bigdata hadoop operator spark spark-sql spark-streaming sparkfun-products sparkjava sparkline sparkp
Last synced: 30 Oct 2024
https://github.com/flipkart-incubator/hbase-orm
A production-grade HBase ORM library that makes accessing HBase clean, fast and fun (Can also be used as Bigtable ORM)
bigtable bigtable-orm cloud-bigtable hadoop hbase hbase-orm mapreduce object-mapping orm
Last synced: 16 Nov 2024
https://github.com/coxautomotivedatasolutions/waimak
Waimak is an open-source framework that makes it easier to create complex data flows in Apache Spark.
data-engineering hadoop scala spark
Last synced: 12 Oct 2024
https://github.com/cloudposse/terraform-aws-emr-cluster
Terraform module to provision an Elastic MapReduce (EMR) cluster on AWS
emr emr-cluster emr-notebooks emrfs hadoop hcl2 hive presto spark terraform terraform-aws terraform-module terraform-modules
Last synced: 28 Dec 2024
https://github.com/s911415/apache-hadoop-3.1.0-winutils
HADOOP 3.1.0 winutils
apache-hadoop hadoop native winutils
Last synced: 30 Oct 2024
https://github.com/shifuml/guagua
An iterative computing framework for both Hadoop MapReduce and Hadoop YARN.
hadoop in-memory iterative machine-learning yarn
Last synced: 10 Oct 2024
https://github.com/impetus/jumbune
Jumbune, an open source BigData APM & Data Quality Management Platform for Data Clouds. Enterprise feature offering is available at http://jumbune.com. More details of open source offering are at,
aiops apm cluster-monitoring data-analysis data-quality developer-tools devops-tools hadoop hadoop-cluster hadoop-monitor hadoop-monitoring monitoring-tool optimization-framework yarn yarn-hadoop-cluster
Last synced: 14 Nov 2024
https://github.com/nielsbasjes/splittablegzip
Splittable Gzip codec for Hadoop
codec gzip gzip-codec gzipped-files hadoop mapreduce-java pig spark splittable
Last synced: 01 Jan 2025
https://github.com/thomasweise/distributedcomputingexamples
Example codes for my Distributed Computing course at Hefei University.
axis2 c communication distributed-computing glassfish hadoop html java java-rmi java-servlet javascript javaserver-pages json-rpc jsp mpi servlet-container socket web-services xml xml-document
Last synced: 09 Nov 2024
https://github.com/groda/big_data
Tutorials on Big Data essentials: Hadoop, MapReduce, Spark.
apache-sedona apache-spark big-data bigdata bigtop docker gutenberg-ebooks hadoop hadoop-cluster hadoop-hdfs hadoop-mapreduce jupyter-notebook mapreduce mapreduce-bash mrjob pyspark spark spark-sql testdfsio
Last synced: 31 Dec 2024
https://github.com/zuinnote/hadoopoffice
HadoopOffice - Analyze Office documents using the Hadoop ecosystem (Spark/Flink/Hive)
analyze-office-documents bigdata excel flink hadoop hadoop-ecosystem hadoopoffice hive office poi spark
Last synced: 14 Oct 2024
https://github.com/mooseburger1/springboard-data-science-immersive
convolutional-neural-networks data-science deep-learning deep-neural-networks eda h5 hadoop nlp opencv pyspark python sql statistical-analysis statistical-inference statistical-modeling tensorboard tensorflow time-series-analysis time-series-prediction web-scraping
Last synced: 24 Nov 2024
https://github.com/longshilin/hdfs-netdisc
基于Hadoop的分布式云存储系统 :palm_tree:
hadoop hadoop-filesystem hdfs hdfs-netdisc netdisk
Last synced: 10 Nov 2024
https://github.com/vivek-bombatkar/mylearningnotes
Because its never late to start taking notes and 'public' it...
blockchain hadoop hive pandas python spark sparkml
Last synced: 31 Dec 2024
https://github.com/rubenafo/docker-spark-cluster
A Spark cluster setup running on Docker containers
big-data docker docker-image hadoop openjdk scala spark
Last synced: 13 Oct 2024
https://github.com/zhuyuqing/bestconf
A tool automatically improving the performance of large-scale systems by finding better configuration settings
benchmark cassandra configuration hadoop hive mysql optimization performance spark tomcat tuning
Last synced: 05 Nov 2024
https://github.com/googlecloudplatform/serverless-spark-workshop
Solution Accelerators for Serverless Spark on GCP, the industry's first auto-scaling and serverless Spark as a service
apache-spark autoscaling bigdata dataproc hadoop serverless solution-accelerator spark usecases
Last synced: 07 Oct 2024
https://github.com/turboway/pybigdata
使用 python 操作大数据的各种组件
elasticsearch hadoop hbase hive impala kafka mapreduce spark
Last synced: 15 Nov 2024
https://github.com/damiencarol/jsr203-hadoop
A Java NIO file system provider for HDFS
Last synced: 29 Dec 2024
https://github.com/punit-naik/mlhadoop
This repository contains Machine-Learning MapReduce codes for Hadoop which are written from scratch (without using any package or library). E.g. Prediction (Linear and Logistic Regression), Clustering (K-Means), Classification (KNN) etc.
Last synced: 15 Nov 2024
https://github.com/myamafuj/hadoop-hive-spark-docker
Hadoop-Hive-Spark cluster + Jupyter on Docker
docker hadoop hive jupyter jupyter-notebook pyspark spark
Last synced: 11 Nov 2024
https://github.com/dimajix/spark-training
Repository used for Spark Trainings
hadoop hadoop-training hive pyspark python scala spark spark-ml spark-streaming spark-training sqoop
Last synced: 09 Nov 2024
https://github.com/Cigna/ibis
IBIS is a workflow creation-engine that abstracts the Hadoop internals of ingesting RDBMS data.
cigna hadoop hadoop-ecosystem hadoop-framework ibis ingestion oozie sqoop sqoop2 workflow workflow-automation workflow-scheduler
Last synced: 27 Nov 2024
https://github.com/terascope/teraslice
Scalable data processing pipelines in JavaScript
elasticsearch hadoop hdfs json kafka
Last synced: 27 Dec 2024
https://github.com/pnavaro/big-data
Python tools for big data
dask data-science hadoop jupyter-book notebooks python spark
Last synced: 02 Nov 2024
https://github.com/pbwebmedia/yarn-prometheus-exporter
Export Hadoop YARN (resource-manager) metrics in prometheus format
apache apache-hadoop exporter hadoop metrics prometheus resource-manager yarn yarn-hadoop-cluster
Last synced: 19 Dec 2024
https://github.com/palantir/hadoop-crypto
Library for per-file client-side encyption in Hadoop FileSystems such as HDFS or S3.
hadoop hadoop-crypto hadoop-filesystem octo-correct-managed
Last synced: 31 Dec 2024
https://github.com/pierrekieffer/docker-spark-yarn-cluster
Docker multi-nodes Hadoop cluster with Spark 2.4.1 on Yarn
cluster docker hadoop spark yarn yarn-hadoop-cluster
Last synced: 02 Nov 2024
https://github.com/coxautomotivedatasolutions/spark-distcp
A re-implementation of Hadoop DistCP in Apache Spark
apache-spark data-engineering distcp hadoop spark
Last synced: 12 Oct 2024
https://github.com/melin/spark-jobserver
REST job server for Apache Spark
hadoop hive java kerberos kubernetes spark yarn
Last synced: 05 Nov 2024
https://github.com/LB-Yu/data-systems-learning
Learning summary and examples about data systems.
antlr big-data calcite distributed-systems flink hadoop hbase spark
Last synced: 05 Nov 2024
https://github.com/rootsongjc/magpie
Yarn on Docker - Managing Hadoop Yarn cluster with Docker Swarm.
containers docker hadoop swarm yarn
Last synced: 27 Oct 2024
https://github.com/jacobstanley/hadoop-tools
Tools for working with Hadoop, written with performance in mind.
Last synced: 14 Nov 2024
https://github.com/bytedance/clickhouse_hadoop
Import data from clickhouse to hadoop with pure SQL
Last synced: 15 Nov 2024
https://github.com/basin-etl/basin
Basin is a visual programming editor for building Spark and PySpark pipelines. Easily build, debug, and deploy complex ETL pipelines from your browser
emr etl hadoop informatica odi pipeline pyspark spark
Last synced: 09 Nov 2024
https://github.com/pippozq/hadoop-ansible
Install hadoop cluster with ansible
Last synced: 23 Nov 2024
https://github.com/kakao/cmux
A set of commands for managing CDH clusters using Cloudera Manager REST API.
Last synced: 19 Nov 2024
https://github.com/agile-lab-dev/darwin
Avro Schema Evolution made easy
avro avro-schema hadoop hbase scala schema-evolution spark
Last synced: 14 Oct 2024
https://github.com/whitfin/efflux
Easy Hadoop Streaming and MapReduce interfaces in Rust
Last synced: 16 Nov 2024
https://github.com/oeljeklaus-you/loganalyzehelper
论坛日志分析系统清洗程序(包含IP规则库,UDF开发,MapReduce程序,日志数据)
Last synced: 05 Nov 2024
https://github.com/apache/doris-thirdparty
Self-managed thirdparty dependencies for Apache Doris
analytics big-data data-warehousing database datalake dbms distributed-database hadoop hive hudi iceberg mpp olap real-time sql ssb tpch vectorized
Last synced: 01 Jan 2025
https://github.com/openucx/sparkucx
A high-performance, scalable and efficient ShuffleManager plugin for Apache Spark, utilizing UCX communication layer
apache-spark big-data hadoop hpc rdma spark
Last synced: 10 Nov 2024
https://github.com/agile-lab-dev/wasp
WASP is a framework to build complex real time big data applications. It relies on a kind of Kappa/Lambda architecture mainly leveraging Kafka and Spark. If you need to ingest huge amount of heterogeneous data and analyze them through complex pipelines, this is the framework for you.
akka elasticsearch hadoop hbase hdfs jdbc kafka parquet scala solr spark spark-streaming yarn
Last synced: 01 Jan 2025
https://github.com/izeigerman/akkeeper
An easy way to deploy your Akka services to a distributed environment.
akka deployment distributed-actors distributed-systems hadoop monitoring yarn
Last synced: 09 Nov 2024
https://github.com/clusterdock/clusterdock
clusterdock is a framework for creating Docker-based container clusters
Last synced: 26 Oct 2024
https://github.com/kairen/learning-spark
Tidy up Spark and Hadoop tutorials.
bigdata data-science hadoop spark
Last synced: 30 Oct 2024