Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

Projects in Awesome Lists tagged with hadoop

A curated list of projects in awesome lists tagged with hadoop .

https://github.com/skyleaworlder/hadoop-cfg

:elephant: Quick-Start scripts. *.sh about Hadoop 2.10.1 config on Ubuntu 20.04

hadoop

Last synced: 15 Nov 2024

https://github.com/khinshankhan/nlp-tf-idf-hadoop

NLP analysis of Term Frequency - Inverse Document Frequency using Hadoop

hadoop mapreduce nlp tf-idf

Last synced: 19 Jan 2025

https://github.com/HwiLu/Hadoop-cluster

大数据组件笔记

hadoop hbase hive yarn

Last synced: 24 Oct 2024

https://github.com/steveloughran/validate-hadoop-client-artifacts

build/validate hadoop RCs. moved into apache hadoop itself.

hadoop

Last synced: 15 Nov 2024

https://github.com/mitre/clusterconf

Manage Hadoop cluster configurations

hadoop hadoop-cluster r r-package rstats

Last synced: 09 Nov 2024

https://github.com/JHM9191/Smart_Inventory_Manager

This is a IoT project repository. The topic is about Smart Inventory Management using Load Cell Sensor that detects current weight of the IoT Container that we made

arduino aws firebase hadoop iot loadcell r-programming sensor tcpip

Last synced: 13 Nov 2024

https://github.com/mesmacosta/hive-custom-hook

Example on how to implement a hive hook

hadoop hive hive-hook java metadata-extraction

Last synced: 11 Nov 2024

https://github.com/serenasensini/docker-apogeo

Repo che contiene gli esempi presenti nel libro "Docker", edito da Apogeo. Guida al deploy di applicazioni in contenitori software, disponibile dal 24 settembre 2020!

apogeo docker flask hadoop kafka laravel nodejs sentiment-analysis sqlite

Last synced: 20 Nov 2024

https://github.com/majidgolshadi/knowledge

software technology documents

hadoop knowledge mongodb zookeeper

Last synced: 29 Dec 2024

https://github.com/davidov541/hadooponvagrant

Collection of vagrant boxes which makes setting up a mini-cluster simple

hadoop kerberos vagrant vagrant-boxes

Last synced: 14 Jan 2025

https://github.com/harshoza36/movielens_pyspark

MovieLens Dataset analysis using Hadoop and Pyspark

big-data-analytics hadoop movielens movielens-data-analysis pyspark spark spark-sql

Last synced: 10 Jan 2025

https://github.com/michabirklbauer/mahout_docker

Running Apache Mahout in Docker.

apache docker dockerfile hadoop mahout maven spark

Last synced: 04 Jan 2025

https://github.com/yadvi12/automating-hadoop-cluster-on-aws-cloud-using-terraform

This repository is a part of our Final Year Minor/ Major Project in College.

automation aws big-data cloud-computing devops hadoop terraform

Last synced: 24 Jan 2025

https://github.com/multivacplatform/multivac-hdfs-c

Connect c/c++ application to HDFS managed by Cloudera/CDH

c c-plus-plus cdh5 cloudera hadoop hdfs

Last synced: 12 Jan 2025

https://github.com/zurfyx/cassandra-hadoop-example

Cassandra Hadoop Example

cassandra hadoop mapreduce nodejs

Last synced: 11 Dec 2024

https://github.com/rosacarla/dio-cloud-data-engineer

Reúne atividades e projetos realizados durante o bootcamp Cognizant Cloud Data Engineer #2, promovido pela DIO Inc.

etl hadoop hive impala linux mongodb python sql

Last synced: 06 Jan 2025

https://github.com/chaokunyang/athena

A task scheduler for spark, flink, mapreduce, java, python, bash

flink hadoop mapreduce spark task-manager task-scheduler

Last synced: 19 Nov 2024

https://github.com/sivakumar-mahalingam/mercury

Collection of UDFs for Hive

hadoop hive udf udf-library

Last synced: 22 Jan 2025

https://github.com/timvisee/hhs-p7-movie-recommendation-engine

:movie_camera: Big data project for college (HHS) period 7

algorithm hadoop recommendation-engine spark

Last synced: 15 Jan 2025

https://github.com/highoncarbs/hadoopwithpy

:elephant: :heavy_plus_sign: :snake: Learning Hadoop with Python

flask hadoop hadoop-mapreduce hadoop-streaming python recommender-system

Last synced: 26 Jan 2025

https://github.com/dineshchitlangia/ambari-service-check

Ambari Service Check is a shell script utility to invoke service check for some or all components on the stack

ambari curl hadoop

Last synced: 23 Nov 2024

https://github.com/riskiq/solr-map-reduce

Utilities for creation of Solr indexes using mapreduce

hadoop solr

Last synced: 05 Nov 2024

https://github.com/pingsutw/hello-submarine

This repo is for beginner who want to learn and use Submarine

docker hadoop kubernetes pytorch submarine tensorflow

Last synced: 16 Oct 2024

https://github.com/prabaprakash/hadoop-2.3

Hadoop 2.3 for Windows x64

hadoop hadoop-mapreduce java

Last synced: 14 Nov 2024

https://github.com/touero/rhodeinae

A Java program for remotely operating Hbase tasks.

hadoop hbase java maven

Last synced: 25 Jan 2025

https://github.com/conema/spark-terraform

This project create an Hadoop and Spark cluster on Amazon AWS with Terraform

aws cluster hadoop hadoop-cluster hcl spark spark-clusters terraform

Last synced: 20 Nov 2024

https://github.com/stefan-schroedl/pigrank

Apache Pig UDFs for ranking (ndcg, mrr, jaccard coefficient, cosine similarity, rank-biased overlap)

cosine-similarity dcg hadoop map-reduce mrr pig ranking

Last synced: 16 Jan 2025

https://github.com/noseparte/depthsearch

A project through search engine and Hadoop, based on springboot.

hadoop java lucence mongodb redis

Last synced: 26 Dec 2024

https://github.com/pirate-emperor/bigdata-pipeline

BigData Pipeline is a local testing environment for experimenting with various storage solutions (RDB, HDFS), query engines (Trino), schedulers (Airflow), and ETL/ELT tools (DBT). It supports MySQL, Hadoop, Hive, Kudu, and more.

airflow airflow-dags airflow-docker big-data data-lake data-lakestore data-warehouse dbt dbt-core distributed-computing docker docker-compose hadoop hive hiveql kudu mysql mysql-server trino trino-cli

Last synced: 31 Jan 2025

https://github.com/jldbc/big-data

Coursework from Big Data (CS3390) -- Machine Learning tasks performed using Hadoop, MapReduce, and Spark

big-data hadoop pagerank recommender-system spark

Last synced: 04 Jan 2025

https://github.com/dimajix/docker-spark

Repository for building Docker containers for Spark

cluster docker hadoop spark

Last synced: 05 Jan 2025

https://github.com/geekalexis/search-engine

A distributed, RESTful search engine powered by AWS

aws hadoop search-engine webapp

Last synced: 12 Jan 2025

https://github.com/kwartile/spark-benchmark

Spark Benchmark suite to evaluate cluster configuration and compare the performance with other big data frameworks.

apache-spark benchmark benchmarking-suite cdh cloudera-hadoop hadoop hive impala performance scala spark

Last synced: 15 Dec 2024

https://github.com/keith-ratcliffe/cloud-basher

Automated, pluggable standup/teardown of cloud schtuff

accumulo automated bash hadoop install nifi

Last synced: 09 Jan 2025

https://github.com/kemalcanbora/ba_bigdata_docker

Docker containers provide a way to package applications with everything needed to run them, including base operating system images, databases, libraries, and binaries.

bigdata hadoop hue kafka spark

Last synced: 24 Jan 2025

https://github.com/burhanahmed1/big-data-analytics

Practice tasks in Python programming language using Hadoop, MRJob, PySpark for Big Data Analytics.

apache-spark hadoop hadoop-mapreduce jupyter-notebook mrjob pyspark python spark spark-sql sparksql

Last synced: 11 Oct 2024

https://github.com/zncdatadev/kubedoop

The modular open source big data platform using kubernetes and cloud-native ecosystem which is the base for DataOps/MLOps(LLMOps)

bigdata cloud-native data-platform dataops hadoop kubernetes llmops mlops

Last synced: 19 Nov 2024

https://github.com/dimajix/docker-hive

Docker container running the Hive Metastore

docker hadoop hive

Last synced: 05 Jan 2025

https://github.com/omar-besbes/football-big-data

This is a comprehensive solution for real-time football analytics, leveraging Apache Spark execution on yarn for both streaming and batch processing, Hadoop HDFS for distributed storage, Kafka for real-time data ingestion, RethinkDB for live data updates and Next.js for data visualization as well as a custom built search engine.

batch-processing hadoop kafka nextjs rethinkdb spark streaming t3-stack yarn

Last synced: 20 Jan 2025

https://github.com/codito/hadoop-expt

Experiments with Hadoop cluster setups in Docker

docker docker-compose hadoop hadoop-cluster hadoop-docker

Last synced: 10 Nov 2024

https://github.com/madjar/fhue

Hue is annoying to use. F hue

hadoop haskell hdfs hue

Last synced: 17 Dec 2024

https://github.com/meijies/hadoop-performance-summary

hadoop 性能总结,不猜测, do in progress

hadoop hdfs hive performance turning

Last synced: 06 Dec 2024

https://github.com/kmohamedalie/big-data-hadoop-spark-lab

Big Data🛢️ with Hadoop🐘 and Spark⭐ lab🧪🥼

big-data coursera data-engineering docker hadoop ibm kubernetes spark

Last synced: 02 Jan 2025

https://github.com/chanran/statvisit

记录用户浏览网页时的行为数据,如点击该页面的某个链接行为,数据保存到本地日志文件,经flume收集后并处理,或者用linux定时器任务,上传数据到HDFS中。然后通过HQL查询后生成每日统计数据(PV、UV)保存到关系型数据库MySql中,同时在网站中可以浏览该统计数据

hadoop java node-schedule nodejs

Last synced: 16 Jan 2025

https://github.com/mcddhub/mcdd-big-data-study

Study project for big data (Hadoop, Zookeeper, Kafka, Flink, Spark)

big-data data-processing docker flink hadoop kafka spark zookeeper

Last synced: 10 Oct 2024

https://github.com/sandeepkundalwal/advanced-computer-science-practicum

[CS515: Advanced Computer Science Practicum] This repo contains all the assignment of CS515 offered at IIT Mandi by Dr. Sriram Kailasam & Dr. Manas Thakur during Fall Session 2022.

fork-join hadoop java mapreduce scheme-programming-language thread-pool threads

Last synced: 07 Dec 2024

https://github.com/zejnilovic/hadoop-docker

Hadoop 2.7.5 in Docker

docker hadoop

Last synced: 30 Dec 2024

https://github.com/zncdatadev/hdfs-operator

Apache Hadoop HDFS operator for the Kubernetes Data Stack

hadoop hdfs k8s kubernetes

Last synced: 09 Oct 2024

https://github.com/krishnadey30/newsheadlines

This repository have codes that extracts meaningful information from News headline data-set.

hadoop hadoop-mapreduce mapreduce-python news-dataset python

Last synced: 24 Jan 2025

https://github.com/hamzahamidi/map-reduce-sample

MapReduce exercices sample

hadoop hadoop-mapreduce java

Last synced: 06 Jan 2025

https://github.com/figuran04/big-data

📃 Praktikum Big Data

anaconda big data hadoop hive mongodb pig spark

Last synced: 01 Nov 2024

https://github.com/rui-exe/feup-oakmont

Building a stock broker web application using Apache HBase, Fast API and React js

fastapi finance hadoop happybase hbase java non-relational-database python python3 react reactjs stock-broker stock-market wide-column-database zookeeper

Last synced: 08 Nov 2024

https://github.com/oracle-quickstart/oci-hadoop

Terraform module to deploy Hadoop on Oracle Cloud Infrastructure (OCI)

cloud hadoop oci oracle oracle-led terraform

Last synced: 07 Nov 2024

https://github.com/vicentebolea/hadoop-apriori

Apriori algorithm implementeed in hadoop

apriori hadoop

Last synced: 15 Jan 2025

https://github.com/nathanhowell/tfrecords-hadoop

A Hadoop OutputFormat for writing compressed TFRecords

hadoop java scala tensorflow tfrecords

Last synced: 27 Dec 2024

https://github.com/jms0522/hadoop_system

✅ hadoop eco system을 구성하고 파이프라인 제작합니다.

hadoop pipeline spark

Last synced: 11 Oct 2024

https://github.com/divinenaman/dbscan-mapreduce

DBSCAN implementation on mapreduce

dbscan-clustering hadoop java mapreduce-java

Last synced: 17 Dec 2024

https://github.com/yashindane/web-menu

:globe_with_meridians: Automate Docker , Kubernetes , Hadoop and AWS using voice commands!

ansible automation aws docker hadoop kubernetes

Last synced: 13 Jan 2025

https://github.com/rootsongjc/hadoop-cluster-monitor

Hadoop cluster monitor and alert

hadoop monitor

Last synced: 20 Dec 2024

https://github.com/leovct/hidoop

:elephant: Simple Big Data platform running MapReduce applications, inspired by Hadoop

big-data cluster hadoop hdfs mapreduce-applications

Last synced: 08 Nov 2024

https://github.com/gaelfoppolo/self-service-data-analytics

Data analysis made for business users

aws big-data data-analytics hadoop spark

Last synced: 08 Dec 2024

https://github.com/tck1/hadoop-mapreduce-example

Aplicação implementando técnicas de MapReduce usando Hadoop

hadoop java mapreduce

Last synced: 27 Jan 2025

https://github.com/yinfuyuan/docker-bigdata

This is a project created to build a big data cluster.

apache docker docker-compose hadoop hbase kafak zookeeper

Last synced: 23 Dec 2024

https://github.com/shathor/gaia-cluster

Provides a scaffold to easily build a cluster to query the data from ESA's Gaia satellite. Gaia is an ambitious mission to chart a three-dimensional map of our Galaxy, the Milky Way. Gaia will provide unprecedented positional and radial velocity measurements with the accuracies needed to produce a stereoscopic and kinematic census of about one billion stars in our Galaxy and throughout the Local Group. This amounts to about 1 per cent of the Galactic stellar population.

apache-cassandra apache-spark astronomy big-data bigdata cassandra cluster distributed-computing esa hadoop java java-8 machine-learning map-reduce

Last synced: 21 Jan 2025

https://github.com/badoo/hadoop-xargs

Util to run heterogenous applications on Hadoop synchronously

hadoop java spark

Last synced: 12 Nov 2024

https://github.com/policratus/sparkmage

🐘 A tool for blazing fast analysis and clustering of similar images using 🐘 Hadoop and ⚡ Spark.

big-data computer-vision hadoop image-processing spark

Last synced: 02 Nov 2024

https://github.com/gmartinezramirez-old/data-science-portafolio

:notebook: [Active] Portafolio of data science projects. Using: Python, PyTorch, Spark, Tensorflow, Scikit, Keras. Includes Classification, Regression, Time series, NLP, Deep learning, among others.

data-science data-science-learning data-science-notebook data-science-portfolio hadoop jupyter-notebook keras notebook pandas pyspark python pytorch r sci-kit spark tensorflow

Last synced: 05 Dec 2024

https://github.com/risdenk/s3a-localstack

Testing of Apache Hadoop S3A with Localstack

hadoop localstack s3

Last synced: 06 Dec 2024

https://github.com/mukjepscarlet/bilibili-predict-recommend

[大数据课程作业] Bilibili 助手: 视频推荐 + 热门预测

bilibili flask hadoop html javascript prediction pyspark python recommendation spark

Last synced: 18 Jan 2025

https://github.com/starhe/balm

基于Spring Boot全家桶打造,大数据PAAS组件适配器,一键适配DolphinScheduler、Hadoop、Spark、Hive、Impala、HBase、Kafka、StarRocks、ClickHouse、Neo4j,通过标准REST接口操作,简单易用,方便二次开发和集成

clickhouse dolphinscheduler hadoop hbase hive impala kafka neo4j spark spring starrocks

Last synced: 21 Dec 2024

https://github.com/hexnn/balm

基于Spring Boot全家桶打造,大数据PAAS组件适配器,一键适配DolphinScheduler、Hadoop、Spark、Hive、Impala、HBase、Kafka、StarRocks、ClickHouse、Neo4j、Redis、ElasticSearch,通过标准REST接口和SQL语句操作,简单易用,方便二次开发和快速集成

clickhouse datax dolphinscheduler elasticsearch hadoop hbase hive impala kafka maxcompute neo4j phoenix presto spark starrocks

Last synced: 21 Dec 2024

https://github.com/saagarjha/hdshell

CLI wrapper for HDFS

cli hadoop hdfs macos

Last synced: 16 Dec 2024

https://github.com/nthaihoc/segmentation-customer-hadoop-spark-mlops-icta-2024

An automatic machine learning based customer segmentation model with RFM analysis at ICTA conference 2024

dbscan-clustering-algorithm dvc-pipeline feature-engineering hadoop k-means-clustering machine-learning mlops-workflow spark

Last synced: 21 Jan 2025

https://github.com/machinecyc/environmentsetting

Common Tools Installation Files in Data Analysis, Machine Learning, and Deep Learning

airflow docker docker-compose docker-image dockerhub git hadoop issues mysql python3 rabbitmq splunk tensorflow-gpu ubuntu virtualbox vscode

Last synced: 05 Dec 2024

https://github.com/kuro337/scalamono

Scala Monorepo Tooling for Kafka, Opensearch, Spark, Redpanda, Hadoop - and Lang Reference.

data database duckdb hadoop kafka redpanda sdala spark

Last synced: 14 Jan 2025

https://github.com/sameetasadullah/find-max-temperature-using-mapreduce-hadoop

Program coded in Java language to find max temperature in a large file using Hadoop MapReduce

hadoop hadoop-mapreduce java linux max-temperature ubuntu

Last synced: 21 Jan 2025

https://github.com/sameetasadullah/count-words-using-mapreduce-hadoop

Program coded in Java language to count words in a large file using Hadoop MapReduce

count-words hadoop hadoop-mapreduce java linux ubuntu

Last synced: 21 Jan 2025

https://github.com/sameetasadullah/check-keywords-using-mapreduce-hadoop

Program coded in Java language to find different types of keywords in a large file using Hadoop MapReduce

hadoop hadoop-mapreduce java linux ubuntu

Last synced: 21 Jan 2025

https://github.com/oracle-quickstart/oci-hortonworks

Terraform module to deploy Hortonworks on Oracle Cloud Infrastructure (OCI)

cloud hadoop hdf hdp hortonworks oci oracle partner-led spark terraform

Last synced: 07 Nov 2024