Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
Projects in Awesome Lists tagged with hadoop
A curated list of projects in awesome lists tagged with hadoop .
https://github.com/skyleaworlder/hadoop-cfg
:elephant: Quick-Start scripts. *.sh about Hadoop 2.10.1 config on Ubuntu 20.04
Last synced: 15 Nov 2024
https://github.com/khinshankhan/nlp-tf-idf-hadoop
NLP analysis of Term Frequency - Inverse Document Frequency using Hadoop
Last synced: 19 Jan 2025
https://github.com/steveloughran/validate-hadoop-client-artifacts
build/validate hadoop RCs. moved into apache hadoop itself.
Last synced: 15 Nov 2024
https://github.com/mitre/clusterconf
Manage Hadoop cluster configurations
hadoop hadoop-cluster r r-package rstats
Last synced: 09 Nov 2024
https://github.com/JHM9191/Smart_Inventory_Manager
This is a IoT project repository. The topic is about Smart Inventory Management using Load Cell Sensor that detects current weight of the IoT Container that we made
arduino aws firebase hadoop iot loadcell r-programming sensor tcpip
Last synced: 13 Nov 2024
https://github.com/mesmacosta/hive-custom-hook
Example on how to implement a hive hook
hadoop hive hive-hook java metadata-extraction
Last synced: 11 Nov 2024
https://github.com/serenasensini/docker-apogeo
Repo che contiene gli esempi presenti nel libro "Docker", edito da Apogeo. Guida al deploy di applicazioni in contenitori software, disponibile dal 24 settembre 2020!
apogeo docker flask hadoop kafka laravel nodejs sentiment-analysis sqlite
Last synced: 20 Nov 2024
https://github.com/majidgolshadi/knowledge
software technology documents
hadoop knowledge mongodb zookeeper
Last synced: 29 Dec 2024
https://github.com/davidov541/hadooponvagrant
Collection of vagrant boxes which makes setting up a mini-cluster simple
hadoop kerberos vagrant vagrant-boxes
Last synced: 14 Jan 2025
https://github.com/harshoza36/movielens_pyspark
MovieLens Dataset analysis using Hadoop and Pyspark
big-data-analytics hadoop movielens movielens-data-analysis pyspark spark spark-sql
Last synced: 10 Jan 2025
https://github.com/michabirklbauer/mahout_docker
Running Apache Mahout in Docker.
apache docker dockerfile hadoop mahout maven spark
Last synced: 04 Jan 2025
https://github.com/yadvi12/automating-hadoop-cluster-on-aws-cloud-using-terraform
This repository is a part of our Final Year Minor/ Major Project in College.
automation aws big-data cloud-computing devops hadoop terraform
Last synced: 24 Jan 2025
https://github.com/multivacplatform/multivac-hdfs-c
Connect c/c++ application to HDFS managed by Cloudera/CDH
c c-plus-plus cdh5 cloudera hadoop hdfs
Last synced: 12 Jan 2025
https://github.com/ahmetfurkandemir/minio-hive-example
Kubernetes Hive Minio connection example
apache-hive hadoop hive hive-metastore hive-server k8s kubernetes kubernetes-cluster kubernetes-deployment minio postgresql s3 s3-bucket
Last synced: 17 Jan 2025
https://github.com/zurfyx/cassandra-hadoop-example
Cassandra Hadoop Example
cassandra hadoop mapreduce nodejs
Last synced: 11 Dec 2024
https://github.com/chaokunyang/athena
A task scheduler for spark, flink, mapreduce, java, python, bash
flink hadoop mapreduce spark task-manager task-scheduler
Last synced: 19 Nov 2024
https://github.com/sivakumar-mahalingam/mercury
Collection of UDFs for Hive
Last synced: 22 Jan 2025
https://github.com/timvisee/hhs-p7-movie-recommendation-engine
:movie_camera: Big data project for college (HHS) period 7
algorithm hadoop recommendation-engine spark
Last synced: 15 Jan 2025
https://github.com/highoncarbs/hadoopwithpy
:elephant: :heavy_plus_sign: :snake: Learning Hadoop with Python
flask hadoop hadoop-mapreduce hadoop-streaming python recommender-system
Last synced: 26 Jan 2025
https://github.com/dineshchitlangia/ambari-service-check
Ambari Service Check is a shell script utility to invoke service check for some or all components on the stack
Last synced: 23 Nov 2024
https://github.com/riskiq/solr-map-reduce
Utilities for creation of Solr indexes using mapreduce
Last synced: 05 Nov 2024
https://github.com/pingsutw/hello-submarine
This repo is for beginner who want to learn and use Submarine
docker hadoop kubernetes pytorch submarine tensorflow
Last synced: 16 Oct 2024
https://github.com/touero/rhodeinae
A Java program for remotely operating Hbase tasks.
Last synced: 25 Jan 2025
https://github.com/conema/spark-terraform
This project create an Hadoop and Spark cluster on Amazon AWS with Terraform
aws cluster hadoop hadoop-cluster hcl spark spark-clusters terraform
Last synced: 20 Nov 2024
https://github.com/stefan-schroedl/pigrank
Apache Pig UDFs for ranking (ndcg, mrr, jaccard coefficient, cosine similarity, rank-biased overlap)
cosine-similarity dcg hadoop map-reduce mrr pig ranking
Last synced: 16 Jan 2025
https://github.com/pirate-emperor/bigdata-pipeline
BigData Pipeline is a local testing environment for experimenting with various storage solutions (RDB, HDFS), query engines (Trino), schedulers (Airflow), and ETL/ELT tools (DBT). It supports MySQL, Hadoop, Hive, Kudu, and more.
airflow airflow-dags airflow-docker big-data data-lake data-lakestore data-warehouse dbt dbt-core distributed-computing docker docker-compose hadoop hive hiveql kudu mysql mysql-server trino trino-cli
Last synced: 31 Jan 2025
https://github.com/jldbc/big-data
Coursework from Big Data (CS3390) -- Machine Learning tasks performed using Hadoop, MapReduce, and Spark
big-data hadoop pagerank recommender-system spark
Last synced: 04 Jan 2025
https://github.com/dimajix/docker-spark
Repository for building Docker containers for Spark
Last synced: 05 Jan 2025
https://github.com/geekalexis/search-engine
A distributed, RESTful search engine powered by AWS
aws hadoop search-engine webapp
Last synced: 12 Jan 2025
https://github.com/kwartile/spark-benchmark
Spark Benchmark suite to evaluate cluster configuration and compare the performance with other big data frameworks.
apache-spark benchmark benchmarking-suite cdh cloudera-hadoop hadoop hive impala performance scala spark
Last synced: 15 Dec 2024
https://github.com/mikeroyal/apache-hadoop-guide
Apache Hadoop Guide
hadoop hadoop-cluster hadoop-filesystem hadoop-hdfs hadoop-mapreduce
Last synced: 12 Dec 2024
https://github.com/janheinrichmerker/song-analysis
Analysing the Million Song Dataset.
big-data data-analysis data-science hadoop hadoop-mapreduce java kotlin songs
Last synced: 24 Dec 2024
https://github.com/burhanahmed1/big-data-analytics
Practice tasks in Python programming language using Hadoop, MRJob, PySpark for Big Data Analytics.
apache-spark hadoop hadoop-mapreduce jupyter-notebook mrjob pyspark python spark spark-sql sparksql
Last synced: 11 Oct 2024
https://github.com/zncdatadev/kubedoop
The modular open source big data platform using kubernetes and cloud-native ecosystem which is the base for DataOps/MLOps(LLMOps)
bigdata cloud-native data-platform dataops hadoop kubernetes llmops mlops
Last synced: 19 Nov 2024
https://github.com/spineo/ansible-aws-instance
Launch AWS Instances Using Ansible
accumulo ansible ansible-inventory ansible-playbook ansible-template aws aws-ec2 hadoop perl python python3 zookeeper
Last synced: 23 Jan 2025
https://github.com/dimajix/docker-hive
Docker container running the Hive Metastore
Last synced: 05 Jan 2025
https://github.com/omar-besbes/football-big-data
This is a comprehensive solution for real-time football analytics, leveraging Apache Spark execution on yarn for both streaming and batch processing, Hadoop HDFS for distributed storage, Kafka for real-time data ingestion, RethinkDB for live data updates and Next.js for data visualization as well as a custom built search engine.
batch-processing hadoop kafka nextjs rethinkdb spark streaming t3-stack yarn
Last synced: 20 Jan 2025
https://github.com/codito/hadoop-expt
Experiments with Hadoop cluster setups in Docker
docker docker-compose hadoop hadoop-cluster hadoop-docker
Last synced: 10 Nov 2024
https://github.com/meijies/hadoop-performance-summary
hadoop 性能总结,不猜测, do in progress
hadoop hdfs hive performance turning
Last synced: 06 Dec 2024
https://github.com/kmohamedalie/big-data-hadoop-spark-lab
Big Data🛢️ with Hadoop🐘 and Spark⭐ lab🧪🥼
big-data coursera data-engineering docker hadoop ibm kubernetes spark
Last synced: 02 Jan 2025
https://github.com/chanran/statvisit
记录用户浏览网页时的行为数据,如点击该页面的某个链接行为,数据保存到本地日志文件,经flume收集后并处理,或者用linux定时器任务,上传数据到HDFS中。然后通过HQL查询后生成每日统计数据(PV、UV)保存到关系型数据库MySql中,同时在网站中可以浏览该统计数据
hadoop java node-schedule nodejs
Last synced: 16 Jan 2025
https://github.com/sergiomt/centorion
Configurable Vagrant development virtual machine on CentOs 7
cassandra centos centos7 cinnamon dcevm development docker hadoop hbase installer lamp openldap phpldapadmin phppgadmin postgresql tomcat vagrant vm zookeeper
Last synced: 14 Jan 2025
https://github.com/mcddhub/mcdd-big-data-study
Study project for big data (Hadoop, Zookeeper, Kafka, Flink, Spark)
big-data data-processing docker flink hadoop kafka spark zookeeper
Last synced: 10 Oct 2024
https://github.com/icai/whybug
fundebug
bigdata bug-tracker elk-stack flink hadoop miniprogram tracking
Last synced: 08 Jan 2025
https://github.com/sandeepkundalwal/advanced-computer-science-practicum
[CS515: Advanced Computer Science Practicum] This repo contains all the assignment of CS515 offered at IIT Mandi by Dr. Sriram Kailasam & Dr. Manas Thakur during Fall Session 2022.
fork-join hadoop java mapreduce scheme-programming-language thread-pool threads
Last synced: 07 Dec 2024
https://github.com/zncdatadev/hdfs-operator
Apache Hadoop HDFS operator for the Kubernetes Data Stack
Last synced: 09 Oct 2024
https://github.com/krishnadey30/newsheadlines
This repository have codes that extracts meaningful information from News headline data-set.
hadoop hadoop-mapreduce mapreduce-python news-dataset python
Last synced: 24 Jan 2025
https://github.com/hamzahamidi/map-reduce-sample
MapReduce exercices sample
Last synced: 06 Jan 2025
https://github.com/open-datastudio/hive-metastore
Hive metastore on Staroid
hadoop hive hive-metastore kubernetes spark staroid
Last synced: 18 Nov 2024
https://github.com/rui-exe/feup-oakmont
Building a stock broker web application using Apache HBase, Fast API and React js
fastapi finance hadoop happybase hbase java non-relational-database python python3 react reactjs stock-broker stock-market wide-column-database zookeeper
Last synced: 08 Nov 2024
https://github.com/oracle-quickstart/oci-hadoop
Terraform module to deploy Hadoop on Oracle Cloud Infrastructure (OCI)
cloud hadoop oci oracle oracle-led terraform
Last synced: 07 Nov 2024
https://github.com/hibuz/hadoop-docker
🐳 hadoop ecosystems docker image
data-engineering docker docker-compose flink hadoop hbase hive spark zeppelin
Last synced: 15 Nov 2024
https://github.com/extwiii/bigdata-uc.san.diego
Unlock Value in Massive Datasets - UC San Diego
big-data classification data-science graph hadoop integration machine-learning management modeling neo4j processing regression spark
Last synced: 28 Jan 2025
https://github.com/dhchenx/catla
The Catla Project
big-data catla derivative-free-optimization hadoop java mapreduce parameter self-tuning-system
Last synced: 29 Jan 2025
https://github.com/vicentebolea/hadoop-apriori
Apriori algorithm implementeed in hadoop
Last synced: 15 Jan 2025
https://github.com/nathanhowell/tfrecords-hadoop
A Hadoop OutputFormat for writing compressed TFRecords
hadoop java scala tensorflow tfrecords
Last synced: 27 Dec 2024
https://github.com/jms0522/hadoop_system
✅ hadoop eco system을 구성하고 파이프라인 제작합니다.
Last synced: 11 Oct 2024
https://github.com/divinenaman/dbscan-mapreduce
DBSCAN implementation on mapreduce
dbscan-clustering hadoop java mapreduce-java
Last synced: 17 Dec 2024
https://github.com/yashindane/web-menu
:globe_with_meridians: Automate Docker , Kubernetes , Hadoop and AWS using voice commands!
ansible automation aws docker hadoop kubernetes
Last synced: 13 Jan 2025
https://github.com/rootsongjc/hadoop-cluster-monitor
Hadoop cluster monitor and alert
Last synced: 20 Dec 2024
https://github.com/leovct/hidoop
:elephant: Simple Big Data platform running MapReduce applications, inspired by Hadoop
big-data cluster hadoop hdfs mapreduce-applications
Last synced: 08 Nov 2024
https://github.com/gaelfoppolo/self-service-data-analytics
Data analysis made for business users
aws big-data data-analytics hadoop spark
Last synced: 08 Dec 2024
https://github.com/prabaprakash/docker-pipeline-for-hadoop-n-spark-submit
Docker CI/CD Pipeline
apache-spark docker docker-compose docker-pipeline gocd-agent gocd-agent-docker gocd-server hadoop spark
Last synced: 14 Jan 2025
https://github.com/tck1/hadoop-mapreduce-example
Aplicação implementando técnicas de MapReduce usando Hadoop
Last synced: 27 Jan 2025
https://github.com/yinfuyuan/docker-bigdata
This is a project created to build a big data cluster.
apache docker docker-compose hadoop hbase kafak zookeeper
Last synced: 23 Dec 2024
https://github.com/shathor/gaia-cluster
Provides a scaffold to easily build a cluster to query the data from ESA's Gaia satellite. Gaia is an ambitious mission to chart a three-dimensional map of our Galaxy, the Milky Way. Gaia will provide unprecedented positional and radial velocity measurements with the accuracies needed to produce a stereoscopic and kinematic census of about one billion stars in our Galaxy and throughout the Local Group. This amounts to about 1 per cent of the Galactic stellar population.
apache-cassandra apache-spark astronomy big-data bigdata cassandra cluster distributed-computing esa hadoop java java-8 machine-learning map-reduce
Last synced: 21 Jan 2025
https://github.com/badoo/hadoop-xargs
Util to run heterogenous applications on Hadoop synchronously
Last synced: 12 Nov 2024
https://github.com/nbfujx/hadoop-learn-demo
hadoop hadoop-hdfs hadoop-mapreduce
Last synced: 08 Jan 2025
https://github.com/policratus/sparkmage
🐘 A tool for blazing fast analysis and clustering of similar images using 🐘 Hadoop and ⚡ Spark.
big-data computer-vision hadoop image-processing spark
Last synced: 02 Nov 2024
https://github.com/gmartinezramirez-old/data-science-portafolio
:notebook: [Active] Portafolio of data science projects. Using: Python, PyTorch, Spark, Tensorflow, Scikit, Keras. Includes Classification, Regression, Time series, NLP, Deep learning, among others.
data-science data-science-learning data-science-notebook data-science-portfolio hadoop jupyter-notebook keras notebook pandas pyspark python pytorch r sci-kit spark tensorflow
Last synced: 05 Dec 2024
https://github.com/mikeroyal/apache-storm-guide
Apache Storm Guide
batch-processing data-science dataprocessing hadoop real-time storm storm-topology
Last synced: 12 Dec 2024
https://github.com/risdenk/s3a-localstack
Testing of Apache Hadoop S3A with Localstack
Last synced: 06 Dec 2024
https://github.com/mukjepscarlet/bilibili-predict-recommend
[大数据课程作业] Bilibili 助手: 视频推荐 + 热门预测
bilibili flask hadoop html javascript prediction pyspark python recommendation spark
Last synced: 18 Jan 2025
https://github.com/starhe/balm
基于Spring Boot全家桶打造,大数据PAAS组件适配器,一键适配DolphinScheduler、Hadoop、Spark、Hive、Impala、HBase、Kafka、StarRocks、ClickHouse、Neo4j,通过标准REST接口操作,简单易用,方便二次开发和集成
clickhouse dolphinscheduler hadoop hbase hive impala kafka neo4j spark spring starrocks
Last synced: 21 Dec 2024
https://github.com/hexnn/balm
基于Spring Boot全家桶打造,大数据PAAS组件适配器,一键适配DolphinScheduler、Hadoop、Spark、Hive、Impala、HBase、Kafka、StarRocks、ClickHouse、Neo4j、Redis、ElasticSearch,通过标准REST接口和SQL语句操作,简单易用,方便二次开发和快速集成
clickhouse datax dolphinscheduler elasticsearch hadoop hbase hive impala kafka maxcompute neo4j phoenix presto spark starrocks
Last synced: 21 Dec 2024
https://github.com/nthaihoc/segmentation-customer-hadoop-spark-mlops-icta-2024
An automatic machine learning based customer segmentation model with RFM analysis at ICTA conference 2024
dbscan-clustering-algorithm dvc-pipeline feature-engineering hadoop k-means-clustering machine-learning mlops-workflow spark
Last synced: 21 Jan 2025
https://github.com/machinecyc/environmentsetting
Common Tools Installation Files in Data Analysis, Machine Learning, and Deep Learning
airflow docker docker-compose docker-image dockerhub git hadoop issues mysql python3 rabbitmq splunk tensorflow-gpu ubuntu virtualbox vscode
Last synced: 05 Dec 2024
https://github.com/salma-mamdoh/data-ingestion-pipelines
My Tasks at Big Data Training At Samsung innovation Campus
apache-flume apache-kafka apache-sqoop big-data hadoop lunix mariadb-database pipeline python
Last synced: 30 Dec 2024
https://github.com/sameetasadullah/find-max-temperature-using-mapreduce-hadoop
Program coded in Java language to find max temperature in a large file using Hadoop MapReduce
hadoop hadoop-mapreduce java linux max-temperature ubuntu
Last synced: 21 Jan 2025
https://github.com/sameetasadullah/count-words-using-mapreduce-hadoop
Program coded in Java language to count words in a large file using Hadoop MapReduce
count-words hadoop hadoop-mapreduce java linux ubuntu
Last synced: 21 Jan 2025
https://github.com/sameetasadullah/check-keywords-using-mapreduce-hadoop
Program coded in Java language to find different types of keywords in a large file using Hadoop MapReduce
hadoop hadoop-mapreduce java linux ubuntu
Last synced: 21 Jan 2025
https://github.com/oracle-quickstart/oci-hortonworks
Terraform module to deploy Hortonworks on Oracle Cloud Infrastructure (OCI)
cloud hadoop hdf hdp hortonworks oci oracle partner-led spark terraform
Last synced: 07 Nov 2024