Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
Projects in Awesome Lists tagged with hadoop-hdfs
A curated list of projects in awesome lists tagged with hadoop-hdfs .
https://github.com/seaweedfs/seaweedfs
SeaweedFS is a fast distributed storage system for blobs, objects, files, and data lake, for billions of files! Blob store has O(1) disk seek, cloud tiering. Filer supports Cloud Drive, cross-DC active-active replication, Kubernetes, POSIX FUSE mount, S3 API, S3 Gateway, Hadoop, WebDAV, encryption, Erasure Coding.
blob-storage cloud-drive distributed-file-system distributed-storage distributed-systems erasure-coding fuse hadoop-hdfs hdfs kubernetes object-storage posix replication s3 s3-storage seaweedfs tiered-file-system
Last synced: 16 Dec 2024
https://github.com/obenner/data-engineering-interview-questions
More than 2000+ Data engineer interview questions.
airflow avro aws azure cassandra data-engineering data-structures flink flume hadoop hadoop-hdfs hbase hive impala interview interview-questions kafka nifi spark sql
Last synced: 19 Dec 2024
https://github.com/OBenner/data-engineering-interview-questions
More than 2000+ Data engineer interview questions.
airflow avro aws azure cassandra data-engineering data-structures flink flume hadoop hadoop-hdfs hbase hive impala interview interview-questions kafka nifi spark sql
Last synced: 07 Nov 2024
https://github.com/morphl-ai/morphl-community-edition
MorphL Community Edition uses big data and machine learning to predict user behaviors in digital products and services with the end goal of increasing KPIs (click-through rates, conversion rates, etc.) through personalization
artificial-intelligence cassandra conversion-rate-optimization data-driven-design front-end-development hadoop-hdfs kubernetes machine-learning morphl-platform pipeline product-development pyspark user-experience
Last synced: 18 Dec 2024
https://github.com/ibm/sparksql-for-hbase
Learn how to use Spark SQL and HSpark connector package to create / query data tables that reside in HBase region servers
apache-spark hadoop-hdfs hbase ibmcode nosql spark sql
Last synced: 12 Oct 2024
https://github.com/groda/big_data
Tutorials on Big Data essentials: Hadoop, MapReduce, Spark.
apache-sedona apache-spark big-data bigdata bigtop docker gutenberg-ebooks hadoop hadoop-cluster hadoop-hdfs hadoop-mapreduce jupyter-notebook mapreduce mapreduce-bash mrjob pyspark spark spark-sql testdfsio
Last synced: 17 Dec 2024
https://github.com/ahmetfurkandemir/data-engineering-project-with-hdfs-and-kafka
Data Engineering Project with Hadoop HDFS and Kafka
data data-engineer data-engineering data-engineering-pipeline docker docker-compose hadoop hadoop-filesystem hadoop-hdfs hdfs hdfs-client hdfs-dfs kafka kafka-consumer kafka-producer kafka-ui kafkaui pipline python python-hdfs-client
Last synced: 16 Nov 2024
https://github.com/ren294/covid-data-process
This project integrates real-time data processing and analytics using Apache NiFi, Kafka, Spark, Hive, and AWS services for comprehensive COVID-19 data insights.
airflow aws aws-ec2 aws-quicksight big-data big-data-analytics covid19-data docker docker-compose hadoop-hdfs hdfs hive kafka nifi pipeline redpanda spark spark-sql spark-streaming sparksql
Last synced: 11 Oct 2024
https://github.com/ren294/log-analysis-project
This project builds a scalable log analytics pipeline use Lambda architecture for real-time and batch processing of NASA server logs.
apache-kafka apache-nifi apache-spark big-data big-data-analytics cassandra cassandra-driver data-engineering data-science grafana hadoop hadoop-hdfs hive powerbi spark-rdd spark-sql spark-streaming
Last synced: 11 Oct 2024
https://github.com/mahmoud-nfz/football-big-data
This is a comprehensive solution for real-time football analytics, leveraging Apache Spark execution on yarn for both streaming and batch processing, Hadoop HDFS for distributed storage, Kafka for real-time data ingestion, rethinkdb for live data updates , a custom built search engine and Next.js for data visualization.
hadoop hadoop-hdfs kafka nextjs rethinkdb search-engine spark spark-streaming t3-stack
Last synced: 10 Oct 2024
https://github.com/mgarralda/hadoop-spark-cluster
Repository containing Docker images for create a cluster Spark on Hadoop Yarn.
hadoop-hdfs spark spark-cluster spark-hadoop spark-hadoop-docker spark-yarn-docker
Last synced: 11 Nov 2024
https://github.com/benjdiasaad/mapreduce_k-means
Implémentation de l'algorithme de clustering k-means en utilisant le framework Hadoop version 3.1.3 (MapReduce).
big-data hadoop-hdfs hadoop-mapreduce kmeans-clustering mapreduce-java unsupervised-clustering
Last synced: 19 Nov 2024
https://github.com/nbfujx/hadoop-learn-demo
hadoop hadoop-hdfs hadoop-mapreduce
Last synced: 11 Nov 2024
https://github.com/benjdiasaad/mapreduce_wordcount
Création d'un programme Hadoop Java : compteur d’occurrence de mots. Si vous souhaitez compiler manuellement le code sur la machine virtuelle Hadoop, vous devrez y copier ce code dans la VM
eclipse-ide hadoop-hdfs hadoop-mapreduce java-8
Last synced: 19 Nov 2024
https://github.com/mikeroyal/apache-hadoop-guide
Apache Hadoop Guide
hadoop hadoop-cluster hadoop-filesystem hadoop-hdfs hadoop-mapreduce
Last synced: 12 Dec 2024
https://github.com/benjdiasaad/mapredcuce_analyse_vente
Création d'un programme Hadoop Java : Analyse de ventes.
eclipse-ide hadoop-hdfs hadoop-mapreduce java jdk-8
Last synced: 19 Nov 2024
https://github.com/vibhuti03/hadoop-administration-analysis
Setting up of a cluster and performing analysis of Aadhar Dataset using Apache Hive
aadhar-dataset cluster hadoop hadoop-administration-analysis hadoop-hdfs hive nonhacluster performing-analysis
Last synced: 13 Nov 2024
https://github.com/evegen55/mastering-spark
mastering spark
apache-spark hadoop-filesystem hadoop-hdfs multilayer-perceptron-network production-ready spark-ml
Last synced: 21 Nov 2024
https://github.com/29dch/hadoop-hdfs-mapreduce-examples
Java API操作HDFS文件、基于MapReduce的词频统计程序及其重构、MapReduce编程之Combiner、Partitioner组件应用
Last synced: 11 Nov 2024
https://github.com/abroniewski/idlecompute-data-management-architecture
Implementation of a big data management and analysis backbone architecture using PySpark for distributed and scalable data ingestion and MLlib for machine learning analysis. Part of Big Data Management and Analytics (BDMA) program.
bdma big-data big-data-analytics bigdata dataops hadoop-hdfs machine-learning parquet pipeline pyspark-mllib
Last synced: 12 Nov 2024
https://github.com/mikeroyal/apache-pig-guide
Apache Pig Guide
hadoop-hdfs hadoop-mapreduce hdfs mapreduce pig yarn
Last synced: 12 Dec 2024
https://github.com/fbraza/scala-dfs-lib
DFS-Lib is a scala flavoured api to the Hadoop java filesystem api
hadoop-filesystem hadoop-hdfs hdfs scala
Last synced: 27 Nov 2024
https://github.com/vigneshss-07/bigdata_technologies
This repo contains all technical knowledge and implementation of big data technologies.
big-data hadoop hadoop-hdfs hbase hive hive-metastore kafka mapreduce-python pyspark spark sparksql
Last synced: 15 Nov 2024
https://github.com/vigneshss-07/data-engineering
This Repo contain details related to Data Engineering tech stacks
gcp hadoop-hdfs hive pyspark scala spark sql
Last synced: 15 Nov 2024
https://github.com/stefanofioravanzo/evolving-wikipedia-graph
Distributed processing of Wikipedia history files using Hadoop and Spark
distributed-processing hadoop-hdfs spark wikipedia
Last synced: 18 Nov 2024
https://github.com/ankit21111/sparnordetl
ETL Pipeline for Spar Nord Bank for the analysis of refilling frequency of the ATM's all over the europe
amazon-redshift hadoop-hdfs python sql sqoop-import
Last synced: 18 Nov 2024
https://github.com/murshidazher/terraform-hdp
👷 A hdp-terraform setup for the big-data analytics
aws bigdata ec2 hadoop-hdfs hbase hdp hive hortonworks-sandbox sandbox terraform
Last synced: 08 Nov 2024
https://github.com/kriss024/hadoop
Hadoop and Hive fundamental commands
hadoop hadoop-filesystem hadoop-hdfs hive
Last synced: 25 Nov 2024
https://github.com/ineerav/sparkini
base docker compose to setup the data engineering env in local
Last synced: 11 Oct 2024
https://github.com/spineo/hadoop-app
ansible ansible-inventory ansible-playbook hadoop hadoop-cluster hadoop-hdfs hadoop-mapreduce hdfs yarn
Last synced: 23 Nov 2024
https://github.com/spineo/accumulo-hdfs-zookeeper
Create a storage cluster running Accumulo on HDFS and Zookeeper for node management.
accumulo accumulo-hdfs-zookeeper ansible ansible-inventory ansible-playbooks cluster hadoop hadoop-hdfs hdfs zookeeper
Last synced: 23 Nov 2024
https://github.com/devlucho/spark-procesamiento-en-batch
Este proyecto utiliza PySpark para analizar datos de estudiantes a partir de un archivo CSV almacenado en HDFS.
apache-spark hadoop-hdfs pyspark python3
Last synced: 19 Dec 2024
https://github.com/divinenaman/mapreduce-matrix-multipy
A python implementation of matrix multiplication using Hadoop streaming API
hadoop hadoop-hdfs hadoop-mapreduce python
Last synced: 17 Dec 2024
https://github.com/shortthirdman/apache-hadoop-nativelib
Apache Hadoop NativeLib Build for 64-bit (x86_64)
apache-hadoop hadoop hadoop-hdfs hadoop-mapreduce hadoop-nativelib
Last synced: 19 Nov 2024
https://github.com/vinceecws/project-1
A project that involves manipulating unstructured CSV data with Hadoop's HDFS & Hive, additionally performing queries using SparkSQL
apache-spark hadoop-hdfs hadoop-hive sbt scala
Last synced: 20 Nov 2024
https://github.com/vladd12/big-data-practice
Introduce to Big Data with Hadoop
hadoop hadoop-hdfs hadoop-mapreduce pig-latin
Last synced: 29 Nov 2024
https://github.com/ankit21111/patient-alert-etl
The Patient Alert ETL 🚑 project creates a real-time data pipeline to monitor vital health parameters from IoT devices in hospitals. Using Apache Kafka, Spark, and HBase, it processes streaming data and sends immediate alerts via Amazon SNS when vitals exceed normal thresholds, enhancing patient care through timely interventions.
apache-kafka apache-spark awssns hadoop-hdfs hbase hive java-8 mysql python3 rdbms sqoop
Last synced: 14 Dec 2024
https://github.com/vaxdata22/nosql-and-big-data-demonstration
This is a fun assignment task I undertook to explore the world of NoSQL and Big Data. technologies.
apache-hive cassandra-cql cypher-query-language data-warehouse hadoop-hdfs json mongodb neo4j nosql-databases redis
Last synced: 21 Dec 2024
https://github.com/amirhnajafiz-university/s7cc03
Third project of Cloud Computing course.
big-data hadoop hadoop-hdfs mapreduce python python3 spark
Last synced: 06 Nov 2024
https://github.com/pawsanie/pyspark_universal_dq_report
The script reads the dataset along the path and selects the columns in it received from the argument for the specified dates. Then it saves the report to the specified path of HDFS.
data-quality data-quality-checks data-quality-monitoring dq hadoop hadoop-hdfs hdfs pyspark python python-3 python-script python3
Last synced: 09 Nov 2024
https://github.com/kumarvna/terraform-azurerm-hdinsight
Terraform module to create managed, full-spectrum, open-source analytics service Azure HDInsight. This module creates Apache Hadoop, Apache Spark, Apache HBase, Interactive Query (Apache Hive LLAP) and Apache Kafka clusters.
apache-hive-cluster azure azure-hdinsight hadoop-cluster hadoop-filesystem hadoop-hdfs hbase-cluster hdinsight-cluster hdinsight-hadoop-cluster hdinsight-hbase-cluster hdinsight-interactive-query-cluster hdinsight-kafka-cluster hdinsight-spark-cluster kafka-cluster spark-cluster spark-clusters terraform terraform-module
Last synced: 08 Nov 2024
https://github.com/cevheri/hadoop.3-config
My Apache Hadoop 3 config files.
hadoop hadoop-conf hadoop-core hadoop-filesystem hadoop-hdfs hadoop-mapreduce linux-bash pom-xml
Last synced: 09 Nov 2024
https://github.com/cevheri/hadoop-mr-example-currency
Hadoop MapReduce, Read currency.txt and driver, mapper, and reducer
hadoop hadoop-filesystem hadoop-hdfs hadoop-mapreduce java maven
Last synced: 09 Nov 2024
https://github.com/venkat-a/exploratory-data-analysis-eda-using-pyspark
Leverage the power of Apache Spark for large-scale data processing and analysis
dataframes descriptive-statistics hadoop-hdfs matplotlib plotly-express pyspark-python seaborn sql statistical-analysis visualization
Last synced: 10 Nov 2024
https://github.com/aymane-maghouti/mobile-data-hive-insights
This project demonstrates the process of extracting data from a MySQL database, transferring it using Apache Sqoop, storing it in Hive Data warehouse (the data actually is store in Hadoop Distributed File System (HDFS)), and performing analysis using Hive Query Language (Hive QL) (it is a language close to SQL). Then visualize the data in Power BI,
apache-sqoop data data-integration data-visualization hadoop-hdfs hivedb hiveql powerbi
Last synced: 16 Nov 2024
https://github.com/prakhar-ff13/hadoop
This repository contains Hadoop Ecosystem Files (Code, data, readme etc...)
flume-ng hadoop hadoop-filesystem hadoop-hdfs hadoop-mapreduce hive java mapreduce-java oozie-mapreduce pig yarn yarn-hadoop-cluster
Last synced: 30 Nov 2024
https://github.com/xpcosmos/data-lake-prime
This project aims to simulate and configure a Distributed File System using Hadoop HDFS. For this project, 3 machines were created: 1 Master Node and 2 Worker Nodes.
hadoop hadoop-cluster hadoop-hdfs hdfs network
Last synced: 14 Nov 2024
https://github.com/madhurimarawat/big-data-analytics
This repository demonstrates big data processing, visualization, and machine learning using tools such as Hadoop, Spark, Kafka, and Python.
big-data big-data-analytics big-data-analytics-techniques hadoop-hdfs hadoop-installation hadoop-mapreduce python
Last synced: 14 Nov 2024
https://github.com/ilieschibane/projet-iot-cloud-bigdata
Implémentation d'une pipeline permettant de faire la prédiction de la maladie de parkinson via des outils d'IoT, Cloud, et Big Data
big-data cassandra cloud flask hadoop-hdfs iot kafka machine-learning mongodb mqtt python rest-api sickit-learn spark
Last synced: 19 Nov 2024