Projects in Awesome Lists tagged with hadoop-hdfs

https://github.com/seaweedfs/seaweedfs

SeaweedFS is a fast distributed storage system for blobs, objects, files, and data lake, for billions of files! Blob store has O(1) disk seek, cloud tiering. Filer supports Cloud Drive, cross-DC active-active replication, Kubernetes, POSIX FUSE mount, S3 API, S3 Gateway, Hadoop, WebDAV, encryption, Erasure Coding.

blob-storage cloud-drive distributed-file-system distributed-storage distributed-systems erasure-coding fuse hadoop-hdfs hdfs kubernetes object-storage posix replication s3 s3-storage seaweedfs tiered-file-system

Last synced: 16 Dec 2024

https://github.com/obenner/data-engineering-interview-questions

More than 2000+ Data engineer interview questions.

airflow avro aws azure cassandra data-engineering data-structures flink flume hadoop hadoop-hdfs hbase hive impala interview interview-questions kafka nifi spark sql

Last synced: 19 Dec 2024

https://github.com/OBenner/data-engineering-interview-questions

More than 2000+ Data engineer interview questions.

airflow avro aws azure cassandra data-engineering data-structures flink flume hadoop hadoop-hdfs hbase hive impala interview interview-questions kafka nifi spark sql

Last synced: 07 Nov 2024

https://github.com/morphl-ai/morphl-community-edition

MorphL Community Edition uses big data and machine learning to predict user behaviors in digital products and services with the end goal of increasing KPIs (click-through rates, conversion rates, etc.) through personalization

artificial-intelligence cassandra conversion-rate-optimization data-driven-design front-end-development hadoop-hdfs kubernetes machine-learning morphl-platform pipeline product-development pyspark user-experience

Last synced: 18 Dec 2024

https://github.com/ibm/sparksql-for-hbase

Learn how to use Spark SQL and HSpark connector package to create / query data tables that reside in HBase region servers

apache-spark hadoop-hdfs hbase ibmcode nosql spark sql

Last synced: 12 Oct 2024

https://github.com/groda/big_data

Tutorials on Big Data essentials: Hadoop, MapReduce, Spark.

apache-sedona apache-spark big-data bigdata bigtop docker gutenberg-ebooks hadoop hadoop-cluster hadoop-hdfs hadoop-mapreduce jupyter-notebook mapreduce mapreduce-bash mrjob pyspark spark spark-sql testdfsio

Last synced: 17 Dec 2024

https://github.com/ahmetfurkandemir/data-engineering-project-with-hdfs-and-kafka

Data Engineering Project with Hadoop HDFS and Kafka

data data-engineer data-engineering data-engineering-pipeline docker docker-compose hadoop hadoop-filesystem hadoop-hdfs hdfs hdfs-client hdfs-dfs kafka kafka-consumer kafka-producer kafka-ui kafkaui pipline python python-hdfs-client

Last synced: 16 Nov 2024

https://github.com/ren294/covid-data-process

This project integrates real-time data processing and analytics using Apache NiFi, Kafka, Spark, Hive, and AWS services for comprehensive COVID-19 data insights.

airflow aws aws-ec2 aws-quicksight big-data big-data-analytics covid19-data docker docker-compose hadoop-hdfs hdfs hive kafka nifi pipeline redpanda spark spark-sql spark-streaming sparksql

Last synced: 11 Oct 2024

https://github.com/ren294/log-analysis-project

This project builds a scalable log analytics pipeline use Lambda architecture for real-time and batch processing of NASA server logs.

apache-kafka apache-nifi apache-spark big-data big-data-analytics cassandra cassandra-driver data-engineering data-science grafana hadoop hadoop-hdfs hive powerbi spark-rdd spark-sql spark-streaming

Last synced: 11 Oct 2024

https://github.com/mahmoud-nfz/football-big-data

This is a comprehensive solution for real-time football analytics, leveraging Apache Spark execution on yarn for both streaming and batch processing, Hadoop HDFS for distributed storage, Kafka for real-time data ingestion, rethinkdb for live data updates , a custom built search engine and Next.js for data visualization.

hadoop hadoop-hdfs kafka nextjs rethinkdb search-engine spark spark-streaming t3-stack

Last synced: 10 Oct 2024

https://github.com/mgarralda/hadoop-spark-cluster

Repository containing Docker images for create a cluster Spark on Hadoop Yarn.

hadoop-hdfs spark spark-cluster spark-hadoop spark-hadoop-docker spark-yarn-docker

Last synced: 11 Nov 2024

https://github.com/benjdiasaad/mapreduce_k-means

Implémentation de l'algorithme de clustering k-means en utilisant le framework Hadoop version 3.1.3 (MapReduce).

big-data hadoop-hdfs hadoop-mapreduce kmeans-clustering mapreduce-java unsupervised-clustering

Last synced: 19 Nov 2024

https://github.com/nbfujx/hadoop-learn-demo

hadoop hadoop-hdfs hadoop-mapreduce

Last synced: 11 Nov 2024

https://github.com/benjdiasaad/mapreduce_wordcount

Création d'un programme Hadoop Java : compteur d’occurrence de mots. Si vous souhaitez compiler manuellement le code sur la machine virtuelle Hadoop, vous devrez y copier ce code dans la VM

eclipse-ide hadoop-hdfs hadoop-mapreduce java-8

Last synced: 19 Nov 2024

https://github.com/mikeroyal/apache-hadoop-guide

Apache Hadoop Guide

hadoop hadoop-cluster hadoop-filesystem hadoop-hdfs hadoop-mapreduce

Last synced: 12 Dec 2024

https://github.com/benjdiasaad/mapredcuce_analyse_vente

Création d'un programme Hadoop Java : Analyse de ventes.

eclipse-ide hadoop-hdfs hadoop-mapreduce java jdk-8

Last synced: 19 Nov 2024

https://github.com/vibhuti03/hadoop-administration-analysis

Setting up of a cluster and performing analysis of Aadhar Dataset using Apache Hive

aadhar-dataset cluster hadoop hadoop-administration-analysis hadoop-hdfs hive nonhacluster performing-analysis

Last synced: 13 Nov 2024

https://github.com/evegen55/mastering-spark

mastering spark

apache-spark hadoop-filesystem hadoop-hdfs multilayer-perceptron-network production-ready spark-ml

Last synced: 21 Nov 2024

https://github.com/29dch/hadoop-hdfs-mapreduce-examples

Java API操作HDFS文件、基于MapReduce的词频统计程序及其重构、MapReduce编程之Combiner、Partitioner组件应用

hadoop-hdfs hadoop-mapreduce

Last synced: 11 Nov 2024

https://github.com/abroniewski/idlecompute-data-management-architecture

Implementation of a big data management and analysis backbone architecture using PySpark for distributed and scalable data ingestion and MLlib for machine learning analysis. Part of Big Data Management and Analytics (BDMA) program.

bdma big-data big-data-analytics bigdata dataops hadoop-hdfs machine-learning parquet pipeline pyspark-mllib

Last synced: 12 Nov 2024

https://github.com/mikeroyal/apache-pig-guide

Apache Pig Guide

hadoop-hdfs hadoop-mapreduce hdfs mapreduce pig yarn

Last synced: 12 Dec 2024

https://github.com/fbraza/scala-dfs-lib

DFS-Lib is a scala flavoured api to the Hadoop java filesystem api

hadoop-filesystem hadoop-hdfs hdfs scala

Last synced: 27 Nov 2024

https://github.com/vigneshss-07/bigdata_technologies

This repo contains all technical knowledge and implementation of big data technologies.

big-data hadoop hadoop-hdfs hbase hive hive-metastore kafka mapreduce-python pyspark spark sparksql

Last synced: 15 Nov 2024

https://github.com/vigneshss-07/data-engineering

This Repo contain details related to Data Engineering tech stacks

gcp hadoop-hdfs hive pyspark scala spark sql

Last synced: 15 Nov 2024

https://github.com/stefanofioravanzo/evolving-wikipedia-graph

Distributed processing of Wikipedia history files using Hadoop and Spark

distributed-processing hadoop-hdfs spark wikipedia

Last synced: 18 Nov 2024

https://github.com/ankit21111/sparnordetl

ETL Pipeline for Spar Nord Bank for the analysis of refilling frequency of the ATM's all over the europe

amazon-redshift hadoop-hdfs python sql sqoop-import

Last synced: 18 Nov 2024

https://github.com/murshidazher/terraform-hdp

👷 A hdp-terraform setup for the big-data analytics

aws bigdata ec2 hadoop-hdfs hbase hdp hive hortonworks-sandbox sandbox terraform

Last synced: 08 Nov 2024

https://github.com/kriss024/hadoop

Hadoop and Hive fundamental commands

hadoop hadoop-filesystem hadoop-hdfs hive

Last synced: 25 Nov 2024

https://github.com/ineerav/sparkini

base docker compose to setup the data engineering env in local

docker hadoop-hdfs hue spark

Last synced: 11 Oct 2024

https://github.com/spineo/hadoop-app

ansible ansible-inventory ansible-playbook hadoop hadoop-cluster hadoop-hdfs hadoop-mapreduce hdfs yarn

Last synced: 23 Nov 2024

https://github.com/spineo/accumulo-hdfs-zookeeper

Create a storage cluster running Accumulo on HDFS and Zookeeper for node management.

accumulo accumulo-hdfs-zookeeper ansible ansible-inventory ansible-playbooks cluster hadoop hadoop-hdfs hdfs zookeeper

Last synced: 23 Nov 2024

https://github.com/devlucho/spark-procesamiento-en-batch

Este proyecto utiliza PySpark para analizar datos de estudiantes a partir de un archivo CSV almacenado en HDFS.

apache-spark hadoop-hdfs pyspark python3

Last synced: 19 Dec 2024

https://github.com/divinenaman/mapreduce-matrix-multipy

A python implementation of matrix multiplication using Hadoop streaming API

hadoop hadoop-hdfs hadoop-mapreduce python

Last synced: 17 Dec 2024

https://github.com/shortthirdman/apache-hadoop-nativelib

Apache Hadoop NativeLib Build for 64-bit (x86_64)

apache-hadoop hadoop hadoop-hdfs hadoop-mapreduce hadoop-nativelib

Last synced: 19 Nov 2024

https://github.com/vinceecws/project-1

A project that involves manipulating unstructured CSV data with Hadoop's HDFS & Hive, additionally performing queries using SparkSQL

apache-spark hadoop-hdfs hadoop-hive sbt scala

Last synced: 20 Nov 2024

https://github.com/vladd12/big-data-practice

Introduce to Big Data with Hadoop

hadoop hadoop-hdfs hadoop-mapreduce pig-latin

Last synced: 29 Nov 2024

https://github.com/ankit21111/patient-alert-etl

The Patient Alert ETL 🚑 project creates a real-time data pipeline to monitor vital health parameters from IoT devices in hospitals. Using Apache Kafka, Spark, and HBase, it processes streaming data and sends immediate alerts via Amazon SNS when vitals exceed normal thresholds, enhancing patient care through timely interventions.

apache-kafka apache-spark awssns hadoop-hdfs hbase hive java-8 mysql python3 rdbms sqoop

Last synced: 14 Dec 2024

https://github.com/vaxdata22/nosql-and-big-data-demonstration

This is a fun assignment task I undertook to explore the world of NoSQL and Big Data. technologies.

apache-hive cassandra-cql cypher-query-language data-warehouse hadoop-hdfs json mongodb neo4j nosql-databases redis

Last synced: 21 Dec 2024

https://github.com/amirhnajafiz-university/s7cc03

Third project of Cloud Computing course.

big-data hadoop hadoop-hdfs mapreduce python python3 spark

Last synced: 06 Nov 2024

https://github.com/pawsanie/pyspark_universal_dq_report

The script reads the dataset along the path and selects the columns in it received from the argument for the specified dates. Then it saves the report to the specified path of HDFS.

data-quality data-quality-checks data-quality-monitoring dq hadoop hadoop-hdfs hdfs pyspark python python-3 python-script python3

Last synced: 09 Nov 2024

https://github.com/kumarvna/terraform-azurerm-hdinsight

Terraform module to create managed, full-spectrum, open-source analytics service Azure HDInsight. This module creates Apache Hadoop, Apache Spark, Apache HBase, Interactive Query (Apache Hive LLAP) and Apache Kafka clusters.

apache-hive-cluster azure azure-hdinsight hadoop-cluster hadoop-filesystem hadoop-hdfs hbase-cluster hdinsight-cluster hdinsight-hadoop-cluster hdinsight-hbase-cluster hdinsight-interactive-query-cluster hdinsight-kafka-cluster hdinsight-spark-cluster kafka-cluster spark-cluster spark-clusters terraform terraform-module

Last synced: 08 Nov 2024

https://github.com/cevheri/hadoop.3-config

My Apache Hadoop 3 config files.

hadoop hadoop-conf hadoop-core hadoop-filesystem hadoop-hdfs hadoop-mapreduce linux-bash pom-xml

Last synced: 09 Nov 2024

https://github.com/cevheri/hadoop-mr-example-currency

Hadoop MapReduce, Read currency.txt and driver, mapper, and reducer

hadoop hadoop-filesystem hadoop-hdfs hadoop-mapreduce java maven

Last synced: 09 Nov 2024

https://github.com/venkat-a/exploratory-data-analysis-eda-using-pyspark

Leverage the power of Apache Spark for large-scale data processing and analysis

dataframes descriptive-statistics hadoop-hdfs matplotlib plotly-express pyspark-python seaborn sql statistical-analysis visualization

Last synced: 10 Nov 2024

https://github.com/aymane-maghouti/mobile-data-hive-insights

This project demonstrates the process of extracting data from a MySQL database, transferring it using Apache Sqoop, storing it in Hive Data warehouse (the data actually is store in Hadoop Distributed File System (HDFS)), and performing analysis using Hive Query Language (Hive QL) (it is a language close to SQL). Then visualize the data in Power BI,

apache-sqoop data data-integration data-visualization hadoop-hdfs hivedb hiveql powerbi

Last synced: 16 Nov 2024

https://github.com/prakhar-ff13/hadoop

This repository contains Hadoop Ecosystem Files (Code, data, readme etc...)

flume-ng hadoop hadoop-filesystem hadoop-hdfs hadoop-mapreduce hive java mapreduce-java oozie-mapreduce pig yarn yarn-hadoop-cluster

Last synced: 30 Nov 2024

https://github.com/xpcosmos/data-lake-prime

This project aims to simulate and configure a Distributed File System using Hadoop HDFS. For this project, 3 machines were created: 1 Master Node and 2 Worker Nodes.

hadoop hadoop-cluster hadoop-hdfs hdfs network

Last synced: 14 Nov 2024

https://github.com/madhurimarawat/big-data-analytics

This repository demonstrates big data processing, visualization, and machine learning using tools such as Hadoop, Spark, Kafka, and Python.

big-data big-data-analytics big-data-analytics-techniques hadoop-hdfs hadoop-installation hadoop-mapreduce python

Last synced: 14 Nov 2024

https://github.com/ilieschibane/projet-iot-cloud-bigdata

Implémentation d'une pipeline permettant de faire la prédiction de la maladie de parkinson via des outils d'IoT, Cloud, et Big Data

big-data cassandra cloud flask hadoop-hdfs iot kafka machine-learning mongodb mqtt python rest-api sickit-learn spark

Last synced: 19 Nov 2024