An open API service indexing awesome lists of open source software.

Projects in Awesome Lists tagged with hdfs

A curated list of projects in awesome lists tagged with hdfs .

https://github.com/seaweedfs/seaweedfs

SeaweedFS is a fast distributed storage system for blobs, objects, files, and data lake, for billions of files! Blob store has O(1) disk seek, cloud tiering. Filer supports Cloud Drive, cross-DC active-active replication, Kubernetes, POSIX FUSE mount, S3 API, S3 Gateway, Hadoop, WebDAV, encryption, Erasure Coding.

blob-storage cloud-drive distributed-file-system distributed-storage distributed-systems erasure-coding fuse hadoop-hdfs hdfs kubernetes object-storage posix replication s3 s3-storage seaweedfs tiered-file-system

Last synced: 16 Dec 2025

https://github.com/juicedata/juicefs

JuiceFS is a distributed POSIX file system built on top of Redis and S3.

bigdata cloud-native distributed-systems filesystem go golang hdfs object-storage posix redis s3 storage

Last synced: 12 May 2025

https://github.com/wangzhiwubigdata/god-of-bigdata

专注大数据学习面试,大数据成神之路开启。Flink/Spark/Hadoop/Hbase/Hive...

azkaban bigdata flink flume hadoop hbase hdfs hive kafka spark zookeeper

Last synced: 13 May 2025

https://github.com/wangzhiwubigdata/God-Of-BigData

专注大数据学习面试,大数据成神之路开启。Flink/Spark/Hadoop/Hbase/Hive...

azkaban bigdata flink flume hadoop hbase hdfs hive kafka spark zookeeper

Last synced: 27 Mar 2025

https://github.com/piskvorky/smart_open

Utils for streaming large files (S3, HDFS, gzip, bz2...)

boto bz2 file gzip-stream hacktoberfest hdfs python s3 streaming streaming-data webhdfs

Last synced: 11 Dec 2025

https://github.com/RaRe-Technologies/smart_open

Utils for streaming large files (S3, HDFS, gzip, bz2...)

boto bz2 file gzip-stream hacktoberfest hdfs python s3 streaming streaming-data webhdfs

Last synced: 31 Mar 2025

https://github.com/water8394/bigdata-interview

:dart: :star2:[大数据面试题]分享自己在网络上收集的大数据相关的面试题以及自己的答案总结.目前包含Hadoop/Hive/Spark/Flink/Hbase/Kafka/Zookeeper框架的面试题知识总结

bigdata flink hadoop hbase hdfs interview interview-questions kafka mapreduce spark yarn

Last synced: 15 May 2025

https://github.com/water8394/BigData-Interview

:dart: :star2:[大数据面试题]分享自己在网络上收集的大数据相关的面试题以及自己的答案总结.目前包含Hadoop/Hive/Spark/Flink/Hbase/Kafka/Zookeeper框架的面试题知识总结

bigdata flink hadoop hbase hdfs interview interview-questions kafka mapreduce spark yarn

Last synced: 27 Mar 2025

https://github.com/collabh/bigdata-growth

大数据知识仓库涉及到数据仓库建模、实时计算、大数据、数据中台、系统设计、Java、算法等。

bigdata bigdatalearning debezium flink hadoop hbase hdfs hive hudi kafka kudu mapreduce olap spark

Last synced: 14 May 2025

https://github.com/collabH/bigdata-growth

大数据知识仓库涉及到数据仓库建模、实时计算、大数据、数据中台、系统设计、Java、算法等。

bigdata bigdatalearning debezium flink hadoop hbase hdfs hive hudi kafka kudu mapreduce olap spark

Last synced: 28 Mar 2025

https://github.com/colinmarc/hdfs

A native go client for HDFS

commandline go hdfs

Last synced: 12 May 2025

https://github.com/wgzhao/addax

A fast and versatile ETL tool that can transfer data between RDBMS and NoSQL seamlessly

clickhouse database etl excel hadoop hdfs hive impala influxdb kudu mysql oracle postgresql sqlserver trino

Last synced: 06 Oct 2025

https://github.com/wgzhao/Addax

A fast and versatile ETL tool that can transfer data between RDBMS and NoSQL seamlessly

clickhouse database etl excel hadoop hdfs hive impala influxdb kudu mysql oracle postgresql sqlserver trino

Last synced: 14 Apr 2025

https://github.com/spotify/snakebite

A pure python HDFS client

hdfs python python-hdfs-client

Last synced: 20 Oct 2025

https://github.com/harisekhon/devops-python-tools

80+ DevOps & Data CLI Tools - AWS, GCP, GCF Python Cloud Functions, Log Anonymizer, Spark, Hadoop, HBase, Hive, Impala, Linux, Docker, Spark Data Converters & Validators (Avro/Parquet/JSON/CSV/INI/XML/YAML), Travis CI, AWS CloudFormation, Elasticsearch, Solr etc.

avro aws cloudformation devops docker dockerhub elasticsearch gcf gcp hadoop hbase hdfs json linux parquet pyspark python solr spark travis-ci

Last synced: 13 Jun 2025

https://github.com/HariSekhon/DevOps-Python-tools

80+ DevOps & Data CLI Tools - AWS, GCP, GCF Python Cloud Functions, Log Anonymizer, Spark, Hadoop, HBase, Hive, Impala, Linux, Docker, Spark Data Converters & Validators (Avro/Parquet/JSON/CSV/INI/XML/YAML), Travis CI, AWS CloudFormation, Elasticsearch, Solr etc.

avro aws cloudformation devops docker dockerhub elasticsearch gcf gcp hadoop hbase hdfs json linux parquet pyspark python solr spark travis-ci

Last synced: 11 Apr 2025

https://github.com/dromara/cloudeon

CloudEon uses Kubernetes to install and deploy open-source big data components, enabling the containerized operation of an open-source big data platform. This allows you to reduce your focus on underlying resource management and maintenance.

bigdata cloudnative doris hadoop hdfs kubernetes yarn

Last synced: 15 May 2025

https://github.com/dromara/CloudEon

CloudEon uses Kubernetes to install and deploy open-source big data components, enabling the containerized operation of an open-source big data platform. This allows you to reduce your focus on underlying resource management and maintenance.

bigdata cloudnative doris hadoop hdfs kubernetes yarn

Last synced: 04 Apr 2025

https://github.com/uber/storagetapper

StorageTapper is a scalable realtime MySQL change data streaming, logical backup and logical replication service

avro cdc clickhouse etl hdfs json kafka msgpack mysql postgresql s3

Last synced: 11 Jun 2025

https://github.com/datawhalechina/juicy-bigdata

🎉🎉🐳 Datawhale大数据处理导论教程 | 大数据技术方向的开篇课程🎉🎉

bigdata hadoop hbase hdfs hive mapreduce spark

Last synced: 09 Apr 2025

https://github.com/Eugene-Mark/bigdata-file-viewer

A cross-platform (Windows, MAC, Linux) desktop application to view common bigdata binary format like Parquet, ORC, AVRO, etc. Support local file system, HDFS, AWS S3, Azure Blob Storage ,etc.

avro bigdata hdfs orc parquet

Last synced: 20 Nov 2025

https://github.com/mtth/hdfs

API and command line interface for HDFS

cli hdfs python

Last synced: 15 May 2025

https://github.com/rumbledb/rumble

⛈️ RumbleDB 1.23.0 "Mountain Ash" 🌳 for Apache Spark | Run queries on your large-scale, messy JSON-like data (JSON, text, CSV, Parquet, ROOT, AVRO, SVM...) | No install required (just a jar to download) | Declarative Machine Learning and more

avro azure csv data-science dataframes hdfs json jsoniq machine-learning nested parquet query query-engine s3 scale schemaless spark svm text yaml

Last synced: 03 Aug 2025

https://github.com/RumbleDB/rumble

Quick start: pip install jsoniq ⛈️ RumbleDB 2.0.0 "Lemon Ironwood" 🌳 for Apache Spark | Run queries on your large-scale, messy datasets (JSON, text, CSV, Parquet, Delta...) | Data Lakehouse with Updates, Scripting, Declarative Machine Learning and more

azure csv data-science dataframes delta-lake hdfs json jsoniq lakehouse machine-learning nested parquet query query-engine s3 scale schemaless spark svm text

Last synced: 20 Nov 2025

https://github.com/breuner/elbencho

A distributed storage benchmark for file systems, object stores & block devices with support for GPUs

benchmark block-storage deep-learning distributed file-systems fio gpu hdfs ior linux live-stats mdtest nvme parallel s3 storage windows

Last synced: 28 Dec 2025

https://github.com/tiledb-inc/tiledb-py

Python interface to the TileDB storage engine

array hdfs numpy python s3 storage-manager tiledb

Last synced: 15 May 2025

https://github.com/paddlepaddle/elasticctr

ElasticCTR,即飞桨弹性计算推荐系统,是基于Kubernetes的企业级推荐系统开源解决方案。该方案融合了百度业务场景下持续打磨的高精度CTR模型、飞桨开源框架的大规模分布式训练能力、工业级稀疏参数弹性调度服务,帮助用户在Kubernetes环境中一键完成推荐系统部署,具备高性能、工业级部署、端到端体验的特点,并且作为开源套件,满足二次深度开发的需求。

ctr hdfs k8s personalization ranking recommender-system

Last synced: 21 Aug 2025

https://github.com/marcelmay/hadoop-hdfs-fsimage-exporter

Exports Hadoop HDFS content statistics to Prometheus

hadoop hadoop-fsimage hdfs hdfs-metrics monitoring prometheus-exporter

Last synced: 15 Sep 2025

https://github.com/d2iq-archive/dcos-commons

DC/OS SDK is a collection of tools, libraries, and documentation for easy integration of technologies such as Kafka, Cassandra, HDFS, Spark, and TensorFlow with DC/OS.

cassandra dcos dcos-data-services-guild declarative elasticsearch hdfs kafka kubernetes mesos stateful-containers tensorflow

Last synced: 26 Mar 2025

https://github.com/avast/hdfs-shell

HDFS Shell is a HDFS manipulation tool to work with functions integrated in Hadoop DFS

big-data cli cli-application hadoop hdfs hdfs-manipulation linux shell

Last synced: 26 Oct 2025

https://github.com/jcrist/skein

A tool and library for easily deploying applications on Apache YARN

apache-yarn cluster deployment hadoop hdfs python

Last synced: 05 Apr 2025

https://github.com/paypal/nnanalytics

NameNodeAnalytics is a self-help utility for scouting and maintaining the namespace of an HDFS instance.

fsimage hadoop hdfs metadata namespace scanner utility

Last synced: 09 May 2025

https://github.com/TileDB-Inc/TileDB-R

R interface to TileDB: The Modern Database

array hdfs r s3 storage-manager tiledb

Last synced: 13 Jul 2025

https://github.com/tiledb-inc/tiledb-r

R interface to TileDB: The Modern Database

array hdfs r s3 storage-manager tiledb

Last synced: 12 Apr 2025

https://github.com/luckyzxl2016/cloud-note

基于分布式的云笔记(参考某道云笔记),数据存储在redis与hbase中

hbase hdfs linux nginx redis ssm tomcat web

Last synced: 21 Mar 2025

https://github.com/coderayzhang/cloud-note

基于分布式的云笔记(参考某道云笔记),数据存储在redis与hbase中

hbase hdfs linux nginx redis ssm tomcat web

Last synced: 24 Oct 2025

https://github.com/harisekhon/devops-perl-tools

25+ DevOps CLI Tools - Anonymizer, SQL ReCaser (MySQL, PostgreSQL, AWS Redshift, Snowflake, Apache Drill, Hive, Impala, Cassandra CQL, Microsoft SQL Server, Oracle, Couchbase N1QL, Dockerfiles), Hadoop HDFS & Hive tools, Solr/SolrCloud CLI, Nginx stats & HTTP(S) URL watchers for load-balanced web farms, Linux tools etc.

anonymize apache-drill cassandra couchbase docker hacktoberfest hadoop hbase hdfs hive kerberos linux mysql neo4j nginx recaser solr solrcloud sql

Last synced: 13 Jun 2025

https://github.com/starlake-ai/starlake

Declarative text based tool for data analysts and engineers to extract, load, transform and orchestrate their data pipelines.

bigquery data-engineering data-integration data-pipeline etl hdfs redshift snowflake spark synapse

Last synced: 05 Apr 2025

https://github.com/seznam/euphoria

Euphoria is an open source Java API for creating unified big-data processing flows. It provides an engine independent programming model which can express both batch and stream transformations.

apache-flink apache-spark batch-processing big-data hadoop hdfs java-api kafka streaming-data unified-bigdata-processing

Last synced: 21 Aug 2025

https://github.com/dbiir/rainbow

A data layout optimization framework for wide tables stored on HDFS. See rainbow's webpage

column-store data-analytics data-layout hdfs sql wide-table

Last synced: 30 Jun 2025

https://github.com/longshilin/hdfs-netdisc

基于Hadoop的分布式云存储系统 :palm_tree:

bigdata filesystem hadoop hadoop-filesystem hdfs hdfs-client hdfs-netdisc netdisk

Last synced: 07 Apr 2025

https://github.com/marcelmay/hfsa

Hadoop FSImage Analyzer (HFSA)

fsimage hadoop hdfs tool

Last synced: 15 Sep 2025

https://github.com/fluent/fluent-plugin-webhdfs

Hadoop WebHDFS output plugin for Fluentd

fluentd fluentd-plugin hadoop hdfs

Last synced: 05 Jul 2025

https://github.com/ascrus/getl

A tool for developing and testing ETL and ELT processes for automating the capture, delivery and processing of information in data warehouses on the MicroFocus Vertica platform.

csv dsl elt etl excel hdfs hive impala json kafka sql unit-testing vertica xml

Last synced: 14 Jun 2025

https://github.com/damiencarol/jsr203-hadoop

A Java NIO file system provider for HDFS

hadoop hdfs java nio

Last synced: 07 Apr 2025

https://github.com/tiledb-inc/tiledb-go

Go Interface to the TileDB storage manager

array go golang golang-library hdfs s3 storage-manager tiledb

Last synced: 20 Aug 2025

https://github.com/terascope/teraslice

Scalable data processing pipelines in JavaScript

elasticsearch hadoop hdfs json kafka

Last synced: 04 Apr 2025

https://github.com/criteo/cluster-pack

A library on top of either pex or conda-pack to make your Python code easily available on a cluster

conda-pack hdfs pex pyspark s3 skein

Last synced: 05 Apr 2025

https://github.com/ibmstreams/samples

This repository contains open-source sample applications for IBM Streams.

database geofence geofencing hdfs healthcare ibm-streams samples stream-processing text-analytics timeseries

Last synced: 15 Jul 2025

https://github.com/aikuyun/bigdata-doc

大数据学习笔记,学习路线,技术案例整理。

bigdata flink hadoop hdfs hive kafka mapreduce

Last synced: 07 Nov 2025

https://github.com/jacobstanley/hadoop-tools

Tools for working with Hadoop, written with performance in mind.

hadoop haskell hdfs

Last synced: 11 Dec 2025

https://github.com/zongxr/bigdata-competition

全国大数据竞赛三等奖解决方案,省赛二等奖解决方案。一键安装大数据环境脚本,自动部署集群环境,包括zookeeper、hadoop、mysql、hive、spark以及一些基础环境。已通过实际服务器测试,效果极佳,仅需要输入密码等少量人为干预。解放安装部署配置所需人力。并添加若干scala案例,结合spark用以进行数据准备。

bigdata hadoop hdfs hive mysql scala shell spark wordcount zookeeper

Last synced: 14 Apr 2025

https://github.com/gchq/gaffer-docker

Gaffer Docker images and associated Helm charts for deploying on Kubernetes

accumulo docker gaffer hdfs helm

Last synced: 09 Jul 2025

https://github.com/oracle/oci-hdfs-connector

HDFS Connector for Oracle Cloud Infrastructure

cloud hdfs oracle-cloud

Last synced: 13 Apr 2025

https://github.com/agile-lab-dev/wasp

WASP is a framework to build complex real time big data applications. It relies on a kind of Kappa/Lambda architecture mainly leveraging Kafka and Spark. If you need to ingest huge amount of heterogeneous data and analyze them through complex pipelines, this is the framework for you.

akka elasticsearch hadoop hbase hdfs jdbc kafka parquet scala solr spark spark-streaming yarn

Last synced: 09 Apr 2025

https://github.com/sergio11/document_search_engine_architecture

📄🚀 Unleash a powerful Document Search Engine with Apache NiFi for lightning-fast, comprehensive text indexing and search.

consul docker elasticsearch feign-client hdfs kafka keycloak kibana logstash mongodb nifi nifi-templates rabbitmq spring-boot spring-cloud-gateway spring-cloud-stream stomp stompwebsocket tika tika-server

Last synced: 13 Aug 2025

https://github.com/orangedrk/javanotes

Java后端学习笔记。包括Linux、maven、git、互联网架构、大数据体系等

flume git hadoop hbase hdfs hive javaee javase kafka linux mapreduce maven mybatis mycat rabbitmq redis spring spring-boot springcloud zookeeper

Last synced: 07 Oct 2025

https://github.com/sansa-stack/sansa-notebooks

Interactive Spark Notebooks for running SANSA examples.

hdfs hue machine-learning notebook owl rdf sansa semantics spark zeppelin

Last synced: 14 Apr 2025

https://github.com/astrolabsoftware/spark-fits

FITS data source for Spark SQL and DataFrames

apache-spark fits fitsio hdfs pyspark scala spark-sql

Last synced: 29 Oct 2025

https://github.com/aphp/py-hdfs-mount

Mount HDFS with fuse, works with kerberos!

fuse hadoop hdfs kerberos mount mount-hdfs

Last synced: 18 Jul 2025

https://github.com/singgel/bigdata-skilltree

【易车】- Spark、flink、HBase、Hive、flume集成了一些Hadoop的原生api的一些demo(如HDFS、MapReduce:目前就这两个);同时测试一些异常功能

hadoop hbase hdfs hive kylin mapreduce scala spark

Last synced: 12 Apr 2025

https://github.com/dayyass/pydfs

Distributed File System written in Python

distributed-systems filesystem hadoop hdfs mapreduce python

Last synced: 13 Apr 2025

https://github.com/dmwm/cmsspark

General purpose framework to run CMS experiment workflows on HDFS/Spark platform

analytics bigdata cms-framework hdfs spark

Last synced: 14 Apr 2025

https://github.com/manuparra/masterdatcom_bdcc_practice

Practice and Workshop on BigData and Cloud Computing using Docker Containers and OpenNebula. HDFS, hadoop and spark+R

bigdata cloudcomputing containers docker hadoop hdfs linux opennebula practices spark sparkr

Last synced: 12 Apr 2025

https://github.com/stefen-taime/etl-data-pipeline-rdbms-to-hdfs-using-airflow-apache-sqoop-spark-postgres-and-hive

This project aims to move the data from a Relational database system (RDBMS) to a Hadoop file system (HDFS)

airflow big-data data docker-compose etl-pipeline hdfs hive infrastructure-as-code rdbms spark sql sqoop

Last synced: 03 Jul 2025

https://github.com/aymane-maghouti/big-data-project

This project aims to predict smartphone prices using a combination of batch and stream processing techniques in a Big Data environment. The architecture follows the Lambda Architecture pattern, providing both real-time and batch processing capabilities to users.

apache-airflow apache-kafka apache-spark batch-processing big-data-projects hbase hdfs ingestion java lambda-architecture machine-learning postgresql-database powerbi pyspark python spring-boot streaming

Last synced: 29 Oct 2025

https://github.com/manuparra/masterdegreecc_practice

Taller del Máster Profesional de Informática UGR. Curso de CloudComputing.

cloudcomputing cluster docker docker-cluster docker-container hadoop hadoop-cluster hdfs opennebula practice virtual-machine

Last synced: 12 Apr 2025

https://github.com/wuzhouhui/ngx-hdfs

Nginx on HDFS Module

hdfs nginx

Last synced: 26 Apr 2025

https://github.com/ibmstreams/streamsx.hdfs

This toolkit provides operators and functions for interacting with Hadoop File System.

hadoop hdfs ibm-streams java stream-processing toolkit

Last synced: 09 Sep 2025

https://github.com/tencentyun/hdfs_to_cos_tools

用于将HDFS上的数据拷贝到COS上

cos hdfs

Last synced: 27 Apr 2025

https://github.com/fasouto/webhdfspy

Python wrapper to access Hadoop HDFS REST API

hadoop-filesystem hdfs python wrapper

Last synced: 19 Apr 2025

https://github.com/risdenk/webhdfs-dotnet

WebHDFS API for .Net

csharp dotnet hadoop hdfs knox webhdfs

Last synced: 15 Apr 2025

https://github.com/nikoshet/monitoring-spark-on-docker

Spark Monitoring With Prometheus And Grafana Using Docker

docker docker-compose grafana hadoop hdfs monitoring node-exporter prometheus spark

Last synced: 24 Jul 2025