Projects in Awesome Lists tagged with hdfs
A curated list of projects in awesome lists tagged with hdfs .
https://github.com/seaweedfs/seaweedfs
SeaweedFS is a fast distributed storage system for blobs, objects, files, and data lake, for billions of files! Blob store has O(1) disk seek, cloud tiering. Filer supports Cloud Drive, cross-DC active-active replication, Kubernetes, POSIX FUSE mount, S3 API, S3 Gateway, Hadoop, WebDAV, encryption, Erasure Coding.
blob-storage cloud-drive distributed-file-system distributed-storage distributed-systems erasure-coding fuse hadoop-hdfs hdfs kubernetes object-storage posix replication s3 s3-storage seaweedfs tiered-file-system
Last synced: 16 Dec 2025
https://github.com/ceph/ceph
Ceph is a distributed object, block, and file storage platform
block-storage cloud-storage distributed-file-system distributed-storage erasure-coding fuse hdfs high-performance highly-available iscsi kubernetes nfs nvme-over-fabrics object-store posix replication s3 smb software-defined-storage storage
Last synced: 13 May 2025
https://github.com/juicedata/juicefs
JuiceFS is a distributed POSIX file system built on top of Redis and S3.
bigdata cloud-native distributed-systems filesystem go golang hdfs object-storage posix redis s3 storage
Last synced: 12 May 2025
https://github.com/piskvorky/smart_open
Utils for streaming large files (S3, HDFS, gzip, bz2...)
boto bz2 file gzip-stream hacktoberfest hdfs python s3 streaming streaming-data webhdfs
Last synced: 11 Dec 2025
https://github.com/RaRe-Technologies/smart_open
Utils for streaming large files (S3, HDFS, gzip, bz2...)
boto bz2 file gzip-stream hacktoberfest hdfs python s3 streaming streaming-data webhdfs
Last synced: 31 Mar 2025
https://github.com/tiledb-inc/tiledb
The Universal Storage Engine
arrays data-analysis data-science dataframes dense-data hdfs s3 s3-storage scientific-computing sparse-arrays sparse-data storage-engine tiledb
Last synced: 13 May 2025
https://github.com/TileDB-Inc/TileDB
The Universal Storage Engine
arrays data-analysis data-science dataframes dense-data hdfs s3 s3-storage scientific-computing sparse-arrays sparse-data storage-engine tiledb
Last synced: 28 Mar 2025
https://github.com/spotify/snakebite
A pure python HDFS client
hdfs python python-hdfs-client
Last synced: 20 Oct 2025
https://github.com/harisekhon/devops-python-tools
80+ DevOps & Data CLI Tools - AWS, GCP, GCF Python Cloud Functions, Log Anonymizer, Spark, Hadoop, HBase, Hive, Impala, Linux, Docker, Spark Data Converters & Validators (Avro/Parquet/JSON/CSV/INI/XML/YAML), Travis CI, AWS CloudFormation, Elasticsearch, Solr etc.
avro aws cloudformation devops docker dockerhub elasticsearch gcf gcp hadoop hbase hdfs json linux parquet pyspark python solr spark travis-ci
Last synced: 13 Jun 2025
https://github.com/HariSekhon/DevOps-Python-tools
80+ DevOps & Data CLI Tools - AWS, GCP, GCF Python Cloud Functions, Log Anonymizer, Spark, Hadoop, HBase, Hive, Impala, Linux, Docker, Spark Data Converters & Validators (Avro/Parquet/JSON/CSV/INI/XML/YAML), Travis CI, AWS CloudFormation, Elasticsearch, Solr etc.
avro aws cloudformation devops docker dockerhub elasticsearch gcf gcp hadoop hbase hdfs json linux parquet pyspark python solr spark travis-ci
Last synced: 11 Apr 2025
https://github.com/Stratio/sparta
Real Time Analytics and Data Pipelines based on Spark Streaming
analytics hdfs kafka lambda olap real-time scala spark spark-streaming sparksql sparta stratio stratio-sparta streaming streaming-data triggers workflow
Last synced: 09 May 2025
https://github.com/lensesio/kafka-connect-ui
Web tool for Kafka Connect |
cassandra documentdb elasticsearch hdfs influxdb jms kafka kafka-connect mqtt redis s3 twitter
Last synced: 04 Apr 2025
https://github.com/dromara/cloudeon
CloudEon uses Kubernetes to install and deploy open-source big data components, enabling the containerized operation of an open-source big data platform. This allows you to reduce your focus on underlying resource management and maintenance.
bigdata cloudnative doris hadoop hdfs kubernetes yarn
Last synced: 15 May 2025
https://github.com/dromara/CloudEon
CloudEon uses Kubernetes to install and deploy open-source big data components, enabling the containerized operation of an open-source big data platform. This allows you to reduce your focus on underlying resource management and maintenance.
bigdata cloudnative doris hadoop hdfs kubernetes yarn
Last synced: 04 Apr 2025
https://github.com/tirthajyoti/spark-with-python
Fundamentals of Spark with Python (using PySpark), code examples
analytics apache apache-spark big-data database dataframe distributed-computing hadoop hdfs machine-learning map-reduce mlib parallel-computing pyspark python spark sql
Last synced: 05 Apr 2025
https://github.com/uber/storagetapper
StorageTapper is a scalable realtime MySQL change data streaming, logical backup and logical replication service
avro cdc clickhouse etl hdfs json kafka msgpack mysql postgresql s3
Last synced: 11 Jun 2025
https://github.com/curvineio/curvine
High performance distributed cache system. Built by Rust.
ai ai-infra bigdata cache-storage cloud-native hdfs high-performance-computing io rust s3 shuffle spark train-acceleration
Last synced: 11 Aug 2025
https://github.com/divolte/divolte-collector
Divolte Collector
analytics analytics-tracking avro clickstream divolte-collector gcs hdfs java kafka pubsub
Last synced: 17 Dec 2025
https://github.com/rumbledb/rumble
⛈️ RumbleDB 1.23.0 "Mountain Ash" 🌳 for Apache Spark | Run queries on your large-scale, messy JSON-like data (JSON, text, CSV, Parquet, ROOT, AVRO, SVM...) | No install required (just a jar to download) | Declarative Machine Learning and more
avro azure csv data-science dataframes hdfs json jsoniq machine-learning nested parquet query query-engine s3 scale schemaless spark svm text yaml
Last synced: 03 Aug 2025
https://github.com/RumbleDB/rumble
Quick start: pip install jsoniq ⛈️ RumbleDB 2.0.0 "Lemon Ironwood" 🌳 for Apache Spark | Run queries on your large-scale, messy datasets (JSON, text, CSV, Parquet, Delta...) | Data Lakehouse with Updates, Scripting, Declarative Machine Learning and more
azure csv data-science dataframes delta-lake hdfs json jsoniq lakehouse machine-learning nested parquet query query-engine s3 scale schemaless spark svm text
Last synced: 20 Nov 2025
https://github.com/breuner/elbencho
A distributed storage benchmark for file systems, object stores & block devices with support for GPUs
benchmark block-storage deep-learning distributed file-systems fio gpu hdfs ior linux live-stats mdtest nvme parallel s3 storage windows
Last synced: 28 Dec 2025
https://github.com/tiledb-inc/tiledb-py
Python interface to the TileDB storage engine
array hdfs numpy python s3 storage-manager tiledb
Last synced: 15 May 2025
https://github.com/helyim/helyim
seaweedfs implemented in pure Rust
cuda dpdk erasure-coding hdfs iouring kernel-bypass object-storage rdma s3 spdk webdav
Last synced: 03 Oct 2025
https://github.com/paddlepaddle/elasticctr
ElasticCTR,即飞桨弹性计算推荐系统,是基于Kubernetes的企业级推荐系统开源解决方案。该方案融合了百度业务场景下持续打磨的高精度CTR模型、飞桨开源框架的大规模分布式训练能力、工业级稀疏参数弹性调度服务,帮助用户在Kubernetes环境中一键完成推荐系统部署,具备高性能、工业级部署、端到端体验的特点,并且作为开源套件,满足二次深度开发的需求。
ctr hdfs k8s personalization ranking recommender-system
Last synced: 21 Aug 2025
https://github.com/marcelmay/hadoop-hdfs-fsimage-exporter
Exports Hadoop HDFS content statistics to Prometheus
hadoop hadoop-fsimage hdfs hdfs-metrics monitoring prometheus-exporter
Last synced: 15 Sep 2025
https://github.com/d2iq-archive/dcos-commons
DC/OS SDK is a collection of tools, libraries, and documentation for easy integration of technologies such as Kafka, Cassandra, HDFS, Spark, and TensorFlow with DC/OS.
cassandra dcos dcos-data-services-guild declarative elasticsearch hdfs kafka kubernetes mesos stateful-containers tensorflow
Last synced: 26 Mar 2025
https://github.com/avast/hdfs-shell
HDFS Shell is a HDFS manipulation tool to work with functions integrated in Hadoop DFS
big-data cli cli-application hadoop hdfs hdfs-manipulation linux shell
Last synced: 26 Oct 2025
https://github.com/jcrist/skein
A tool and library for easily deploying applications on Apache YARN
apache-yarn cluster deployment hadoop hdfs python
Last synced: 05 Apr 2025
https://github.com/linkedin/dynamometer
A tool for scale and performance testing of HDFS with a specific focus on the NameNode.
hadoop hadoop-filesystem hadoop-framework hadoop-hdfs hdfs hdfs-dfs performance-analysis performance-metrics performance-test performance-testing scale scale-up testing testing-tools
Last synced: 17 Aug 2025
https://github.com/mmolimar/kafka-connect-fs
Kafka Connect FileSystem Connector
apache-kafka azure-storage confluent files filesystem ftp gcp hadoop hadoop-filesystem hdfs kafka kafka-connect kafka-connect-fs kafka-connector s3
Last synced: 11 May 2025
https://github.com/TileDB-Inc/TileDB-R
R interface to TileDB: The Modern Database
array hdfs r s3 storage-manager tiledb
Last synced: 13 Jul 2025
https://github.com/tiledb-inc/tiledb-r
R interface to TileDB: The Modern Database
array hdfs r s3 storage-manager tiledb
Last synced: 12 Apr 2025
https://github.com/ahmetfurkandemir/data-engineering-project-with-hdfs-and-kafka
Data Engineering Project with Hadoop HDFS and Kafka
data data-engineer data-engineering data-engineering-pipeline docker docker-compose hadoop hadoop-filesystem hadoop-hdfs hdfs hdfs-client hdfs-dfs kafka kafka-consumer kafka-producer kafka-ui kafkaui pipline python python-hdfs-client
Last synced: 15 Apr 2025
https://github.com/harisekhon/devops-perl-tools
25+ DevOps CLI Tools - Anonymizer, SQL ReCaser (MySQL, PostgreSQL, AWS Redshift, Snowflake, Apache Drill, Hive, Impala, Cassandra CQL, Microsoft SQL Server, Oracle, Couchbase N1QL, Dockerfiles), Hadoop HDFS & Hive tools, Solr/SolrCloud CLI, Nginx stats & HTTP(S) URL watchers for load-balanced web farms, Linux tools etc.
anonymize apache-drill cassandra couchbase docker hacktoberfest hadoop hbase hdfs hive kerberos linux mysql neo4j nginx recaser solr solrcloud sql
Last synced: 13 Jun 2025
https://github.com/starlake-ai/starlake
Declarative text based tool for data analysts and engineers to extract, load, transform and orchestrate their data pipelines.
bigquery data-engineering data-integration data-pipeline etl hdfs redshift snowflake spark synapse
Last synced: 05 Apr 2025
https://github.com/seznam/euphoria
Euphoria is an open source Java API for creating unified big-data processing flows. It provides an engine independent programming model which can express both batch and stream transformations.
apache-flink apache-spark batch-processing big-data hadoop hdfs java-api kafka streaming-data unified-bigdata-processing
Last synced: 21 Aug 2025
https://github.com/dbiir/rainbow
A data layout optimization framework for wide tables stored on HDFS. See rainbow's webpage
column-store data-analytics data-layout hdfs sql wide-table
Last synced: 30 Jun 2025
https://github.com/longshilin/hdfs-netdisc
基于Hadoop的分布式云存储系统 :palm_tree:
bigdata filesystem hadoop hadoop-filesystem hdfs hdfs-client hdfs-netdisc netdisk
Last synced: 07 Apr 2025
https://github.com/monix/monix-connect
A set of connectors for Monix. 🔛
aws connectors dynamodb elasticsearch google-cloud-storage hdfs mongodb monix parquet reactive-streams redis s3 scala sqs workflow
Last synced: 21 Jul 2025
https://github.com/fluent/fluent-plugin-webhdfs
Hadoop WebHDFS output plugin for Fluentd
fluentd fluentd-plugin hadoop hdfs
Last synced: 05 Jul 2025
https://github.com/ait-aecid/anomaly-detection-log-datasets
Analysis scripts for log data sets used in anomaly detection.
anomaly-detection bgl hadoop hdfs log-data logs machine-learning python review semi-supervised sequences survey unsupervised
Last synced: 10 Apr 2025
https://github.com/ascrus/getl
A tool for developing and testing ETL and ELT processes for automating the capture, delivery and processing of information in data warehouses on the MicroFocus Vertica platform.
csv dsl elt etl excel hdfs hive impala json kafka sql unit-testing vertica xml
Last synced: 14 Jun 2025
https://github.com/damiencarol/jsr203-hadoop
A Java NIO file system provider for HDFS
Last synced: 07 Apr 2025
https://github.com/tiledb-inc/tiledb-go
Go Interface to the TileDB storage manager
array go golang golang-library hdfs s3 storage-manager tiledb
Last synced: 20 Aug 2025
https://github.com/terascope/teraslice
Scalable data processing pipelines in JavaScript
elasticsearch hadoop hdfs json kafka
Last synced: 04 Apr 2025
https://github.com/criteo/cluster-pack
A library on top of either pex or conda-pack to make your Python code easily available on a cluster
conda-pack hdfs pex pyspark s3 skein
Last synced: 05 Apr 2025
https://github.com/wittline/apache-spark-docker
Dockerizing an Apache Spark Standalone Cluster
apache-spark dataengineer dataengineering docker docker-compose hadoop-cluster hadoop-docker hdfs hive hive-metastore hue pyspark
Last synced: 13 Apr 2025
https://github.com/ibmstreams/samples
This repository contains open-source sample applications for IBM Streams.
database geofence geofencing hdfs healthcare ibm-streams samples stream-processing text-analytics timeseries
Last synced: 15 Jul 2025
https://github.com/canelmas/kafka-connect-field-and-time-partitioner
Kafka Connect Store Partitioner by custom fields and time
hdfs kafka kafka-connect kafka-connect-hdfs kafka-connect-s3 kafka-connector kafka-connectors partitioner s3 s3-storage
Last synced: 10 Apr 2025
https://github.com/jacobstanley/hadoop-tools
Tools for working with Hadoop, written with performance in mind.
Last synced: 11 Dec 2025
https://github.com/zongxr/bigdata-competition
全国大数据竞赛三等奖解决方案,省赛二等奖解决方案。一键安装大数据环境脚本,自动部署集群环境,包括zookeeper、hadoop、mysql、hive、spark以及一些基础环境。已通过实际服务器测试,效果极佳,仅需要输入密码等少量人为干预。解放安装部署配置所需人力。并添加若干scala案例,结合spark用以进行数据准备。
bigdata hadoop hdfs hive mysql scala shell spark wordcount zookeeper
Last synced: 14 Apr 2025
https://github.com/kmgowda/sbk
Storage Benchmark Kit
distributed-storage dockers filesystem grafana hdfs kafka latency mysql nats-streaming performance-benchmarking pravega prometheus-metrics pulsar sbk storage-device storage-driver throughput
Last synced: 05 Apr 2025
https://github.com/oracle/oci-hdfs-connector
HDFS Connector for Oracle Cloud Infrastructure
Last synced: 13 Apr 2025
https://github.com/agile-lab-dev/wasp
WASP is a framework to build complex real time big data applications. It relies on a kind of Kappa/Lambda architecture mainly leveraging Kafka and Spark. If you need to ingest huge amount of heterogeneous data and analyze them through complex pipelines, this is the framework for you.
akka elasticsearch hadoop hbase hdfs jdbc kafka parquet scala solr spark spark-streaming yarn
Last synced: 09 Apr 2025
https://github.com/sergio11/document_search_engine_architecture
📄🚀 Unleash a powerful Document Search Engine with Apache NiFi for lightning-fast, comprehensive text indexing and search.
consul docker elasticsearch feign-client hdfs kafka keycloak kibana logstash mongodb nifi nifi-templates rabbitmq spring-boot spring-cloud-gateway spring-cloud-stream stomp stompwebsocket tika tika-server
Last synced: 13 Aug 2025
https://github.com/confluentinc/kafka-connect-hdfs
Kafka Connect HDFS connector
apache-kafka big-data confluent hadoop hdfs kafka kafka-connect-hdfs kafka-connector streaming
Last synced: 11 May 2025
https://github.com/astrolabsoftware/spark-fits
FITS data source for Spark SQL and DataFrames
apache-spark fits fitsio hdfs pyspark scala spark-sql
Last synced: 29 Oct 2025
https://github.com/aphp/py-hdfs-mount
Mount HDFS with fuse, works with kerberos!
fuse hadoop hdfs kerberos mount mount-hdfs
Last synced: 18 Jul 2025
https://github.com/dayyass/pydfs
Distributed File System written in Python
distributed-systems filesystem hadoop hdfs mapreduce python
Last synced: 13 Apr 2025
https://github.com/lovenui/customer-viewership-realtime-analysis
crm-analytics etl hadoop hdfs hive kafka nifi spark
Last synced: 13 May 2025
https://github.com/manuparra/taller_sparkr
Taller SparkR para las Jornadas de Usuarios de R
artificial-intelligence bigdata data-analysis data-mining hdfs ipynb machine-learning-algorithms r rstudio spark sparklyr sparkr
Last synced: 17 Jul 2025
https://github.com/dmwm/cmsspark
General purpose framework to run CMS experiment workflows on HDFS/Spark platform
analytics bigdata cms-framework hdfs spark
Last synced: 14 Apr 2025
https://github.com/manuparra/masterdatcom_bdcc_practice
Practice and Workshop on BigData and Cloud Computing using Docker Containers and OpenNebula. HDFS, hadoop and spark+R
bigdata cloudcomputing containers docker hadoop hdfs linux opennebula practices spark sparkr
Last synced: 12 Apr 2025
https://github.com/stefen-taime/etl-data-pipeline-rdbms-to-hdfs-using-airflow-apache-sqoop-spark-postgres-and-hive
This project aims to move the data from a Relational database system (RDBMS) to a Hadoop file system (HDFS)
airflow big-data data docker-compose etl-pipeline hdfs hive infrastructure-as-code rdbms spark sql sqoop
Last synced: 03 Jul 2025
https://github.com/aymane-maghouti/big-data-project
This project aims to predict smartphone prices using a combination of batch and stream processing techniques in a Big Data environment. The architecture follows the Lambda Architecture pattern, providing both real-time and batch processing capabilities to users.
apache-airflow apache-kafka apache-spark batch-processing big-data-projects hbase hdfs ingestion java lambda-architecture machine-learning postgresql-database powerbi pyspark python spring-boot streaming
Last synced: 29 Oct 2025
https://github.com/manuparra/masterdegreecc_practice
Taller del Máster Profesional de Informática UGR. Curso de CloudComputing.
cloudcomputing cluster docker docker-cluster docker-container hadoop hadoop-cluster hdfs opennebula practice virtual-machine
Last synced: 12 Apr 2025
https://github.com/ibmstreams/streamsx.hdfs
This toolkit provides operators and functions for interacting with Hadoop File System.
hadoop hdfs ibm-streams java stream-processing toolkit
Last synced: 09 Sep 2025
https://github.com/fasouto/webhdfspy
Python wrapper to access Hadoop HDFS REST API
hadoop-filesystem hdfs python wrapper
Last synced: 19 Apr 2025
https://github.com/nikoshet/monitoring-spark-on-docker
Spark Monitoring With Prometheus And Grafana Using Docker
docker docker-compose grafana hadoop hdfs monitoring node-exporter prometheus spark
Last synced: 24 Jul 2025
https://github.com/ditectrev/amazon-web-services-certified-aws-certified-data-analytics-das-c01-practice-tests-exams-question
⛳️ PASS: Amazon Web Services Certified (AWS Certified) Data Analytics Specialty (DAS-C01) by learning based on our Questions & Answers (Q&A) Practice Tests Exams.
amazon-athena amazon-aurora amazon-cloudwatch amazon-ec2 amazon-emr amazon-quicksight amazon-rds amazon-s3 apache-kafka apache-spark aws aws-certified aws-data-analytics aws-glue aws-lambda das-c01 hdfs practice-exam practice-exams practice-test
Last synced: 25 Apr 2025