Projects in Awesome Lists tagged with hadoop

https://github.com/donnemartin/data-science-ipython-notebooks

Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.

aws big-data caffe data-science deep-learning hadoop kaggle keras machine-learning mapreduce matplotlib numpy pandas python scikit-learn scipy spark tensorflow theano

Last synced: 12 May 2025

https://github.com/spotify/luigi

Luigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in.

hadoop luigi orchestration-framework python scheduling

Last synced: 12 May 2025

https://github.com/tencent/apijson

🏆 实时零代码、全功能、强安全 ORM 库 🚀 后端接口和文档零代码，前端(客户端) 定制返回 JSON 的数据和结构 🏆 Real-Time coding-free, powerful and secure ORM 🚀 providing APIs and Docs without coding by Backend, and the returned JSON of API can be customized by Frontend(Client) users

baas clickhouse crud databricks elasticsearch hadoop hive influxdb low-code lowcode milvus nocode oracle postgresql postgresql-database serverless snowflake sqlserver tdengine tidb

Last synced: 13 May 2025

https://github.com/Tencent/APIJSON

🏆 实时零代码、全功能、强安全 ORM 库 🚀 后端接口和文档零代码，前端(客户端) 定制返回 JSON 的数据和结构 🏆 Real-Time coding-free, powerful and secure ORM 🚀 providing APIs and Docs without coding by Backend, and the returned JSON of API can be customized by Frontend(Client) users

baas clickhouse crud databricks elasticsearch hadoop hive influxdb low-code lowcode milvus nocode oracle postgresql postgresql-database serverless snowflake sqlserver tdengine tidb

Last synced: 01 Apr 2025

https://github.com/heibaiying/bigdata-notes

大数据入门指南 :star:

azkaban big-data bigdata flume hadoop hbase hdfs hive kafka mapreduce phoenix scala spark sqoop storm yarn zookeeper

Last synced: 25 Apr 2025

https://github.com/prestodb/presto

The official home of the Presto distributed SQL query engine for big data

big-data data hadoop hive java lakehouse presto query sql

Last synced: 12 May 2025

https://github.com/heibaiying/BigData-Notes

大数据入门指南 :star:

azkaban big-data bigdata flume hadoop hbase hdfs hive kafka mapreduce phoenix scala spark sqoop storm yarn zookeeper

Last synced: 24 Mar 2025

https://github.com/apache/hadoop

Apache Hadoop

hadoop

Last synced: 12 May 2025

https://github.com/deeplearning4j/deeplearning4j

Suite of tools for deploying and training deep learning models using the JVM. Highlights include model import for keras, tensorflow, and onnx/pytorch, a modular and tiny c++ library for running math code and a java based math library on top of the core c++ library. Also includes samediff: a pytorch/tensorflow like library for running deep learn...

artificial-intelligence clojure deeplearning deeplearning4j dl4j gpu hadoop intellij java linear-algebra matrix-library neural-nets python scala spark

Last synced: 12 May 2025

https://github.com/apache/doris

Apache Doris is an easy-to-use, high performance and unified analytics database.

bigquery database dbt delta-lake elt etl hadoop hive hudi iceberg lakehouse olap query-engine real-time redshift snowflake spark sql

Last synced: 12 May 2025

https://github.com/apache/incubator-doris

Apache Doris is an easy-to-use, high performance and unified analytics database.

bigquery database dbt delta-lake elt etl hadoop hive hudi iceberg lakehouse olap query-engine real-time redshift snowflake spark sql

Last synced: 14 Dec 2024

https://github.com/trinodb/trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)

analytics big-data data-science database databases datalake delta-lake distributed-database distributed-systems hadoop hive iceberg java jdbc presto prestodb query-engine sql trino

Last synced: 12 May 2025

https://github.com/wangzhiwubigdata/god-of-bigdata

专注大数据学习面试，大数据成神之路开启。Flink/Spark/Hadoop/Hbase/Hive...

azkaban bigdata flink flume hadoop hbase hdfs hive kafka spark zookeeper

Last synced: 13 May 2025

https://github.com/wangzhiwubigdata/God-Of-BigData

专注大数据学习面试，大数据成神之路开启。Flink/Spark/Hadoop/Hbase/Hive...

azkaban bigdata flink flume hadoop hbase hdfs hive kafka spark zookeeper

Last synced: 27 Mar 2025

https://github.com/linkedin/school-of-sre

At LinkedIn, we are using this curriculum for onboarding our entry-level talents into the SRE role.

git hadoop linux mysql networking nosql python security sre system-design

Last synced: 14 May 2025

https://github.com/h2oai/h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.

automl big-data data-science deep-learning distributed ensemble-learning gbm gpu h2o h2o-automl hadoop java machine-learning naive-bayes opensource pca python r random-forest spark

Last synced: 12 May 2025

https://github.com/alluxio/alluxio

Alluxio, data orchestration for analytics and machine learning in the cloud

alluxio data-analysis data-orchestration hadoop memory-speed presto spark tensorflow virtual-distributed-filesystem

Last synced: 12 May 2025

https://github.com/Alluxio/alluxio

Alluxio, data orchestration for analytics and machine learning in the cloud

alluxio data-analysis data-orchestration hadoop memory-speed presto spark tensorflow virtual-distributed-filesystem

Last synced: 26 Mar 2025

https://github.com/harisekhon/devops-bash-tools

1000+ DevOps Bash Scripts - AWS, GCP, Kubernetes, Docker, CI/CD, APIs, SQL, PostgreSQL, MySQL, Hive, Impala, Kafka, Hadoop, Jenkins, GitHub, GitLab, BitBucket, Azure DevOps, TeamCity, Spotify, MP3, LDAP, Code/Build Linting, pkg mgmt for Linux, Mac, Python, Perl, Ruby, NodeJS, Golang, Advanced dotfiles: .bashrc, .vimrc, .gitconfig, .screenrc, tmux..

api aws bash ci cloudera devops docker gcp git github hacktoberfest hadoop jenkins kafka kubernetes linux mysql perl postgresql terraform

Last synced: 23 Apr 2025

https://github.com/HariSekhon/DevOps-Bash-tools

1000+ DevOps Bash Scripts - AWS, GCP, Kubernetes, Docker, CI/CD, APIs, SQL, PostgreSQL, MySQL, Hive, Impala, Kafka, Hadoop, Jenkins, GitHub, GitLab, BitBucket, Azure DevOps, TeamCity, Spotify, MP3, LDAP, Code/Build Linting, pkg mgmt for Linux, Mac, Python, Perl, Ruby, NodeJS, Golang, Advanced dotfiles: .bashrc, .vimrc, .gitconfig, .screenrc, tmux..

api aws bash ci cloudera devops docker gcp git github hacktoberfest hadoop jenkins kafka kubernetes linux mysql perl postgresql terraform

Last synced: 02 Apr 2025

https://github.com/apache/hive

Apache Hive

apache big-data database hadoop hive java sql

Last synced: 14 May 2025

https://github.com/apache/ignite

Apache Ignite

big-data cache cloud data-management-platform database distributed-sql-database hadoop ignite in-memory-computing in-memory-database iot network-client network-server osgi sql

Last synced: 14 May 2025

https://github.com/apache/calcite

Apache Calcite

big-data calcite geospatial hadoop java sql

Last synced: 12 May 2025

https://github.com/tomwhite/hadoop-book

Example source code accompanying O'Reilly's "Hadoop: The Definitive Guide" by Tom White

book hadoop o-reilly

Last synced: 14 May 2025

https://github.com/webankfintech/dataspherestudio

DataSphereStudio is a one stop data application development& management portal, covering scenarios including data exchange, desensitization/cleansing, analysis/mining, quality measurement, visualization, and task scheduling.

airflow atlas azkaban dataworks davinci dolphinscheduler flink governance griffin hadoop hive hue kettle linkis spark supperset tableau visualis workflow zeppelin

Last synced: 14 May 2025

https://github.com/WeBankFinTech/DataSphereStudio

DataSphereStudio is a one stop data application development& management portal, covering scenarios including data exchange, desensitization/cleansing, analysis/mining, quality measurement, visualization, and task scheduling.

airflow atlas azkaban dataworks davinci dolphinscheduler flink governance griffin hadoop hive hue kettle linkis spark supperset tableau visualis workflow zeppelin

Last synced: 14 Mar 2025

https://github.com/apache/nutch

Apache Nutch is an extensible and scalable web crawler

apache crawling hadoop java nutch web-crawler

Last synced: 13 May 2025

https://github.com/luckyzxl2016/movie_recommend

基于Spark的电影推荐系统，包含爬虫项目、web网站、后台管理系统以及spark推荐系统

hadoop hive mysql nginx scala scrapy spark-mllib spark-streaming ssm-maven

Last synced: 15 May 2025

https://github.com/LuckyZXL2016/Movie_Recommend

基于Spark的电影推荐系统，包含爬虫项目、web网站、后台管理系统以及spark推荐系统

hadoop hive mysql nginx scala scrapy spark-mllib spark-streaming ssm-maven

Last synced: 26 Mar 2025

https://github.com/moran1607/bigdataguide

大数据学习，从零开始学习大数据，包含大数据学习各阶段学习视频、面试资料

bigdata flink flume hadoop hbase hive javase kafka scala spark zookeeper

Last synced: 14 May 2025

https://github.com/MoRan1607/BigDataGuide

大数据学习，从零开始学习大数据，包含大数据学习各阶段学习视频、面试资料

bigdata flink flume hadoop hbase hive javase kafka scala spark zookeeper

Last synced: 04 Apr 2025

https://github.com/geekyouth/szt-bigdata

深圳地铁大数据客流分析系统🚇🚄🌟

cdh6 clickhouse docker elasticsearch flink hadoop hbase hive kafka kibana kylin mongodb mysql phoenix redis scala spark springboot szt-bigdata zookeeper

Last synced: 14 Apr 2025

https://github.com/geekyouth/SZT-bigdata

深圳地铁大数据客流分析系统🚇🚄🌟

cdh6 clickhouse docker elasticsearch flink hadoop hbase hive kafka kibana kylin mongodb mysql phoenix redis scala spark springboot szt-bigdata zookeeper

Last synced: 28 Mar 2025

https://github.com/big-data-europe/docker-hadoop

Apache Hadoop docker image

docker docker-hadoop hadoop hadoop-cluster hadoop-docker

Last synced: 14 May 2025

https://github.com/apache/kyuubi

Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses.

data-lake hacktoberfest hadoop hive jdbc kubernetes spark spark-sql sql thrift

Last synced: 13 May 2025

https://github.com/dahuoyzs/javapdf

🍣100本 Java电子书技术书籍PDF(以下载阅读为荣，以点赞收藏为耻)

hadoop java-pdf jvm mysql mysql-innodb

Last synced: 15 May 2025

https://github.com/cdarlint/winutils

winutils.exe hadoop.dll and hdfs.dll binaries for hadoop windows

binaries hadoop winutils

Last synced: 14 May 2025

https://github.com/apache/drill

Apache Drill is a distributed MPP query layer for self describing data

big-data drill hadoop hive java jdbc parquet sql

Last synced: 13 May 2025

https://github.com/gchq/gaffer

A large-scale entity and relation database supporting aggregation of properties

accumulo aggregation big-data graph graph-database hadoop hbase parquet spark

Last synced: 12 May 2025

https://github.com/gchq/Gaffer

A large-scale entity and relation database supporting aggregation of properties

accumulo aggregation big-data graph graph-database hadoop hbase parquet spark

Last synced: 04 May 2025

https://github.com/qihoo360/hbox

AI on Hadoop

ai caffe deeplearning hadoop machinelearning mxnet tensorflow yarn

Last synced: 15 May 2025

https://github.com/Qihoo360/hbox

AI on Hadoop

ai caffe deeplearning hadoop machinelearning mxnet tensorflow yarn

Last synced: 27 Mar 2025

https://github.com/water8394/bigdata-interview

:dart: :star2:[大数据面试题]分享自己在网络上收集的大数据相关的面试题以及自己的答案总结.目前包含Hadoop/Hive/Spark/Flink/Hbase/Kafka/Zookeeper框架的面试题知识总结

bigdata flink hadoop hbase hdfs interview interview-questions kafka mapreduce spark yarn

Last synced: 15 May 2025

https://github.com/water8394/BigData-Interview

:dart: :star2:[大数据面试题]分享自己在网络上收集的大数据相关的面试题以及自己的答案总结.目前包含Hadoop/Hive/Spark/Flink/Hbase/Kafka/Zookeeper框架的面试题知识总结

bigdata flink hadoop hbase hdfs interview interview-questions kafka mapreduce spark yarn

Last synced: 27 Mar 2025

https://github.com/collabh/bigdata-growth

大数据知识仓库涉及到数据仓库建模、实时计算、大数据、数据中台、系统设计、Java、算法等。

bigdata bigdatalearning debezium flink hadoop hbase hdfs hive hudi kafka kudu mapreduce olap spark

Last synced: 14 May 2025

https://github.com/collabH/bigdata-growth

大数据知识仓库涉及到数据仓库建模、实时计算、大数据、数据中台、系统设计、Java、算法等。

bigdata bigdatalearning debezium flink hadoop hbase hdfs hive hudi kafka kudu mapreduce olap spark

Last synced: 28 Mar 2025

https://github.com/apache/carbondata

High performance data store solution

apache big-data carbondata data-format hadoop java scala spark

Last synced: 13 May 2025

https://github.com/harisekhon/dockerfiles

50+ DockerHub public images for Docker & Kubernetes - DevOps, CI/CD, GitHub Actions, CircleCI, Jenkins, TeamCity, Alpine, CentOS, Debian, Fedora, Ubuntu, Hadoop, Kafka, ZooKeeper, HBase, Cassandra, Solr, SolrCloud, Presto, Apache Drill, Nifi, Spark, Consul, Riak

apache-drill cassandra consul devops docker dockerhub hacktoberfest hadoop hbase kafka kubernetes linux nagios-plugins presto rabbitmq rabbitmq-cluster solr solrcloud spark zookeeper

Last synced: 14 May 2025

https://github.com/obenner/data-engineering-interview-questions

More than 2000+ Data engineer interview questions.

airflow avro aws azure cassandra data-engineering data-structures flink flume hadoop hadoop-hdfs hbase hive impala interview interview-questions kafka nifi spark sql

Last synced: 14 May 2025

https://github.com/OBenner/data-engineering-interview-questions

More than 2000+ Data engineer interview questions.

airflow avro aws azure cassandra data-engineering data-structures flink flume hadoop hadoop-hdfs hbase hive impala interview interview-questions kafka nifi spark sql

Last synced: 10 Apr 2025

https://github.com/HariSekhon/Dockerfiles

50+ DockerHub public images for Docker & Kubernetes - DevOps, CI/CD, GitHub Actions, CircleCI, Jenkins, TeamCity, Alpine, CentOS, Debian, Fedora, Ubuntu, Hadoop, Kafka, ZooKeeper, HBase, Cassandra, Solr, SolrCloud, Presto, Apache Drill, Nifi, Spark, Consul, Riak

apache-drill cassandra consul devops docker dockerhub hacktoberfest hadoop hbase kafka kubernetes linux nagios-plugins presto rabbitmq rabbitmq-cluster solr solrcloud spark zookeeper

Last synced: 03 Apr 2025

https://github.com/wgzhao/addax

A fast and versatile ETL tool that can transfer data between RDBMS and NoSQL seamlessly

clickhouse database etl excel hadoop hdfs hive impala influxdb kudu mysql oracle postgresql sqlserver trino

Last synced: 14 May 2025

https://github.com/wgzhao/Addax

A fast and versatile ETL tool that can transfer data between RDBMS and NoSQL seamlessly

clickhouse database etl excel hadoop hdfs hive impala influxdb kudu mysql oracle postgresql sqlserver trino

Last synced: 14 Apr 2025

https://github.com/dtstack/taier

Taier is a big data development platform for submission, scheduling, operation and maintenance, and indicator information display

azkaban chunjun cronjob-scheduler dag data-schedule distributed-schedule-system flink hadoop hive job-scheduler scheduler spark task-schedule workflow-scheduling-system

Last synced: 15 May 2025

https://github.com/DTStack/Taier

Taier is a big data development platform for submission, scheduling, operation and maintenance, and indicator information display

azkaban chunjun cronjob-scheduler dag data-schedule distributed-schedule-system flink hadoop hive job-scheduler scheduler spark task-schedule workflow-scheduling-system

Last synced: 27 Mar 2025

https://github.com/HariSekhon/nagios-plugins

450+ AWS, Hadoop, Cloud, Kafka, Docker, Elasticsearch, RabbitMQ, Redis, HBase, Solr, Cassandra, ZooKeeper, HDFS, Yarn, Hive, Presto, Drill, Impala, Consul, Spark, Jenkins, Travis CI, Git, MySQL, Linux, DNS, Whois, SSL Certs, Yum Security Updates, Kubernetes, Cloudera etc...

aws cassandra cloud cloudera consul docker elasticsearch hacktoberfest hadoop hbase jenkins kafka kubernetes linux mysql nagios-plugins rabbitmq redis solr zookeeper

Last synced: 09 May 2025

https://github.com/harisekhon/nagios-plugins

450+ AWS, Hadoop, Cloud, Kafka, Docker, Elasticsearch, RabbitMQ, Redis, HBase, Solr, Cassandra, ZooKeeper, HDFS, Yarn, Hive, Presto, Drill, Impala, Consul, Spark, Jenkins, Travis CI, Git, MySQL, Linux, DNS, Whois, SSL Certs, Yum Security Updates, Kubernetes, Cloudera etc...

aws cassandra cloud cloudera consul docker elasticsearch hacktoberfest hadoop hbase jenkins kafka kubernetes linux mysql nagios-plugins rabbitmq redis solr zookeeper

Last synced: 14 May 2025

https://github.com/teradata/kylo

Kylo is a data lake management software platform and framework for enabling scalable enterprise-class data lakes on big data technologies such as Teradata, Apache Spark and/or Hadoop. Kylo is licensed under Apache 2.0. Contributed by Teradata Inc.

data-lake hadoop kylo nifi spark teradata

Last synced: 15 May 2025

https://github.com/Teradata/kylo

Kylo is a data lake management software platform and framework for enabling scalable enterprise-class data lakes on big data technologies such as Teradata, Apache Spark and/or Hadoop. Kylo is licensed under Apache 2.0. Contributed by Teradata Inc.

data-lake hadoop kylo nifi spark teradata

Last synced: 06 Apr 2025

https://github.com/oeljeklaus-you/useractionanalyzeplatform

电商用户行为分析大数据平台

accumulator hadoop java kyro spark spark-sql sparkjava

Last synced: 16 May 2025

https://github.com/apache/ozone

Scalable, reliable, distributed storage system optimized for data analytics and object store workloads.

big-data hadoop kubernetes object-store s3 storage

Last synced: 11 Apr 2025

https://github.com/HariSekhon/DevOps-Python-tools

80+ DevOps & Data CLI Tools - AWS, GCP, GCF Python Cloud Functions, Log Anonymizer, Spark, Hadoop, HBase, Hive, Impala, Linux, Docker, Spark Data Converters & Validators (Avro/Parquet/JSON/CSV/INI/XML/YAML), Travis CI, AWS CloudFormation, Elasticsearch, Solr etc.

avro aws cloudformation devops docker dockerhub elasticsearch gcf gcp hadoop hbase hdfs json linux parquet pyspark python solr spark travis-ci

Last synced: 11 Apr 2025

https://github.com/tony-framework/TonY

TonY is a framework to natively run deep learning frameworks on Apache Hadoop.

deep-learning hadoop hadoop-yarn horovod machine-learning tensorflow

Last synced: 20 Apr 2025

https://github.com/tony-framework/tony

TonY is a framework to natively run deep learning frameworks on Apache Hadoop.

deep-learning hadoop hadoop-yarn horovod machine-learning tensorflow

Last synced: 12 Apr 2025

https://github.com/WeBankFinTech/WeDataSphere

WeDataSphere is a financial grade, one-stop big data platform suite.

analytics bigdata data-analysis datafabric datagovernance dataspherestudio exchangis flink hadoop hive ide linkis prophecis qualitis schedulis scriptis spark streamis visualis

Last synced: 27 Mar 2025

https://github.com/absaoss/spline

Data Lineage Tracking And Visualization Solution

bigdata hadoop lineage scala spark tracking visualization

Last synced: 16 May 2025

https://github.com/cerndb/dist-keras

Distributed Deep Learning, with a focus on distributed training, using Keras and Apache Spark.

apache-spark data-parallelism data-science deep-learning distributed-optimizers hadoop keras machine-learning optimization-algorithms tensorflow

Last synced: 22 Jan 2025

https://github.com/AbsaOSS/spline

Data Lineage Tracking And Visualization Solution

bigdata hadoop lineage scala spark tracking visualization

Last synced: 04 Apr 2025

https://github.com/linkedin/venice

Venice, Derived Data Platform for Planet-Scale Workloads.

ai database hadoop kafka ml

Last synced: 27 Apr 2025

https://github.com/Esri/gis-tools-for-hadoop

The GIS Tools for Hadoop are a collection of GIS tools for spatial analysis of big data.

hadoop spatial-analysis

Last synced: 30 Mar 2025

https://github.com/raray-chuan/xichuan_note

xichuan的学习总结笔记,覆盖了java、spring、java其他常用框架,以及大数据相关组件等📚

bigdata elk flink hadoop hbase hive java juc jvm kafaka kafka redis spark spring springcloud zabbix zookeeper

Last synced: 05 Apr 2025

https://github.com/apache/tez

Apache Tez

apache big-data hadoop java tez

Last synced: 14 May 2025

https://github.com/uber/marmaray

Generic Data Ingestion & Dispersal Library for Hadoop

avro-schema data-lake hadoop ingest-data schema-format spark

Last synced: 23 Mar 2025

https://github.com/houshanren/big_data_architect_skills

一个大数据架构师应该掌握的技能

analytics bigdata hadoop skills spark xuan-xing

Last synced: 05 Apr 2025

https://github.com/dromara/cloudeon

CloudEon uses Kubernetes to install and deploy open-source big data components, enabling the containerized operation of an open-source big data platform. This allows you to reduce your focus on underlying resource management and maintenance.

bigdata cloudnative doris hadoop hdfs kubernetes yarn

Last synced: 15 May 2025

https://github.com/dromara/CloudEon

CloudEon uses Kubernetes to install and deploy open-source big data components, enabling the containerized operation of an open-source big data platform. This allows you to reduce your focus on underlying resource management and maintenance.

bigdata cloudnative doris hadoop hdfs kubernetes yarn

Last synced: 04 Apr 2025

https://github.com/fabiogjardim/bigdata_docker

Big Data Ecosystem Docker

hadoop hbase hdfs hive hue jupyter-notebook metabase mongo mysql nifi presto spark zookeeper

Last synced: 04 Apr 2025

https://github.com/cubefs/compass

Compass is a task diagnosis platform for bigdata

airflow bigdata diagnose dolphinscheduler flink hadoop mapreduce scheduler spark sql

Last synced: 15 May 2025

https://github.com/hortonworks/cloudbreak

CDP Public Cloud is an integrated analytics and data management platform deployed on cloud services. It offers broad data analytics and artificial intelligence functionality along with secure user access and data governance features.

big-data cloud cloudera deployment hacktoberfest hadoop java

Last synced: 09 May 2025

https://github.com/cwensel/cascading

Cascading is a feature rich API for defining and executing complex and fault tolerant data processing flows locally or on a cluster.

hadoop java mapreduce tez

Last synced: 15 May 2025

https://github.com/kanyun-inc/ytk-learn

Ytk-learn is a distributed machine learning library which implements most of popular machine learning algorithms(GBDT, GBRT, Mixture Logistic Regression, Gradient Boosting Soft Tree, Factorization Machines, Field-aware Factorization Machines, Logistic Regression, Softmax).

distributed factorization-machines gbdt gbm hadoop logistic-regression machine-learning spark

Last synced: 06 Apr 2025

https://github.com/tencent/caelus

Set of Kubernetes solutions for reusing idle resources of nodes by running extra batch jobs

containerd docker hadoop kubernetes runtime yarn

Last synced: 05 Apr 2025

https://github.com/tirthajyoti/spark-with-python

Fundamentals of Spark with Python (using PySpark), code examples

analytics apache apache-spark big-data database dataframe distributed-computing hadoop hdfs machine-learning map-reduce mlib parallel-computing pyspark python spark sql

Last synced: 05 Apr 2025

https://github.com/elasticluster/elasticluster

Create clusters of VMs on the cloud and configure them with Ansible.

ansible azure cloud cluster clustering ec2 gcp gridengine hadoop hpc python slurm spark

Last synced: 07 Apr 2025

https://github.com/datawhalechina/juicy-bigdata

🎉🎉🐳 Datawhale大数据处理导论教程 | 大数据技术方向的开篇课程🎉🎉

bigdata hadoop hbase hdfs hive mapreduce spark

Last synced: 09 Apr 2025

https://github.com/sakserv/hadoop-mini-clusters

hadoop-mini-clusters provides an easy way to test Hadoop projects directly in your IDE

hadoop hadoop-mini-clusters ide java test-automation

Last synced: 04 Apr 2025

https://github.com/florent37/android-nosql

Lightweight, simple structured NoSQL database for Android

android cassandra cassandra-database data db elastic firebase hadoop local mongo mongodb nosql path preferences saver shared simple sql uri

Last synced: 19 Jan 2025

https://github.com/googleclouddataproc/hadoop-connectors

Libraries and tools for interoperability between Hadoop-related open-source software and Google Cloud Platform.

bigquery google-cloud-dataproc hadoop hadoop-filesystem hadoop-hcfs

Last synced: 13 May 2025

https://github.com/GoogleCloudDataproc/hadoop-connectors

Libraries and tools for interoperability between Hadoop-related open-source software and Google Cloud Platform.

bigquery google-cloud-dataproc hadoop hadoop-filesystem hadoop-hcfs

Last synced: 14 Mar 2025

https://github.com/brndnmtthws/facebook-hive-udfs

Facebook's Hive UDFs

hadoop hive udf udf-libraries

Last synced: 05 Apr 2025

https://github.com/wavestone-cdt/hadoop-attack-library

A collection of pentest tools and resources targeting Hadoop environments

bigdata hadoop pentest

Last synced: 10 Apr 2025

https://github.com/apache/calcite-avatica

Apache Calcite Avatica

big-data calcite geospatial hadoop java sql

Last synced: 13 May 2025

https://github.com/shifuml/shifu

An end-to-end machine learning and data mining framework on Hadoop

bigdata end-to-end-machine-learning gbdt hadoop machine-learning neural-network pipeline random-forest shifu

Last synced: 05 Apr 2025

https://github.com/oeljeklaus-you/javaorbigdata-interview

Java开发者或者大数据开发者面试知识点整理

bigdata hadoop interview java spark storm

Last synced: 08 May 2025

https://github.com/jasonTangxd/recommendSys

推荐项目（实时推荐和离线推荐）

hadoop kafka mahot storm toos

Last synced: 04 May 2025

https://github.com/mellanox/sparkrdma

This is archive of SparkRDMA project. The new repository with RDMA shuffle acceleration for Apache Spark is here: https://github.com/Nvidia/sparkucx

apache-spark big-data bigdata disni hadoop infiniband java mellanox rdma roce scala shuffle spark

Last synced: 22 Jan 2025

https://github.com/huangfox/dpkb

大数据相关内容汇总，包括分布式存储引擎、分布式计算引擎、数仓建设等。关键词：Hadoop、HBase、ES、Kudu、Hive、Presto、Spark、Flink、Kylin、ClickHouse

flink hadoop hbase hive presto spark

Last synced: 27 Mar 2025

https://github.com/apache/incubator-wayang

Apache Wayang(incubating) is the first cross-platform data processing system.

apache big-data cross-platform data-management-platform data-processing distributed-system hadoop java jdbc middleware open-source performance scala spark

Last synced: 15 May 2025

https://github.com/HariSekhon/HAProxy-configs

80+ HAProxy Configs for Hadoop, Big Data, NoSQL, Docker, Kubernetes, Elasticsearch, SolrCloud, HBase, MySQL, PostgreSQL, Apache Drill, Hive, Presto, Impala, Hue, ZooKeeper, SSH, RabbitMQ, Redis, Riak, Cloudera, OpenTSDB, InfluxDB, Prometheus, Kibana, Graphite, Rancher etc.

apache-drill cassandra cloudera elasticsearch hacktoberfest hadoop haproxy hbase hive influxdb mapr mysql nosql opentsdb postgresql presto prometheus redis solrcloud zookeeper

Last synced: 07 Apr 2025

https://github.com/harisekhon/haproxy-configs

80+ HAProxy Configs for Hadoop, Big Data, NoSQL, Docker, Kubernetes, Elasticsearch, SolrCloud, HBase, MySQL, PostgreSQL, Apache Drill, Hive, Presto, Impala, Hue, ZooKeeper, SSH, RabbitMQ, Redis, Riak, Cloudera, OpenTSDB, InfluxDB, Prometheus, Kibana, Graphite, Rancher etc.

apache-drill cassandra cloudera elasticsearch hacktoberfest hadoop haproxy hbase hive influxdb mapr mysql nosql opentsdb postgresql presto prometheus redis solrcloud zookeeper

Last synced: 09 Apr 2025