Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
Projects in Awesome Lists tagged with hadoop
A curated list of projects in awesome lists tagged with hadoop .
https://github.com/donnemartin/data-science-ipython-notebooks
Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.
aws big-data caffe data-science deep-learning hadoop kaggle keras machine-learning mapreduce matplotlib numpy pandas python scikit-learn scipy spark tensorflow theano
Last synced: 14 Jan 2025
https://github.com/spotify/luigi
Luigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in.
hadoop luigi orchestration-framework python scheduling
Last synced: 13 Jan 2025
https://github.com/tencent/apijson
🏆 实时 零代码、全功能、强安全 ORM 库 🚀 后端接口和文档零代码,前端(客户端) 定制返回 JSON 的数据和结构 🏆 Real-Time coding-free, powerful and secure ORM 🚀 providing APIs and Docs without coding by Backend, and the returned JSON of API can be customized by Frontend(Client) users
baas clickhouse crud databricks elasticsearch hadoop hive influxdb low-code lowcode milvus nocode oracle postgresql postgresql-database serverless snowflake sqlserver tdengine tidb
Last synced: 13 Jan 2025
https://github.com/Tencent/APIJSON
🏆 实时 零代码、全功能、强安全 ORM 库 🚀 后端接口和文档零代码,前端(客户端) 定制返回 JSON 的数据和结构 🏆 Real-Time coding-free, powerful and secure ORM 🚀 providing APIs and Docs without coding by Backend, and the returned JSON of API can be customized by Frontend(Client) users
baas clickhouse crud databricks elasticsearch hadoop hive influxdb low-code lowcode milvus nocode oracle postgresql postgresql-database serverless snowflake sqlserver tdengine tidb
Last synced: 02 Nov 2024
https://github.com/deeplearning4j/deeplearning4j
Suite of tools for deploying and training deep learning models using the JVM. Highlights include model import for keras, tensorflow, and onnx/pytorch, a modular and tiny c++ library for running math code and a java based math library on top of the core c++ library. Also includes samediff: a pytorch/tensorflow like library for running deep learn...
artificial-intelligence clojure deeplearning deeplearning4j dl4j gpu hadoop intellij java linear-algebra matrix-library neural-nets python scala spark
Last synced: 13 Jan 2025
https://github.com/trinodb/trino
Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
analytics big-data data-science database databases datalake delta-lake distributed-database distributed-systems hadoop hive iceberg java jdbc presto prestodb query-engine sql trino
Last synced: 13 Jan 2025
https://github.com/linkedin/school-of-sre
At LinkedIn, we are using this curriculum for onboarding our entry-level talents into the SRE role.
git hadoop linux mysql networking nosql python security sre system-design
Last synced: 14 Jan 2025
https://github.com/h2oai/h2o-3
H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
automl big-data data-science deep-learning distributed ensemble-learning gbm gpu h2o h2o-automl hadoop java machine-learning naive-bayes opensource pca python r random-forest spark
Last synced: 13 Jan 2025
https://github.com/alluxio/alluxio
Alluxio, data orchestration for analytics and machine learning in the cloud
alluxio data-analysis data-orchestration hadoop memory-speed presto spark tensorflow virtual-distributed-filesystem
Last synced: 13 Jan 2025
https://github.com/Alluxio/alluxio
Alluxio, data orchestration for analytics and machine learning in the cloud
alluxio data-analysis data-orchestration hadoop memory-speed presto spark tensorflow virtual-distributed-filesystem
Last synced: 29 Oct 2024
https://github.com/harisekhon/devops-bash-tools
1000+ DevOps Bash Scripts - AWS, GCP, Kubernetes, Docker, CI/CD, APIs, SQL, PostgreSQL, MySQL, Hive, Impala, Kafka, Hadoop, Jenkins, GitHub, GitLab, BitBucket, Azure DevOps, TeamCity, Spotify, MP3, LDAP, Code/Build Linting, pkg mgmt for Linux, Mac, Python, Perl, Ruby, NodeJS, Golang, Advanced dotfiles: .bashrc, .vimrc, .gitconfig, .screenrc, tmux..
api aws bash ci cloudera devops docker gcp git github hacktoberfest hadoop jenkins kafka kubernetes linux mysql perl postgresql terraform
Last synced: 14 Jan 2025
https://github.com/HariSekhon/DevOps-Bash-tools
1000+ DevOps Bash Scripts - AWS, GCP, Kubernetes, Docker, CI/CD, APIs, SQL, PostgreSQL, MySQL, Hive, Impala, Kafka, Hadoop, Jenkins, GitHub, GitLab, BitBucket, Azure DevOps, TeamCity, Spotify, MP3, LDAP, Code/Build Linting, pkg mgmt for Linux, Mac, Python, Perl, Ruby, NodeJS, Golang, Advanced dotfiles: .bashrc, .vimrc, .gitconfig, .screenrc, tmux..
api aws bash ci cloudera devops docker gcp git github hacktoberfest hadoop jenkins kafka kubernetes linux mysql perl postgresql terraform
Last synced: 03 Nov 2024
https://github.com/apache/ignite
Apache Ignite
big-data cache cloud data-management-platform database distributed-sql-database hadoop ignite in-memory-computing in-memory-database iot network-client network-server osgi sql
Last synced: 14 Jan 2025
https://github.com/apache/calcite
Apache Calcite
big-data calcite geospatial hadoop java sql
Last synced: 13 Jan 2025
https://github.com/tomwhite/hadoop-book
Example source code accompanying O'Reilly's "Hadoop: The Definitive Guide" by Tom White
Last synced: 29 Nov 2024
https://github.com/webankfintech/dataspherestudio
DataSphereStudio is a one stop data application development& management portal, covering scenarios including data exchange, desensitization/cleansing, analysis/mining, quality measurement, visualization, and task scheduling.
airflow atlas azkaban dataworks davinci dolphinscheduler flink governance griffin hadoop hive hue kettle linkis spark supperset tableau visualis workflow zeppelin
Last synced: 14 Jan 2025
https://github.com/WeBankFinTech/DataSphereStudio
DataSphereStudio is a one stop data application development& management portal, covering scenarios including data exchange, desensitization/cleansing, analysis/mining, quality measurement, visualization, and task scheduling.
airflow atlas azkaban dataworks davinci dolphinscheduler flink governance griffin hadoop hive hue kettle linkis spark supperset tableau visualis workflow zeppelin
Last synced: 26 Oct 2024
https://github.com/apache/nutch
Apache Nutch is an extensible and scalable web crawler
apache crawling hadoop java nutch web-crawler
Last synced: 14 Jan 2025
https://github.com/luckyzxl2016/movie_recommend
基于Spark的电影推荐系统,包含爬虫项目、web网站、后台管理系统以及spark推荐系统
hadoop hive mysql nginx scala scrapy spark-mllib spark-streaming ssm-maven
Last synced: 18 Jan 2025
https://github.com/LuckyZXL2016/Movie_Recommend
基于Spark的电影推荐系统,包含爬虫项目、web网站、后台管理系统以及spark推荐系统
hadoop hive mysql nginx scala scrapy spark-mllib spark-streaming ssm-maven
Last synced: 29 Oct 2024
https://github.com/geekyouth/szt-bigdata
深圳地铁大数据客流分析系统🚇🚄🌟
cdh6 clickhouse docker elasticsearch flink hadoop hbase hive kafka kibana kylin mongodb mysql phoenix redis scala spark springboot szt-bigdata zookeeper
Last synced: 17 Jan 2025
https://github.com/geekyouth/SZT-bigdata
深圳地铁大数据客流分析系统🚇🚄🌟
cdh6 clickhouse docker elasticsearch flink hadoop hbase hive kafka kibana kylin mongodb mysql phoenix redis scala spark springboot szt-bigdata zookeeper
Last synced: 31 Oct 2024
https://github.com/big-data-europe/docker-hadoop
Apache Hadoop docker image
docker docker-hadoop hadoop hadoop-cluster hadoop-docker
Last synced: 17 Jan 2025
https://github.com/apache/kyuubi
Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses.
data-lake hacktoberfest hadoop hive jdbc kubernetes spark spark-sql sql thrift
Last synced: 14 Jan 2025
https://github.com/dahuoyzs/javapdf
🍣100本 Java电子书 技术书籍PDF(以下载阅读为荣,以点赞收藏为耻)
hadoop java-pdf jvm mysql mysql-innodb
Last synced: 18 Jan 2025
https://github.com/cdarlint/winutils
winutils.exe hadoop.dll and hdfs.dll binaries for hadoop windows
Last synced: 17 Jan 2025
https://github.com/gchq/Gaffer
A large-scale entity and relation database supporting aggregation of properties
accumulo aggregation big-data graph graph-database hadoop hbase parquet spark
Last synced: 13 Nov 2024
https://github.com/gchq/gaffer
A large-scale entity and relation database supporting aggregation of properties
accumulo aggregation big-data graph graph-database hadoop hbase parquet spark
Last synced: 14 Jan 2025
https://github.com/qihoo360/hbox
AI on Hadoop
ai caffe deeplearning hadoop machinelearning mxnet tensorflow yarn
Last synced: 16 Jan 2025
https://github.com/Qihoo360/hbox
AI on Hadoop
ai caffe deeplearning hadoop machinelearning mxnet tensorflow yarn
Last synced: 30 Oct 2024
https://github.com/apache/carbondata
High performance data store solution
apache big-data carbondata data-format hadoop java scala spark
Last synced: 14 Jan 2025
https://github.com/DTStack/Taier
Taier is a big data development platform for submission, scheduling, operation and maintenance, and indicator information display
azkaban chunjun cronjob-scheduler dag data-schedule distributed-schedule-system flink hadoop hive job-scheduler scheduler spark task-schedule workflow-scheduling-system
Last synced: 30 Oct 2024
https://github.com/harisekhon/dockerfiles
50+ DockerHub public images for Docker & Kubernetes - DevOps, CI/CD, GitHub Actions, CircleCI, Jenkins, TeamCity, Alpine, CentOS, Debian, Fedora, Ubuntu, Hadoop, Kafka, ZooKeeper, HBase, Cassandra, Solr, SolrCloud, Presto, Apache Drill, Nifi, Spark, Consul, Riak
apache-drill cassandra consul devops docker dockerhub hacktoberfest hadoop hbase kafka kubernetes linux nagios-plugins presto rabbitmq rabbitmq-cluster solr solrcloud spark zookeeper
Last synced: 16 Jan 2025
https://github.com/HariSekhon/Dockerfiles
50+ DockerHub public images for Docker & Kubernetes - DevOps, CI/CD, GitHub Actions, CircleCI, Jenkins, TeamCity, Alpine, CentOS, Debian, Fedora, Ubuntu, Hadoop, Kafka, ZooKeeper, HBase, Cassandra, Solr, SolrCloud, Presto, Apache Drill, Nifi, Spark, Consul, Riak
apache-drill cassandra consul devops docker dockerhub hacktoberfest hadoop hbase kafka kubernetes linux nagios-plugins presto rabbitmq rabbitmq-cluster solr solrcloud spark zookeeper
Last synced: 04 Nov 2024
https://github.com/dtstack/taier
Taier is a big data development platform for submission, scheduling, operation and maintenance, and indicator information display
azkaban chunjun cronjob-scheduler dag data-schedule distributed-schedule-system flink hadoop hive job-scheduler scheduler spark task-schedule workflow-scheduling-system
Last synced: 16 Jan 2025
https://github.com/obenner/data-engineering-interview-questions
More than 2000+ Data engineer interview questions.
airflow avro aws azure cassandra data-engineering data-structures flink flume hadoop hadoop-hdfs hbase hive impala interview interview-questions kafka nifi spark sql
Last synced: 16 Jan 2025
https://github.com/wgzhao/addax
Addax is a versatile open-source ETL tool that can seamlessly transfer data between various RDBMS and NoSQL databases, making it an ideal solution for data migration.
clickhouse database etl excel hadoop hdfs hive impala influxdb kudu mysql oracle postgresql sqlserver trino
Last synced: 16 Jan 2025
https://github.com/harisekhon/nagios-plugins
450+ AWS, Hadoop, Cloud, Kafka, Docker, Elasticsearch, RabbitMQ, Redis, HBase, Solr, Cassandra, ZooKeeper, HDFS, Yarn, Hive, Presto, Drill, Impala, Consul, Spark, Jenkins, Travis CI, Git, MySQL, Linux, DNS, Whois, SSL Certs, Yum Security Updates, Kubernetes, Cloudera etc...
aws cassandra cloud cloudera consul docker elasticsearch hacktoberfest hadoop hbase jenkins kafka kubernetes linux mysql nagios-plugins rabbitmq redis solr zookeeper
Last synced: 16 Jan 2025
https://github.com/HariSekhon/nagios-plugins
450+ AWS, Hadoop, Cloud, Kafka, Docker, Elasticsearch, RabbitMQ, Redis, HBase, Solr, Cassandra, ZooKeeper, HDFS, Yarn, Hive, Presto, Drill, Impala, Consul, Spark, Jenkins, Travis CI, Git, MySQL, Linux, DNS, Whois, SSL Certs, Yum Security Updates, Kubernetes, Cloudera etc...
aws cassandra cloud cloudera consul docker elasticsearch hacktoberfest hadoop hbase jenkins kafka kubernetes linux mysql nagios-plugins rabbitmq redis solr zookeeper
Last synced: 16 Nov 2024
https://github.com/OBenner/data-engineering-interview-questions
More than 2000+ Data engineer interview questions.
airflow avro aws azure cassandra data-engineering data-structures flink flume hadoop hadoop-hdfs hbase hive impala interview interview-questions kafka nifi spark sql
Last synced: 07 Nov 2024
https://github.com/teradata/kylo
Kylo is a data lake management software platform and framework for enabling scalable enterprise-class data lakes on big data technologies such as Teradata, Apache Spark and/or Hadoop. Kylo is licensed under Apache 2.0. Contributed by Teradata Inc.
data-lake hadoop kylo nifi spark teradata
Last synced: 17 Jan 2025
https://github.com/Teradata/kylo
Kylo is a data lake management software platform and framework for enabling scalable enterprise-class data lakes on big data technologies such as Teradata, Apache Spark and/or Hadoop. Kylo is licensed under Apache 2.0. Contributed by Teradata Inc.
data-lake hadoop kylo nifi spark teradata
Last synced: 05 Nov 2024
https://github.com/wgzhao/Addax
Addax is a versatile open-source ETL tool that can seamlessly transfer data between various RDBMS and NoSQL databases, making it an ideal solution for data migration.
clickhouse data-integrity database datax etl excel hadoop hdfs hive impala influxdb kudu mysql oracle postgresql sqlserver trino
Last synced: 08 Nov 2024
https://github.com/oeljeklaus-you/useractionanalyzeplatform
电商用户行为分析大数据平台
accumulator hadoop java kyro spark spark-sql sparkjava
Last synced: 20 Jan 2025
https://github.com/apache/ozone
Scalable, redundant, and distributed object store for Apache Hadoop
big-data hadoop kubernetes object-store s3 storage
Last synced: 16 Jan 2025
https://github.com/HariSekhon/DevOps-Python-tools
80+ DevOps & Data CLI Tools - AWS, GCP, GCF Python Cloud Functions, Log Anonymizer, Spark, Hadoop, HBase, Hive, Impala, Linux, Docker, Spark Data Converters & Validators (Avro/Parquet/JSON/CSV/INI/XML/YAML), Travis CI, AWS CloudFormation, Elasticsearch, Solr etc.
avro aws cloudformation devops docker dockerhub elasticsearch gcf gcp hadoop hbase hdfs json linux parquet pyspark python solr spark travis-ci
Last synced: 07 Nov 2024
https://github.com/tony-framework/TonY
TonY is a framework to natively run deep learning frameworks on Apache Hadoop.
deep-learning hadoop hadoop-yarn horovod machine-learning tensorflow
Last synced: 09 Nov 2024
https://github.com/tony-framework/tony
TonY is a framework to natively run deep learning frameworks on Apache Hadoop.
deep-learning hadoop hadoop-yarn horovod machine-learning tensorflow
Last synced: 20 Jan 2025
https://github.com/WeBankFinTech/WeDataSphere
WeDataSphere is a financial grade, one-stop big data platform suite.
analytics bigdata data-analysis datafabric datagovernance dataspherestudio exchangis flink hadoop hive ide linkis prophecis qualitis schedulis scriptis spark streamis visualis
Last synced: 30 Oct 2024
https://github.com/cerndb/dist-keras
Distributed Deep Learning, with a focus on distributed training, using Keras and Apache Spark.
apache-spark data-parallelism data-science deep-learning distributed-optimizers hadoop keras machine-learning optimization-algorithms tensorflow
Last synced: 28 Sep 2024
https://github.com/absaoss/spline
Data Lineage Tracking And Visualization Solution
bigdata hadoop lineage scala spark tracking visualization
Last synced: 19 Jan 2025
https://github.com/AbsaOSS/spline
Data Lineage Tracking And Visualization Solution
bigdata hadoop lineage scala spark tracking visualization
Last synced: 05 Nov 2024
https://github.com/Esri/gis-tools-for-hadoop
The GIS Tools for Hadoop are a collection of GIS tools for spatial analysis of big data.
Last synced: 01 Nov 2024
https://github.com/uber/marmaray
Generic Data Ingestion & Dispersal Library for Hadoop
avro-schema data-lake hadoop ingest-data schema-format spark
Last synced: 28 Oct 2024
https://github.com/dromara/cloudeon
CloudEon uses Kubernetes to install and deploy open-source big data components, enabling the containerized operation of an open-source big data platform. This allows you to reduce your focus on underlying resource management and maintenance.
bigdata cloudnative doris hadoop hdfs kubernetes yarn
Last synced: 18 Jan 2025
https://github.com/dromara/CloudEon
CloudEon uses Kubernetes to install and deploy open-source big data components, enabling the containerized operation of an open-source big data platform. This allows you to reduce your focus on underlying resource management and maintenance.
bigdata cloudnative doris hadoop hdfs kubernetes yarn
Last synced: 05 Nov 2024
https://github.com/hortonworks/cloudbreak
CDP Public Cloud is an integrated analytics and data management platform deployed on cloud services. It offers broad data analytics and artificial intelligence functionality along with secure user access and data governance features.
big-data cloud cloudera deployment hacktoberfest hadoop java
Last synced: 15 Nov 2024
https://github.com/kanyun-inc/ytk-learn
Ytk-learn is a distributed machine learning library which implements most of popular machine learning algorithms(GBDT, GBRT, Mixture Logistic Regression, Gradient Boosting Soft Tree, Factorization Machines, Field-aware Factorization Machines, Logistic Regression, Softmax).
distributed factorization-machines gbdt gbm hadoop logistic-regression machine-learning spark
Last synced: 14 Jan 2025
https://github.com/cwensel/cascading
Cascading is a feature rich API for defining and executing complex and fault tolerant data processing flows locally or on a cluster.
Last synced: 17 Jan 2025
https://github.com/tencent/caelus
Set of Kubernetes solutions for reusing idle resources of nodes by running extra batch jobs
containerd docker hadoop kubernetes runtime yarn
Last synced: 18 Jan 2025
https://github.com/tirthajyoti/spark-with-python
Fundamentals of Spark with Python (using PySpark), code examples
analytics apache apache-spark big-data database dataframe distributed-computing hadoop hdfs machine-learning map-reduce mlib parallel-computing pyspark python spark sql
Last synced: 19 Jan 2025
https://github.com/elasticluster/elasticluster
Create clusters of VMs on the cloud and configure them with Ansible.
ansible azure cloud cluster clustering ec2 gcp gridengine hadoop hpc python slurm spark
Last synced: 06 Nov 2024
https://github.com/sakserv/hadoop-mini-clusters
hadoop-mini-clusters provides an easy way to test Hadoop projects directly in your IDE
hadoop hadoop-mini-clusters ide java test-automation
Last synced: 19 Jan 2025
https://github.com/GoogleCloudDataproc/hadoop-connectors
Libraries and tools for interoperability between Hadoop-related open-source software and Google Cloud Platform.
bigquery google-cloud-dataproc hadoop hadoop-filesystem hadoop-hcfs
Last synced: 25 Oct 2024
https://github.com/googleclouddataproc/hadoop-connectors
Libraries and tools for interoperability between Hadoop-related open-source software and Google Cloud Platform.
bigquery google-cloud-dataproc hadoop hadoop-filesystem hadoop-hcfs
Last synced: 15 Jan 2025
https://github.com/apache/calcite-avatica
Apache Calcite Avatica
big-data calcite geospatial hadoop java sql
Last synced: 14 Jan 2025
https://github.com/wavestone-cdt/hadoop-attack-library
A collection of pentest tools and resources targeting Hadoop environments
Last synced: 18 Nov 2024
https://github.com/shifuml/shifu
An end-to-end machine learning and data mining framework on Hadoop
bigdata end-to-end-machine-learning gbdt hadoop machine-learning neural-network pipeline random-forest shifu
Last synced: 20 Jan 2025
https://github.com/harisekhon/haproxy-configs
80+ HAProxy Configs for Hadoop, Big Data, NoSQL, Docker, Kubernetes, Elasticsearch, SolrCloud, HBase, MySQL, PostgreSQL, Apache Drill, Hive, Presto, Impala, Hue, ZooKeeper, SSH, RabbitMQ, Redis, Riak, Cloudera, OpenTSDB, InfluxDB, Prometheus, Kibana, Graphite, Rancher etc.
apache-drill cassandra cloudera elasticsearch hacktoberfest hadoop haproxy hbase hive influxdb mapr mysql nosql opentsdb postgresql presto prometheus redis solrcloud zookeeper
Last synced: 15 Jan 2025
https://github.com/HariSekhon/HAProxy-configs
80+ HAProxy Configs for Hadoop, Big Data, NoSQL, Docker, Kubernetes, Elasticsearch, SolrCloud, HBase, MySQL, PostgreSQL, Apache Drill, Hive, Presto, Impala, Hue, ZooKeeper, SSH, RabbitMQ, Redis, Riak, Cloudera, OpenTSDB, InfluxDB, Prometheus, Kibana, Graphite, Rancher etc.
apache-drill cassandra cloudera elasticsearch hacktoberfest hadoop haproxy hbase hive influxdb mapr mysql nosql opentsdb postgresql presto prometheus redis solrcloud zookeeper
Last synced: 06 Nov 2024
https://github.com/apache/incubator-wayang
Apache Wayang(incubating) is the first cross-platform data processing system.
apache big-data cross-platform data-management-platform data-processing distributed-system hadoop java jdbc middleware open-source performance scala spark
Last synced: 18 Jan 2025
https://github.com/chabane/bigdata-playground
A complete example of a big data application using : Kubernetes (kops/aws), Apache Spark SQL/Streaming/MLib, Apache Flink, Scala, Python, Apache Kafka, Apache Hbase, Apache Parquet, Apache Avro, Apache Storm, Twitter Api, MongoDB, NodeJS, Angular, GraphQL
angular apache-flink apache-spark avro big-data docker graphql hadoop hbase kafka kops machine-learning mongodb nodejs parquet python scala spark-sql spark-streaming twitter-api
Last synced: 16 Jan 2025