Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

Projects in Awesome Lists tagged with hadoop

A curated list of projects in awesome lists tagged with hadoop .

https://github.com/donnemartin/data-science-ipython-notebooks

Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.

aws big-data caffe data-science deep-learning hadoop kaggle keras machine-learning mapreduce matplotlib numpy pandas python scikit-learn scipy spark tensorflow theano

Last synced: 29 Sep 2024

https://github.com/spotify/luigi

Luigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in.

hadoop luigi orchestration-framework python scheduling

Last synced: 29 Sep 2024

https://github.com/tencent/apijson

🏆 实时 零代码、全功能、强安全 ORM 库 🚀 后端接口和文档零代码,前端(客户端) 定制返回 JSON 的数据和结构 🏆 Real-Time coding-free, powerful and secure ORM 🚀 providing APIs and Docs without coding by Backend, and the returned JSON of API can be customized by Frontend(Client) users

baas clickhouse crud databricks elasticsearch hadoop hive influxdb low-code lowcode milvus nocode oracle postgresql postgresql-database serverless snowflake sqlserver tdengine tidb

Last synced: 29 Sep 2024

https://github.com/Tencent/APIJSON

🏆 零代码、全功能、强安全 ORM 库 🚀 后端接口和文档零代码,前端(客户端) 定制返回 JSON 的数据和结构。 🏆 A JSON Transmission Protocol and an ORM Library 🚀 provides APIs and Docs without writing any code.

baas clickhouse crud databricks elasticsearch hadoop hive influxdb low-code lowcode milvus nocode oracle postgresql postgresql-database serverless snowflake sqlserver tdengine tidb

Last synced: 01 Aug 2024

https://github.com/prestodb/presto

The official home of the Presto distributed SQL query engine for big data

big-data data hadoop hive java lakehouse presto query sql

Last synced: 29 Sep 2024

https://github.com/apache/hadoop

Apache Hadoop

hadoop

Last synced: 29 Sep 2024

https://github.com/deeplearning4j/deeplearning4j

Suite of tools for deploying and training deep learning models using the JVM. Highlights include model import for keras, tensorflow, and onnx/pytorch, a modular and tiny c++ library for running math code and a java based math library on top of the core c++ library. Also includes samediff: a pytorch/tensorflow like library for running deep learn...

artificial-intelligence clojure deeplearning deeplearning4j dl4j gpu hadoop intellij java linear-algebra matrix-library neural-nets python scala spark

Last synced: 28 Sep 2024

https://github.com/eclipse/deeplearning4j

Suite of tools for deploying and training deep learning models using the JVM. Highlights include model import for keras, tensorflow, and onnx/pytorch, a modular and tiny c++ library for running math code and a java based math library on top of the core c++ library. Also includes samediff: a pytorch/tensorflow like library for running deep learn...

artificial-intelligence clojure deeplearning deeplearning4j dl4j gpu hadoop intellij java linear-algebra matrix-library neural-nets python scala spark

Last synced: 31 Jul 2024

https://github.com/apache/doris

Apache Doris is an easy-to-use, high performance and unified analytics database.

bigquery database dbt delta-lake elt etl hadoop hive hudi iceberg lakehouse olap query-engine real-time redshift snowflake spark sql

Last synced: 29 Sep 2024

https://github.com/apache/incubator-doris

Apache Doris is an easy-to-use, high performance and unified analytics database.

bigquery database dbt delta-lake elt etl hadoop hive hudi iceberg lakehouse olap query-engine real-time redshift snowflake spark sql

Last synced: 04 Aug 2024

https://github.com/trinodb/trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)

analytics big-data data-science database databases datalake delta-lake distributed-database distributed-systems hadoop hive iceberg java jdbc presto prestodb query-engine sql trino

Last synced: 29 Sep 2024

https://github.com/wangzhiwubigdata/god-of-bigdata

专注大数据学习面试,大数据成神之路开启。Flink/Spark/Hadoop/Hbase/Hive...

azkaban bigdata flink flume hadoop hbase hdfs hive kafka spark zookeeper

Last synced: 28 Sep 2024

https://github.com/wangzhiwubigdata/God-Of-BigData

专注大数据学习面试,大数据成神之路开启。Flink/Spark/Hadoop/Hbase/Hive...

azkaban bigdata flink flume hadoop hbase hdfs hive kafka spark zookeeper

Last synced: 31 Jul 2024

https://github.com/linkedin/school-of-sre

At LinkedIn, we are using this curriculum for onboarding our entry-level talents into the SRE role.

git hadoop linux mysql networking nosql python security sre system-design

Last synced: 30 Sep 2024

https://github.com/alluxio/alluxio

Alluxio, data orchestration for analytics and machine learning in the cloud

alluxio data-analysis data-orchestration hadoop memory-speed presto spark tensorflow virtual-distributed-filesystem

Last synced: 29 Sep 2024

https://github.com/Alluxio/alluxio

Alluxio, data orchestration for analytics and machine learning in the cloud

alluxio data-analysis data-orchestration hadoop memory-speed presto spark tensorflow virtual-distributed-filesystem

Last synced: 31 Jul 2024

https://github.com/h2oai/h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.

automl big-data data-science deep-learning distributed ensemble-learning gbm gpu h2o h2o-automl hadoop java machine-learning naive-bayes opensource pca python r random-forest spark

Last synced: 28 Sep 2024

https://github.com/harisekhon/devops-bash-tools

1000+ DevOps Bash Scripts - AWS, GCP, Kubernetes, Docker, CI/CD, APIs, SQL, PostgreSQL, MySQL, Hive, Impala, Kafka, Hadoop, Jenkins, GitHub, GitLab, BitBucket, Azure DevOps, TeamCity, Spotify, MP3, LDAP, Code/Build Linting, pkg mgmt for Linux, Mac, Python, Perl, Ruby, NodeJS, Golang, Advanced dotfiles: .bashrc, .vimrc, .gitconfig, .screenrc, tmux..

api aws bash ci cloudera devops docker gcp git github hacktoberfest hadoop jenkins kafka kubernetes linux mysql perl postgresql terraform

Last synced: 26 Sep 2024

https://github.com/tomwhite/hadoop-book

Example source code accompanying O'Reilly's "Hadoop: The Definitive Guide" by Tom White

book hadoop o-reilly

Last synced: 30 Sep 2024

https://github.com/webankfintech/dataspherestudio

DataSphereStudio is a one stop data application development& management portal, covering scenarios including data exchange, desensitization/cleansing, analysis/mining, quality measurement, visualization, and task scheduling.

airflow atlas azkaban dataworks davinci dolphinscheduler flink governance griffin hadoop hive hue kettle linkis spark supperset tableau visualis workflow zeppelin

Last synced: 28 Sep 2024

https://github.com/WeBankFinTech/DataSphereStudio

DataSphereStudio is a one stop data application development& management portal, covering scenarios including data exchange, desensitization/cleansing, analysis/mining, quality measurement, visualization, and task scheduling.

airflow atlas azkaban dataworks davinci dolphinscheduler flink governance griffin hadoop hive hue kettle linkis spark supperset tableau visualis workflow zeppelin

Last synced: 30 Jul 2024

https://github.com/apache/nutch

Apache Nutch is an extensible and scalable web crawler

apache crawling hadoop java nutch web-crawler

Last synced: 30 Sep 2024

https://github.com/luckyzxl2016/movie_recommend

基于Spark的电影推荐系统,包含爬虫项目、web网站、后台管理系统以及spark推荐系统

hadoop hive mysql nginx scala scrapy spark-mllib spark-streaming ssm-maven

Last synced: 27 Sep 2024

https://github.com/LuckyZXL2016/Movie_Recommend

基于Spark的电影推荐系统,包含爬虫项目、web网站、后台管理系统以及spark推荐系统

hadoop hive mysql nginx scala scrapy spark-mllib spark-streaming ssm-maven

Last synced: 31 Jul 2024

https://github.com/HariSekhon/DevOps-Bash-tools

1000+ DevOps Bash Scripts - AWS, GCP, Kubernetes, Docker, CI/CD, APIs, SQL, PostgreSQL, MySQL, Hive, Impala, Kafka, Hadoop, Jenkins, GitHub, GitLab, BitBucket, Azure DevOps, TeamCity, Spotify, MP3, LDAP, Code/Build Linting, pkg mgmt for Linux, Mac, Python, Perl, Ruby, NodeJS, Golang, Advanced dotfiles: .bashrc, .vimrc, .gitconfig, .screenrc, tmux..

api aws bash ci cloudera devops docker gcp git github hacktoberfest hadoop jenkins kafka kubernetes linux mysql perl postgresql terraform

Last synced: 01 Aug 2024

https://github.com/moran1607/bigdataguide

大数据学习,从零开始学习大数据,包含大数据学习各阶段学习视频、面试资料

bigdata flink flume hadoop hbase hive javase kafka scala spark zookeeper

Last synced: 28 Sep 2024

https://github.com/MoRan1607/BigDataGuide

大数据学习,从零开始学习大数据,包含大数据学习各阶段学习视频、面试资料

bigdata flink flume hadoop hbase hive javase kafka scala spark zookeeper

Last synced: 01 Aug 2024

https://github.com/dahuoyzs/javapdf

🍣100本 Java电子书 技术书籍PDF(以下载阅读为荣,以点赞收藏为耻)

hadoop java-pdf jvm mysql mysql-innodb

Last synced: 30 Sep 2024

https://github.com/apache/kyuubi

Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses.

data-lake hacktoberfest hadoop hive jdbc kubernetes spark spark-sql sql thrift

Last synced: 28 Sep 2024

https://github.com/cdarlint/winutils

winutils.exe hadoop.dll and hdfs.dll binaries for hadoop windows

binaries hadoop winutils

Last synced: 30 Sep 2024

https://github.com/gchq/Gaffer

A large-scale entity and relation database supporting aggregation of properties

accumulo aggregation big-data graph graph-database hadoop hbase parquet spark

Last synced: 02 Aug 2024

https://github.com/gchq/gaffer

A large-scale entity and relation database supporting aggregation of properties

accumulo aggregation big-data graph graph-database hadoop hbase parquet spark

Last synced: 28 Sep 2024

https://github.com/water8394/bigdata-interview

:dart: :star2:[大数据面试题]分享自己在网络上收集的大数据相关的面试题以及自己的答案总结.目前包含Hadoop/Hive/Spark/Flink/Hbase/Kafka/Zookeeper框架的面试题知识总结

bigdata flink hadoop hbase hdfs interview interview-questions kafka mapreduce spark yarn

Last synced: 27 Sep 2024

https://github.com/water8394/BigData-Interview

:dart: :star2:[大数据面试题]分享自己在网络上收集的大数据相关的面试题以及自己的答案总结.目前包含Hadoop/Hive/Spark/Flink/Hbase/Kafka/Zookeeper框架的面试题知识总结

bigdata flink hadoop hbase hdfs interview interview-questions kafka mapreduce spark yarn

Last synced: 31 Jul 2024

https://github.com/collabh/bigdata-growth

大数据知识仓库涉及到数据仓库建模、实时计算、大数据、数据中台、系统设计、Java、算法等。

bigdata bigdatalearning debezium flink hadoop hbase hdfs hive hudi kafka kudu mapreduce olap spark

Last synced: 28 Sep 2024

https://github.com/apache/carbondata

High performance data store solution

apache big-data carbondata data-format hadoop java scala spark

Last synced: 29 Sep 2024

https://github.com/collabH/bigdata-growth

大数据知识仓库涉及到数据仓库建模、实时计算、大数据、数据中台、系统设计、Java、算法等。

bigdata bigdatalearning debezium flink hadoop hbase hdfs hive hudi kafka kudu mapreduce olap spark

Last synced: 31 Jul 2024

https://github.com/dtstack/taier

Taier is a big data development platform for submission, scheduling, operation and maintenance, and indicator information display

azkaban chunjun cronjob-scheduler dag data-schedule distributed-schedule-system flink hadoop hive job-scheduler scheduler spark task-schedule workflow-scheduling-system

Last synced: 28 Sep 2024

https://github.com/DTStack/Taier

Taier is a big data development platform for submission, scheduling, operation and maintenance, and indicator information display

azkaban chunjun cronjob-scheduler dag data-schedule distributed-schedule-system flink hadoop hive job-scheduler scheduler spark task-schedule workflow-scheduling-system

Last synced: 31 Jul 2024

https://github.com/harisekhon/dockerfiles

50+ DockerHub public images for Docker & Kubernetes - DevOps, CI/CD, GitHub Actions, CircleCI, Jenkins, TeamCity, Alpine, CentOS, Debian, Fedora, Ubuntu, Hadoop, Kafka, ZooKeeper, HBase, Cassandra, Solr, SolrCloud, Presto, Apache Drill, Nifi, Spark, Consul, Riak

apache-drill cassandra consul devops docker dockerhub hacktoberfest hadoop hbase kafka kubernetes linux nagios-plugins presto rabbitmq rabbitmq-cluster solr solrcloud spark zookeeper

Last synced: 28 Sep 2024

https://github.com/HariSekhon/Dockerfiles

50+ DockerHub public images for Docker & Kubernetes - DevOps, CI/CD, GitHub Actions, CircleCI, Jenkins, TeamCity, Alpine, CentOS, Debian, Fedora, Ubuntu, Hadoop, Kafka, ZooKeeper, HBase, Cassandra, Solr, SolrCloud, Presto, Apache Drill, Nifi, Spark, Consul, Riak

apache-drill cassandra consul devops docker dockerhub hacktoberfest hadoop hbase kafka kubernetes linux nagios-plugins presto rabbitmq rabbitmq-cluster solr solrcloud spark zookeeper

Last synced: 01 Aug 2024

https://github.com/harisekhon/nagios-plugins

450+ AWS, Hadoop, Cloud, Kafka, Docker, Elasticsearch, RabbitMQ, Redis, HBase, Solr, Cassandra, ZooKeeper, HDFS, Yarn, Hive, Presto, Drill, Impala, Consul, Spark, Jenkins, Travis CI, Git, MySQL, Linux, DNS, Whois, SSL Certs, Yum Security Updates, Kubernetes, Cloudera etc...

aws cassandra cloud cloudera consul docker elasticsearch hacktoberfest hadoop hbase jenkins kafka kubernetes linux mysql nagios-plugins rabbitmq redis solr zookeeper

Last synced: 29 Sep 2024

https://github.com/HariSekhon/nagios-plugins

450+ AWS, Hadoop, Cloud, Kafka, Docker, Elasticsearch, RabbitMQ, Redis, HBase, Solr, Cassandra, ZooKeeper, HDFS, Yarn, Hive, Presto, Drill, Impala, Consul, Spark, Jenkins, Travis CI, Git, MySQL, Linux, DNS, Whois, SSL Certs, Yum Security Updates, Kubernetes, Cloudera etc...

aws cassandra cloud cloudera consul docker elasticsearch hacktoberfest hadoop hbase jenkins kafka kubernetes linux mysql nagios-plugins rabbitmq redis solr zookeeper

Last synced: 03 Aug 2024

https://github.com/teradata/kylo

Kylo is a data lake management software platform and framework for enabling scalable enterprise-class data lakes on big data technologies such as Teradata, Apache Spark and/or Hadoop. Kylo is licensed under Apache 2.0. Contributed by Teradata Inc.

data-lake hadoop kylo nifi spark teradata

Last synced: 28 Sep 2024

https://github.com/wgzhao/Addax

Addax is a versatile open-source ETL tool that can seamlessly transfer data between various RDBMS and NoSQL databases, making it an ideal solution for data migration.

clickhouse data-integrity database datax etl excel hadoop hdfs hive impala influxdb kudu mysql oracle postgresql sqlserver trino

Last synced: 01 Aug 2024

https://github.com/wgzhao/addax

Addax is a versatile open-source ETL tool that can seamlessly transfer data between various RDBMS and NoSQL databases, making it an ideal solution for data migration.

clickhouse data-integrity database datax etl excel hadoop hdfs hive impala influxdb kudu mysql oracle postgresql sqlserver trino

Last synced: 28 Sep 2024

https://github.com/Teradata/kylo

Kylo is a data lake management software platform and framework for enabling scalable enterprise-class data lakes on big data technologies such as Teradata, Apache Spark and/or Hadoop. Kylo is licensed under Apache 2.0. Contributed by Teradata Inc.

data-lake hadoop kylo nifi spark teradata

Last synced: 01 Aug 2024

https://github.com/oeljeklaus-you/useractionanalyzeplatform

电商用户行为分析大数据平台

accumulator hadoop java kyro spark spark-sql sparkjava

Last synced: 28 Sep 2024

https://github.com/apache/ozone

Scalable, redundant, and distributed object store for Apache Hadoop

big-data hadoop kubernetes object-store s3 storage

Last synced: 30 Sep 2024

https://github.com/HariSekhon/DevOps-Python-tools

80+ DevOps & Data CLI Tools - AWS, GCP, GCF Python Cloud Functions, Log Anonymizer, Spark, Hadoop, HBase, Hive, Impala, Linux, Docker, Spark Data Converters & Validators (Avro/Parquet/JSON/CSV/INI/XML/YAML), Travis CI, AWS CloudFormation, Elasticsearch, Solr etc.

avro aws cloudformation devops docker dockerhub elasticsearch gcf gcp hadoop hbase hdfs json linux parquet pyspark python solr spark travis-ci

Last synced: 01 Aug 2024

https://github.com/tony-framework/TonY

TonY is a framework to natively run deep learning frameworks on Apache Hadoop.

deep-learning hadoop hadoop-yarn horovod machine-learning tensorflow

Last synced: 02 Aug 2024

https://github.com/tony-framework/tony

TonY is a framework to natively run deep learning frameworks on Apache Hadoop.

deep-learning hadoop hadoop-yarn horovod machine-learning tensorflow

Last synced: 26 Sep 2024

https://github.com/cerndb/dist-keras

Distributed Deep Learning, with a focus on distributed training, using Keras and Apache Spark.

apache-spark data-parallelism data-science deep-learning distributed-optimizers hadoop keras machine-learning optimization-algorithms tensorflow

Last synced: 28 Sep 2024

https://github.com/AbsaOSS/spline

Data Lineage Tracking And Visualization Solution

bigdata hadoop lineage scala spark tracking visualization

Last synced: 01 Aug 2024

https://github.com/Esri/gis-tools-for-hadoop

The GIS Tools for Hadoop are a collection of GIS tools for spatial analysis of big data.

hadoop spatial-analysis

Last synced: 01 Aug 2024

https://github.com/uber/marmaray

Generic Data Ingestion & Dispersal Library for Hadoop

avro-schema data-lake hadoop ingest-data schema-format spark

Last synced: 31 Jul 2024

https://github.com/apache/tez

Apache Tez

apache big-data hadoop java tez

Last synced: 30 Sep 2024

https://github.com/linkedin/venice

Venice, Derived Data Platform for Planet-Scale Workloads.

ai database hadoop kafka ml

Last synced: 02 Aug 2024

https://github.com/dromara/cloudeon

CloudEon uses Kubernetes to install and deploy open-source big data components, enabling the containerized operation of an open-source big data platform. This allows you to reduce your focus on underlying resource management and maintenance.

bigdata cloudnative doris hadoop hdfs kubernetes yarn

Last synced: 27 Sep 2024

https://github.com/dromara/CloudEon

CloudEon uses Kubernetes to install and deploy open-source big data components, enabling the containerized operation of an open-source big data platform. This allows you to reduce your focus on underlying resource management and maintenance.

bigdata cloudnative doris hadoop hdfs kubernetes yarn

Last synced: 01 Aug 2024

https://github.com/hortonworks/cloudbreak

CDP Public Cloud is an integrated analytics and data management platform deployed on cloud services. It offers broad data analytics and artificial intelligence functionality along with secure user access and data governance features.

big-data cloud cloudera deployment hacktoberfest hadoop java

Last synced: 03 Aug 2024

https://github.com/kanyun-inc/ytk-learn

Ytk-learn is a distributed machine learning library which implements most of popular machine learning algorithms(GBDT, GBRT, Mixture Logistic Regression, Gradient Boosting Soft Tree, Factorization Machines, Field-aware Factorization Machines, Logistic Regression, Softmax).

distributed factorization-machines gbdt gbm hadoop logistic-regression machine-learning spark

Last synced: 04 Aug 2024

https://github.com/tencent/caelus

Set of Kubernetes solutions for reusing idle resources of nodes by running extra batch jobs

containerd docker hadoop kubernetes runtime yarn

Last synced: 27 Sep 2024

https://github.com/elasticluster/elasticluster

Create clusters of VMs on the cloud and configure them with Ansible.

ansible azure cloud cluster clustering ec2 gcp gridengine hadoop hpc python slurm spark

Last synced: 01 Aug 2024

https://github.com/cubefs/compass

Compass is a task diagnosis platform for bigdata

airflow bigdata diagnose dolphinscheduler flink hadoop mapreduce scheduler spark sql

Last synced: 01 Aug 2024

https://github.com/sakserv/hadoop-mini-clusters

hadoop-mini-clusters provides an easy way to test Hadoop projects directly in your IDE

hadoop hadoop-mini-clusters ide java test-automation

Last synced: 30 Sep 2024

https://github.com/googleclouddataproc/hadoop-connectors

Libraries and tools for interoperability between Hadoop-related open-source software and Google Cloud Platform.

bigquery google-cloud-dataproc hadoop hadoop-filesystem hadoop-hcfs

Last synced: 29 Sep 2024

https://github.com/wavestone-cdt/hadoop-attack-library

A collection of pentest tools and resources targeting Hadoop environments

bigdata hadoop pentest

Last synced: 03 Aug 2024

https://github.com/apache/calcite-avatica

Apache Calcite Avatica

big-data calcite geospatial hadoop java sql

Last synced: 30 Sep 2024

https://github.com/HariSekhon/HAProxy-configs

80+ HAProxy Configs for Hadoop, Big Data, NoSQL, Docker, Kubernetes, Elasticsearch, SolrCloud, HBase, MySQL, PostgreSQL, Apache Drill, Hive, Presto, Impala, Hue, ZooKeeper, SSH, RabbitMQ, Redis, Riak, Cloudera, OpenTSDB, InfluxDB, Prometheus, Kibana, Graphite, Rancher etc.

apache-drill cassandra cloudera elasticsearch hacktoberfest hadoop haproxy hbase hive influxdb mapr mysql nosql opentsdb postgresql presto prometheus redis solrcloud zookeeper

Last synced: 01 Aug 2024

https://github.com/harisekhon/haproxy-configs

80+ HAProxy Configs for Hadoop, Big Data, NoSQL, Docker, Kubernetes, Elasticsearch, SolrCloud, HBase, MySQL, PostgreSQL, Apache Drill, Hive, Presto, Impala, Hue, ZooKeeper, SSH, RabbitMQ, Redis, Riak, Cloudera, OpenTSDB, InfluxDB, Prometheus, Kibana, Graphite, Rancher etc.

apache-drill cassandra cloudera elasticsearch hacktoberfest hadoop haproxy hbase hive influxdb mapr mysql nosql opentsdb postgresql presto prometheus redis solrcloud zookeeper

Last synced: 29 Sep 2024

https://github.com/huangfox/dpkb

大数据相关内容汇总,包括分布式存储引擎、分布式计算引擎、数仓建设等。关键词:Hadoop、HBase、ES、Kudu、Hive、Presto、Spark、Flink、Kylin、ClickHouse

flink hadoop hbase hive presto spark

Last synced: 31 Jul 2024

https://github.com/Chabane/bigdata-playground

A complete example of a big data application using : Kubernetes (kops/aws), Apache Spark SQL/Streaming/MLib, Apache Flink, Scala, Python, Apache Kafka, Apache Hbase, Apache Parquet, Apache Avro, Apache Storm, Twitter Api, MongoDB, NodeJS, Angular, GraphQL

angular apache-flink apache-spark avro big-data docker graphql hadoop hbase kafka kops machine-learning mongodb nodejs parquet python scala spark-sql spark-streaming twitter-api

Last synced: 02 Aug 2024

https://github.com/apache/hadoop-hdfs

Mirror of Apache Hadoop HDFS

hadoop

Last synced: 28 Sep 2024

https://github.com/lynnlangit/learning-hadoop-and-spark

Companion to Learning Hadoop and Learning Spark courses on Linked In Learning

apache-spark dataproc emr hadoop learning-hadoop mapreduce spark wordcount

Last synced: 28 Sep 2024

https://github.com/dsaidgovsg/airflow-pipeline

An Airflow docker image preconfigured to work well with Spark and Hadoop/EMR

airflow docker hadoop spark

Last synced: 31 Jul 2024

https://github.com/aliyun/aliyun-emapreduce-datasources

Extended datasource support for Spark/Hadoop on Aliyun E-MapReduce.

aliyun datasources e-mapreduce hadoop kafka spark

Last synced: 26 Sep 2024

https://github.com/apache/hadoop-common

Mirror of Apache Hadoop common

hadoop

Last synced: 28 Sep 2024

https://github.com/nielsbasjes/logparser

Easy parsing of Apache HTTPD and NGINX access logs with Java, Hadoop, Hive, Flink, Beam, Storm, Drill, ...

apache beam drill flink hadoop hive httpd java logformat nginx parse parser

Last synced: 27 Sep 2024

https://github.com/jcrist/skein

A tool and library for easily deploying applications on Apache YARN

apache-yarn cluster deployment hadoop hdfs python

Last synced: 07 Aug 2024

https://github.com/marcelmay/hadoop-hdfs-fsimage-exporter

Exports Hadoop HDFS content statistics to Prometheus

hadoop hadoop-fsimage hdfs hdfs-metrics monitoring prometheus-exporter

Last synced: 01 Aug 2024

https://github.com/archivesunleashed/aut

The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.

analysis apache-spark big-data big-data-analytics dataframe digital-humanities hadoop network-graphing pyspark python3 scala spark text-extraction webarchives

Last synced: 28 Sep 2024