Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

https://github.com/vector4wang/quick-spark-process

:star2::star2::star2:学习spark的相关示例

java spark springboot-spark

Last synced: 03 Jul 2024

https://github.com/huangyueranbbc/SparkDemo

spark全示例代码(java、scala) Spark most full instance code DEMO (java、scala)

bigdata hadoop operator spark spark-sql spark-streaming sparkfun-products sparkjava sparkline sparkp

Last synced: 03 Jul 2024

https://github.com/yanagishima/yanagishima

Web UI for Trino, Hive and SparkSQL

elasticsearch hive spark trino

Last synced: 03 Jul 2024

https://github.com/lakesoul-io/LakeSoul

LakeSoul is an end-to-end, realtime and cloud native Lakehouse framework with fast data ingestion, concurrent update and incremental data analytics on cloud storages for both BI and AI applications.

arrow big-data datafusion datalake flink huggingface lakehouse lakesoul postgresql python pytorch rust spark sql streaming vectorized velox

Last synced: 03 Jul 2024

https://github.com/DTStack/Taier

Taier is a big data development platform for submission, scheduling, operation and maintenance, and indicator information display

azkaban chunjun cronjob-scheduler dag data-schedule distributed-schedule-system flink hadoop hive job-scheduler scheduler spark task-schedule workflow-scheduling-system

Last synced: 03 Jul 2024

https://github.com/XuefengHuang/RecommendationSystem

Book recommender system using collaborative filtering based on Spark

collaborative-filtering python-flask recommendation-system spark

Last synced: 02 Jul 2024

https://github.com/P7h/docker-spark

:ship: Docker image for Apache Spark

docker hadoop java scala spark

Last synced: 02 Jul 2024

https://github.com/zsvoboda/ngods-stocks

New Generation Opensource Data Stack Demo

cube dagster datahub dbt iceberg metabase python spark spark-sql trino trinodb

Last synced: 02 Jul 2024

https://github.com/uber/marmaray

Generic Data Ingestion & Dispersal Library for Hadoop

avro-schema data-lake hadoop ingest-data schema-format spark

Last synced: 02 Jul 2024

https://github.com/kanyun-inc/ytk-learn

Ytk-learn is a distributed machine learning library which implements most of popular machine learning algorithms(GBDT, GBRT, Mixture Logistic Regression, Gradient Boosting Soft Tree, Factorization Machines, Field-aware Factorization Machines, Logistic Regression, Softmax).

distributed factorization-machines gbdt gbm hadoop logistic-regression machine-learning spark

Last synced: 27 Jun 2024

https://github.com/HariSekhon/DevOps-Python-tools

80+ DevOps & Data CLI Tools - AWS, GCP, GCF Python Cloud Functions, Log Anonymizer, Spark, Hadoop, HBase, Hive, Impala, Linux, Docker, Spark Data Converters & Validators (Avro/Parquet/JSON/CSV/INI/XML/YAML), Travis CI, AWS CloudFormation, Elasticsearch, Solr etc.

avro aws cloudformation devops docker dockerhub elasticsearch gcf gcp hadoop hbase hdfs json linux parquet pyspark python solr spark travis-ci

Last synced: 27 Jun 2024

https://github.com/databrickslabs/automl-toolkit

Toolkit for Apache Spark ML for Feature clean-up, feature Importance calculation suite, Information Gain selection, Distributed SMOTE, Model selection and training, Hyper parameter optimization and selection, Model interprability.

apache-spark feature-engineering machinelearning ml pyspark scala spark

Last synced: 24 Jun 2024

https://github.com/broadinstitute/gatk

Official code repository for GATK versions 4 and up

bioinformatics dna gatk genome genomics ngs science sequencing spark

Last synced: 23 Jun 2024

https://github.com/AbsaOSS/spline

Data Lineage Tracking And Visualization Solution

bigdata hadoop lineage scala spark tracking visualization

Last synced: 21 Jun 2024

https://github.com/IBM/elasticsearch-spark-recommender

Use Jupyter Notebooks to demonstrate how to build a Recommender with Apache Spark & Elasticsearch

elasticsearch ibmcode jupyter python spark

Last synced: 20 Jun 2024

https://github.com/CognonicLabs/awesome-AI-kubernetes

:snowflake: :whale: Awesome tools and libs for AI, Deep Learning, Machine Learning, Computer Vision, Data Science, Data Analytics and Cognitive Computing that are baked in the oven to be Native on Kubernetes and Docker with Python, R, Scala, Java, C#, Go, Julia, C++ etc

ai analytics big-data cognitive-science data-science docker kubeflow kubernetes kubernetes-ai kubernetes-analytics kubernetes-data-science kubernetes-ml ml pachyderm python-ml scala seldon-core spark spark-kubernetes spark-ml

Last synced: 20 Jun 2024

https://github.com/alexarchambault/ammonite-spark

Run spark calculations from Ammonite

ammonite scala spark

Last synced: 17 Jun 2024

https://github.com/piotr-kalanski/data-quality-monitoring

Data Quality Monitoring Tool

data-quality monitoring scala spark

Last synced: 17 Jun 2024

https://github.com/dstlry/dstlr

scalable knowledge graph construction from unstructured text

corenlp neo4j spark

Last synced: 17 Jun 2024

https://github.com/camposvinicius/aws-etl

This is an ETL application on AWS with general open sales and customer data that you can find here: https://github.com/camposvinicius/data/blob/main/AdventureWorks.zip, it's a zipped file with some .csvs inside that we will apply transformations.

airflow argocd athena aws catalog data data-engineer database emr emr-cluster etl glue kubernetes pipeline postgres pyspark rds spark

Last synced: 16 Jun 2024

https://github.com/iamabug/BigDataParty

大数据组件 All-in-One 的 Dockerfile

big-data dockerfile hadoop kafka spark

Last synced: 16 Jun 2024

https://github.com/innat/ML-Resource

A concise resource repository for machine learning

data-analysis data-science deep-learning kaggle machine-learning python spark

Last synced: 16 Jun 2024

https://github.com/dsaidgovsg/airflow-pipeline

An Airflow docker image preconfigured to work well with Spark and Hadoop/EMR

airflow docker hadoop spark

Last synced: 15 Jun 2024

https://github.com/mdrakiburrahman/sgx-pyspark-sql-demo

Demonstrating Confidential Analytics on Azure SGX VM's with Apache Spark and SCONE.

azure azure-sql-database docker kubernetes sgx spark

Last synced: 15 Jun 2024

https://github.com/mc2-project/opaque-sql

An encrypted data analytics platform

analytics enclave machine-learning privacy security spark spark-sql

Last synced: 15 Jun 2024

https://github.com/a616567126/gpt-web-java

基于JDK8 AI 聊天机器人!微信公众号 Midjourney画图、卡密兑换、web 支持ChatGPT、Midjourney画图、sd画图,卡密兑换,易支付,公众号引流,邮件注册🔥

bard-api chatgpt google midjourney-api spark stable-diffusion

Last synced: 14 Jun 2024

https://github.com/Teradata/kylo

Kylo is a data lake management software platform and framework for enabling scalable enterprise-class data lakes on big data technologies such as Teradata, Apache Spark and/or Hadoop. Kylo is licensed under Apache 2.0. Contributed by Teradata Inc.

data-lake hadoop kylo nifi spark teradata

Last synced: 14 Jun 2024

https://github.com/kwai/blaze

Blazing-fast query execution engine speaks Apache Spark language and has Arrow-DataFusion at its core.

arrow-datafusion big-data data-engineering execution-engine rust spark sql

Last synced: 13 Jun 2024

https://github.com/japila-books/apache-spark-internals

The Internals of Apache Spark

apache-spark book internals spark

Last synced: 13 Jun 2024

https://github.com/dynamicheart/DTSS

Distributed Transaction Settlement System

hadoop kafka se347 spark zookeeper

Last synced: 12 Jun 2024

https://github.com/awslabs/data-on-eks

DoEKS is a tool to build, deploy and scale Data & ML Platforms on Amazon EKS

aws-eks eks jupyterhub kubeflow kubernetes ml mlflow ray spark terraform

Last synced: 12 Jun 2024

https://github.com/Anant/Cassandra.Realtime

Different ways to process data into Cassandra in realtime with technologies such as Kafka, Spark, Akka, Flink

akka cassandra flink flink-stream-processing flink-streaming kafka kafka-connect spark spark-streaming

Last synced: 12 Jun 2024

https://github.com/deanwampler/spark-scala-tutorial

A free tutorial for Apache Spark.

jupyter scala spark tutorial

Last synced: 12 Jun 2024

https://github.com/xubo245/SparkLearning

Learning Apache spark,including code and data .Most part can run local.

learning ml spark sparkcore

Last synced: 11 Jun 2024

https://github.com/HariSekhon/Dockerfiles

50+ DockerHub public images for Docker & Kubernetes - DevOps, CI/CD, GitHub Actions, CircleCI, Jenkins, TeamCity, Alpine, CentOS, Debian, Fedora, Ubuntu, Hadoop, Kafka, ZooKeeper, HBase, Cassandra, Solr, SolrCloud, Presto, Apache Drill, Nifi, Spark, Consul, Riak

apache-drill cassandra consul devops docker dockerhub hacktoberfest hadoop hbase kafka kubernetes linux nagios-plugins presto rabbitmq rabbitmq-cluster solr solrcloud spark zookeeper

Last synced: 11 Jun 2024

https://github.com/x4ax/lxss-install-zeppelin

Step by step guide on how to install Zeppelin 0.7.3 on Linux subsystem (WSL) for Windows 10

hadoop linux-subsystem lxss spark wsl zeppelin

Last synced: 10 Jun 2024

https://github.com/thangdnsf/BigCLAM-ApacheSpark

Overlapping community detection in Large-Scale Networks using BigCLAM model build on Apache Spark

apache-spark bigclam bigclam-model community-detection graph-mining graphx large-scale latex machine-learning scala scale-networks spark

Last synced: 09 Jun 2024

https://github.com/collabH/bigdata-growth

大数据知识仓库涉及到数据仓库建模、实时计算、大数据、数据中台、系统设计、Java、算法等。

bigdata bigdatalearning debezium flink hadoop hbase hdfs hive hudi kafka kudu mapreduce olap spark

Last synced: 08 Jun 2024

https://github.com/policratus/sparkmage

🐘 A tool for blazing fast analysis and clustering of similar images using 🐘 Hadoop and ⚡ Spark.

big-data computer-vision hadoop image-processing spark

Last synced: 07 Jun 2024

https://github.com/neoremind/kraps-rpc

A RPC framework leveraging Spark RPC module

rpc spark

Last synced: 07 Jun 2024

https://github.com/NVIDIA/spark-rapids

Spark RAPIDS plugin - accelerate Apache Spark with GPUs

big-data gpu rapids spark

Last synced: 07 Jun 2024

https://github.com/huangfox/dpkb

大数据相关内容汇总,包括分布式存储引擎、分布式计算引擎、数仓建设等。关键词:Hadoop、HBase、ES、Kudu、Hive、Presto、Spark、Flink、Kylin、ClickHouse

flink hadoop hbase hive presto spark

Last synced: 07 Jun 2024

https://github.com/wangzhiwubigdata/God-Of-BigData

专注大数据学习面试,大数据成神之路开启。Flink/Spark/Hadoop/Hbase/Hive...

azkaban bigdata flink flume hadoop hbase hdfs hive kafka spark zookeeper

Last synced: 07 Jun 2024

https://github.com/jaceklaskowski/spark-kubernetes-book

The Internals of Spark on Kubernetes

apache-spark book internals kubernetes spark

Last synced: 07 Jun 2024

https://github.com/melin/spark-jobserver

REST job server for Apache Spark

hadoop hive java kerberos kubernetes spark yarn

Last synced: 07 Jun 2024

https://github.com/fiatjaf/kwh

webln browser extension for lightningd/eclair/ptarmigan

c-lightning eclair lightning-network lightningd ptarmigan spark web-extension webln

Last synced: 07 Jun 2024

https://github.com/MoRan1607/BigDataGuide

大数据学习,从零开始学习大数据,包含大数据学习各阶段学习视频、面试资料

bigdata flink flume hadoop hbase hive javase kafka scala spark zookeeper

Last synced: 07 Jun 2024

https://github.com/zhaoyachao/zdh_web

大数据采集,抽取平台,zdh_web是zdh系列服务的可视化管理平台,包含数据采集,调度,权限,审批流,私域营销等模块

bigdata collection data data-collection datapipeline datax-web etl pipline scheduler spark sparketl

Last synced: 07 Jun 2024

https://github.com/bytedance/CloudShuffleService

Cloud Shuffle Service(CSS) is a general purpose remote shuffle solution for compute engines, including Spark/Flink/MapReduce.

flink hadoop-mapreduce spark

Last synced: 07 Jun 2024

https://github.com/apache/incubator-uniffle

Uniffle is a high performance, general purpose Remote Shuffle Service.

mapreduce remote-shuffle-service rss shuffle spark tez

Last synced: 07 Jun 2024

https://github.com/datavane/datasophon

The next generation of cloud-native big data management expert , Aims to help users rapidly build stable, efficient, and scalable cloud-native platforms for big data.

cloudnative doris easy-to-use kubernetes spark yarn

Last synced: 07 Jun 2024

https://github.com/duhanmin/bigdata-sql-parser

数据血缘,支持spark sql,hive sql,pg sql,presto sql,mysql sql,tidb sql, flink sql, datax血缘,spark/flink jar 运行命令的血缘解析;支持with语法

datax flink hive mysql postgresql presto spark tidb trino

Last synced: 07 Jun 2024

https://github.com/cubefs/compass

Compass is a task diagnosis platform for bigdata

airflow bigdata diagnose dolphinscheduler flink hadoop mapreduce scheduler spark sql

Last synced: 07 Jun 2024

https://github.com/Nosto/spartann

Hyper performant kNN using Annoy for Apache Spark.

ann annoy apache-spark k-nearest-neighbors k-nearest-neighbours knn ml spark

Last synced: 07 Jun 2024

https://github.com/microsoft/Azure-Synapse-Content-Recommendations-Solution-Accelerator

This is a solution accelerator for creating personalized content recommendations based on user activity.

azure-synapse-analytics power-bi spark

Last synced: 04 Jun 2024

https://github.com/microsoft/hyperspace

An open source indexing subsystem that brings index-based query acceleration to Apache Spark™ and big data workloads.

acceleration analytics big-data databases indexing spark

Last synced: 04 Jun 2024

https://github.com/henridf/apache-spark-node

Node.js bindings for Apache Spark DataFrame APIs

data-frame node spark

Last synced: 03 Jun 2024

https://github.com/apachecn/.github

ApacheCN 开源组织:公告、介绍、成员、活动、交流方式

dl ml python pytorch solidity spark

Last synced: 03 Jun 2024

https://github.com/flyteorg/flytekit

Extensible Python SDK for developing Flyte tasks and workflows. Simple to get started and learn and highly extensible.

automation data data-science extensible flyte flyte-tasks hacktoberfest mlops pypi python sdk spark workflows

Last synced: 02 Jun 2024

https://github.com/projectnessie/nessie

Nessie: Transactional Catalog for Data Lakes with Git-like semantics

aws-lambda data git iceberg java spark

Last synced: 02 Jun 2024

https://github.com/minio/sidekick

High Performance HTTP Sidecar Load Balancer

bigdata kubernetes load-balancer minio-servers proxy sidecar sidekick spark

Last synced: 02 Jun 2024

https://github.com/ging/fiware-cosmos

The Cosmos Generic Enabler enables an easier BigData analysis over context integrated with some of the most popular BigData platforms.

analysis big-data fiware fiware-cosmos flink processing real-time-analytics spark streaming-engine

Last synced: 01 Jun 2024

https://github.com/ballista-compute/ballista

Distributed compute platform implemented in Rust, and powered by Apache Arrow.

arrow dataframe datafusion distributed java jvm kotlin kubernetes rust scala spark

Last synced: 01 Jun 2024

https://github.com/felipekunzler/frequent-itemset-mining-spark

Sequential and distributed implementations of Apriori and FP-Growth algorithms using Scala and Spark.

apriori dfps fp-growth rapriori scala spark yafim

Last synced: 01 Jun 2024

https://github.com/felipekunzler/spark-twitter-analysis

Analyse a twitter dataset with Spark and vizualize the results on a React dashboard.

java reactjs scala spark

Last synced: 31 May 2024

https://github.com/Angel-ML/angel

A Flexible and Powerful Parameter Server for large-scale machine learning

high-dimensional machine-learning model online-learning parameter-server scala spark spark-streaming

Last synced: 31 May 2024

https://github.com/rezacsedu/Mining-Maximal-Frequent-Pattern-Spark

Implementation of Static mining part of "Mining maximal frequent patterns in transactional databases and dynamic data streams: A spark-based approach" Information Sciences, Volume 432, March 2018, Pages 278-300

data-mining data-stream frequent-pattern-mining java maximal-frequent-pattern spark structured-streaming

Last synced: 31 May 2024

https://github.com/feng-li/Distributed-Statistical-Computing

Teaching Materials for Distributed Statistical Computing (大数据分布式计算教学材料)

hadoop mapreduce pyspark-tutorial spark spark-teaching statistical-models

Last synced: 31 May 2024

https://github.com/zhonghuasheng/Tutorial

后端 (Java Golang)全栈知识架构体系总结

emsp go java keepalived mongodb mqtt mysql netty redis rocketmq spark spring springboot springcloud tomcat tutorial

Last synced: 31 May 2024

https://github.com/aalansehaiyang/technology-talk

【大厂面试专栏】一份Java程序员需要的技术指南,这里有面试题、系统架构、职场锦囊、主流中间件等,让你成为更牛的自己!

dubbo es6 git hbase java kafka mycat spark spring springboot

Last synced: 31 May 2024

https://github.com/zhisheng17/flink-learning

flink learning blog. http://www.54tianzhisheng.cn/ 含 Flink 入门、概念、原理、实战、性能调优、源码解析等内容。涉及 Flink Connector、Metrics、Library、DataStream API、Table API & SQL 等内容的学习案例,还有 Flink 落地应用的大型项目案例(PVUV、日志存储、百亿数据实时去重、监控告警)分享。欢迎大家支持我的专栏《大数据实时计算引擎 Flink 实战与性能优化》

clickhouse elasticsearch flink hbase influxdb kafka loki mysql opentsdb rabbitmq redis rocketmq spark stream-processing streaming

Last synced: 31 May 2024

https://github.com/apache/doris

Apache Doris is an easy-to-use, high performance and unified analytics database.

bigquery database dbt delta-lake elt etl hadoop hive hudi iceberg lakehouse olap query-engine real-time redshift snowflake spark sql

Last synced: 31 May 2024

https://github.com/XZB-1248/Spark

✨Spark is a web-based, cross-platform and full-featured Remote Administration Tool (RAT) written in Go that allows you control all your devices anywhere. Spark是一个Go编写的,网页UI、跨平台以及多功能的远程控制和监控工具,你可以随时随地监控和控制所有设备。

dashboard go golang rat remote-access-tool remote-admin-tool remote-administration-tool remote-control server-monitoring shell spark webshell

Last synced: 31 May 2024

https://github.com/water8394/BigData-Interview

:dart: :star2:[大数据面试题]分享自己在网络上收集的大数据相关的面试题以及自己的答案总结.目前包含Hadoop/Hive/Spark/Flink/Hbase/Kafka/Zookeeper框架的面试题知识总结

bigdata flink hadoop hbase hdfs interview interview-questions kafka mapreduce spark yarn

Last synced: 31 May 2024

https://github.com/liyupi/sql-generator

🔨 用 JSON 来生成结构化的 SQL 语句,基于 Vue3 + TypeScript + Vite + Ant Design + MonacoEditor 实现,项目简单(重逻辑轻页面)、适合练手~

ant-design bigdata hive javascript json monaco-editor mysql spark sql typescript vite vue vue3

Last synced: 30 May 2024

https://github.com/miztiik/s3-to-rds-with-glue

Extract, transform, and load data for analytic processing using AWS Glue

cdk cloud-development-kit etl glue glue-catalog glue-job miztiik-automation s3-to-rds spark

Last synced: 27 May 2024

https://github.com/AuFeld/Data_Engineering_Projects

A collection of data engineering projects: data modeling, ETL pipelines, data lakes, infrastructure configuration on AWS, data warehousing, containerization, and a dashboard to monitor data pipeline KPIs

airflow aws cassandra data-engineering data-lake data-warehouse docker emr etl-pipeline infrastructure-as-code infrastructure-setup postgresql python redshift s3 spark

Last synced: 27 May 2024