An open API service indexing awesome lists of open source software.

Apache Spark

Apache Spark is an open source distributed general-purpose cluster-computing framework. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.

https://github.com/combust/mleap

MLeap: Deploy ML Pipelines to Production

data-pipelines python scala scikit-learn spark tensorflow transformers

Last synced: 14 May 2025

https://github.com/japila-books/apache-spark-internals

The Internals of Apache Spark

apache-spark book internals spark

Last synced: 15 May 2025

https://github.com/kwai/blaze

Blazing-fast query execution engine speaks Apache Spark language and has Arrow-DataFusion at its core.

big-data datafusion rust-lang spark

Last synced: 14 May 2025

https://github.com/apache/carbondata

High performance data store solution

apache big-data carbondata data-format hadoop java scala spark

Last synced: 13 May 2025

https://github.com/jupyter-incubator/sparkmagic

Jupyter magics and kernels for working with remote Spark clusters

cluster jupyter jupyter-notebook kerberos kernel livy magic notebook pandas-dataframe pyspark spark sql-query

Last synced: 13 May 2025

https://github.com/harisekhon/dockerfiles

50+ DockerHub public images for Docker & Kubernetes - DevOps, CI/CD, GitHub Actions, CircleCI, Jenkins, TeamCity, Alpine, CentOS, Debian, Fedora, Ubuntu, Hadoop, Kafka, ZooKeeper, HBase, Cassandra, Solr, SolrCloud, Presto, Apache Drill, Nifi, Spark, Consul, Riak

apache-drill cassandra consul devops docker dockerhub hacktoberfest hadoop hbase kafka kubernetes linux nagios-plugins presto rabbitmq rabbitmq-cluster solr solrcloud spark zookeeper

Last synced: 14 May 2025

https://github.com/HariSekhon/Dockerfiles

50+ DockerHub public images for Docker & Kubernetes - DevOps, CI/CD, GitHub Actions, CircleCI, Jenkins, TeamCity, Alpine, CentOS, Debian, Fedora, Ubuntu, Hadoop, Kafka, ZooKeeper, HBase, Cassandra, Solr, SolrCloud, Presto, Apache Drill, Nifi, Spark, Consul, Riak

apache-drill cassandra consul devops docker dockerhub hacktoberfest hadoop hbase kafka kubernetes linux nagios-plugins presto rabbitmq rabbitmq-cluster solr solrcloud spark zookeeper

Last synced: 03 Apr 2025

https://github.com/databricks/learningsparkv2

This is the github repo for Learning Spark: Lightning-Fast Data Analytics [2nd Edition]

apache-spark delta-lake mlflow mllib spark spark-mllib spark-sql structured-streaming

Last synced: 14 May 2025

https://github.com/dtstack/taier

Taier is a big data development platform for submission, scheduling, operation and maintenance, and indicator information display

azkaban chunjun cronjob-scheduler dag data-schedule distributed-schedule-system flink hadoop hive job-scheduler scheduler spark task-schedule workflow-scheduling-system

Last synced: 15 May 2025

https://github.com/DTStack/Taier

Taier is a big data development platform for submission, scheduling, operation and maintenance, and indicator information display

azkaban chunjun cronjob-scheduler dag data-schedule distributed-schedule-system flink hadoop hive job-scheduler scheduler spark task-schedule workflow-scheduling-system

Last synced: 27 Mar 2025

https://github.com/mahmoudparsian/pyspark-tutorial

PySpark-Tutorial provides basic algorithms using PySpark

big-data big-data-analytics data-algorithms pyspark spark spark-dataframes spark-rdd

Last synced: 14 May 2025

https://github.com/datavane/datasophon

The next generation of cloud-native big data management expert , Aims to help users rapidly build stable, efficient, and scalable cloud-native platforms for big data.

cloudnative doris easy-to-use kubernetes spark yarn

Last synced: 15 May 2025

https://github.com/apachecn/spark-doc-zh

Apache Spark 官方文档中文版

big-data documentation java spark

Last synced: 07 Apr 2025

https://github.com/projectnessie/nessie

Nessie: Transactional Catalog for Data Lakes with Git-like semantics

aws-lambda data git iceberg java spark

Last synced: 13 May 2025

https://github.com/teradata/kylo

Kylo is a data lake management software platform and framework for enabling scalable enterprise-class data lakes on big data technologies such as Teradata, Apache Spark and/or Hadoop. Kylo is licensed under Apache 2.0. Contributed by Teradata Inc.

data-lake hadoop kylo nifi spark teradata

Last synced: 15 May 2025

https://github.com/Teradata/kylo

Kylo is a data lake management software platform and framework for enabling scalable enterprise-class data lakes on big data technologies such as Teradata, Apache Spark and/or Hadoop. Kylo is licensed under Apache 2.0. Contributed by Teradata Inc.

data-lake hadoop kylo nifi spark teradata

Last synced: 06 Apr 2025

https://github.com/jacksu/utils4s

scala、spark使用过程中,各种测试用例以及相关资料整理

akka breeze json4s scala scala-demo scala-spark spark spark-streaming

Last synced: 16 May 2025

https://github.com/SnappyDataInc/snappydata

Project SnappyData - memory optimized analytics database, based on Apache Spark™ and Apache Geode™. Stream, Transact, Analyze, Predict in one cluster

analytics memory-database scale snappydata spark stream transaction

Last synced: 31 Mar 2025

https://github.com/graphframes/graphframes

GraphFrames is a package for Apache Spark which provides DataFrame-based Graphs

apache-spark big-data connected-components dataframe dataframes graphs network-motif network-motifs networks spark

Last synced: 14 May 2025

https://github.com/tibcosoftware/snappydata

Project SnappyData - memory optimized analytics database, based on Apache Spark™ and Apache Geode™. Stream, Transact, Analyze, Predict in one cluster

analytics memory-database scale snappydata spark stream transaction

Last synced: 15 May 2025

https://github.com/TIBCOSoftware/snappydata

Project SnappyData - memory optimized analytics database, based on Apache Spark™ and Apache Geode™. Stream, Transact, Analyze, Predict in one cluster

analytics memory-database scale snappydata spark stream transaction

Last synced: 14 May 2025

https://github.com/oeljeklaus-you/useractionanalyzeplatform

电商用户行为分析大数据平台

accumulator hadoop java kyro spark spark-sql sparkjava

Last synced: 16 May 2025

https://github.com/josonle/coding-now

学习记录的一些笔记,以及所看得一些电子书eBooks、视频资源和平常收纳的一些自己认为比较好的博客、网站、工具。涉及大数据几大组件、Python机器学习和数据分析、Linux、操作系统、算法、网络等

bigdata coding ebook-collection hadoop-hive java linux notes spark

Last synced: 16 May 2025

https://github.com/bigdatagenomics/adam

ADAM is a genomics analysis platform with specialized file formats built using Apache Avro, Apache Spark, and Apache Parquet. Apache 2 licensed.

avro big-data bioinformatics genomics java parquet python r scala spark

Last synced: 12 May 2025

https://github.com/twosigma/flint

A Time Series Library for Apache Spark

spark timeseries

Last synced: 12 Apr 2025

https://github.com/deanwampler/spark-scala-tutorial

A free tutorial for Apache Spark.

jupyter scala spark tutorial

Last synced: 16 May 2025

https://github.com/h2oai/sparkling-water

Sparkling Water provides H2O functionality inside Spark cluster

big-data h2o integration machine-learning pyspark pysparkling rsparkling scala spark

Last synced: 13 May 2025

https://github.com/redislabs/spark-redis

A connector for Spark that allows reading and writing to/from Redis cluster

dataframe java redis spark

Last synced: 14 May 2025

https://github.com/apache/celeborn

Apache Celeborn is an elastic and high-performance service for shuffle and spilled data.

bigdata shuffle spark

Last synced: 14 May 2025

https://github.com/apache/datafusion-comet

Apache DataFusion Comet Spark Accelerator

arrow datafusion rust spark

Last synced: 14 May 2025

https://github.com/RedisLabs/spark-redis

A connector for Spark that allows reading and writing to/from Redis cluster

dataframe java redis spark

Last synced: 28 Mar 2025

https://github.com/wzhe06/sparkctr

CTR prediction model based on spark(LR, GBDT, DNN)

computational-advertising ctr-prediction machine-learning scala spark spark-ml spark-mllib

Last synced: 13 Apr 2025

https://github.com/apache/incubator-livy

Apache Livy is an open source REST interface for interacting with Apache Spark from anywhere.

apachelivy bigdata livy spark

Last synced: 12 May 2025

https://github.com/wzhe06/SparkCTR

CTR prediction model based on spark(LR, GBDT, DNN)

computational-advertising ctr-prediction machine-learning scala spark spark-ml spark-mllib

Last synced: 17 Dec 2024

https://github.com/fayson/cdhproject

hadoop各组件使用,持续更新

java scala spark

Last synced: 16 May 2025

https://github.com/nvidia/spark-rapids

Spark RAPIDS plugin - accelerate Apache Spark with GPUs

big-data gpu rapids spark

Last synced: 14 May 2025

https://github.com/pingcap/tispark

TiSpark is built for running Apache Spark on top of TiDB/TiKV

bigdata spark tidb tikv

Last synced: 14 May 2025

https://github.com/typelevel/frameless

Expressive types for Spark.

fp functional-programming scala spark typelevel

Last synced: 14 May 2025

https://github.com/NVIDIA/spark-rapids

Spark RAPIDS plugin - accelerate Apache Spark with GPUs

big-data gpu rapids spark

Last synced: 04 Apr 2025

https://github.com/IBM/elasticsearch-spark-recommender

Use Jupyter Notebooks to demonstrate how to build a Recommender with Apache Spark & Elasticsearch

elasticsearch ibmcode jupyter python spark

Last synced: 03 May 2025

https://github.com/delta-io/delta-sharing

An open protocol for secure data sharing

big-data data-sharing delta-lake pandas spark

Last synced: 13 May 2025

https://github.com/jadianes/spark-movie-lens

An on-line movie recommender using Spark, Python Flask, and the MovieLens dataset

big-data bigdata flask movie-recommendation movielens-dataset python spark

Last synced: 12 Apr 2025

https://github.com/WeBankFinTech/Scriptis

Scriptis is for interactive data analysis with script development(SQL, Pyspark, HiveQL), task submission(Spark, Hive), UDF, function, resource management and intelligent diagnosis.

errorcode hive hive-table hql hue ide linkis pyspark resouce-management scala spark sql udf zeppelin

Last synced: 23 Nov 2024

https://github.com/lyhue1991/eat_pyspark_in_10_days

pyspark🍒🥭 is delicious,just eat it!😋😋

pyspark spark

Last synced: 04 Apr 2025

https://github.com/HariSekhon/DevOps-Python-tools

80+ DevOps & Data CLI Tools - AWS, GCP, GCF Python Cloud Functions, Log Anonymizer, Spark, Hadoop, HBase, Hive, Impala, Linux, Docker, Spark Data Converters & Validators (Avro/Parquet/JSON/CSV/INI/XML/YAML), Travis CI, AWS CloudFormation, Elasticsearch, Solr etc.

avro aws cloudformation devops docker dockerhub elasticsearch gcf gcp hadoop hbase hdfs json linux parquet pyspark python solr spark travis-ci

Last synced: 11 Apr 2025

https://github.com/cdapio/cdap

An open source framework for building data analytic applications.

cdap dataset integration java java-8 mapreduce middleware platform python spark spark-streaming unified

Last synced: 13 May 2025

https://github.com/mrpowers-io/spark-daria

Essential Spark extensions and helper methods ✨😲

dataframe spark

Last synced: 14 Apr 2025

https://github.com/lucacanali/sparkmeasure

This is the development repository for sparkMeasure, a tool and library designed for efficient analysis and troubleshooting of Apache Spark jobs. It focuses on easing the collection and examination of Spark metrics, making it a practical choice for both developers and data engineers.

apache-spark performance-metrics performance-troubleshooting python scala spark

Last synced: 14 May 2025

https://github.com/a616567126/gpt-web-java

基于JDK8 AI 聊天机器人!微信公众号 Midjourney画图、卡密兑换、web 支持ChatGPT、Midjourney画图、sd画图,卡密兑换,易支付,公众号引流,邮件注册🔥

bard-api chatgpt google midjourney-api spark stable-diffusion

Last synced: 21 Apr 2025

https://github.com/miguno/kafka-storm-starter

[PROJECT IS NO LONGER MAINTAINED] Code examples that show to integrate Apache Kafka 0.8+ with Apache Storm 0.9+ and Apache Spark Streaming 1.1+, while using Apache Avro as the data serialization format.

apache-avro apache-kafka apache-spark apache-storm avro integration kafka scala spark storm

Last synced: 22 Jan 2025

https://github.com/mongodb/mongo-spark

The MongoDB Spark Connector

connector mongo-spark mongodb spark spark-packages

Last synced: 14 May 2025

https://github.com/LucaCanali/sparkMeasure

This is the development repository for sparkMeasure, a tool and library designed for efficient analysis and troubleshooting of Apache Spark jobs. It focuses on easing the collection and examination of Spark metrics, making it a practical choice for both developers and data engineers.

apache-spark performance-metrics performance-troubleshooting python scala spark

Last synced: 25 Nov 2024

https://github.com/metabrainz/listenbrainz-server

Server for the ListenBrainz project, including the front-end (javascript/react) code that it serves and all of the data processing components that LB uses.

big-data database listenbrainz-server music python react spark typescript web

Last synced: 17 Apr 2025

https://github.com/deanwampler/JustEnoughScalaForSpark

A tutorial on the most important features and idioms of Scala that you need to use Spark's Scala APIs.

jupyter scala spark tutorial

Last synced: 14 Mar 2025

https://github.com/deanwampler/justenoughscalaforspark

A tutorial on the most important features and idioms of Scala that you need to use Spark's Scala APIs.

jupyter scala spark tutorial

Last synced: 04 Apr 2025

https://github.com/jupyter-server/enterprise_gateway

A lightweight, multi-tenant, scalable and secure gateway that enables Jupyter Notebooks to share resources across distributed clusters such as Apache Spark, Kubernetes and others.

enterprise gateway hacktoberfest jupyter jupyter-enterprise-gateway jupyter-kernels jupyter-notebook kernel kubernetes remote-kernels spark spark-on-kubernetes yarn

Last synced: 16 May 2025

https://github.com/yanagishima/yanagishima

Web UI for Trino, Hive and SparkSQL

elasticsearch hive spark trino

Last synced: 27 Mar 2025

https://github.com/absaoss/spline

Data Lineage Tracking And Visualization Solution

bigdata hadoop lineage scala spark tracking visualization

Last synced: 16 May 2025

https://github.com/xubo245/SparkLearning

Learning Apache spark,including code and data .Most part can run local.

learning ml spark sparkcore

Last synced: 28 Mar 2025

https://github.com/awslabs/data-on-eks

DoEKS is a tool to build, deploy and scale Data & ML Platforms on Amazon EKS

aws-eks eks jupyterhub kubeflow kubernetes ml mlflow ray spark terraform

Last synced: 15 Apr 2025

https://github.com/yotpoltd/metorikku

A simplified, lightweight ETL Framework based on Apache Spark

big-data distributed-computing etl etl-framework etl-pipeline scala spark sql

Last synced: 06 Apr 2025

https://github.com/igniterealtime/spark

Cross-platform real-time collaboration client optimized for business and organizations.

collaboration cross-platform jabber java openfire spark xmpp xmpp-client

Last synced: 14 May 2025

https://github.com/AbsaOSS/spline

Data Lineage Tracking And Visualization Solution

bigdata hadoop lineage scala spark tracking visualization

Last synced: 04 Apr 2025

https://github.com/mvillarrealb/docker-spark-cluster

A simple spark standalone cluster for your testing environment purposses

bigdata developer-tools docker-compose spark

Last synced: 16 May 2025

https://github.com/capitalone/datacompy

Pandas, Polars, Spark, and Snowpark DataFrame comparison for humans and more!

compare dask data data-science dataframes fugue numpy pandas polars pyspark python snowflake snowpark spark

Last synced: 14 May 2025

https://github.com/minio/sidekick

High Performance HTTP Sidecar Load Balancer

bigdata kubernetes load-balancer minio-servers proxy sidecar sidekick spark

Last synced: 30 Mar 2025

https://github.com/housepower/clickhouse-native-jdbc

ClickHouse Native Protocol JDBC implementation

analytics clickhouse clickhouse-client database jdbc spark tcp-protocol

Last synced: 27 Feb 2025

https://github.com/housepower/ClickHouse-Native-JDBC

ClickHouse Native Protocol JDBC implementation

analytics clickhouse clickhouse-client database jdbc spark tcp-protocol

Last synced: 03 May 2025

https://github.com/ankurchavda/streamify

A data engineering project with Kafka, Spark Streaming, dbt, Docker, Airflow, Terraform, GCP and much more!

airflow data-engineering dbt gcp kafka python spark

Last synced: 27 Nov 2024

https://github.com/josephmachado/beginner_de_project

Beginner data engineering project - batch edition

airflow database docker emr engineering etl python redshift redshift-cluster spark

Last synced: 15 May 2025

https://github.com/running-elephant/moonbox

Moonbox is a DVtaaS (Data Virtualization as a Service) Platform

data-virtualization hive kudu moonbox spark virtual-database

Last synced: 04 Apr 2025

https://github.com/nightscape/spark-excel

A Spark plugin for reading and writing Excel files

data-frame etl excel scala spark

Last synced: 15 May 2025

https://github.com/raray-chuan/xichuan_note

xichuan的学习总结笔记,覆盖了java、spring、java其他常用框架,以及大数据相关组件等📚

bigdata elk flink hadoop hbase hive java juc jvm kafaka kafka redis spark spring springcloud zabbix zookeeper

Last synced: 05 Apr 2025

https://github.com/uber/marmaray

Generic Data Ingestion & Dispersal Library for Hadoop

avro-schema data-lake hadoop ingest-data schema-format spark

Last synced: 23 Mar 2025

https://github.com/Kotlin/kotlin-spark-api

This projects gives Kotlin bindings and several extensions for Apache Spark. We are looking to have this as a part of Apache Spark 3.x

bigdata kotlin nullability scala spark

Last synced: 13 May 2025