An open API service indexing awesome lists of open source software.

Apache Spark

Apache Spark is an open source distributed general-purpose cluster-computing framework. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.

https://github.com/WeBankFinTech/WeBank-all-Project

All the project addresses participated and established by WeBank are collected.汇集了微众银行参与和建立的所有项目地址。

ai bigdata blockchain could dpr fate finance frontend linkis spark

Last synced: 27 Mar 2025

https://github.com/JahstreetOrg/spark-on-kubernetes-helm

Spark on Kubernetes infrastructure Helm charts repo

helm history-server jupyter kubernetes livy spark

Last synced: 08 May 2025

https://github.com/ClickHouse/spark-clickhouse-connector

Spark ClickHouse Connector build on DataSourceV2 API

arrow clickhouse datasourcev2 grpc http spark

Last synced: 03 May 2025

https://github.com/clickhouse/spark-clickhouse-connector

Spark ClickHouse Connector build on DataSourceV2 API

arrow clickhouse datasourcev2 grpc http spark

Last synced: 12 Apr 2025

https://github.com/lynnlangit/learning-hadoop-and-spark

Companion to Learning Hadoop and Learning Spark courses on Linked In Learning

apache-spark dataproc emr hadoop learning-hadoop mapreduce spark wordcount

Last synced: 16 May 2025

https://github.com/G-Research/spark-extension

A library that provides useful extensions to Apache Spark and PySpark.

gr-oss java pyspark python scala spark

Last synced: 15 Mar 2025

https://github.com/dvgodoy/handyspark

HandySpark - bringing pandas-like capabilities to Spark dataframes

exploratory-data-analysis imputation outlier-detection pandas pyspark python spark visualization

Last synced: 05 Apr 2025

https://github.com/g-research/spark-extension

A library that provides useful extensions to Apache Spark and PySpark.

gr-oss java pyspark python scala spark

Last synced: 07 Apr 2025

https://github.com/karakanb/vue-info-card

Simple and beautiful card component with an elegant spark line, for VueJS.

card card-component component info-card spark vue vue-components vuejs vuejs2

Last synced: 07 Apr 2025

https://github.com/databrickslabs/automl-toolkit

Toolkit for Apache Spark ML for Feature clean-up, feature Importance calculation suite, Information Gain selection, Distributed SMOTE, Model selection and training, Hyper parameter optimization and selection, Model interprability.

apache-spark feature-engineering machinelearning ml pyspark scala spark

Last synced: 22 Jan 2025

https://github.com/syzer/js-spark

Realtime calculation distributed system. AKA distributed lodash

distributed distributed-computing multicore realtime spark

Last synced: 09 Apr 2025

https://github.com/adtech-labs/spylon-kernel

Jupyter kernel for scala and spark

jupyter-kernels kernel metakernel scala spark team-platform

Last synced: 09 Apr 2025

https://github.com/apple/batch-processing-gateway

The gateway component to make Spark on K8s much easier for Spark users.

batch-processing k8s kubernetes spark

Last synced: 13 Apr 2025

https://github.com/ChatLunaLab/chatluna

多平台模型接入,可扩展,多种输出格式,提供大语言模型聊天服务的插件 | A bot plugin for LLM chat services with multi-model integration, extensibility, and various output formats

ai bot chatbot chatglm chatgpt claude gemini gpt gpt-4o koishi langchain llm openai plugin qq-bot qwen rwkv spark typescript

Last synced: 07 Dec 2024

https://github.com/vericast/spylon-kernel

Jupyter kernel for scala and spark

jupyter-kernels kernel metakernel scala spark team-platform

Last synced: 09 Jan 2025

https://github.com/swoop-inc/spark-alchemy

Collection of open-source Spark tools & frameworks that have made the data engineering and data science teams at Swoop highly productive

data-engineering data-science scala spark

Last synced: 07 May 2025

https://github.com/josephmachado/data_engineering_best_practices

Sample project to demonstrate data engineering best practices

data-engineering delta-lake etl great-expectations minio pyspark spark

Last synced: 15 Apr 2025

https://github.com/polomarcus/spark-structured-streaming-examples

Spark Structured Streaming / Kafka / Cassandra / Elastic

cassandra kafka spark spark-sql structured-streaming

Last synced: 10 Apr 2025

https://github.com/mc2-project/opaque-sql

An encrypted data analytics platform

analytics enclave machine-learning privacy security spark spark-sql

Last synced: 28 Mar 2025

https://github.com/leobenkel/zparkio

Boiler plate framework to use Spark and ZIO together.

boiler-plate functional-programming helpers scala spark template zio

Last synced: 07 Apr 2025

https://github.com/leobenkel/Zparkio

Boiler plate framework to use Spark and ZIO together.

boiler-plate functional-programming helpers scala spark template zio

Last synced: 20 Apr 2025

https://github.com/benfradet/spark-kafka-writer

Write your Spark data to Kafka seamlessly

kafka spark

Last synced: 06 Apr 2025

https://github.com/dsaidgovsg/airflow-pipeline

An Airflow docker image preconfigured to work well with Spark and Hadoop/EMR

airflow docker hadoop spark

Last synced: 27 Mar 2025

https://github.com/capeprivacy/cape-dataframes

Privacy transformations on Spark and Pandas dataframes backed by a simple policy language.

collaboration data-science hacktoberfest machine-learning pandas policy privacy python spark

Last synced: 06 Apr 2025

https://github.com/yaooqinn/spark-authorizer

A Spark SQL extension which provides SQL Standard Authorization for Apache Spark | This repo is contributed to Apache Kyuubi | 项目已迁移至 Apache Kyuubi

acl hive ranger ranger-hive-plugin spark

Last synced: 13 Apr 2025

https://github.com/krishnan-r/sparkmonitor

Monitor Apache Spark from Jupyter Notebook

extension jupyter spark

Last synced: 22 Jan 2025

https://github.com/linkedin/lift

The LinkedIn Fairness Toolkit (LiFT) is a Scala/Spark library that enables the measurement of fairness in large scale machine learning workflows.

fairness fairness-ai fairness-ml linkedin machine-learning scala spark

Last synced: 21 Mar 2025

https://github.com/aliyun/aliyun-emapreduce-datasources

Extended datasource support for Spark/Hadoop on Aliyun E-MapReduce.

aliyun datasources e-mapreduce hadoop kafka spark

Last synced: 07 Apr 2025

https://github.com/unnati-xyz/scalable-data-science-platform

Content for architecting a data science platform for products using Luigi, Spark & Flask.

data-engineer data-pipeline data-science luigi machine-learning rest-api spark

Last synced: 27 Nov 2024

https://github.com/saurfang/spark-tsne

Distributed t-SNE via Apache Spark

spark tsne

Last synced: 13 Apr 2025

https://github.com/harisekhon/knowledge-base

Large Tech Knowledge Base from 20 years in DevOps, Linux, Cloud, Big Data, AWS, GCP etc - gradually porting my large private knowledge base to public

aws azure bash bigdata cicd cloud devops elasticsearch gcp git groovy hadoop java jvm performance-tuning python scripting solr solrcloud spark

Last synced: 05 Apr 2025

https://github.com/davidzajac1/zillacode

Open Source LeetCode for PySpark, Spark, Pandas and DBT/Snowflake

aws coding-interview dbt docker github-actions leetcode pandas pyspark python react snowflake spark terraform

Last synced: 04 Apr 2025

https://github.com/radanalyticsio/spark-operator

Operator for managing the Spark clusters on Kubernetes and OpenShift.

apache-spark kubernetes kubernetes-operator openshift spark

Last synced: 07 May 2025

https://github.com/cubefs/shuttle

Shuttle:High Available, High Performance Remote Shuffle Service

distributed hadoop remote shuffle spark

Last synced: 20 Dec 2024

https://github.com/sparkling-graph/sparkling-graph

SparklingGraph provides easy to use set of features that will give you ability to proces large scala graphs using Spark and GraphX.

approximation big-data coarsing comunity-detection-methods dsl graph graph-algorithms heuristics link-predication machine-learning measure network-analysis spark vertex

Last synced: 24 Apr 2025

https://github.com/helgeho/archivespark

An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developed at Internet Archive.

archivespark internet-archive spark spark-framework warc web-archiving webarchive

Last synced: 05 Apr 2025

https://github.com/helgeho/ArchiveSpark

An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developed at Internet Archive.

archivespark internet-archive spark spark-framework warc web-archiving webarchive

Last synced: 08 Apr 2025

https://github.com/henridf/apache-spark-node

Node.js bindings for Apache Spark DataFrame APIs

data-frame node spark

Last synced: 01 Apr 2025

https://github.com/absaoss/cobrix

A COBOL parser and Mainframe/EBCDIC data source for Apache Spark

cobol cobol-parser copybook ebcdic etl mainframe scalable spark

Last synced: 09 Apr 2025

https://github.com/sansa-stack/sansa-stack

Big Data RDF Processing and Analytics Stack built on Apache Spark and Apache Jena http://sansa-stack.github.io/SANSA-Stack/

apache-jena apache-spark distributed-computing flink rdf semantic-web spark

Last synced: 04 Apr 2025

https://github.com/SANSA-Stack/SANSA-Stack

Big Data RDF Processing and Analytics Stack built on Apache Spark and Apache Jena http://sansa-stack.github.io/SANSA-Stack/

apache-jena apache-spark distributed-computing flink rdf semantic-web spark

Last synced: 20 Nov 2024

https://github.com/eto-ai/rikai

Parquet-based ML data format optimized for working with unstructured data

deep-learning machine-learning pytorch spark tensorflow

Last synced: 07 Apr 2025

https://github.com/zuinnote/hadoopcryptoledger

Hadoop Crypto Ledger - Analyzing CryptoLedgers, such as Bitcoin Blockchain, on Big Data platforms, such as Hadoop/Spark/Flink/Hive

bigdata bitcoin blockchain cryptoledger ethereum flink hadoop hive spark

Last synced: 13 Apr 2025

https://github.com/virtuslab/iskra

Typesafe wrapper for Apache Spark DataFrame API

scala scala3 spark

Last synced: 05 Apr 2025

https://github.com/llm-red-team/spark-free-api

🚀 讯飞星火大模型逆向API【特长:办公助手】,支持高速流式输出、智能体对话、联网搜索、AI绘图、长文档解读、图像解析、多轮对话,零配置部署,多路token支持,自动清理会话痕迹,仅供测试,如需商用请前往官方开放平台。。

chat-api chatbot chatgpt-api iflytek llm spark spark-ai

Last synced: 04 Apr 2025

https://github.com/easysql/easy_sql

A library developed to ease the data ETL development process.

clickhouse etl postgres postgresql python spark sql

Last synced: 16 May 2025

https://github.com/gvcgo/gvc

Geek's valuable collection. A cross-platform supertool that brings convinience to coding.

asciinema auto-install browser chatgpt cloc cross-platform docker environment g go gvm languages spark tools version webdav

Last synced: 29 Apr 2025

https://github.com/archivesunleashed/aut

The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.

analysis apache-spark big-data big-data-analytics dataframe digital-humanities hadoop network-graphing pyspark python3 scala spark text-extraction webarchives

Last synced: 13 Apr 2025

https://github.com/isxcode/spark-yun

Big data computing platform based on Spark <至轻云-超轻量级大数据计算平台/数据中台>

docker hadoop hive platform saas spark

Last synced: 28 Dec 2024

https://github.com/clustering4ever/clustering4ever

C4E, a JVM friendly library written in Scala for both local and distributed (Spark) Clustering.

ai artificial-intelligence big-data bigdata clustering clustering-algorithm clustering-evaluation scala scalability spark

Last synced: 23 Feb 2025

https://github.com/Clustering4Ever/Clustering4Ever

C4E, a JVM friendly library written in Scala for both local and distributed (Spark) Clustering.

ai artificial-intelligence big-data bigdata clustering clustering-algorithm clustering-evaluation scala scalability spark

Last synced: 13 May 2025

https://github.com/apache/spark-docker

Official Dockerfile for Apache Spark

big-data java jdbc python r scala spark sql

Last synced: 04 Apr 2025

https://github.com/kavgan/phrase-at-scale

Detect common phrases in large amounts of text using a data-driven approach. Size of discovered phrases can be arbitrary. Can be used in languages other than English

collocation-extraction multiword-expressions multiword-extraction natural-language-processing nlp nlp-machine-learning phrase-discovery phrase-extraction pyspark spark

Last synced: 26 Mar 2025

https://github.com/jaegertracing/spark-dependencies

Spark job for dependency links

jaegertracing spark

Last synced: 04 Apr 2025

https://github.com/lichaojacobs/java_learning_practice

java 进阶之路:面试高频算法、akka、多线程、NIO、Netty、SpringBoot、Spark&&Flink 等

algorithm flink java netty spark spring web

Last synced: 19 Dec 2024

https://github.com/memverge/splash

Splash, a flexible Spark shuffle manager that supports user-defined storage backends for shuffle data storage and exchange

apache-spark bigdata disaggregation elasticity java scala shuffle spark storage

Last synced: 05 Apr 2025

https://github.com/apache/spark-website

Apache Spark Website

big-data java jdbc python r scala spark sql

Last synced: 15 May 2025

https://github.com/mkuthan/example-spark-kafka

Apache Spark and Apache Kafka integration example

kafka spark spark-streaming

Last synced: 07 Apr 2025

https://github.com/smart-data-lake/smart-data-lake

Smart Automation Tool for building modern Data Lakes and Data Pipelines

data-lake data-pipelines deltalake hadoop hive scala smart-data-lake spark transform-data

Last synced: 13 Apr 2025

https://github.com/ethicalml/kafka-spark-streaming-zeppelin-docker

One click deploy docker-compose with Kafka, Spark Streaming, Zeppelin UI and Monitoring (Grafana + Kafka Manager)

docker docker-compose kafka kafka-spark kafka-spark-streaming kafka-zeppelin spark spark-kafka spark-streaming-kafka spark-zeppelin streaming zeppelin

Last synced: 08 Apr 2025

https://github.com/jleetutorial/sparktutorial

Source code for James Lee's Aparch Spark with Java course

bigdata spark

Last synced: 09 Apr 2025

https://github.com/shjwudp/c4-dataset-script

Inspired by google c4, here is a series of colossal clean data cleaning scripts focused on CommonCrawl data processing. Including Chinese data processing and cleaning methods in MassiveText.

commoncrawl dataset massivetext nlp python spark

Last synced: 02 Dec 2024

https://github.com/alexarchambault/ammonite-spark

Run spark calculations from Ammonite

ammonite scala spark

Last synced: 17 Mar 2025

https://github.com/233zzh/TitanDataOperationSystem

最好的大数据项目。《Titan数据运营系统》,本项目是一个全栈闭环系统,我们有用作数据可视化的web系统,然后用flume-kafaka-flume进行日志的读取,在hive设计数仓,编写spark代码进行数仓表之间的转化以及ads层表到mysql的迁移,使用azkaban进行定时任务的调度,使用技术:Java/Scala语言,Hadoop、Spark、Hive、Kafka、Flume、Azkaban、SpringBoot,Bootstrap, Echart等;

azkaban flume hadoop hive kafka spark

Last synced: 27 Mar 2025

https://github.com/utdemir/distributed-dataset

A distributed data processing framework in Haskell.

aws-lambda data-processing distributed haskell spark

Last synced: 16 Mar 2025

https://github.com/rstudio/bigdataclass

Two-day workshop that covers how to use R to interact databases and Spark

big-data db dbi dbplyr r spark

Last synced: 14 Apr 2025

https://github.com/indix/schemer

Schema registry for CSV, TSV, JSON, AVRO and Parquet schema. Supports schema inference and GraphQL API.

avro graphql-api json parquet schema-inference schema-registry spark tsv

Last synced: 12 Feb 2025

https://github.com/innat/ML-Resource

A concise resource repository for machine learning

data-analysis data-science deep-learning kaggle machine-learning python spark

Last synced: 29 Apr 2025

https://github.com/JaryZhen/rulegin

基于JavaScript Engine的轻量级规则引擎系统,重构于开源IOT项目thingboard

grpc-java javascript kafka netty spark sping zk

Last synced: 27 Mar 2025

https://github.com/AdaCore/RecordFlux

Formal specification and generation of verifiable binary parsers, message generators and protocol state machines

ada binary-parser communication-protocol formal-methods formal-specification formal-verification parser protocol-parser protocol-specification python spark

Last synced: 14 Mar 2025

https://github.com/hurence/logisland

Scalable stream processing platform for advanced realtime analytics on top of Kafka and Spark. LogIsland also supports MQTT and Kafka Streams (Flink being in the roadmap). The platform does complex event processing and is suitable for time series analysis. A large set of valuable ready to use processors, data sources and sinks are available.

analytics big-data cassandra complex-event-processing elasticsearch influxdb kafka kafka-streams pattern-recognition solr spark stream-processing

Last synced: 07 Apr 2025

https://github.com/rstudio-conf-2020/big-data

:wrench: Use dplyr to analyze Big Data :elephant:

databases dplyr r rstudio spark sparklyr workshop

Last synced: 14 Apr 2025

https://github.com/sjrusso8/spark-connect-rs

Apache Spark Connect Client for Rust

grpc-client spark spark-connect spark-sql

Last synced: 16 May 2025

https://github.com/commoncrawl/cc-index-table

Index Common Crawl archives in tabular format

apache-parquet aws-athena columnar-storage commoncrawl spark sql

Last synced: 25 Nov 2024

https://github.com/apache/spark-kubernetes-operator

Apache Spark Kubernetes Operator

java kubernetes spark

Last synced: 05 Apr 2025

https://github.com/trK54Ylmz/kafka-spark-streaming-example

Simple examle for Spark Streaming over Kafka topic

java kafka spark stream-processing

Last synced: 02 Apr 2025

https://github.com/trk54ylmz/kafka-spark-streaming-example

Simple examle for Spark Streaming over Kafka topic

java kafka spark stream-processing

Last synced: 12 May 2025

https://github.com/feng-li/Distributed-Statistical-Computing

Teaching Materials for Distributed Statistical Computing (大数据分布式计算教学材料)

hadoop mapreduce pyspark-tutorial spark spark-teaching statistical-models

Last synced: 26 Mar 2025

https://github.com/vspiewak/twitter-sentiment-analysis

Streaming tweets with spark, language detection & sentiment analysis, dashboard with Kibana

dashboard kibana nlp scala sentiment-analysis spark tiwtter

Last synced: 22 Apr 2025

https://github.com/iimeta/fastapi-admin

企业级 LLM API 快速集成系统,支持OpenAI、Azure、文心一言、讯飞星火、通义千问、智谱GLM、Gemini、DeepSeek、Anthropic Claude以及OpenAI格式的模型等,简洁的页面风格,轻量高效且稳定,支持Docker一键部署。

api chatgpt deepseek ernie-bot fast fastapi glm gpt gpt-4 openai qwen realtime spark

Last synced: 16 May 2025