Apache Spark
Apache Spark is an open source distributed general-purpose cluster-computing framework. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.
- GitHub: https://github.com/topics/spark
- Wikipedia: https://en.wikipedia.org/wiki/Apache_Spark
- Repo: https://github.com/apache/spark
- Created by: Matei Zaharia
- Released: May 26, 2014
- Related Topics: scala, hadoop,
- Aliases: apache-spark,
- Last updated: 2025-05-27 00:23:34 UTC
- JSON Representation
https://github.com/azure/azure-cosmosdb-spark
Apache Spark Connector for Azure Cosmos DB
apache-spark azure-cosmos-db azure-databricks changefeed connector cosmos-db databricks databricks-notebooks jupyter-notebook lambda-architecture pyspark spark
Last synced: 02 Mar 2025
https://github.com/Azure/azure-cosmosdb-spark
Apache Spark Connector for Azure Cosmos DB
apache-spark azure-cosmos-db azure-databricks changefeed connector cosmos-db databricks databricks-notebooks jupyter-notebook lambda-architecture pyspark spark
Last synced: 10 May 2025
https://github.com/JahstreetOrg/spark-on-kubernetes-helm
Spark on Kubernetes infrastructure Helm charts repo
helm history-server jupyter kubernetes livy spark
Last synced: 08 May 2025
https://github.com/ClickHouse/spark-clickhouse-connector
Spark ClickHouse Connector build on DataSourceV2 API
arrow clickhouse datasourcev2 grpc http spark
Last synced: 03 May 2025
https://github.com/clickhouse/spark-clickhouse-connector
Spark ClickHouse Connector build on DataSourceV2 API
arrow clickhouse datasourcev2 grpc http spark
Last synced: 12 Apr 2025
https://github.com/lynnlangit/learning-hadoop-and-spark
Companion to Learning Hadoop and Learning Spark courses on Linked In Learning
apache-spark dataproc emr hadoop learning-hadoop mapreduce spark wordcount
Last synced: 16 May 2025
https://github.com/dvgodoy/handyspark
HandySpark - bringing pandas-like capabilities to Spark dataframes
exploratory-data-analysis imputation outlier-detection pandas pyspark python spark visualization
Last synced: 05 Apr 2025
https://github.com/karakanb/vue-info-card
Simple and beautiful card component with an elegant spark line, for VueJS.
card card-component component info-card spark vue vue-components vuejs vuejs2
Last synced: 07 Apr 2025
https://github.com/databrickslabs/automl-toolkit
Toolkit for Apache Spark ML for Feature clean-up, feature Importance calculation suite, Information Gain selection, Distributed SMOTE, Model selection and training, Hyper parameter optimization and selection, Model interprability.
apache-spark feature-engineering machinelearning ml pyspark scala spark
Last synced: 22 Jan 2025
https://github.com/syzer/js-spark
Realtime calculation distributed system. AKA distributed lodash
distributed distributed-computing multicore realtime spark
Last synced: 09 Apr 2025
https://github.com/adtech-labs/spylon-kernel
Jupyter kernel for scala and spark
jupyter-kernels kernel metakernel scala spark team-platform
Last synced: 09 Apr 2025
https://github.com/apple/batch-processing-gateway
The gateway component to make Spark on K8s much easier for Spark users.
batch-processing k8s kubernetes spark
Last synced: 13 Apr 2025
https://github.com/ChatLunaLab/chatluna
多平台模型接入,可扩展,多种输出格式,提供大语言模型聊天服务的插件 | A bot plugin for LLM chat services with multi-model integration, extensibility, and various output formats
ai bot chatbot chatglm chatgpt claude gemini gpt gpt-4o koishi langchain llm openai plugin qq-bot qwen rwkv spark typescript
Last synced: 07 Dec 2024
https://github.com/vericast/spylon-kernel
Jupyter kernel for scala and spark
jupyter-kernels kernel metakernel scala spark team-platform
Last synced: 09 Jan 2025
https://github.com/swoop-inc/spark-alchemy
Collection of open-source Spark tools & frameworks that have made the data engineering and data science teams at Swoop highly productive
data-engineering data-science scala spark
Last synced: 07 May 2025
https://github.com/josephmachado/data_engineering_best_practices
Sample project to demonstrate data engineering best practices
data-engineering delta-lake etl great-expectations minio pyspark spark
Last synced: 15 Apr 2025
https://github.com/nareshk1290/udacity-data-engineering
Udacity Data Engineering Nano Degree (DEND)
airflow aws cassandra etl postgresql redshift s3 spark star-schema udacity-dend
Last synced: 10 Apr 2025
https://github.com/polomarcus/spark-structured-streaming-examples
Spark Structured Streaming / Kafka / Cassandra / Elastic
cassandra kafka spark spark-sql structured-streaming
Last synced: 10 Apr 2025
https://github.com/SETL-Framework/setl
A simple Spark-powered ETL framework that just works 🍺
big-data data-analysis data-engineering data-science data-transformation dataset etl etl-pipeline framework machine-learning modularization pipeline scala setl spark
Last synced: 15 Apr 2025
https://github.com/setl-framework/setl
A simple Spark-powered ETL framework that just works 🍺
big-data data-analysis data-engineering data-science data-transformation dataset etl etl-pipeline framework machine-learning modularization pipeline scala setl spark
Last synced: 10 Apr 2025
https://github.com/locationtech-labs/geopyspark
GeoTrellis for PySpark
big-data geospatial geotrellis python spark tile-server
Last synced: 27 Nov 2024
https://github.com/mc2-project/opaque-sql
An encrypted data analytics platform
analytics enclave machine-learning privacy security spark spark-sql
Last synced: 28 Mar 2025
https://github.com/whylabs/whylogs-java
Profile and monitor your ML data pipeline end-to-end
ai-pipelines aiops apache-spark approximate-statistics calculate-statistics data-quality dataset java mlops spark statistical-properties statistics whylogs
Last synced: 22 Jan 2025
https://github.com/leobenkel/zparkio
Boiler plate framework to use Spark and ZIO together.
boiler-plate functional-programming helpers scala spark template zio
Last synced: 07 Apr 2025
https://github.com/leobenkel/Zparkio
Boiler plate framework to use Spark and ZIO together.
boiler-plate functional-programming helpers scala spark template zio
Last synced: 20 Apr 2025
https://github.com/benfradet/spark-kafka-writer
Write your Spark data to Kafka seamlessly
Last synced: 06 Apr 2025
https://github.com/dsaidgovsg/airflow-pipeline
An Airflow docker image preconfigured to work well with Spark and Hadoop/EMR
Last synced: 27 Mar 2025
https://github.com/capeprivacy/cape-dataframes
Privacy transformations on Spark and Pandas dataframes backed by a simple policy language.
collaboration data-science hacktoberfest machine-learning pandas policy privacy python spark
Last synced: 06 Apr 2025
https://github.com/yaooqinn/spark-authorizer
A Spark SQL extension which provides SQL Standard Authorization for Apache Spark | This repo is contributed to Apache Kyuubi | 项目已迁移至 Apache Kyuubi
acl hive ranger ranger-hive-plugin spark
Last synced: 13 Apr 2025
https://github.com/krishnan-r/sparkmonitor
Monitor Apache Spark from Jupyter Notebook
Last synced: 22 Jan 2025
https://github.com/linkedin/lift
The LinkedIn Fairness Toolkit (LiFT) is a Scala/Spark library that enables the measurement of fairness in large scale machine learning workflows.
fairness fairness-ai fairness-ml linkedin machine-learning scala spark
Last synced: 21 Mar 2025
https://github.com/aliyun/aliyun-emapreduce-datasources
Extended datasource support for Spark/Hadoop on Aliyun E-MapReduce.
aliyun datasources e-mapreduce hadoop kafka spark
Last synced: 07 Apr 2025
https://github.com/unnati-xyz/scalable-data-science-platform
Content for architecting a data science platform for products using Luigi, Spark & Flask.
data-engineer data-pipeline data-science luigi machine-learning rest-api spark
Last synced: 27 Nov 2024
https://github.com/harisekhon/knowledge-base
Large Tech Knowledge Base from 20 years in DevOps, Linux, Cloud, Big Data, AWS, GCP etc - gradually porting my large private knowledge base to public
aws azure bash bigdata cicd cloud devops elasticsearch gcp git groovy hadoop java jvm performance-tuning python scripting solr solrcloud spark
Last synced: 05 Apr 2025
https://github.com/davidzajac1/zillacode
Open Source LeetCode for PySpark, Spark, Pandas and DBT/Snowflake
aws coding-interview dbt docker github-actions leetcode pandas pyspark python react snowflake spark terraform
Last synced: 04 Apr 2025
https://github.com/baghelamit/iot-traffic-monitor
cassandra java kafka spark spring-boot
Last synced: 09 Apr 2025
https://github.com/radanalyticsio/spark-operator
Operator for managing the Spark clusters on Kubernetes and OpenShift.
apache-spark kubernetes kubernetes-operator openshift spark
Last synced: 07 May 2025
https://github.com/cubefs/shuttle
Shuttle:High Available, High Performance Remote Shuffle Service
distributed hadoop remote shuffle spark
Last synced: 20 Dec 2024
https://github.com/sparkling-graph/sparkling-graph
SparklingGraph provides easy to use set of features that will give you ability to proces large scala graphs using Spark and GraphX.
approximation big-data coarsing comunity-detection-methods dsl graph graph-algorithms heuristics link-predication machine-learning measure network-analysis spark vertex
Last synced: 24 Apr 2025
https://github.com/qubole/spark-on-lambda
Apache Spark on AWS Lambda
apache-spark aws aws-cloud aws-lambda big-data lambda serverless spark
Last synced: 07 Apr 2025
https://github.com/helgeho/archivespark
An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developed at Internet Archive.
archivespark internet-archive spark spark-framework warc web-archiving webarchive
Last synced: 05 Apr 2025
https://github.com/helgeho/ArchiveSpark
An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developed at Internet Archive.
archivespark internet-archive spark spark-framework warc web-archiving webarchive
Last synced: 08 Apr 2025
https://github.com/henridf/apache-spark-node
Node.js bindings for Apache Spark DataFrame APIs
Last synced: 01 Apr 2025
https://github.com/absaoss/cobrix
A COBOL parser and Mainframe/EBCDIC data source for Apache Spark
cobol cobol-parser copybook ebcdic etl mainframe scalable spark
Last synced: 09 Apr 2025
https://github.com/sansa-stack/sansa-stack
Big Data RDF Processing and Analytics Stack built on Apache Spark and Apache Jena http://sansa-stack.github.io/SANSA-Stack/
apache-jena apache-spark distributed-computing flink rdf semantic-web spark
Last synced: 04 Apr 2025
https://github.com/SANSA-Stack/SANSA-Stack
Big Data RDF Processing and Analytics Stack built on Apache Spark and Apache Jena http://sansa-stack.github.io/SANSA-Stack/
apache-jena apache-spark distributed-computing flink rdf semantic-web spark
Last synced: 20 Nov 2024
https://github.com/eto-ai/rikai
Parquet-based ML data format optimized for working with unstructured data
deep-learning machine-learning pytorch spark tensorflow
Last synced: 07 Apr 2025
https://github.com/zuinnote/hadoopcryptoledger
Hadoop Crypto Ledger - Analyzing CryptoLedgers, such as Bitcoin Blockchain, on Big Data platforms, such as Hadoop/Spark/Flink/Hive
bigdata bitcoin blockchain cryptoledger ethereum flink hadoop hive spark
Last synced: 13 Apr 2025
https://github.com/virtuslab/iskra
Typesafe wrapper for Apache Spark DataFrame API
Last synced: 05 Apr 2025
https://github.com/edoardottt/spark-ar-creators
List of 9500 (and counting) Spark AR Creators. Open an issue or contact me if you want to be added.❤️
3d ar ar-studio augmented-reality augmented-reality-applications facebook facebook-face-recognition facebook-filter filter filters instagram instagram-feed photos spark spark-ar spark-ar-creators spark-ar-studio sparkar virtual-reality vr
Last synced: 05 Apr 2025
https://github.com/qubole/kinesis-sql
Kinesis Connector for Structured Streaming
kinesis real-time-processing spark spark-streaming spark-structured-streaming structured-streaming
Last synced: 08 Apr 2025
https://github.com/llm-red-team/spark-free-api
🚀 讯飞星火大模型逆向API【特长:办公助手】,支持高速流式输出、智能体对话、联网搜索、AI绘图、长文档解读、图像解析、多轮对话,零配置部署,多路token支持,自动清理会话痕迹,仅供测试,如需商用请前往官方开放平台。。
chat-api chatbot chatgpt-api iflytek llm spark spark-ai
Last synced: 04 Apr 2025
https://github.com/easysql/easy_sql
A library developed to ease the data ETL development process.
clickhouse etl postgres postgresql python spark sql
Last synced: 16 May 2025
https://github.com/gvcgo/gvc
Geek's valuable collection. A cross-platform supertool that brings convinience to coding.
asciinema auto-install browser chatgpt cloc cross-platform docker environment g go gvm languages spark tools version webdav
Last synced: 29 Apr 2025
https://github.com/archivesunleashed/aut
The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.
analysis apache-spark big-data big-data-analytics dataframe digital-humanities hadoop network-graphing pyspark python3 scala spark text-extraction webarchives
Last synced: 13 Apr 2025
https://github.com/clustering4ever/clustering4ever
C4E, a JVM friendly library written in Scala for both local and distributed (Spark) Clustering.
ai artificial-intelligence big-data bigdata clustering clustering-algorithm clustering-evaluation scala scalability spark
Last synced: 23 Feb 2025
https://github.com/Clustering4Ever/Clustering4Ever
C4E, a JVM friendly library written in Scala for both local and distributed (Spark) Clustering.
ai artificial-intelligence big-data bigdata clustering clustering-algorithm clustering-evaluation scala scalability spark
Last synced: 13 May 2025
https://github.com/kavgan/phrase-at-scale
Detect common phrases in large amounts of text using a data-driven approach. Size of discovered phrases can be arbitrary. Can be used in languages other than English
collocation-extraction multiword-expressions multiword-extraction natural-language-processing nlp nlp-machine-learning phrase-discovery phrase-extraction pyspark spark
Last synced: 26 Mar 2025
https://github.com/jaegertracing/spark-dependencies
Spark job for dependency links
Last synced: 04 Apr 2025
https://github.com/memverge/splash
Splash, a flexible Spark shuffle manager that supports user-defined storage backends for shuffle data storage and exchange
apache-spark bigdata disaggregation elasticity java scala shuffle spark storage
Last synced: 05 Apr 2025
https://github.com/alanchn31/Movalytics-Data-Warehouse
Data pipeline performing ETL to AWS Redshift using Spark, orchestrated with Apache Airflow
airflow analytics aws-redshift aws-s3 data-engineer-nanodegree data-engineering data-engineering-pipeline data-modelling data-warehouse-cloud docker movie-database movie-recommendation movie-reviews pyspark python3 redshift spark sql udacity
Last synced: 04 Dec 2024
https://github.com/mkuthan/example-spark-kafka
Apache Spark and Apache Kafka integration example
Last synced: 07 Apr 2025
https://github.com/qihoo360/xlearning-xdml
extremely distributed machine learning
ai distributed hadoop hazelcast kudu machine-learning parameter-server spark
Last synced: 10 Apr 2025
https://github.com/Qihoo360/XLearning-XDML
extremely distributed machine learning
ai distributed hadoop hazelcast kudu machine-learning parameter-server spark
Last synced: 28 Mar 2025
https://github.com/smart-data-lake/smart-data-lake
Smart Automation Tool for building modern Data Lakes and Data Pipelines
data-lake data-pipelines deltalake hadoop hive scala smart-data-lake spark transform-data
Last synced: 13 Apr 2025
https://github.com/ethicalml/kafka-spark-streaming-zeppelin-docker
One click deploy docker-compose with Kafka, Spark Streaming, Zeppelin UI and Monitoring (Grafana + Kafka Manager)
docker docker-compose kafka kafka-spark kafka-spark-streaming kafka-zeppelin spark spark-kafka spark-streaming-kafka spark-zeppelin streaming zeppelin
Last synced: 08 Apr 2025
https://github.com/jleetutorial/sparktutorial
Source code for James Lee's Aparch Spark with Java course
Last synced: 09 Apr 2025
https://github.com/shjwudp/c4-dataset-script
Inspired by google c4, here is a series of colossal clean data cleaning scripts focused on CommonCrawl data processing. Including Chinese data processing and cleaning methods in MassiveText.
commoncrawl dataset massivetext nlp python spark
Last synced: 02 Dec 2024
https://github.com/alexarchambault/ammonite-spark
Run spark calculations from Ammonite
Last synced: 17 Mar 2025
https://github.com/233zzh/TitanDataOperationSystem
最好的大数据项目。《Titan数据运营系统》,本项目是一个全栈闭环系统,我们有用作数据可视化的web系统,然后用flume-kafaka-flume进行日志的读取,在hive设计数仓,编写spark代码进行数仓表之间的转化以及ads层表到mysql的迁移,使用azkaban进行定时任务的调度,使用技术:Java/Scala语言,Hadoop、Spark、Hive、Kafka、Flume、Azkaban、SpringBoot,Bootstrap, Echart等;
azkaban flume hadoop hive kafka spark
Last synced: 27 Mar 2025
https://github.com/dataeval/dingo
Dingo: A Comprehensive Data Quality Evaluation Tool
data-evaluation data-quality data-science data-validation dataquality datascience deepseek gpt llm openai opencompass spark vlm
Last synced: 06 Apr 2025
https://github.com/utdemir/distributed-dataset
A distributed data processing framework in Haskell.
aws-lambda data-processing distributed haskell spark
Last synced: 16 Mar 2025
https://github.com/streamnative/pulsar-spark
Spark Connector to read and write with Pulsar
apache-pulsar apache-spark batch-processing data-processing data-science flink spark spark-sql stream-processing structured-streaming
Last synced: 16 May 2025
https://github.com/vivek-bombatkar/spark-with-python---my-learning-notes-
ETL pipeline using pyspark (Spark - Python)
apache-spark catalyst-optimizer python spark tungsten
Last synced: 14 Feb 2025
https://github.com/indix/schemer
Schema registry for CSV, TSV, JSON, AVRO and Parquet schema. Supports schema inference and GraphQL API.
avro graphql-api json parquet schema-inference schema-registry spark tsv
Last synced: 12 Feb 2025
https://github.com/innat/ML-Resource
A concise resource repository for machine learning
data-analysis data-science deep-learning kaggle machine-learning python spark
Last synced: 29 Apr 2025
https://github.com/JaryZhen/rulegin
基于JavaScript Engine的轻量级规则引擎系统,重构于开源IOT项目thingboard
grpc-java javascript kafka netty spark sping zk
Last synced: 27 Mar 2025
https://github.com/AdaCore/RecordFlux
Formal specification and generation of verifiable binary parsers, message generators and protocol state machines
ada binary-parser communication-protocol formal-methods formal-specification formal-verification parser protocol-parser protocol-specification python spark
Last synced: 14 Mar 2025
https://github.com/sparkdesignsystem/spark-design-system
Spark Design System
design-patterns design-system hacktoberfest spark
Last synced: 05 Apr 2025
https://github.com/izhangzhihao/real-time-data-warehouse
Real-time Data Warehouse with Apache Flink & Apache Kafka & Apache Hudi
cdc change-data-capture data-warehouse data-warehousing datalake debezium delta delta-lake deltalake elasticsearch flink flink-sql hoodie hudi iceberg kafka real-time-data-warehouse spark spark-sql sql
Last synced: 20 Dec 2024
https://github.com/hurence/logisland
Scalable stream processing platform for advanced realtime analytics on top of Kafka and Spark. LogIsland also supports MQTT and Kafka Streams (Flink being in the roadmap). The platform does complex event processing and is suitable for time series analysis. A large set of valuable ready to use processors, data sources and sinks are available.
analytics big-data cassandra complex-event-processing elasticsearch influxdb kafka kafka-streams pattern-recognition solr spark stream-processing
Last synced: 07 Apr 2025
https://github.com/sjrusso8/spark-connect-rs
Apache Spark Connect Client for Rust
grpc-client spark spark-connect spark-sql
Last synced: 16 May 2025
https://github.com/commoncrawl/cc-index-table
Index Common Crawl archives in tabular format
apache-parquet aws-athena columnar-storage commoncrawl spark sql
Last synced: 25 Nov 2024
https://github.com/apache/spark-kubernetes-operator
Apache Spark Kubernetes Operator
Last synced: 05 Apr 2025
https://github.com/trK54Ylmz/kafka-spark-streaming-example
Simple examle for Spark Streaming over Kafka topic
java kafka spark stream-processing
Last synced: 02 Apr 2025
https://github.com/trk54ylmz/kafka-spark-streaming-example
Simple examle for Spark Streaming over Kafka topic
java kafka spark stream-processing
Last synced: 12 May 2025
https://github.com/feng-li/Distributed-Statistical-Computing
Teaching Materials for Distributed Statistical Computing (大数据分布式计算教学材料)
hadoop mapreduce pyspark-tutorial spark spark-teaching statistical-models
Last synced: 26 Mar 2025
https://github.com/vspiewak/twitter-sentiment-analysis
Streaming tweets with spark, language detection & sentiment analysis, dashboard with Kibana
dashboard kibana nlp scala sentiment-analysis spark tiwtter
Last synced: 22 Apr 2025
https://github.com/trannhatnguyen2/nyc_taxi_data_pipeline
Nyc_Taxi_Data_Pipeline - DE Project
airflow dbt debezium docker great-expectations kafka minio postgresql spark trino
Last synced: 06 Apr 2025