Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
Apache Spark
Apache Spark is an open source distributed general-purpose cluster-computing framework. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.
- GitHub: https://github.com/topics/spark
- Wikipedia: https://en.wikipedia.org/wiki/Apache_Spark
- Repo: https://github.com/apache/spark
- Created by: Matei Zaharia
- Released: May 26, 2014
- Related Topics: scala, hadoop,
- Aliases: apache-spark,
- Last updated: 2025-01-22 00:29:18 UTC
- JSON Representation
https://github.com/adtech-labs/spylon-kernel
Jupyter kernel for scala and spark
jupyter-kernels kernel metakernel scala spark team-platform
Last synced: 16 Jan 2025
https://github.com/ChatLunaLab/chatluna
多平台模型接入,可扩展,多种输出格式,提供大语言模型聊天服务的插件 | A bot plugin for LLM chat services with multi-model integration, extensibility, and various output formats
ai bot chatbot chatglm chatgpt claude gemini gpt gpt-4o koishi langchain llm openai plugin qq-bot qwen rwkv spark typescript
Last synced: 07 Dec 2024
https://github.com/swoop-inc/spark-alchemy
Collection of open-source Spark tools & frameworks that have made the data engineering and data science teams at Swoop highly productive
data-engineering data-science scala spark
Last synced: 16 Jan 2025
https://github.com/vericast/spylon-kernel
Jupyter kernel for scala and spark
jupyter-kernels kernel metakernel scala spark team-platform
Last synced: 09 Jan 2025
https://github.com/nareshk1290/udacity-data-engineering
Udacity Data Engineering Nano Degree (DEND)
airflow aws cassandra etl postgresql redshift s3 spark star-schema udacity-dend
Last synced: 16 Jan 2025
https://github.com/ClickHouse/spark-clickhouse-connector
Spark ClickHouse Connector build on DataSourceV2 API
arrow clickhouse datasourcev2 grpc http spark
Last synced: 12 Nov 2024
https://github.com/apple/batch-processing-gateway
The gateway component to make Spark on K8s much easier for Spark users.
batch-processing k8s kubernetes spark
Last synced: 16 Jan 2025
https://github.com/polomarcus/spark-structured-streaming-examples
Spark Structured Streaming / Kafka / Cassandra / Elastic
cassandra kafka spark spark-sql structured-streaming
Last synced: 16 Jan 2025
https://github.com/locationtech-labs/geopyspark
GeoTrellis for PySpark
big-data geospatial geotrellis python spark tile-server
Last synced: 27 Nov 2024
https://github.com/mc2-project/opaque-sql
An encrypted data analytics platform
analytics enclave machine-learning privacy security spark spark-sql
Last synced: 31 Oct 2024
https://github.com/setl-framework/setl
A simple Spark-powered ETL framework that just works 🍺
big-data data-analysis data-engineering data-science data-transformation dataset etl etl-pipeline framework machine-learning modularization pipeline scala setl spark
Last synced: 16 Jan 2025
https://github.com/whylabs/whylogs-java
Profile and monitor your ML data pipeline end-to-end
ai-pipelines aiops apache-spark approximate-statistics calculate-statistics data-quality dataset java mlops spark statistical-properties statistics whylogs
Last synced: 22 Jan 2025
https://github.com/SETL-Framework/setl
A simple Spark-powered ETL framework that just works 🍺
big-data data-analysis data-engineering data-science data-transformation dataset etl etl-pipeline framework machine-learning modularization pipeline scala setl spark
Last synced: 08 Nov 2024
https://github.com/leobenkel/Zparkio
Boiler plate framework to use Spark and ZIO together.
boiler-plate functional-programming helpers scala spark template zio
Last synced: 09 Nov 2024
https://github.com/leobenkel/zparkio
Boiler plate framework to use Spark and ZIO together.
boiler-plate functional-programming helpers scala spark template zio
Last synced: 22 Jan 2025
https://github.com/benfradet/spark-kafka-writer
Write your Spark data to Kafka seamlessly
Last synced: 21 Jan 2025
https://github.com/capeprivacy/cape-dataframes
Privacy transformations on Spark and Pandas dataframes backed by a simple policy language.
collaboration data-science hacktoberfest machine-learning pandas policy privacy python spark
Last synced: 14 Nov 2024
https://github.com/krishnan-r/sparkmonitor
Monitor Apache Spark from Jupyter Notebook
Last synced: 22 Jan 2025
https://github.com/yaooqinn/spark-authorizer
A Spark SQL extension which provides SQL Standard Authorization for Apache Spark | This repo is contributed to Apache Kyuubi | 项目已迁移至 Apache Kyuubi
acl hive ranger ranger-hive-plugin spark
Last synced: 16 Jan 2025
https://github.com/dsaidgovsg/airflow-pipeline
An Airflow docker image preconfigured to work well with Spark and Hadoop/EMR
Last synced: 30 Oct 2024
https://github.com/aliyun/aliyun-emapreduce-datasources
Extended datasource support for Spark/Hadoop on Aliyun E-MapReduce.
aliyun datasources e-mapreduce hadoop kafka spark
Last synced: 21 Jan 2025
https://github.com/unnati-xyz/scalable-data-science-platform
Content for architecting a data science platform for products using Luigi, Spark & Flask.
data-engineer data-pipeline data-science luigi machine-learning rest-api spark
Last synced: 27 Nov 2024
https://github.com/baghelamit/iot-traffic-monitor
cassandra java kafka spark spring-boot
Last synced: 15 Jan 2025
https://github.com/radanalyticsio/spark-operator
Operator for managing the Spark clusters on Kubernetes and OpenShift.
apache-spark kubernetes kubernetes-operator openshift spark
Last synced: 16 Jan 2025
https://github.com/cubefs/shuttle
Shuttle:High Available, High Performance Remote Shuffle Service
distributed hadoop remote shuffle spark
Last synced: 20 Dec 2024
https://github.com/sparkling-graph/sparkling-graph
SparklingGraph provides easy to use set of features that will give you ability to proces large scala graphs using Spark and GraphX.
approximation big-data coarsing comunity-detection-methods dsl graph graph-algorithms heuristics link-predication machine-learning measure network-analysis spark vertex
Last synced: 17 Jan 2025
https://github.com/qubole/spark-on-lambda
Apache Spark on AWS Lambda
apache-spark aws aws-cloud aws-lambda big-data lambda serverless spark
Last synced: 17 Jan 2025
https://github.com/helgeho/ArchiveSpark
An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developed at Internet Archive.
archivespark internet-archive spark spark-framework warc web-archiving webarchive
Last synced: 06 Nov 2024
https://github.com/henridf/apache-spark-node
Node.js bindings for Apache Spark DataFrame APIs
Last synced: 02 Nov 2024
https://github.com/SANSA-Stack/SANSA-Stack
Big Data RDF Processing and Analytics Stack built on Apache Spark and Apache Jena http://sansa-stack.github.io/SANSA-Stack/
apache-jena apache-spark distributed-computing flink rdf semantic-web spark
Last synced: 20 Nov 2024
https://github.com/sansa-stack/sansa-stack
Big Data RDF Processing and Analytics Stack built on Apache Spark and Apache Jena http://sansa-stack.github.io/SANSA-Stack/
apache-jena apache-spark distributed-computing flink rdf semantic-web spark
Last synced: 17 Jan 2025
https://github.com/eto-ai/rikai
Parquet-based ML data format optimized for working with unstructured data
deep-learning machine-learning pytorch spark tensorflow
Last synced: 21 Jan 2025
https://github.com/absaoss/cobrix
A COBOL parser and Mainframe/EBCDIC data source for Apache Spark
cobol cobol-parser copybook ebcdic etl mainframe scalable spark
Last synced: 19 Jan 2025
https://github.com/zuinnote/hadoopcryptoledger
Hadoop Crypto Ledger - Analyzing CryptoLedgers, such as Bitcoin Blockchain, on Big Data platforms, such as Hadoop/Spark/Flink/Hive
bigdata bitcoin blockchain cryptoledger ethereum flink hadoop hive spark
Last synced: 16 Jan 2025
https://github.com/virtuslab/iskra
Typesafe wrapper for Apache Spark DataFrame API
Last synced: 19 Jan 2025
https://github.com/qubole/kinesis-sql
Kinesis Connector for Structured Streaming
kinesis real-time-processing spark spark-streaming spark-structured-streaming structured-streaming
Last synced: 21 Jan 2025
https://github.com/edoardottt/spark-ar-creators
List of 9500 (and counting) Spark AR Creators. Open an issue or contact me if you want to be added.❤️
3d ar ar-studio augmented-reality augmented-reality-applications facebook facebook-face-recognition facebook-filter filter filters instagram instagram-feed photos spark spark-ar spark-ar-creators spark-ar-studio sparkar virtual-reality vr
Last synced: 12 Jan 2025
https://github.com/gvcgo/gvc
Geek's valuable collection. A cross-platform supertool that brings convinience to coding.
asciinema auto-install browser chatgpt cloc cross-platform docker environment g go gvm languages spark tools version webdav
Last synced: 11 Nov 2024
https://github.com/archivesunleashed/aut
The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.
analysis apache-spark big-data big-data-analytics dataframe digital-humanities hadoop network-graphing pyspark python3 scala spark text-extraction webarchives
Last synced: 16 Jan 2025
https://github.com/harisekhon/knowledge-base
Large Tech Knowledge Base from 20 years in DevOps, Linux, Cloud, Big Data, AWS, GCP etc - gradually porting my large private knowledge base to public
aws azure bash bigdata cicd cloud devops elasticsearch gcp git groovy hadoop java jvm performance-tuning python scripting solr solrcloud spark
Last synced: 21 Jan 2025
https://github.com/easysql/easy_sql
A library developed to ease the data ETL development process.
clickhouse etl postgres postgresql python spark sql
Last synced: 19 Jan 2025
https://github.com/Clustering4Ever/Clustering4Ever
C4E, a JVM friendly library written in Scala for both local and distributed (Spark) Clustering.
ai artificial-intelligence big-data bigdata clustering clustering-algorithm clustering-evaluation scala scalability spark
Last synced: 18 Nov 2024
https://github.com/clustering4ever/clustering4ever
C4E, a JVM friendly library written in Scala for both local and distributed (Spark) Clustering.
ai artificial-intelligence big-data bigdata clustering clustering-algorithm clustering-evaluation scala scalability spark
Last synced: 14 Oct 2024
https://github.com/davidzajac1/zillacode
Open Source LeetCode for PySpark, Spark, Pandas and DBT/Snowflake
aws coding-interview dbt docker github-actions leetcode pandas pyspark python react snowflake spark terraform
Last synced: 19 Jan 2025
https://github.com/memverge/splash
Splash, a flexible Spark shuffle manager that supports user-defined storage backends for shuffle data storage and exchange
apache-spark bigdata disaggregation elasticity java scala shuffle spark storage
Last synced: 18 Jan 2025
https://github.com/jaegertracing/spark-dependencies
Spark job for dependency links
Last synced: 18 Jan 2025
https://github.com/alanchn31/Movalytics-Data-Warehouse
Data pipeline performing ETL to AWS Redshift using Spark, orchestrated with Apache Airflow
airflow analytics aws-redshift aws-s3 data-engineer-nanodegree data-engineering data-engineering-pipeline data-modelling data-warehouse-cloud docker movie-database movie-recommendation movie-reviews pyspark python3 redshift spark sql udacity
Last synced: 04 Dec 2024
https://github.com/kavgan/phrase-at-scale
Detect common phrases in large amounts of text using a data-driven approach. Size of discovered phrases can be arbitrary. Can be used in languages other than English
collocation-extraction multiword-expressions multiword-extraction natural-language-processing nlp nlp-machine-learning phrase-discovery phrase-extraction pyspark spark
Last synced: 30 Oct 2024
https://github.com/mkuthan/example-spark-kafka
Apache Spark and Apache Kafka integration example
Last synced: 06 Nov 2024
https://github.com/qihoo360/xlearning-xdml
extremely distributed machine learning
ai distributed hadoop hazelcast kudu machine-learning parameter-server spark
Last synced: 14 Nov 2024
https://github.com/Qihoo360/XLearning-XDML
extremely distributed machine learning
ai distributed hadoop hazelcast kudu machine-learning parameter-server spark
Last synced: 31 Oct 2024
https://github.com/llm-red-team/spark-free-api
🚀 讯飞星火大模型逆向API【特长:办公助手】,支持高速流式输出、智能体对话、联网搜索、AI绘图、长文档解读、图像解析、多轮对话,零配置部署,多路token支持,自动清理会话痕迹,仅供测试,如需商用请前往官方开放平台。。
chat-api chatbot chatgpt-api iflytek llm spark spark-ai
Last synced: 19 Jan 2025
https://github.com/jleetutorial/sparktutorial
Source code for James Lee's Aparch Spark with Java course
Last synced: 15 Jan 2025
https://github.com/shjwudp/c4-dataset-script
Inspired by google c4, here is a series of colossal clean data cleaning scripts focused on CommonCrawl data processing. Including Chinese data processing and cleaning methods in MassiveText.
commoncrawl dataset massivetext nlp python spark
Last synced: 02 Dec 2024
https://github.com/alexarchambault/ammonite-spark
Run spark calculations from Ammonite
Last synced: 18 Jan 2025
https://github.com/smart-data-lake/smart-data-lake
Smart Automation Tool for building modern Data Lakes and Data Pipelines
data-lake data-pipelines deltalake hadoop hive scala smart-data-lake spark transform-data
Last synced: 17 Jan 2025
https://github.com/streamnative/pulsar-spark
Spark Connector to read and write with Pulsar
apache-pulsar apache-spark batch-processing data-processing data-science flink spark spark-sql stream-processing structured-streaming
Last synced: 22 Jan 2025
https://github.com/233zzh/TitanDataOperationSystem
最好的大数据项目。《Titan数据运营系统》,本项目是一个全栈闭环系统,我们有用作数据可视化的web系统,然后用flume-kafaka-flume进行日志的读取,在hive设计数仓,编写spark代码进行数仓表之间的转化以及ads层表到mysql的迁移,使用azkaban进行定时任务的调度,使用技术:Java/Scala语言,Hadoop、Spark、Hive、Kafka、Flume、Azkaban、SpringBoot,Bootstrap, Echart等;
azkaban flume hadoop hive kafka spark
Last synced: 30 Oct 2024
https://github.com/utdemir/distributed-dataset
A distributed data processing framework in Haskell.
aws-lambda data-processing distributed haskell spark
Last synced: 27 Oct 2024
https://github.com/indix/schemer
Schema registry for CSV, TSV, JSON, AVRO and Parquet schema. Supports schema inference and GraphQL API.
avro graphql-api json parquet schema-inference schema-registry spark tsv
Last synced: 11 Oct 2024
https://github.com/innat/ML-Resource
A concise resource repository for machine learning
data-analysis data-science deep-learning kaggle machine-learning python spark
Last synced: 11 Nov 2024
https://github.com/JaryZhen/rulegin
基于JavaScript Engine的轻量级规则引擎系统,重构于开源IOT项目thingboard
grpc-java javascript kafka netty spark sping zk
Last synced: 30 Oct 2024
https://github.com/hurence/logisland
Scalable stream processing platform for advanced realtime analytics on top of Kafka and Spark. LogIsland also supports MQTT and Kafka Streams (Flink being in the roadmap). The platform does complex event processing and is suitable for time series analysis. A large set of valuable ready to use processors, data sources and sinks are available.
analytics big-data cassandra complex-event-processing elasticsearch influxdb kafka kafka-streams pattern-recognition solr spark stream-processing
Last synced: 21 Jan 2025
https://github.com/izhangzhihao/real-time-data-warehouse
Real-time Data Warehouse with Apache Flink & Apache Kafka & Apache Hudi
cdc change-data-capture data-warehouse data-warehousing datalake debezium delta delta-lake deltalake elasticsearch flink flink-sql hoodie hudi iceberg kafka real-time-data-warehouse spark spark-sql sql
Last synced: 20 Dec 2024
https://github.com/trK54Ylmz/kafka-spark-streaming-example
Simple examle for Spark Streaming over Kafka topic
java kafka spark stream-processing
Last synced: 03 Nov 2024
https://github.com/trk54ylmz/kafka-spark-streaming-example
Simple examle for Spark Streaming over Kafka topic
java kafka spark stream-processing
Last synced: 18 Nov 2024
https://github.com/sparkdesignsystem/spark-design-system
Spark Design System
design-patterns design-system hacktoberfest spark
Last synced: 19 Jan 2025
https://github.com/vivek-bombatkar/spark-with-python---my-learning-notes-
ETL pipeline using pyspark (Spark - Python)
apache-spark catalyst-optimizer python spark tungsten
Last synced: 12 Oct 2024
https://github.com/commoncrawl/cc-index-table
Index Common Crawl archives in tabular format
apache-parquet aws-athena columnar-storage commoncrawl spark sql
Last synced: 25 Nov 2024
https://github.com/feng-li/Distributed-Statistical-Computing
Teaching Materials for Distributed Statistical Computing (大数据分布式计算教学材料)
hadoop mapreduce pyspark-tutorial spark spark-teaching statistical-models
Last synced: 30 Oct 2024
https://github.com/vspiewak/twitter-sentiment-analysis
Streaming tweets with spark, language detection & sentiment analysis, dashboard with Kibana
dashboard kibana nlp scala sentiment-analysis spark tiwtter
Last synced: 25 Dec 2024
https://github.com/holdenk/sparkprojecttemplate.g8
Template for Spark Projects
Last synced: 19 Dec 2024
https://github.com/AdaCore/RecordFlux
Formal specification and generation of verifiable binary parsers, message generators and protocol state machines
ada binary-parser communication-protocol formal-methods formal-specification formal-verification parser protocol-parser protocol-specification python spark
Last synced: 26 Oct 2024
https://github.com/jgperrin/net.jgp.books.spark.ch01
Spark in Action, 2nd edition - chapter 1 - Introduction
apache-spark java java8 manning spark sparkwithjava
Last synced: 19 Dec 2024
https://github.com/ethicalml/kafka-spark-streaming-zeppelin-docker
One click deploy docker-compose with Kafka, Spark Streaming, Zeppelin UI and Monitoring (Grafana + Kafka Manager)
docker docker-compose kafka kafka-spark kafka-spark-streaming kafka-zeppelin spark spark-kafka spark-streaming-kafka spark-zeppelin streaming zeppelin
Last synced: 06 Nov 2024
https://github.com/jgperrin/net.jgp.labs.spark
Apache Spark examples exclusively in Java
data-ingestion dataframe ingestion java spark udf
Last synced: 16 Nov 2024
https://github.com/dstlry/dstlr
scalable knowledge graph construction from unstructured text
Last synced: 11 Nov 2024
https://github.com/sjrusso8/spark-connect-rs
Apache Spark Connect Client for Rust
grpc-client spark spark-connect spark-sql
Last synced: 22 Jan 2025
https://github.com/dimajix/flowman
Flowman is an ETL framework powered by Apache Spark. With its declarative approach, Flowman simplifies the development of complex data pipelines.
apache-spark big-data bigdata data-engineering etl flowman hadoop scala spark sql
Last synced: 18 Jan 2025
https://github.com/chermenin/spark-states
Custom state store providers for Apache Spark
apache apache-spark spark spark-streaming spark-structured-streaming state state-store stateful structured-streaming
Last synced: 12 Oct 2024
https://github.com/itsjafer/jupyterlab-sparkmonitor
JupyterLab extension that enables monitoring launched Apache Spark jobs from within a notebook
apache-spark jupyter jupyter-lab jupyterlab jupyterlab-extension pyspark spark
Last synced: 16 Jan 2025
https://github.com/aehrc/pathling
Tools that make it easier to use FHIR® and clinical terminology within data analytics, built on Apache Spark.
analytics fhir spark standards terminology
Last synced: 22 Jan 2025
https://github.com/exacaster/lighter
REST API for Apache Spark on K8S or YARN
apache-spark jupyter k8s livy spark sparkmagic yarn
Last synced: 19 Jan 2025
https://github.com/polyaxon/mloperator
Machine learning operator & controller for Kubernetes
dask deep-learning k8s keras kubernetes kubernetes-operator machine-learning mlops mpi mxnet notebook pytorch scikit-learn spark tensorboard tensorflow xgboost
Last synced: 19 Jan 2025
https://github.com/tiledb-inc/tiledb-vcf
Efficient variant-call data storage and retrieval library using the TileDB storage library.
bioinformatics data-science genomics gwas python spark tiledb variant-calling vcf
Last synced: 21 Jan 2025
https://github.com/asavinov/prosto
Prosto is a data processing toolkit radically changing how data is processed by heavily relying on functions and operations with functions - an alternative to map-reduce and join-groupby
business-intelligence data-preparation data-preprocessing data-processing data-science data-wrangling feature-engineering map-reduce olap pandas python spark workflow
Last synced: 07 Nov 2024
https://github.com/logicalclocks/maggy
Distribution transparent Machine Learning experiments on Apache Spark
ablation ablation-studies ablation-study automl blackbox-optimization hyperparameter-optimization hyperparameter-search hyperparameter-tuning spark
Last synced: 16 Jan 2025
https://github.com/trannhatnguyen2/nyc_taxi_data_pipeline
Nyc_Taxi_Data_Pipeline - DE Project
airflow dbt debezium docker great-expectations kafka minio postgresql spark trino
Last synced: 21 Jan 2025
https://github.com/iamabug/BigDataParty
大数据组件 All-in-One 的 Dockerfile
big-data dockerfile hadoop kafka spark
Last synced: 12 Nov 2024
https://github.com/cretueusebiu/laravel-spark-google2fa
Google Authenticator support for Laravel Spark
authenticator laravel laravel-spark php spark
Last synced: 17 Nov 2024
https://github.com/flint-bot/flint
Webex Bot SDK for Node.js (deprecated in favor of https://github.com/webex/webex-bot-node-framework)
Last synced: 19 Dec 2024