Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

Apache Spark

Apache Spark is an open source distributed general-purpose cluster-computing framework. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.

https://github.com/adtech-labs/spylon-kernel

Jupyter kernel for scala and spark

jupyter-kernels kernel metakernel scala spark team-platform

Last synced: 16 Jan 2025

https://github.com/ChatLunaLab/chatluna

多平台模型接入,可扩展,多种输出格式,提供大语言模型聊天服务的插件 | A bot plugin for LLM chat services with multi-model integration, extensibility, and various output formats

ai bot chatbot chatglm chatgpt claude gemini gpt gpt-4o koishi langchain llm openai plugin qq-bot qwen rwkv spark typescript

Last synced: 07 Dec 2024

https://github.com/vericast/spylon-kernel

Jupyter kernel for scala and spark

jupyter-kernels kernel metakernel scala spark team-platform

Last synced: 09 Jan 2025

https://github.com/swoop-inc/spark-alchemy

Collection of open-source Spark tools & frameworks that have made the data engineering and data science teams at Swoop highly productive

data-engineering data-science scala spark

Last synced: 16 Jan 2025

https://github.com/apple/batch-processing-gateway

The gateway component to make Spark on K8s much easier for Spark users.

batch-processing k8s kubernetes spark

Last synced: 16 Jan 2025

https://github.com/ClickHouse/spark-clickhouse-connector

Spark ClickHouse Connector build on DataSourceV2 API

arrow clickhouse datasourcev2 grpc http spark

Last synced: 12 Nov 2024

https://github.com/polomarcus/spark-structured-streaming-examples

Spark Structured Streaming / Kafka / Cassandra / Elastic

cassandra kafka spark spark-sql structured-streaming

Last synced: 16 Jan 2025

https://github.com/mc2-project/opaque-sql

An encrypted data analytics platform

analytics enclave machine-learning privacy security spark spark-sql

Last synced: 31 Oct 2024

https://github.com/leobenkel/Zparkio

Boiler plate framework to use Spark and ZIO together.

boiler-plate functional-programming helpers scala spark template zio

Last synced: 09 Nov 2024

https://github.com/leobenkel/zparkio

Boiler plate framework to use Spark and ZIO together.

boiler-plate functional-programming helpers scala spark template zio

Last synced: 22 Jan 2025

https://github.com/benfradet/spark-kafka-writer

Write your Spark data to Kafka seamlessly

kafka spark

Last synced: 21 Jan 2025

https://github.com/capeprivacy/cape-dataframes

Privacy transformations on Spark and Pandas dataframes backed by a simple policy language.

collaboration data-science hacktoberfest machine-learning pandas policy privacy python spark

Last synced: 14 Nov 2024

https://github.com/krishnan-r/sparkmonitor

Monitor Apache Spark from Jupyter Notebook

extension jupyter spark

Last synced: 22 Jan 2025

https://github.com/yaooqinn/spark-authorizer

A Spark SQL extension which provides SQL Standard Authorization for Apache Spark | This repo is contributed to Apache Kyuubi | 项目已迁移至 Apache Kyuubi

acl hive ranger ranger-hive-plugin spark

Last synced: 16 Jan 2025

https://github.com/dsaidgovsg/airflow-pipeline

An Airflow docker image preconfigured to work well with Spark and Hadoop/EMR

airflow docker hadoop spark

Last synced: 30 Oct 2024

https://github.com/aliyun/aliyun-emapreduce-datasources

Extended datasource support for Spark/Hadoop on Aliyun E-MapReduce.

aliyun datasources e-mapreduce hadoop kafka spark

Last synced: 21 Jan 2025

https://github.com/unnati-xyz/scalable-data-science-platform

Content for architecting a data science platform for products using Luigi, Spark & Flask.

data-engineer data-pipeline data-science luigi machine-learning rest-api spark

Last synced: 27 Nov 2024

https://github.com/saurfang/spark-tsne

Distributed t-SNE via Apache Spark

spark tsne

Last synced: 07 Nov 2024

https://github.com/radanalyticsio/spark-operator

Operator for managing the Spark clusters on Kubernetes and OpenShift.

apache-spark kubernetes kubernetes-operator openshift spark

Last synced: 16 Jan 2025

https://github.com/cubefs/shuttle

Shuttle:High Available, High Performance Remote Shuffle Service

distributed hadoop remote shuffle spark

Last synced: 20 Dec 2024

https://github.com/sparkling-graph/sparkling-graph

SparklingGraph provides easy to use set of features that will give you ability to proces large scala graphs using Spark and GraphX.

approximation big-data coarsing comunity-detection-methods dsl graph graph-algorithms heuristics link-predication machine-learning measure network-analysis spark vertex

Last synced: 17 Jan 2025

https://github.com/helgeho/ArchiveSpark

An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developed at Internet Archive.

archivespark internet-archive spark spark-framework warc web-archiving webarchive

Last synced: 06 Nov 2024

https://github.com/henridf/apache-spark-node

Node.js bindings for Apache Spark DataFrame APIs

data-frame node spark

Last synced: 02 Nov 2024

https://github.com/SANSA-Stack/SANSA-Stack

Big Data RDF Processing and Analytics Stack built on Apache Spark and Apache Jena http://sansa-stack.github.io/SANSA-Stack/

apache-jena apache-spark distributed-computing flink rdf semantic-web spark

Last synced: 20 Nov 2024

https://github.com/sansa-stack/sansa-stack

Big Data RDF Processing and Analytics Stack built on Apache Spark and Apache Jena http://sansa-stack.github.io/SANSA-Stack/

apache-jena apache-spark distributed-computing flink rdf semantic-web spark

Last synced: 17 Jan 2025

https://github.com/eto-ai/rikai

Parquet-based ML data format optimized for working with unstructured data

deep-learning machine-learning pytorch spark tensorflow

Last synced: 21 Jan 2025

https://github.com/zuinnote/hadoopcryptoledger

Hadoop Crypto Ledger - Analyzing CryptoLedgers, such as Bitcoin Blockchain, on Big Data platforms, such as Hadoop/Spark/Flink/Hive

bigdata bitcoin blockchain cryptoledger ethereum flink hadoop hive spark

Last synced: 16 Jan 2025

https://github.com/absaoss/cobrix

A COBOL parser and Mainframe/EBCDIC data source for Apache Spark

cobol cobol-parser copybook ebcdic etl mainframe scalable spark

Last synced: 19 Jan 2025

https://github.com/virtuslab/iskra

Typesafe wrapper for Apache Spark DataFrame API

scala scala3 spark

Last synced: 19 Jan 2025

https://github.com/gvcgo/gvc

Geek's valuable collection. A cross-platform supertool that brings convinience to coding.

asciinema auto-install browser chatgpt cloc cross-platform docker environment g go gvm languages spark tools version webdav

Last synced: 11 Nov 2024

https://github.com/archivesunleashed/aut

The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.

analysis apache-spark big-data big-data-analytics dataframe digital-humanities hadoop network-graphing pyspark python3 scala spark text-extraction webarchives

Last synced: 16 Jan 2025

https://github.com/harisekhon/knowledge-base

Large Tech Knowledge Base from 20 years in DevOps, Linux, Cloud, Big Data, AWS, GCP etc - gradually porting my large private knowledge base to public

aws azure bash bigdata cicd cloud devops elasticsearch gcp git groovy hadoop java jvm performance-tuning python scripting solr solrcloud spark

Last synced: 21 Jan 2025

https://github.com/easysql/easy_sql

A library developed to ease the data ETL development process.

clickhouse etl postgres postgresql python spark sql

Last synced: 19 Jan 2025

https://github.com/isxcode/spark-yun

Big data computing platform based on Spark <至轻云-超轻量级大数据计算平台/数据中台>

docker hadoop hive platform saas spark

Last synced: 28 Dec 2024

https://github.com/clustering4ever/clustering4ever

C4E, a JVM friendly library written in Scala for both local and distributed (Spark) Clustering.

ai artificial-intelligence big-data bigdata clustering clustering-algorithm clustering-evaluation scala scalability spark

Last synced: 14 Oct 2024

https://github.com/Clustering4Ever/Clustering4Ever

C4E, a JVM friendly library written in Scala for both local and distributed (Spark) Clustering.

ai artificial-intelligence big-data bigdata clustering clustering-algorithm clustering-evaluation scala scalability spark

Last synced: 18 Nov 2024

https://github.com/davidzajac1/zillacode

Open Source LeetCode for PySpark, Spark, Pandas and DBT/Snowflake

aws coding-interview dbt docker github-actions leetcode pandas pyspark python react snowflake spark terraform

Last synced: 19 Jan 2025

https://github.com/lichaojacobs/java_learning_practice

java 进阶之路:面试高频算法、akka、多线程、NIO、Netty、SpringBoot、Spark&&Flink 等

algorithm flink java netty spark spring web

Last synced: 19 Dec 2024

https://github.com/memverge/splash

Splash, a flexible Spark shuffle manager that supports user-defined storage backends for shuffle data storage and exchange

apache-spark bigdata disaggregation elasticity java scala shuffle spark storage

Last synced: 18 Jan 2025

https://github.com/jaegertracing/spark-dependencies

Spark job for dependency links

jaegertracing spark

Last synced: 18 Jan 2025

https://github.com/apache/spark-website

Apache Spark Website

big-data java jdbc python r scala spark sql

Last synced: 17 Jan 2025

https://github.com/kavgan/phrase-at-scale

Detect common phrases in large amounts of text using a data-driven approach. Size of discovered phrases can be arbitrary. Can be used in languages other than English

collocation-extraction multiword-expressions multiword-extraction natural-language-processing nlp nlp-machine-learning phrase-discovery phrase-extraction pyspark spark

Last synced: 30 Oct 2024

https://github.com/mkuthan/example-spark-kafka

Apache Spark and Apache Kafka integration example

kafka spark spark-streaming

Last synced: 06 Nov 2024

https://github.com/llm-red-team/spark-free-api

🚀 讯飞星火大模型逆向API【特长:办公助手】,支持高速流式输出、智能体对话、联网搜索、AI绘图、长文档解读、图像解析、多轮对话,零配置部署,多路token支持,自动清理会话痕迹,仅供测试,如需商用请前往官方开放平台。。

chat-api chatbot chatgpt-api iflytek llm spark spark-ai

Last synced: 19 Jan 2025

https://github.com/jleetutorial/sparktutorial

Source code for James Lee's Aparch Spark with Java course

bigdata spark

Last synced: 15 Jan 2025

https://github.com/shjwudp/c4-dataset-script

Inspired by google c4, here is a series of colossal clean data cleaning scripts focused on CommonCrawl data processing. Including Chinese data processing and cleaning methods in MassiveText.

commoncrawl dataset massivetext nlp python spark

Last synced: 02 Dec 2024

https://github.com/alexarchambault/ammonite-spark

Run spark calculations from Ammonite

ammonite scala spark

Last synced: 18 Jan 2025

https://github.com/apache/spark-docker

Official Dockerfile for Apache Spark

big-data java jdbc python r scala spark sql

Last synced: 19 Jan 2025

https://github.com/rstudio/bigdataclass

Two-day workshop that covers how to use R to interact databases and Spark

big-data db dbi dbplyr r spark

Last synced: 15 Oct 2024

https://github.com/smart-data-lake/smart-data-lake

Smart Automation Tool for building modern Data Lakes and Data Pipelines

data-lake data-pipelines deltalake hadoop hive scala smart-data-lake spark transform-data

Last synced: 17 Jan 2025

https://github.com/233zzh/TitanDataOperationSystem

最好的大数据项目。《Titan数据运营系统》,本项目是一个全栈闭环系统,我们有用作数据可视化的web系统,然后用flume-kafaka-flume进行日志的读取,在hive设计数仓,编写spark代码进行数仓表之间的转化以及ads层表到mysql的迁移,使用azkaban进行定时任务的调度,使用技术:Java/Scala语言,Hadoop、Spark、Hive、Kafka、Flume、Azkaban、SpringBoot,Bootstrap, Echart等;

azkaban flume hadoop hive kafka spark

Last synced: 30 Oct 2024

https://github.com/utdemir/distributed-dataset

A distributed data processing framework in Haskell.

aws-lambda data-processing distributed haskell spark

Last synced: 27 Oct 2024

https://github.com/indix/schemer

Schema registry for CSV, TSV, JSON, AVRO and Parquet schema. Supports schema inference and GraphQL API.

avro graphql-api json parquet schema-inference schema-registry spark tsv

Last synced: 11 Oct 2024

https://github.com/innat/ML-Resource

A concise resource repository for machine learning

data-analysis data-science deep-learning kaggle machine-learning python spark

Last synced: 11 Nov 2024

https://github.com/JaryZhen/rulegin

基于JavaScript Engine的轻量级规则引擎系统,重构于开源IOT项目thingboard

grpc-java javascript kafka netty spark sping zk

Last synced: 30 Oct 2024

https://github.com/hurence/logisland

Scalable stream processing platform for advanced realtime analytics on top of Kafka and Spark. LogIsland also supports MQTT and Kafka Streams (Flink being in the roadmap). The platform does complex event processing and is suitable for time series analysis. A large set of valuable ready to use processors, data sources and sinks are available.

analytics big-data cassandra complex-event-processing elasticsearch influxdb kafka kafka-streams pattern-recognition solr spark stream-processing

Last synced: 21 Jan 2025

https://github.com/rstudio-conf-2020/big-data

:wrench: Use dplyr to analyze Big Data :elephant:

databases dplyr r rstudio spark sparklyr workshop

Last synced: 15 Oct 2024

https://github.com/trk54ylmz/kafka-spark-streaming-example

Simple examle for Spark Streaming over Kafka topic

java kafka spark stream-processing

Last synced: 18 Nov 2024

https://github.com/trK54Ylmz/kafka-spark-streaming-example

Simple examle for Spark Streaming over Kafka topic

java kafka spark stream-processing

Last synced: 03 Nov 2024

https://github.com/commoncrawl/cc-index-table

Index Common Crawl archives in tabular format

apache-parquet aws-athena columnar-storage commoncrawl spark sql

Last synced: 25 Nov 2024

https://github.com/vspiewak/twitter-sentiment-analysis

Streaming tweets with spark, language detection & sentiment analysis, dashboard with Kibana

dashboard kibana nlp scala sentiment-analysis spark tiwtter

Last synced: 25 Dec 2024

https://github.com/feng-li/Distributed-Statistical-Computing

Teaching Materials for Distributed Statistical Computing (大数据分布式计算教学材料)

hadoop mapreduce pyspark-tutorial spark spark-teaching statistical-models

Last synced: 30 Oct 2024

https://github.com/AdaCore/RecordFlux

Formal specification and generation of verifiable binary parsers, message generators and protocol state machines

ada binary-parser communication-protocol formal-methods formal-specification formal-verification parser protocol-parser protocol-specification python spark

Last synced: 26 Oct 2024

https://github.com/holdenk/sparkprojecttemplate.g8

Template for Spark Projects

apachespark g8 spark

Last synced: 19 Dec 2024

https://github.com/jgperrin/net.jgp.books.spark.ch01

Spark in Action, 2nd edition - chapter 1 - Introduction

apache-spark java java8 manning spark sparkwithjava

Last synced: 19 Dec 2024

https://github.com/ethicalml/kafka-spark-streaming-zeppelin-docker

One click deploy docker-compose with Kafka, Spark Streaming, Zeppelin UI and Monitoring (Grafana + Kafka Manager)

docker docker-compose kafka kafka-spark kafka-spark-streaming kafka-zeppelin spark spark-kafka spark-streaming-kafka spark-zeppelin streaming zeppelin

Last synced: 06 Nov 2024

https://github.com/jgperrin/net.jgp.labs.spark

Apache Spark examples exclusively in Java

data-ingestion dataframe ingestion java spark udf

Last synced: 16 Nov 2024

https://github.com/saurfang/sbt-spark-submit

sbt plugin for spark-submit

sbt spark

Last synced: 07 Nov 2024

https://github.com/qubole/spark-acid

ACID Data Source for Apache Spark based on Hive ACID

acid big-data hive hive-acid spark

Last synced: 21 Nov 2024

https://github.com/dstlry/dstlr

scalable knowledge graph construction from unstructured text

corenlp neo4j spark

Last synced: 11 Nov 2024

https://github.com/sjrusso8/spark-connect-rs

Apache Spark Connect Client for Rust

grpc-client spark spark-connect spark-sql

Last synced: 22 Jan 2025

https://github.com/dimajix/flowman

Flowman is an ETL framework powered by Apache Spark. With its declarative approach, Flowman simplifies the development of complex data pipelines.

apache-spark big-data bigdata data-engineering etl flowman hadoop scala spark sql

Last synced: 18 Jan 2025

https://github.com/aehrc/pathling

Tools that make it easier to use FHIR® and clinical terminology within data analytics, built on Apache Spark.

analytics fhir spark standards terminology

Last synced: 22 Jan 2025

https://github.com/itsjafer/jupyterlab-sparkmonitor

JupyterLab extension that enables monitoring launched Apache Spark jobs from within a notebook

apache-spark jupyter jupyter-lab jupyterlab jupyterlab-extension pyspark spark

Last synced: 16 Jan 2025

https://github.com/exacaster/lighter

REST API for Apache Spark on K8S or YARN

apache-spark jupyter k8s livy spark sparkmagic yarn

Last synced: 19 Jan 2025

https://github.com/tiledb-inc/tiledb-vcf

Efficient variant-call data storage and retrieval library using the TileDB storage library.

bioinformatics data-science genomics gwas python spark tiledb variant-calling vcf

Last synced: 21 Jan 2025

https://github.com/asavinov/prosto

Prosto is a data processing toolkit radically changing how data is processed by heavily relying on functions and operations with functions - an alternative to map-reduce and join-groupby

business-intelligence data-preparation data-preprocessing data-processing data-science data-wrangling feature-engineering map-reduce olap pandas python spark workflow

Last synced: 07 Nov 2024

https://github.com/iamabug/BigDataParty

大数据组件 All-in-One 的 Dockerfile

big-data dockerfile hadoop kafka spark

Last synced: 12 Nov 2024

https://github.com/cretueusebiu/laravel-spark-google2fa

Google Authenticator support for Laravel Spark

authenticator laravel laravel-spark php spark

Last synced: 17 Nov 2024

https://github.com/flint-bot/flint

Webex Bot SDK for Node.js (deprecated in favor of https://github.com/webex/webex-bot-node-framework)

cisco spark

Last synced: 19 Dec 2024