Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

Apache Spark

Apache Spark is an open source distributed general-purpose cluster-computing framework. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.

GitHub: https://github.com/topics/spark
Wikipedia: https://en.wikipedia.org/wiki/Apache_Spark
Repo: https://github.com/apache/spark
Created by: Matei Zaharia
Released: May 26, 2014
Related Topics: scala, hadoop,
Aliases: apache-spark,
Last updated: 2025-01-22 00:29:18 UTC
JSON Representation

https://github.com/lucidworks/spark-solr

Tools for reading data from Solr as a Spark RDD and indexing objects from Spark into Solr using SolrJ.

solr spark

Last synced: 15 Nov 2024

https://github.com/mrpowers-io/spark-fast-tests

Apache Spark testing helpers (dependency free & works with Scalatest, uTest, and MUnit)

spark testing-framework

Last synced: 20 Jan 2025

https://github.com/supercowpowers/zat

Zeek Analysis Tools (ZAT): Processing and analysis of Zeek network data with Pandas, scikit-learn, Kafka and Spark

bro data-analysis kafka networking pandas python scikit-learn security spark zeek zeek-analysis

Last synced: 19 Jan 2025

https://github.com/datavane/datavines

Know your data better！Datavines is Next-gen Data Observability Platform, support metadata manage and data quality.

dataobservability dataprofile dataquality datascience doris metadata spark

Last synced: 18 Jan 2025

https://github.com/kevinschaich/pyspark-cheatsheet

🐍 Quick reference guide to common patterns & functions in PySpark.

cheat cheatsheet cheatsheets data data-science docs documentation guide guides pyspark pyspark-tutorial quickstart reference references spark spark-sql

Last synced: 31 Oct 2024

https://github.com/SuperCowPowers/zat

Zeek Analysis Tools (ZAT): Processing and analysis of Zeek network data with Pandas, scikit-learn, Kafka and Spark

bro data-analysis kafka networking pandas python scikit-learn security spark zeek zeek-analysis

Last synced: 27 Nov 2024

https://github.com/microsoft/hyperspace

An open source indexing subsystem that brings index-based query acceleration to Apache Spark™ and big data workloads.

acceleration analytics big-data databases indexing spark

Last synced: 17 Jan 2025

https://github.com/zsvoboda/ngods-stocks

New Generation Opensource Data Stack Demo

cube dagster datahub dbt iceberg metabase python spark spark-sql trino trinodb

Last synced: 16 Jan 2025

https://github.com/japila-books/spark-structured-streaming-internals

The Internals of Spark Structured Streaming

apache-spark book internals mkdocs-material spark structured-streaming

Last synced: 19 Jan 2025

https://github.com/cartershanklin/pyspark-cheatsheet

PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster

apache-spark big-data pyspark spark

Last synced: 12 Oct 2024

https://github.com/USCDataScience/sparkler

Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.

big-data distributed-systems information-retrieval nutch search search-engine solr spark tika web-crawler

Last synced: 29 Oct 2024

https://github.com/zhaoyachao/zdh_web

大数据采集,抽取平台,zdh_web是zdh系列服务的可视化管理平台，包含数据采集,调度,权限,审批流,私域营销等模块

bigdata collection data data-collection datapipeline datax-web etl pipline scheduler spark sparketl

Last synced: 05 Nov 2024

https://github.com/gacwr/openuba

A robust, and flexible open source User & Entity Behavior Analytics (UEBA) framework used for Security Analytics. Developed with luv by Data Scientists & Security Analysts from the Cyber Security Industry. [PRE-ALPHA]

analytics anomaly-detection cybersecurity datascience elasticsearch elk flask information-security machine-learning nodejs react security siem sklearn spark tensorflow threathunting uba ueba user-behaviour

Last synced: 17 Jan 2025

https://github.com/kevinliao159/mydatascienceportfolio

Applying Data Science and Machine Learning to Solve Real World Business Problems

api data-science data-visualization machine-learning neural-networks nlp recommendation-system spark

Last synced: 22 Jan 2025

https://github.com/fabiogjardim/bigdata_docker

Big Data Ecosystem Docker

hadoop hbase hdfs hive hue jupyter-notebook metabase mongo mysql nifi presto spark zookeeper

Last synced: 18 Jan 2025

https://github.com/apache/incubator-uniffle

Uniffle is a high performance, general purpose Remote Shuffle Service.

mapreduce remote-shuffle-service rss shuffle spark tez

Last synced: 18 Jan 2025

https://github.com/teeyog/IQL

An ad hoc query service based on the spark sql engine.(基于spark sql引擎的即席查询服务)

spark sparksql

Last synced: 30 Oct 2024

https://github.com/googleclouddataproc/spark-bigquery-connector

BigQuery data source for Apache Spark: Read data from BigQuery into DataFrames, write DataFrames into BigQuery tables.

bigquery bigquery-storage-api google-bigquery google-cloud google-cloud-dataproc spark

Last synced: 16 Jan 2025

https://github.com/IBM/data-prep-kit

Open source project for data preparation of LLM application builders

code-quality data data-prep data-preparation data-preprocessing data-preprocessing-pipelines datacuration datarecipes deduplication finetuning large-language-models large-scale-data-processing llm llmapps malware python ray spark

Last synced: 11 Jan 2025

https://github.com/cubefs/compass

Compass is a task diagnosis platform for bigdata

airflow bigdata diagnose dolphinscheduler flink hadoop mapreduce scheduler spark sql

Last synced: 19 Jan 2025

https://github.com/XuefengHuang/RecommendationSystem

Book recommender system using collaborative filtering based on Spark

collaborative-filtering python-flask recommendation-system spark

Last synced: 29 Oct 2024

https://github.com/groupon/sparklint

A tool for monitoring and tuning Spark jobs for efficiency.

performance-analysis scala spark

Last synced: 12 Jan 2025

https://github.com/GoogleCloudDataproc/spark-bigquery-connector

BigQuery data source for Apache Spark: Read data from BigQuery into DataFrames, write DataFrames into BigQuery tables.

bigquery bigquery-storage-api google-bigquery google-cloud google-cloud-dataproc spark

Last synced: 30 Sep 2024

https://github.com/kanyun-inc/ytk-learn

Ytk-learn is a distributed machine learning library which implements most of popular machine learning algorithms(GBDT, GBRT, Mixture Logistic Regression, Gradient Boosting Soft Tree, Factorization Machines, Field-aware Factorization Machines, Logistic Regression, Softmax).

distributed factorization-machines gbdt gbm hadoop logistic-regression machine-learning spark

Last synced: 21 Jan 2025

https://github.com/tirthajyoti/spark-with-python

Fundamentals of Spark with Python (using PySpark), code examples

analytics apache apache-spark big-data database dataframe distributed-computing hadoop hdfs machine-learning map-reduce mlib parallel-computing pyspark python spark sql

Last synced: 19 Jan 2025

https://github.com/datamechanics/delight

A Spark UI and Spark History Server alternative with CPU and Memory metrics! Delight is free, cross-platform, and open-source.

apache-spark cpu dashboard delight kubernetes memory monitoring netapp-public spark spark-history-server spark-ui

Last synced: 22 Jan 2025

https://github.com/twosigma/Cook

Fair job scheduler on Kubernetes and Mesos for batch workloads and Spark

cluster gke kubernetes mesos scheduler spark

Last synced: 26 Oct 2024

https://github.com/elasticluster/elasticluster

Create clusters of VMs on the cloud and configure them with Ansible.

ansible azure cloud cluster clustering ec2 gcp gridengine hadoop hpc python slurm spark

Last synced: 06 Nov 2024

https://github.com/jorgebucaran/spark.fish

▁▂▄▆▇█▇▆▄▂▁

fish fish-plugin spark

Last synced: 17 Jan 2025

https://github.com/miguno/wirbelsturm

[PROJECT IS NO LONGER MAINTAINED] Wirbelsturm is a Vagrant and Puppet based tool to perform 1-click local and remote deployments, with a focus on big data tech like Kafka.

apache-kafka apache-spark apache-storm kafka puppet spark storm vagrant

Last synced: 22 Jan 2025

https://github.com/alshdavid/crayon-router

Simple framework agnostic UI router for SPAs

react router spark svelte svelte-v3 vue

Last synced: 22 Jan 2025

https://github.com/lightbend/cloudflow

Cloudflow enables users to quickly develop, orchestrate, and operate distributed streaming applications on Kubernetes.

akka cloudflow flink kubernetes microservices-architectures spark streaming-applications streaming-data streaming-runtimes

Last synced: 17 Jan 2025

https://github.com/sderosiaux/every-single-day-i-tldr

A daily digest of the articles or videos I've found interesting, that I want to share with you.

akka architecture bigdata category-theory data-engineering ddd googlecloudplatform java javascript kafka kubernetes microservices reactjs scala spark technology watch

Last synced: 16 Jan 2025

https://github.com/neo4j/neo4j-spark-connector

Neo4j Connector for Apache Spark, which provides bi-directional read/write access to Neo4j from Spark, using the Spark DataSource APIs

bolt cypher hacktoberfest neo4j-connector neo4j-driver spark

Last synced: 18 Jan 2025

https://github.com/oap-project/raydp

RayDP provides simple APIs for running Spark on Ray and integrating Spark with AI libraries.

ray spark

Last synced: 15 Nov 2024

https://github.com/kamu-data/kamu-cli

Next-generation decentralized data lakehouse and a multi-party stream processing network

blockchain data-as-code data-management data-science datafusion flink jupyter kamu open-data open-data-fabric spark sql

Last synced: 18 Jan 2025

https://github.com/baghelamit/video-stream-analytics

java kafka opencv spark

Last synced: 21 Jan 2025

https://github.com/microsoft/data-accelerator

Data Accelerator for Apache Spark simplifies onboarding to Streaming of Big Data. It offers a rich, easy to use experience to help with creation, editing and management of Spark jobs on Azure HDInsights or Databricks while enabling the full power of the Spark engine.

apache-spark azure big-data cosmosdb docker eventhub hdinsight iot iothub kafka kafka-streams nodejs react servicefabric spark spark-sql spark-streaming sparksql streaming streaming-data

Last synced: 17 Jan 2025

https://github.com/aws/sagemaker-spark

A Spark library for Amazon SageMaker.

amazon-sagemaker aws machine-learning python sagemaker scala spark

Last synced: 16 Jan 2025

https://github.com/DTStack/dt-sql-parser

SQL Parsers for BigData, built with antlr4.

antlr4 autocompletion bigdata flink hive impala mysql parser postgresql spark sql sql-validation trino

Last synced: 02 Nov 2024

https://github.com/dtstack/dt-sql-parser

SQL Parsers for BigData, built with antlr4.

antlr4 autocompletion bigdata flink hive impala mysql parser postgresql spark sql sql-validation trino

Last synced: 17 Jan 2025

https://github.com/datawhalechina/juicy-bigdata

🎉🎉🐳 Datawhale大数据处理导论教程 | 大数据技术方向的开篇课程🎉🎉

bigdata hadoop hbase hdfs hive mapreduce spark

Last synced: 22 Jan 2025

https://github.com/spotify/big-data-rosetta-code

Code snippets for solving common big data problems in various platforms. Inspired by Rosetta Code

bigdata scala scalding scio spark

Last synced: 19 Jan 2025

https://github.com/zero-one-group/geni

A Clojure dataframe library that runs on Spark

big-data clojure clojure-library clojure-repl data-engineering data-science dataframe distributed-computing high-performance-computing machine-learning parallel-computing spark

Last synced: 22 Jan 2025

https://github.com/azure/azure-event-hubs

☁️ Cloud-scale telemetry ingestion from any stream of data with Azure Event Hubs

amqp apache azure c dotnet event-hubs eventhub eventhubs go golang java messaging microsoft node node-js nodejs python spark stream streaming

Last synced: 16 Jan 2025

https://github.com/Ibotta/sk-dist

Distributed scikit-learn meta-estimators in PySpark

data-science machine-learning ml scikit-learn spark

Last synced: 25 Nov 2024

https://github.com/ibotta/sk-dist

Distributed scikit-learn meta-estimators in PySpark

data-science machine-learning ml scikit-learn spark

Last synced: 19 Jan 2025

https://github.com/hbase-rdd/hbase-rdd

Spark RDD to read, write and delete from HBase

hbase scala spark

Last synced: 21 Jan 2025

https://github.com/xd-deng/spark-practice

Apache Spark (PySpark) Practice on Real Data

pyspark spark

Last synced: 21 Jan 2025

https://github.com/projectglow/glow

An open-source toolkit for large-scale genomic analysis

delta genomics gwas machine-learning population-genetics regression spark

Last synced: 25 Nov 2024

https://github.com/hydrospheredata/hydro-serving

MLOps Platform

machine-learning models pipelines realtime scikit-learn scoring serverless serving spark tensorflow

Last synced: 22 Jan 2025

https://github.com/Hydrospheredata/hydro-serving

MLOps Platform

machine-learning models pipelines realtime scikit-learn scoring serverless serving spark tensorflow

Last synced: 27 Oct 2024

https://github.com/jaceklaskowski/spark-workshop

Apache Spark™ and Scala Workshops

apache-spark spark spark-mllib spark-sql spark-structured-streaming spark-workshops workshop

Last synced: 19 Jan 2025

https://github.com/PiercingDan/spark-Jupyter-AWS

A guide on how to set up Jupyter with Pyspark painlessly on AWS EC2 clusters, with S3 I/O support

apache-spark apache-spark-cluster aws aws-ec2 aws-s3 ebs-volumes ec2 ec2-instance jupyter jupyter-notebook spark spark-clusters

Last synced: 27 Nov 2024

https://github.com/piercingdan/spark-jupyter-aws

A guide on how to set up Jupyter with Pyspark painlessly on AWS EC2 clusters, with S3 I/O support

apache-spark apache-spark-cluster aws aws-ec2 aws-s3 ebs-volumes ec2 ec2-instance jupyter jupyter-notebook spark spark-clusters

Last synced: 03 Jan 2025

https://github.com/jelmerk/hnswlib

Java library for approximate nearest neighbors search using Hierarchical Navigable Small World graphs

algorithm java k-nearest-neighbors knn-search pyspark scala spark

Last synced: 20 Jan 2025

https://github.com/WeBankFinTech/Visualis

Visualis is a BI tool for data visualization. It provides financial-grade data visualization capabilities on the basis of data security and permissions, based on the open source project Davinci contributed by CreditEase.

appjoint datasource dataspherestudio davinci linkis scriptis spark superset tableau visualization

Last synced: 31 Oct 2024

https://github.com/melin/superior-sql-parser

基于 antlr4 的多种数据库SQL解析器，获取SQL中元数据，可用于数据平台产品中的多个场景：ddl语句提取元数据、sql 权限校验、表级血缘、sql语法校验等场景。支持spark、flink、gauss、starrocks、Oracle、MYSQL、Postgresql，sqlserver,、db2等

flink gauss lineage metadata mysql parser postgres spark sql starrocks

Last synced: 05 Nov 2024

https://github.com/oap-project/gazelle_plugin

Native SQL Engine plugin for Spark SQL with vectorized SIMD optimizations.

arrow native-kernels native-sql-engine spark vectorized-simd-optimizations

Last synced: 27 Oct 2024

https://github.com/bytedance/cloudshuffleservice

Cloud Shuffle Service(CSS) is a general purpose remote shuffle solution for compute engines, including Spark/Flink/MapReduce.

flink hadoop-mapreduce spark

Last synced: 21 Jan 2025

https://github.com/flyteorg/flytekit

Extensible Python SDK for developing Flyte tasks and workflows. Simple to get started and learn and highly extensible.

automation data data-science extensible flyte flyte-tasks hacktoberfest mlops pypi python sdk spark workflows

Last synced: 15 Jan 2025

https://github.com/mlwhiz/data_science_blogs

A repository to keep track of all the code that I end up writing for my blog posts.

blogging chatbot data datascience gan graphs machine-learning mcmc python spark streamlit time-series xgboost

Last synced: 20 Jan 2025

https://github.com/oeljeklaus-you/javaorbigdata-interview

Java开发者或者大数据开发者面试知识点整理

bigdata hadoop interview java spark storm

Last synced: 17 Jan 2025

https://github.com/MLWhiz/data_science_blogs

A repository to keep track of all the code that I end up writing for my blog posts.

blogging chatbot data datascience gan graphs machine-learning mcmc python spark streamlit time-series xgboost

Last synced: 13 Nov 2024

https://github.com/locationtech/rasterframes

Geospatial Raster support for Spark DataFrames

earth-observation geotrellis image-processing machine-learning scala spark spark-ml sparksql

Last synced: 22 Jan 2025

https://github.com/tencent/firestorm

Firestorm is a Remote Shuffle Service, and provides the capability for Apache Spark and Apache Hadoop MapReduce applications to store shuffle data on remote servers

mapreduce remoteshuffle shuffle spark

Last synced: 22 Jan 2025

https://github.com/FirelyTeam/spark

Firely and Incendi's open source FHIR server

c-sharp docker dstu2 fhir fhir-api fhir-server fhir-spec fhir-specification r4 spark spark-fhir-server stu3

Last synced: 28 Oct 2024

https://github.com/bytedance/CloudShuffleService

Cloud Shuffle Service(CSS) is a general purpose remote shuffle solution for compute engines, including Spark/Flink/MapReduce.

flink hadoop-mapreduce spark

Last synced: 05 Nov 2024

https://github.com/paypal/gimel

Big Data Processing Framework - Unified Data API or SQL on Any Storage

aerospike big-data cassandra data-api elasticsearch gimel hbase jdbc kafka paypal pyspark python restapi scala spark spark-streaming streaming-sql teradata

Last synced: 19 Jan 2025

https://github.com/mellanox/sparkrdma

This is archive of SparkRDMA project. The new repository with RDMA shuffle acceleration for Apache Spark is here: https://github.com/Nvidia/sparkucx

apache-spark big-data bigdata disni hadoop infiniband java mellanox rdma roce scala shuffle spark

Last synced: 22 Jan 2025

https://github.com/saurfang/spark-knn

k-Nearest Neighbors algorithm on Spark

knn spark

Last synced: 21 Jan 2025

https://github.com/azure/azure-event-hubs-spark

Enabling Continuous Data Processing with Apache Spark and Azure Event Hubs

apache apache-spark azure bigdata connector continuous databricks event-hubs eventhubs ingestion kafka microsoft real-time scala spark spark-streaming stream streaming structured-streaming

Last synced: 17 Jan 2025

https://github.com/mgalarnyk/installations_mac_ubuntu_windows

Installations for Data Science. Anaconda, RStudio, Spark, TensorFlow, AWS (Amazon Web Services).

anaconda aws-ec2 ec2-instance python rstudio spark

Last synced: 21 Jan 2025

https://github.com/mGalarnyk/Installations_Mac_Ubuntu_Windows

Installations for Data Science. Anaconda, RStudio, Spark, TensorFlow, AWS (Amazon Web Services).

anaconda aws-ec2 ec2-instance python rstudio spark

Last synced: 27 Nov 2024

https://github.com/absaoss/abris

Avro SerDe for Apache Spark structured APIs.

avro avro-schema kafka schema-registry spark

Last synced: 18 Jan 2025

https://github.com/adidas/lakehouse-engine

The Lakehouse Engine is a configuration driven Spark framework, written in Python, serving as a scalable and distributed engine for several lakehouse algorithms, data flows and utilities for Data Products.

big-data configuration-driven data-engineering data-quality databricks delta-lake framework great-expectations lakehouse spark

Last synced: 17 Jan 2025

https://github.com/apache/incubator-graphar

An open source, standard data file format for graph data storage and retrieval.

big-data data-orchestration etl graph graph-analysis graph-storage pyspark spark

Last synced: 22 Jan 2025

https://github.com/ondra-m/ruby-spark

Ruby wrapper for Apache Spark

distributed rdd ruby ruby-spark spark

Last synced: 21 Jan 2025

https://github.com/mkuthan/example-spark

Spark, Spark Streaming and Spark SQL unit testing strategies

spark spark-streaming testing

Last synced: 16 Jan 2025

https://github.com/iimeta/fastapi

智元 Fast API 是一站式API管理系统，将各类LLM API进行统一格式、统一规范、统一管理，使其在功能、性能和用户体验上达到极致。

api chatgpt ernie-bot fast fastapi glm gpt gpt-4 openai qwen realtime spark

Last synced: 26 Nov 2024

https://github.com/apache/incubator-wayang

Apache Wayang(incubating) is the first cross-platform data processing system.

apache big-data cross-platform data-management-platform data-processing distributed-system hadoop java jdbc middleware open-source performance scala spark

Last synced: 18 Jan 2025

https://github.com/zio/zio-protoquill

Quill for Scala 3

cassandra jdbc language-integrated-query linq postgresql scala spark sparksql sql

Last synced: 18 Jan 2025

https://github.com/neoremind/kraps-rpc

A RPC framework leveraging Spark RPC module

rpc spark

Last synced: 21 Jan 2025

https://github.com/mahmoudparsian/data-algorithms-with-spark

O'Reilly Book: [Data Algorithms with Spark] by Mahmoud Parsian

algorithms bigdata data data-abstractions data-algorithms data-transformation dataframes design design-patterns machine-learning mappers mapreduce monoid partitioning-algorithms pyspark python rdd reducers spark transformations

Last synced: 15 Jan 2025

https://github.com/dylan-profiler/visions

Type System for Data Analysis in Python

data-analysis data-science hacktoberfest numpy pandas python spark type-inference type-system

Last synced: 17 Jan 2025

https://github.com/qihoo360/xsql

Unified SQL Analytics Engine Based on SparkSQL

datasource elasticsearch federation hive spark sql

Last synced: 22 Jan 2025

https://github.com/huangfox/dpkb

大数据相关内容汇总，包括分布式存储引擎、分布式计算引擎、数仓建设等。关键词：Hadoop、HBase、ES、Kudu、Hive、Presto、Spark、Flink、Kylin、ClickHouse

flink hadoop hbase hive presto spark

Last synced: 30 Oct 2024

https://github.com/dfdx/spark.jl

Julia binding for Apache Spark

big-data julia spark

Last synced: 22 Jan 2025

https://github.com/chatlunalab/chatluna

多平台模型接入，可扩展，多种输出格式，提供大语言模型聊天服务的插件 | A bot plugin for LLM chat services with multi-model integration, extensibility, and various output formats

ai bot chatbot chatglm chatgpt claude gemini gpt gpt-4o koishi langchain llm openai plugin qq-bot qwen rwkv spark typescript

Last synced: 20 Jan 2025

https://github.com/azure/azure-cosmosdb-spark

Apache Spark Connector for Azure Cosmos DB

apache-spark azure-cosmos-db azure-databricks changefeed connector cosmos-db databricks databricks-notebooks jupyter-notebook lambda-architecture pyspark spark

Last synced: 19 Jan 2025

https://github.com/WeBankFinTech/WeBank-all-Project

All the project addresses participated and established by WeBank are collected.汇集了微众银行参与和建立的所有项目地址。

ai bigdata blockchain could dpr fate finance frontend linkis spark

Last synced: 30 Oct 2024

https://github.com/Azure/azure-cosmosdb-spark

Apache Spark Connector for Azure Cosmos DB

apache-spark azure-cosmos-db azure-databricks changefeed connector cosmos-db databricks databricks-notebooks jupyter-notebook lambda-architecture pyspark spark

Last synced: 17 Nov 2024

https://github.com/JahstreetOrg/spark-on-kubernetes-helm

Spark on Kubernetes infrastructure Helm charts repo

helm history-server jupyter kubernetes livy spark

Last synced: 15 Nov 2024

https://github.com/g-research/spark-extension

A library that provides useful extensions to Apache Spark and PySpark.

gr-oss java pyspark python scala spark

Last synced: 19 Jan 2025

https://github.com/clickhouse/spark-clickhouse-connector

Spark ClickHouse Connector build on DataSourceV2 API

arrow clickhouse datasourcev2 grpc http spark

Last synced: 17 Jan 2025

https://github.com/karakanb/vue-info-card

Simple and beautiful card component with an elegant spark line, for VueJS.

card card-component component info-card spark vue vue-components vuejs vuejs2

Last synced: 21 Jan 2025

https://github.com/databrickslabs/automl-toolkit

Toolkit for Apache Spark ML for Feature clean-up, feature Importance calculation suite, Information Gain selection, Distributed SMOTE, Model selection and training, Hyper parameter optimization and selection, Model interprability.

apache-spark feature-engineering machinelearning ml pyspark scala spark