Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
Apache Spark
Apache Spark is an open source distributed general-purpose cluster-computing framework. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.
- GitHub: https://github.com/topics/spark
- Wikipedia: https://en.wikipedia.org/wiki/Apache_Spark
- Repo: https://github.com/apache/spark
- Created by: Matei Zaharia
- Released: May 26, 2014
- Related Topics: scala, hadoop,
- Aliases: apache-spark,
- Last updated: 2025-01-22 00:29:18 UTC
- JSON Representation
https://github.com/lucidworks/spark-solr
Tools for reading data from Solr as a Spark RDD and indexing objects from Spark into Solr using SolrJ.
Last synced: 15 Nov 2024
https://github.com/mrpowers-io/spark-fast-tests
Apache Spark testing helpers (dependency free & works with Scalatest, uTest, and MUnit)
Last synced: 20 Jan 2025
https://github.com/supercowpowers/zat
Zeek Analysis Tools (ZAT): Processing and analysis of Zeek network data with Pandas, scikit-learn, Kafka and Spark
bro data-analysis kafka networking pandas python scikit-learn security spark zeek zeek-analysis
Last synced: 19 Jan 2025
https://github.com/datavane/datavines
Know your data better!Datavines is Next-gen Data Observability Platform, support metadata manage and data quality.
dataobservability dataprofile dataquality datascience doris metadata spark
Last synced: 18 Jan 2025
https://github.com/kevinschaich/pyspark-cheatsheet
🐍 Quick reference guide to common patterns & functions in PySpark.
cheat cheatsheet cheatsheets data data-science docs documentation guide guides pyspark pyspark-tutorial quickstart reference references spark spark-sql
Last synced: 31 Oct 2024
https://github.com/SuperCowPowers/zat
Zeek Analysis Tools (ZAT): Processing and analysis of Zeek network data with Pandas, scikit-learn, Kafka and Spark
bro data-analysis kafka networking pandas python scikit-learn security spark zeek zeek-analysis
Last synced: 27 Nov 2024
https://github.com/microsoft/hyperspace
An open source indexing subsystem that brings index-based query acceleration to Apache Spark™ and big data workloads.
acceleration analytics big-data databases indexing spark
Last synced: 17 Jan 2025
https://github.com/japila-books/spark-structured-streaming-internals
The Internals of Spark Structured Streaming
apache-spark book internals mkdocs-material spark structured-streaming
Last synced: 19 Jan 2025
https://github.com/cartershanklin/pyspark-cheatsheet
PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
apache-spark big-data pyspark spark
Last synced: 12 Oct 2024
https://github.com/USCDataScience/sparkler
Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.
big-data distributed-systems information-retrieval nutch search search-engine solr spark tika web-crawler
Last synced: 29 Oct 2024
https://github.com/zhaoyachao/zdh_web
大数据采集,抽取平台,zdh_web是zdh系列服务的可视化管理平台,包含数据采集,调度,权限,审批流,私域营销等模块
bigdata collection data data-collection datapipeline datax-web etl pipline scheduler spark sparketl
Last synced: 05 Nov 2024
https://github.com/gacwr/openuba
A robust, and flexible open source User & Entity Behavior Analytics (UEBA) framework used for Security Analytics. Developed with luv by Data Scientists & Security Analysts from the Cyber Security Industry. [PRE-ALPHA]
analytics anomaly-detection cybersecurity datascience elasticsearch elk flask information-security machine-learning nodejs react security siem sklearn spark tensorflow threathunting uba ueba user-behaviour
Last synced: 17 Jan 2025
https://github.com/kevinliao159/mydatascienceportfolio
Applying Data Science and Machine Learning to Solve Real World Business Problems
api data-science data-visualization machine-learning neural-networks nlp recommendation-system spark
Last synced: 22 Jan 2025
https://github.com/apache/incubator-uniffle
Uniffle is a high performance, general purpose Remote Shuffle Service.
mapreduce remote-shuffle-service rss shuffle spark tez
Last synced: 18 Jan 2025
https://github.com/teeyog/IQL
An ad hoc query service based on the spark sql engine.(基于spark sql引擎的即席查询服务)
Last synced: 30 Oct 2024
https://github.com/googleclouddataproc/spark-bigquery-connector
BigQuery data source for Apache Spark: Read data from BigQuery into DataFrames, write DataFrames into BigQuery tables.
bigquery bigquery-storage-api google-bigquery google-cloud google-cloud-dataproc spark
Last synced: 16 Jan 2025
https://github.com/IBM/data-prep-kit
Open source project for data preparation of LLM application builders
code-quality data data-prep data-preparation data-preprocessing data-preprocessing-pipelines datacuration datarecipes deduplication finetuning large-language-models large-scale-data-processing llm llmapps malware python ray spark
Last synced: 11 Jan 2025
https://github.com/XuefengHuang/RecommendationSystem
Book recommender system using collaborative filtering based on Spark
collaborative-filtering python-flask recommendation-system spark
Last synced: 29 Oct 2024
https://github.com/groupon/sparklint
A tool for monitoring and tuning Spark jobs for efficiency.
performance-analysis scala spark
Last synced: 12 Jan 2025
https://github.com/GoogleCloudDataproc/spark-bigquery-connector
BigQuery data source for Apache Spark: Read data from BigQuery into DataFrames, write DataFrames into BigQuery tables.
bigquery bigquery-storage-api google-bigquery google-cloud google-cloud-dataproc spark
Last synced: 30 Sep 2024
https://github.com/kanyun-inc/ytk-learn
Ytk-learn is a distributed machine learning library which implements most of popular machine learning algorithms(GBDT, GBRT, Mixture Logistic Regression, Gradient Boosting Soft Tree, Factorization Machines, Field-aware Factorization Machines, Logistic Regression, Softmax).
distributed factorization-machines gbdt gbm hadoop logistic-regression machine-learning spark
Last synced: 21 Jan 2025
https://github.com/tirthajyoti/spark-with-python
Fundamentals of Spark with Python (using PySpark), code examples
analytics apache apache-spark big-data database dataframe distributed-computing hadoop hdfs machine-learning map-reduce mlib parallel-computing pyspark python spark sql
Last synced: 19 Jan 2025
https://github.com/datamechanics/delight
A Spark UI and Spark History Server alternative with CPU and Memory metrics! Delight is free, cross-platform, and open-source.
apache-spark cpu dashboard delight kubernetes memory monitoring netapp-public spark spark-history-server spark-ui
Last synced: 22 Jan 2025
https://github.com/twosigma/Cook
Fair job scheduler on Kubernetes and Mesos for batch workloads and Spark
cluster gke kubernetes mesos scheduler spark
Last synced: 26 Oct 2024
https://github.com/elasticluster/elasticluster
Create clusters of VMs on the cloud and configure them with Ansible.
ansible azure cloud cluster clustering ec2 gcp gridengine hadoop hpc python slurm spark
Last synced: 06 Nov 2024
https://github.com/miguno/wirbelsturm
[PROJECT IS NO LONGER MAINTAINED] Wirbelsturm is a Vagrant and Puppet based tool to perform 1-click local and remote deployments, with a focus on big data tech like Kafka.
apache-kafka apache-spark apache-storm kafka puppet spark storm vagrant
Last synced: 22 Jan 2025
https://github.com/lightbend/cloudflow
Cloudflow enables users to quickly develop, orchestrate, and operate distributed streaming applications on Kubernetes.
akka cloudflow flink kubernetes microservices-architectures spark streaming-applications streaming-data streaming-runtimes
Last synced: 17 Jan 2025
https://github.com/sderosiaux/every-single-day-i-tldr
A daily digest of the articles or videos I've found interesting, that I want to share with you.
akka architecture bigdata category-theory data-engineering ddd googlecloudplatform java javascript kafka kubernetes microservices reactjs scala spark technology watch
Last synced: 16 Jan 2025
https://github.com/neo4j/neo4j-spark-connector
Neo4j Connector for Apache Spark, which provides bi-directional read/write access to Neo4j from Spark, using the Spark DataSource APIs
bolt cypher hacktoberfest neo4j-connector neo4j-driver spark
Last synced: 18 Jan 2025
https://github.com/oap-project/raydp
RayDP provides simple APIs for running Spark on Ray and integrating Spark with AI libraries.
Last synced: 15 Nov 2024
https://github.com/kamu-data/kamu-cli
Next-generation decentralized data lakehouse and a multi-party stream processing network
blockchain data-as-code data-management data-science datafusion flink jupyter kamu open-data open-data-fabric spark sql
Last synced: 18 Jan 2025
https://github.com/microsoft/data-accelerator
Data Accelerator for Apache Spark simplifies onboarding to Streaming of Big Data. It offers a rich, easy to use experience to help with creation, editing and management of Spark jobs on Azure HDInsights or Databricks while enabling the full power of the Spark engine.
apache-spark azure big-data cosmosdb docker eventhub hdinsight iot iothub kafka kafka-streams nodejs react servicefabric spark spark-sql spark-streaming sparksql streaming streaming-data
Last synced: 17 Jan 2025
https://github.com/aws/sagemaker-spark
A Spark library for Amazon SageMaker.
amazon-sagemaker aws machine-learning python sagemaker scala spark
Last synced: 16 Jan 2025
https://github.com/DTStack/dt-sql-parser
SQL Parsers for BigData, built with antlr4.
antlr4 autocompletion bigdata flink hive impala mysql parser postgresql spark sql sql-validation trino
Last synced: 02 Nov 2024
https://github.com/dtstack/dt-sql-parser
SQL Parsers for BigData, built with antlr4.
antlr4 autocompletion bigdata flink hive impala mysql parser postgresql spark sql sql-validation trino
Last synced: 17 Jan 2025
https://github.com/zero-one-group/geni
A Clojure dataframe library that runs on Spark
big-data clojure clojure-library clojure-repl data-engineering data-science dataframe distributed-computing high-performance-computing machine-learning parallel-computing spark
Last synced: 22 Jan 2025
https://github.com/Ibotta/sk-dist
Distributed scikit-learn meta-estimators in PySpark
data-science machine-learning ml scikit-learn spark
Last synced: 25 Nov 2024
https://github.com/ibotta/sk-dist
Distributed scikit-learn meta-estimators in PySpark
data-science machine-learning ml scikit-learn spark
Last synced: 19 Jan 2025
https://github.com/hbase-rdd/hbase-rdd
Spark RDD to read, write and delete from HBase
Last synced: 21 Jan 2025
https://github.com/xd-deng/spark-practice
Apache Spark (PySpark) Practice on Real Data
Last synced: 21 Jan 2025
https://github.com/projectglow/glow
An open-source toolkit for large-scale genomic analysis
delta genomics gwas machine-learning population-genetics regression spark
Last synced: 25 Nov 2024
https://github.com/hydrospheredata/hydro-serving
MLOps Platform
machine-learning models pipelines realtime scikit-learn scoring serverless serving spark tensorflow
Last synced: 22 Jan 2025
https://github.com/Hydrospheredata/hydro-serving
MLOps Platform
machine-learning models pipelines realtime scikit-learn scoring serverless serving spark tensorflow
Last synced: 27 Oct 2024
https://github.com/jaceklaskowski/spark-workshop
Apache Spark™ and Scala Workshops
apache-spark spark spark-mllib spark-sql spark-structured-streaming spark-workshops workshop
Last synced: 19 Jan 2025
https://github.com/PiercingDan/spark-Jupyter-AWS
A guide on how to set up Jupyter with Pyspark painlessly on AWS EC2 clusters, with S3 I/O support
apache-spark apache-spark-cluster aws aws-ec2 aws-s3 ebs-volumes ec2 ec2-instance jupyter jupyter-notebook spark spark-clusters
Last synced: 27 Nov 2024
https://github.com/piercingdan/spark-jupyter-aws
A guide on how to set up Jupyter with Pyspark painlessly on AWS EC2 clusters, with S3 I/O support
apache-spark apache-spark-cluster aws aws-ec2 aws-s3 ebs-volumes ec2 ec2-instance jupyter jupyter-notebook spark spark-clusters
Last synced: 03 Jan 2025
https://github.com/jelmerk/hnswlib
Java library for approximate nearest neighbors search using Hierarchical Navigable Small World graphs
algorithm java k-nearest-neighbors knn-search pyspark scala spark
Last synced: 20 Jan 2025
https://github.com/WeBankFinTech/Visualis
Visualis is a BI tool for data visualization. It provides financial-grade data visualization capabilities on the basis of data security and permissions, based on the open source project Davinci contributed by CreditEase.
appjoint datasource dataspherestudio davinci linkis scriptis spark superset tableau visualization
Last synced: 31 Oct 2024
https://github.com/oap-project/gazelle_plugin
Native SQL Engine plugin for Spark SQL with vectorized SIMD optimizations.
arrow native-kernels native-sql-engine spark vectorized-simd-optimizations
Last synced: 27 Oct 2024
https://github.com/bytedance/cloudshuffleservice
Cloud Shuffle Service(CSS) is a general purpose remote shuffle solution for compute engines, including Spark/Flink/MapReduce.
Last synced: 21 Jan 2025
https://github.com/flyteorg/flytekit
Extensible Python SDK for developing Flyte tasks and workflows. Simple to get started and learn and highly extensible.
automation data data-science extensible flyte flyte-tasks hacktoberfest mlops pypi python sdk spark workflows
Last synced: 15 Jan 2025
https://github.com/mlwhiz/data_science_blogs
A repository to keep track of all the code that I end up writing for my blog posts.
blogging chatbot data datascience gan graphs machine-learning mcmc python spark streamlit time-series xgboost
Last synced: 20 Jan 2025
https://github.com/MLWhiz/data_science_blogs
A repository to keep track of all the code that I end up writing for my blog posts.
blogging chatbot data datascience gan graphs machine-learning mcmc python spark streamlit time-series xgboost
Last synced: 13 Nov 2024
https://github.com/locationtech/rasterframes
Geospatial Raster support for Spark DataFrames
earth-observation geotrellis image-processing machine-learning scala spark spark-ml sparksql
Last synced: 22 Jan 2025
https://github.com/tencent/firestorm
Firestorm is a Remote Shuffle Service, and provides the capability for Apache Spark and Apache Hadoop MapReduce applications to store shuffle data on remote servers
mapreduce remoteshuffle shuffle spark
Last synced: 22 Jan 2025
https://github.com/FirelyTeam/spark
Firely and Incendi's open source FHIR server
c-sharp docker dstu2 fhir fhir-api fhir-server fhir-spec fhir-specification r4 spark spark-fhir-server stu3
Last synced: 28 Oct 2024
https://github.com/bytedance/CloudShuffleService
Cloud Shuffle Service(CSS) is a general purpose remote shuffle solution for compute engines, including Spark/Flink/MapReduce.
Last synced: 05 Nov 2024
https://github.com/paypal/gimel
Big Data Processing Framework - Unified Data API or SQL on Any Storage
aerospike big-data cassandra data-api elasticsearch gimel hbase jdbc kafka paypal pyspark python restapi scala spark spark-streaming streaming-sql teradata
Last synced: 19 Jan 2025
https://github.com/mellanox/sparkrdma
This is archive of SparkRDMA project. The new repository with RDMA shuffle acceleration for Apache Spark is here: https://github.com/Nvidia/sparkucx
apache-spark big-data bigdata disni hadoop infiniband java mellanox rdma roce scala shuffle spark
Last synced: 22 Jan 2025
https://github.com/saurfang/spark-knn
k-Nearest Neighbors algorithm on Spark
Last synced: 21 Jan 2025
https://github.com/azure/azure-event-hubs-spark
Enabling Continuous Data Processing with Apache Spark and Azure Event Hubs
apache apache-spark azure bigdata connector continuous databricks event-hubs eventhubs ingestion kafka microsoft real-time scala spark spark-streaming stream streaming structured-streaming
Last synced: 17 Jan 2025
https://github.com/mgalarnyk/installations_mac_ubuntu_windows
Installations for Data Science. Anaconda, RStudio, Spark, TensorFlow, AWS (Amazon Web Services).
anaconda aws-ec2 ec2-instance python rstudio spark
Last synced: 21 Jan 2025
https://github.com/mGalarnyk/Installations_Mac_Ubuntu_Windows
Installations for Data Science. Anaconda, RStudio, Spark, TensorFlow, AWS (Amazon Web Services).
anaconda aws-ec2 ec2-instance python rstudio spark
Last synced: 27 Nov 2024
https://github.com/absaoss/abris
Avro SerDe for Apache Spark structured APIs.
avro avro-schema kafka schema-registry spark
Last synced: 18 Jan 2025
https://github.com/adidas/lakehouse-engine
The Lakehouse Engine is a configuration driven Spark framework, written in Python, serving as a scalable and distributed engine for several lakehouse algorithms, data flows and utilities for Data Products.
big-data configuration-driven data-engineering data-quality databricks delta-lake framework great-expectations lakehouse spark
Last synced: 17 Jan 2025
https://github.com/apache/incubator-graphar
An open source, standard data file format for graph data storage and retrieval.
big-data data-orchestration etl graph graph-analysis graph-storage pyspark spark
Last synced: 22 Jan 2025
https://github.com/ondra-m/ruby-spark
Ruby wrapper for Apache Spark
distributed rdd ruby ruby-spark spark
Last synced: 21 Jan 2025
https://github.com/mkuthan/example-spark
Spark, Spark Streaming and Spark SQL unit testing strategies
Last synced: 16 Jan 2025
https://github.com/apache/incubator-wayang
Apache Wayang(incubating) is the first cross-platform data processing system.
apache big-data cross-platform data-management-platform data-processing distributed-system hadoop java jdbc middleware open-source performance scala spark
Last synced: 18 Jan 2025
https://github.com/zio/zio-protoquill
Quill for Scala 3
cassandra jdbc language-integrated-query linq postgresql scala spark sparksql sql
Last synced: 18 Jan 2025
https://github.com/neoremind/kraps-rpc
A RPC framework leveraging Spark RPC module
Last synced: 21 Jan 2025
https://github.com/mahmoudparsian/data-algorithms-with-spark
O'Reilly Book: [Data Algorithms with Spark] by Mahmoud Parsian
algorithms bigdata data data-abstractions data-algorithms data-transformation dataframes design design-patterns machine-learning mappers mapreduce monoid partitioning-algorithms pyspark python rdd reducers spark transformations
Last synced: 15 Jan 2025
https://github.com/dylan-profiler/visions
Type System for Data Analysis in Python
data-analysis data-science hacktoberfest numpy pandas python spark type-inference type-system
Last synced: 17 Jan 2025
https://github.com/qihoo360/xsql
Unified SQL Analytics Engine Based on SparkSQL
datasource elasticsearch federation hive spark sql
Last synced: 22 Jan 2025
https://github.com/chatlunalab/chatluna
多平台模型接入,可扩展,多种输出格式,提供大语言模型聊天服务的插件 | A bot plugin for LLM chat services with multi-model integration, extensibility, and various output formats
ai bot chatbot chatglm chatgpt claude gemini gpt gpt-4o koishi langchain llm openai plugin qq-bot qwen rwkv spark typescript
Last synced: 20 Jan 2025
https://github.com/azure/azure-cosmosdb-spark
Apache Spark Connector for Azure Cosmos DB
apache-spark azure-cosmos-db azure-databricks changefeed connector cosmos-db databricks databricks-notebooks jupyter-notebook lambda-architecture pyspark spark
Last synced: 19 Jan 2025
https://github.com/Azure/azure-cosmosdb-spark
Apache Spark Connector for Azure Cosmos DB
apache-spark azure-cosmos-db azure-databricks changefeed connector cosmos-db databricks databricks-notebooks jupyter-notebook lambda-architecture pyspark spark
Last synced: 17 Nov 2024
https://github.com/JahstreetOrg/spark-on-kubernetes-helm
Spark on Kubernetes infrastructure Helm charts repo
helm history-server jupyter kubernetes livy spark
Last synced: 15 Nov 2024
https://github.com/clickhouse/spark-clickhouse-connector
Spark ClickHouse Connector build on DataSourceV2 API
arrow clickhouse datasourcev2 grpc http spark
Last synced: 17 Jan 2025
https://github.com/karakanb/vue-info-card
Simple and beautiful card component with an elegant spark line, for VueJS.
card card-component component info-card spark vue vue-components vuejs vuejs2
Last synced: 21 Jan 2025
https://github.com/databrickslabs/automl-toolkit
Toolkit for Apache Spark ML for Feature clean-up, feature Importance calculation suite, Information Gain selection, Distributed SMOTE, Model selection and training, Hyper parameter optimization and selection, Model interprability.
apache-spark feature-engineering machinelearning ml pyspark scala spark
Last synced: 22 Jan 2025
https://github.com/dvgodoy/handyspark
HandySpark - bringing pandas-like capabilities to Spark dataframes
exploratory-data-analysis imputation outlier-detection pandas pyspark python spark visualization
Last synced: 20 Jan 2025
https://github.com/syzer/js-spark
Realtime calculation distributed system. AKA distributed lodash
distributed distributed-computing multicore realtime spark
Last synced: 16 Jan 2025
https://github.com/lynnlangit/learning-hadoop-and-spark
Companion to Learning Hadoop and Learning Spark courses on Linked In Learning
apache-spark dataproc emr hadoop learning-hadoop mapreduce spark wordcount
Last synced: 20 Jan 2025