Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
Apache Spark
Apache Spark is an open source distributed general-purpose cluster-computing framework. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.
- GitHub: https://github.com/topics/spark
- Wikipedia: https://en.wikipedia.org/wiki/Apache_Spark
- Repo: https://github.com/apache/spark
- Created by: Matei Zaharia
- Released: May 26, 2014
- Related Topics: scala, hadoop,
- Aliases: apache-spark,
- Last updated: 2025-01-22 00:29:18 UTC
- JSON Representation
https://github.com/hashicorp/nomad-spark
DEPRECATED: Apache Spark with native support for Nomad as a scheduler
Last synced: 21 Jan 2025
https://github.com/pierrekieffer/docker-spark-yarn-cluster
Docker multi-nodes Hadoop cluster with Spark 2.4.1 on Yarn
cluster docker hadoop spark yarn yarn-hadoop-cluster
Last synced: 02 Nov 2024
https://github.com/benfradet/struct-type-encoder
Deriving Spark DataFrame schemas from case classes
Last synced: 28 Oct 2024
https://github.com/tharwaninitin/etlflow
EtlFlow is an ecosystem of functional libraries in Scala based on ZIO for running complex Auditable workflows which can interact with Google Cloud Platform, AWS, Kubernetes, Databases, SFTP servers, On-Prem Systems and more.
aws bigquery dataproc etl etl-framework etl-pipeline gcp gcs redis s3 scala spark zio
Last synced: 22 Jan 2025
https://github.com/absaoss/hyperdrive
Extensible streaming ingestion pipeline on top of Apache Spark
apache-spark framework ingestion kafka pipeline spark spark-structured-streaming streaming streaming-etl
Last synced: 12 Oct 2024
https://github.com/g-research/spark-dgraph-connector
A connector for Apache Spark and PySpark to Dgraph databases.
Last synced: 20 Dec 2024
https://github.com/coxautomotivedatasolutions/spark-distcp
A re-implementation of Hadoop DistCP in Apache Spark
apache-spark data-engineering distcp hadoop spark
Last synced: 12 Oct 2024
https://github.com/xskipper-io/xskipper
An Extensible Data Skipping Framework
data-skipping indexing scala spark
Last synced: 11 Oct 2024
https://github.com/univalence/zio-spark
A functional wrapper around Spark to make it works with ZIO
Last synced: 16 Jan 2025
https://github.com/ypriverol/spark-java8
Java 8 and Spark learning through examples
dataset java lambda learning-spark spark
Last synced: 28 Oct 2024
https://github.com/spektom/spark-flamegraph
Easy CPU Profiling for Apache Spark applications
apache-spark cpu-profiling flamegraph spark
Last synced: 19 Nov 2024
https://github.com/manuel-lang/data-engineering-nanodegree
Solution to all projects of Udacity's Data Engineering Nanodegree: Data Modeling with Postgres & Cassandra, Data Warehouse with Redshift, Data Lake with Spark and Data Pipeline with Airflow.
airflow cassandra data-engineering postgresql redshift spark udacity udacity-data-engineer-nanodegree
Last synced: 13 Nov 2024
https://github.com/supercowpowers/workbench
Workbench: An easy to use Python API for creating and deploying AWS SageMaker Models
aws big-data data-engineering machine-learning pandas python spark
Last synced: 22 Jan 2025
https://github.com/flipkart-incubator/spark-transformers
Spark-Transformers: Library for exporting Apache Spark MLLIB models to use them in any Java application with no other dependencies.
apache-spark data-pipelines export java machine-learning machine-learning-algorithms machine-learning-library mllib scala spark transformers
Last synced: 11 Oct 2024
https://github.com/tdebatty/spark-knn-graphs
Spark algorithms for building k-nn graphs
algorithm knn-graphs lsh-superbit nearest-neighbor-search nn-descent processing-knn-graphs spark spark-knn-graphs
Last synced: 15 Nov 2024
https://github.com/zuinnote/spark-hadoopoffice-ds
A Spark datasource for the HadoopOffice library
datasource excel hadoopoffice read spark write xls xlsx
Last synced: 03 Dec 2024
https://github.com/supercowpowers/sageworks
SageWorks: An easy to use Python API for creating and deploying AWS SageMaker Models
aws big-data data-engineering machine-learning pandas python spark
Last synced: 16 Dec 2024
https://github.com/lresende/ansible-kubernetes-cluster
Ansible roles to deploy Kubernetes, JupyterHub, Jupyter Enterprise Gateway and Spark on Kubernetes cluster
ansible ansible-roles cloud deploy-kubernetes deploying-kubernetes elyra enterprise-gateway jupyter-enterprise-gateway jupyterhub jupyterlab kubernetes kubernetes-cluster kubernetes-deployment rhel spark spark-on-kubernetes
Last synced: 13 Oct 2024
https://github.com/LB-Yu/data-systems-learning
Learning summary and examples about data systems.
antlr big-data calcite distributed-systems flink hadoop hbase spark
Last synced: 05 Nov 2024
https://github.com/rstudio/graphframes
R Interface for GraphFrames
graphframes graphs pagerank rstats spark sparklyr
Last synced: 10 Nov 2024
https://github.com/melin/spark-jobserver
REST job server for Apache Spark
hadoop hive java kerberos kubernetes spark yarn
Last synced: 05 Nov 2024
https://github.com/vector4wang/quick-spark-process
:star2::star2::star2:学习spark的相关示例
Last synced: 28 Oct 2024
https://github.com/AI-team-UoA/GeoTriples
Publishing Big Geospatial data as Linked Open Geospatial Data
geospatial rdf semantic-web spark
Last synced: 04 Nov 2024
https://github.com/garystafford/emr-demo
Project files for the post: Running PySpark Applications on Amazon EMR: Methods for Interacting with PySpark on Amazon Elastic MapReduce.
amazon-emr aws elastic-map-reduce emr-demo pyspark spark
Last synced: 06 Dec 2024
https://github.com/java-edge/spark-mllib-tutorial
大数据框架 Spark MLlib 机器学习库基础算法全面讲解,附带齐全的测试文件
bigdata machine-learning mllib spark
Last synced: 28 Oct 2024
https://github.com/googlecloudplatform/spark-on-k8s-gcp-examples
Example Spark applications that run on Kubernetes and access GCP products, e.g., GCS, BigQuery, and Cloud PubSub
bigquery cloud-pubsub gcs gcs-connector kubernetes spark
Last synced: 22 Jan 2025
https://github.com/paulk-asert/groovy-data-science
Some Data Science examples using Groovy
beakerx commons-math constraint-programming data-science deep-learning groovy image-recognition kmeans-clustering linear-programming linear-regression mxnet natural natural-language-processing spark
Last synced: 01 Nov 2024
https://github.com/heartsavior/spark-sql-kafka-offset-committer
Kafka offset committer for structured streaming query
kafka spark structured-streaming
Last synced: 28 Oct 2024
https://github.com/tupol/spark-utils
Basic framework utilities to quickly start writing production ready Apache Spark applications
apache-spark convenience data-sink data-source framework scala spark spark-applications spark-streaming
Last synced: 19 Dec 2024
https://github.com/absaoss/spark-hats
Nested array transformation helper extensions for Apache Spark
arrays nested-structures scala schema spark
Last synced: 07 Nov 2024
https://github.com/basin-etl/basin
Basin is a visual programming editor for building Spark and PySpark pipelines. Easily build, debug, and deploy complex ETL pipelines from your browser
emr etl hadoop informatica odi pipeline pyspark spark
Last synced: 09 Nov 2024
https://github.com/oracle-samples/oracle-dataflow-samples
Sample examples Examples demonstrating how to use OCI Data Flow
dataflow java oracle-cloud oracle-cloud-infrastructure paas python scala serverless spark
Last synced: 17 Jan 2025
https://github.com/rainmaker712/nlp_ryan
Study for Natural Language Processing & Deep Learning Framework
chatbot deep-learning machine-comprehension machine-learning nlp python pytorch scala spark tensorflow
Last synced: 13 Nov 2024
https://github.com/yaooqinn/spark-postgres
PostgreSQL and GreenPlum Data Source for Apache Spark
greenplum postgres postgresql spark sparksql transactional
Last synced: 15 Oct 2024
https://github.com/wh1isper/sparglim
Sparglim✨ makes PySpark App Configurable and Deploy Spark Connect Server Easier!
jupyter-magic pyspark spark spark-connect spark-connect-server spark-on-kubernetes spark-sql
Last synced: 16 Jan 2025
https://github.com/hussein-awala/spark-on-k8s
A Python package to submit and manage Apache Spark applications on Kubernetes.
airflow kubernetes python spark
Last synced: 21 Jan 2025
https://github.com/mjhea0/flask-spark-docker
Just a boilerplate for PySpark and Flask
docker flask pyspark python redis-queue spark
Last synced: 28 Oct 2024
https://github.com/joomcode/trace-analysis
Library for performance bottleneck detection and optimization efficiency prediction
jaeger opentracing optimization performance spark
Last synced: 09 Nov 2024
https://github.com/mozilla/telemetry-batch-view
A Scala framework to build derived datasets, aka batch views, of Telemetry data.
bigdata biggest-data dataset mozilla scala spark telemetry
Last synced: 01 Nov 2024
https://github.com/spratiher9/sparkdataset
Instant search for and access to many datasets in Pyspark.
benchmark benchmark-framework data data-analysis data-mining dataengineering dataset datasets easy-access-application instantsearch pyspark python python3 quickstart r spark standard
Last synced: 06 Dec 2024
https://github.com/agile-lab-dev/darwin
Avro Schema Evolution made easy
avro avro-schema hadoop hbase scala schema-evolution spark
Last synced: 14 Oct 2024
https://github.com/weaviate/spark-connector
Weaviate connector for Apache Spark
Last synced: 14 Nov 2024
https://github.com/ksindi/kafka-compose
:musical_score: Docker compose files for various kafka stacks
avro docker-compose kafka kafka-connect pyspark python spark twitter
Last synced: 12 Nov 2024
https://github.com/tomaztk/spark-for-data-engineers
Apache Spark for data engineers
apache-spark data-engineers pyspark python r r-language rspark spark
Last synced: 19 Nov 2024
https://github.com/fiatjaf/kwh
webln browser extension for lightningd/eclair/ptarmigan
c-lightning eclair lightning-network lightningd ptarmigan spark web-extension webln
Last synced: 17 Jan 2025
https://github.com/lewuathe/dllib
dllib is a distributed deep learning library running on Apache Spark
deep-learning mllib scala spark
Last synced: 12 Nov 2024
https://github.com/music-of-the-ainur/almaren-framework
The Almaren Framework provides a simplified consistent minimalistic layer over Apache Spark. While still allowing you to take advantage of native Apache Spark features. You can still combine it with standard Spark code.
Last synced: 21 Jan 2025
https://github.com/Anant/Cassandra.Realtime
Different ways to process data into Cassandra in realtime with technologies such as Kafka, Spark, Akka, Flink
akka cassandra flink flink-stream-processing flink-streaming kafka kafka-connect spark spark-streaming
Last synced: 08 Nov 2024
https://github.com/snowplow/snowplow-rdb-loader
Stores Snowplow enriched events in Redshift, Snowflake and Databricks
Last synced: 16 Nov 2024
https://github.com/dbt-labs/spark-utils
Utility functions for dbt projects running on Spark
Last synced: 12 Nov 2024
https://github.com/learningjournal/spark-streaming-in-scala
Apache Spark 3 - Structured Streaming Course Material
apache-spark big-data bigdata datalake scala spark spark-sql spark-streaming
Last synced: 19 Nov 2024
https://github.com/anant/cassandra.realtime
Different ways to process data into Cassandra in realtime with technologies such as Kafka, Spark, Akka, Flink
akka cassandra flink flink-stream-processing flink-streaming kafka kafka-connect spark spark-streaming
Last synced: 18 Nov 2024
https://github.com/openucx/sparkucx
A high-performance, scalable and efficient ShuffleManager plugin for Apache Spark, utilizing UCX communication layer
apache-spark big-data hadoop hpc rdma spark
Last synced: 10 Nov 2024
https://github.com/agile-lab-dev/wasp
WASP is a framework to build complex real time big data applications. It relies on a kind of Kappa/Lambda architecture mainly leveraging Kafka and Spark. If you need to ingest huge amount of heterogeneous data and analyze them through complex pipelines, this is the framework for you.
akka elasticsearch hadoop hbase hdfs jdbc kafka parquet scala solr spark spark-streaming yarn
Last synced: 01 Jan 2025
https://github.com/endymecy/algorithmsonspark
Some popular algorithms(dbscan,knn,fm etc.) on spark
dbscan factorization-machines knn spark
Last synced: 25 Nov 2024
https://github.com/souvik-databricks/dlt-with-debug
A lightweight helper utility which allows developers to do interactive pipeline development by having a unified source code for both DLT run and Non-DLT interactive notebook run.
big-data big-data-processing databricks delta-live-tables dlt etl etl-pipeline python3 spark
Last synced: 01 Nov 2024
https://github.com/giantcroc/featuretoolsonspark
A simplified version of featuretools for Spark
automated-feature-engineering automated-machine-learning automl deep-feature-synthesis feature-engineering featuretools machine-learning python spark
Last synced: 12 Oct 2024
https://github.com/isarn/isarn-sketches-spark
Routines and data structures for using isarn-sketches idiomatically in Apache Spark
aggregator apache-spark data-sketches data-sketching dataframe dataframes dataset datasets feature-importance pyspark python scala sketching-algorithm spark spark-ml t-digest udaf variable-importance
Last synced: 12 Oct 2024
https://github.com/cretueusebiu/laravel-spark-camera
Profile Photo Camera support for Laravel Spark
camera laravel laravel-spark php spark
Last synced: 17 Nov 2024
https://github.com/laravel/spark-aurelius-mollie
Laravel Spark, Mollie edition
laravel mollie saas spark subscription-billing
Last synced: 07 Oct 2024
https://github.com/kairen/learning-spark
Tidy up Spark and Hadoop tutorials.
bigdata data-science hadoop spark
Last synced: 30 Oct 2024
https://github.com/debussy-labs/debussy_concert
Debussy is an opinionated Data Architecture and Engineering framework, enabling data analysts and engineers to build better platforms and pipelines.
airflow airflow-operators airflow-plugin big-data-platform bigquery data-architecture data-engineering data-pipeline dataform dataproc dbt gcp google-cloud mssql mysql postgresql spark sql workflow
Last synced: 10 Jan 2025
https://github.com/mu-sigma/analysis-pipelines
Enables data scientists to compose pipelines of analysis which consist of data manipulation, exploratory analysis & reporting, as well as modeling steps. Data scientists can use tools of their choice through an R interface, and compose interoperable pipelines between R, Spark, and Python.
analysis-pipeline interoperable-pipelines python r spark
Last synced: 09 Nov 2024
https://github.com/oracle/spark-oracle
On the fly, translation of Spark programs to run natively on your Oracle DB. Your Spark programs require no changes.
Last synced: 06 Nov 2024
https://github.com/indix/sparkplug
Spark package to "plug" holes in data using SQL based rules ⚡️ 🔌
Last synced: 07 Nov 2024
https://github.com/jaceklaskowski/kubernetes-100days
Notes from 100 days with Kubernetes
100days google-kubernetes-engine kubernetes minikube spark
Last synced: 08 Nov 2024
https://github.com/projectnessie/nessie-demos
Demos for Nessie. Nessie provides Git-like capabilities for your Data Lake.
binder iceberg jupyter-notebooks nessie spark
Last synced: 12 Nov 2024
https://github.com/kpolley/relk
RELK -- The Research Elastic Stack (Kafka, Beats, Zookeeper, Logstash, ElasticSearch, Kibana, Spark, & Jupyter -- All in Docker)
beats docker elastic elasticsearch elk elk-stack es filebeats jupyter jupyter-lab jupyter-notebook kafka kibana logstash pyspark python spark zookeeper
Last synced: 11 Oct 2024
https://github.com/xqnwang/darima
Distributed ARIMA Models
arima distributed-computing spark time-series-forecasting
Last synced: 30 Oct 2024
https://github.com/bbenzikry/spark-eks
Examples and custom spark images for working with the spark-on-k8s operator on AWS
aws docker dockerfile eks eks-cluster glue-catalog kubernetes kubernetes-operator metastore spark
Last synced: 27 Oct 2024
https://github.com/fsanaulla/chronicler-spark
InfluxDB connector to Apache Spark on top of Chronicler
chronicler dataframe influxdb rdd scala spark streaming
Last synced: 31 Oct 2024
https://github.com/vesoft-inc/nebula-exchange
NebulaGraph Exchange is an Apache Spark application to parse data from different sources to NebulaGraph in a distributed environment. It supports both batch and streaming data in various formats and sources including other Graph Databases, RDBMS, Data warehouses, NoSQL, Message Bus, File systems, etc.
data-import data-pipeline etl graph-database hacktoberfest nebulagraph spark
Last synced: 07 Nov 2024
https://github.com/fsanaulla/chronicler
Scala toolchain for InfluxDB
akka-http async-http-client chronicler influxdb macros scala spark udp url-connection
Last synced: 16 Jan 2025
https://github.com/ing-bank/spark-matcher
Record matching and entity resolution at scale in Spark
deduplication entity-resolution record-linkage spark
Last synced: 08 Nov 2024
https://github.com/faviovazquez/odsc_india_2018
My presentation at ODSC India 2018 about Deep Learning with Apache Spark
data datascience deeplearning optimus pyspark spark
Last synced: 09 Nov 2024
https://github.com/uniai-lab/uniai-maas
An opensource AI & model as a service platform.
ai chatglm chatgpt gpt kimichat midjourney moonshot spark stability-ai uniai
Last synced: 10 Nov 2024
https://github.com/geotrellis/geotrellis-pointcloud
GeoTrellis PointCloud library to work with any pointcloud data on Spark
geotrellis gis hacktoberfest pdal pointcloud scala spark
Last synced: 11 Nov 2024
https://github.com/bnosac/spark.sas7bdat
Read in SAS data in parallel into Apache Spark
Last synced: 11 Nov 2024
https://github.com/propelledanalytics/sparksql.jl
SparkSQL.jl enables Julia programs to work with Apache Spark data using just SQL.
apachespark julia-language julialang spark
Last synced: 11 Oct 2024
https://github.com/leehuwuj/olh
Open source stack lakehouse
bigdata dataplatform deltalake kubernetes lakehouse spark
Last synced: 22 Jan 2025
https://github.com/timgent/data-flare
Data quality control tool built on spark and deequ
Last synced: 16 Nov 2024
https://github.com/drkostas/hgn
Hybrid Girvan Newman. Code for the "A Distributed Hybrid Community Detection Methodology for Social Networks" paper.
apache-spark community-detection distributed girvan-newman graphframes paper-implementations papers-with-code social-networks spark
Last synced: 28 Oct 2024
https://github.com/semyonsinchenko/tsumugi-spark
SparkConnect Server plugin and protobuf messages for the Amazon Deequ Data Quality Engine.
data-quality deequ pyspark spark
Last synced: 10 Oct 2024
https://github.com/cheng-lin-li/spark
There are Python 2.7 codes and learning notes for Spark 2.1.1
als alternating-least-squares apriori-algorithm apriori-son cosine-similarity kmeans kmeans-clustering map-reduce minhash minhash-lsh-algorithm python27 savasere-omiecinski-and-navathe spark tf-idf uv-decomposition
Last synced: 20 Jan 2025
https://github.com/wittline/pyspark-on-aws-emr
The goal of this project is to offer an AWS EMR template using Spot Fleet and On-Demand Instances that you can use quickly. Just focus on writing pyspark code.
aws aws-emr big-data big-data-analytics dataengineering ec2-spot ec2-spot-instances emr-cluster pyspark python spark wordcloud-generator
Last synced: 14 Oct 2024
https://github.com/alonsodomin/sbt-spark
Simple SBT plugin to configure Spark applications
Last synced: 09 Nov 2024
https://github.com/zongxr/bigdata-competition
全国大数据竞赛三等奖解决方案,省赛二等奖解决方案。一键安装大数据环境脚本,自动部署集群环境,包括zookeeper、hadoop、mysql、hive、spark以及一些基础环境。已通过实际服务器测试,效果极佳,仅需要输入密码等少量人为干预。解放安装部署配置所需人力。并添加若干scala案例,结合spark用以进行数据准备。
bigdata hadoop hdfs hive mysql scala shell spark wordcount zookeeper
Last synced: 15 Nov 2024
https://github.com/hibayesian/spark-word2vec
A parallel implementation of word2vec based on Spark
machine-learning spark word2vec
Last synced: 23 Nov 2024
https://github.com/absaoss/pramen
Resilient data pipeline framework running on Apache Spark
big-data data-pipeline etl hacktoberfest scala spark
Last synced: 19 Dec 2024