Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

Apache Spark

Apache Spark is an open source distributed general-purpose cluster-computing framework. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.

https://github.com/hashicorp/nomad-spark

DEPRECATED: Apache Spark with native support for Nomad as a scheduler

nomad scheduler spark

Last synced: 21 Jan 2025

https://github.com/pierrekieffer/docker-spark-yarn-cluster

Docker multi-nodes Hadoop cluster with Spark 2.4.1 on Yarn

cluster docker hadoop spark yarn yarn-hadoop-cluster

Last synced: 02 Nov 2024

https://github.com/benfradet/struct-type-encoder

Deriving Spark DataFrame schemas from case classes

spark sparksql

Last synced: 28 Oct 2024

https://github.com/tharwaninitin/etlflow

EtlFlow is an ecosystem of functional libraries in Scala based on ZIO for running complex Auditable workflows which can interact with Google Cloud Platform, AWS, Kubernetes, Databases, SFTP servers, On-Prem Systems and more.

aws bigquery dataproc etl etl-framework etl-pipeline gcp gcs redis s3 scala spark zio

Last synced: 22 Jan 2025

https://github.com/absaoss/hyperdrive

Extensible streaming ingestion pipeline on top of Apache Spark

apache-spark framework ingestion kafka pipeline spark spark-structured-streaming streaming streaming-etl

Last synced: 12 Oct 2024

https://github.com/josonle/bigdata-learning

大数据学习,主要涉及Kafka、ZooKeeper、Hive、HBase、Spark

hadoop hive java kafka scala spark zookeeper

Last synced: 25 Nov 2024

https://github.com/g-research/spark-dgraph-connector

A connector for Apache Spark and PySpark to Dgraph databases.

dgraph gr-oss pyspark spark

Last synced: 20 Dec 2024

https://github.com/coxautomotivedatasolutions/spark-distcp

A re-implementation of Hadoop DistCP in Apache Spark

apache-spark data-engineering distcp hadoop spark

Last synced: 12 Oct 2024

https://github.com/xskipper-io/xskipper

An Extensible Data Skipping Framework

data-skipping indexing scala spark

Last synced: 11 Oct 2024

https://github.com/ypriverol/spark-java8

Java 8 and Spark learning through examples

dataset java lambda learning-spark spark

Last synced: 28 Oct 2024

https://github.com/spektom/spark-flamegraph

Easy CPU Profiling for Apache Spark applications

apache-spark cpu-profiling flamegraph spark

Last synced: 19 Nov 2024

https://github.com/univalence/zio-spark

A functional wrapper around Spark to make it works with ZIO

scala spark zio zio-spark

Last synced: 16 Jan 2025

https://github.com/manuel-lang/data-engineering-nanodegree

Solution to all projects of Udacity's Data Engineering Nanodegree: Data Modeling with Postgres & Cassandra, Data Warehouse with Redshift, Data Lake with Spark and Data Pipeline with Airflow.

airflow cassandra data-engineering postgresql redshift spark udacity udacity-data-engineer-nanodegree

Last synced: 13 Nov 2024

https://github.com/supercowpowers/workbench

Workbench: An easy to use Python API for creating and deploying AWS SageMaker Models

aws big-data data-engineering machine-learning pandas python spark

Last synced: 22 Jan 2025

https://github.com/flipkart-incubator/spark-transformers

Spark-Transformers: Library for exporting Apache Spark MLLIB models to use them in any Java application with no other dependencies.

apache-spark data-pipelines export java machine-learning machine-learning-algorithms machine-learning-library mllib scala spark transformers

Last synced: 11 Oct 2024

https://github.com/zuinnote/spark-hadoopoffice-ds

A Spark datasource for the HadoopOffice library

datasource excel hadoopoffice read spark write xls xlsx

Last synced: 03 Dec 2024

https://github.com/supercowpowers/sageworks

SageWorks: An easy to use Python API for creating and deploying AWS SageMaker Models

aws big-data data-engineering machine-learning pandas python spark

Last synced: 16 Dec 2024

https://github.com/LB-Yu/data-systems-learning

Learning summary and examples about data systems.

antlr big-data calcite distributed-systems flink hadoop hbase spark

Last synced: 05 Nov 2024

https://github.com/rstudio/graphframes

R Interface for GraphFrames

graphframes graphs pagerank rstats spark sparklyr

Last synced: 10 Nov 2024

https://github.com/melin/spark-jobserver

REST job server for Apache Spark

hadoop hive java kerberos kubernetes spark yarn

Last synced: 05 Nov 2024

https://github.com/vector4wang/quick-spark-process

:star2::star2::star2:学习spark的相关示例

java spark springboot-spark

Last synced: 28 Oct 2024

https://github.com/AI-team-UoA/GeoTriples

Publishing Big Geospatial data as Linked Open Geospatial Data

geospatial rdf semantic-web spark

Last synced: 04 Nov 2024

https://github.com/sepatel/tekniq

A framework designed around Kotlin providing Restful HTTP Client, JDBC DSL, Loading Cache, Configurations, Validations, and more

config dsl gradle java jdbc kotlin rest spark

Last synced: 18 Dec 2024

https://github.com/garystafford/emr-demo

Project files for the post: Running PySpark Applications on Amazon EMR: Methods for Interacting with PySpark on Amazon Elastic MapReduce.

amazon-emr aws elastic-map-reduce emr-demo pyspark spark

Last synced: 06 Dec 2024

https://github.com/java-edge/spark-mllib-tutorial

大数据框架 Spark MLlib 机器学习库基础算法全面讲解,附带齐全的测试文件

bigdata machine-learning mllib spark

Last synced: 28 Oct 2024

https://github.com/googlecloudplatform/spark-on-k8s-gcp-examples

Example Spark applications that run on Kubernetes and access GCP products, e.g., GCS, BigQuery, and Cloud PubSub

bigquery cloud-pubsub gcs gcs-connector kubernetes spark

Last synced: 22 Jan 2025

https://github.com/heartsavior/spark-sql-kafka-offset-committer

Kafka offset committer for structured streaming query

kafka spark structured-streaming

Last synced: 28 Oct 2024

https://github.com/tupol/spark-utils

Basic framework utilities to quickly start writing production ready Apache Spark applications

apache-spark convenience data-sink data-source framework scala spark spark-applications spark-streaming

Last synced: 19 Dec 2024

https://github.com/absaoss/spark-hats

Nested array transformation helper extensions for Apache Spark

arrays nested-structures scala schema spark

Last synced: 07 Nov 2024

https://github.com/basin-etl/basin

Basin is a visual programming editor for building Spark and PySpark pipelines. Easily build, debug, and deploy complex ETL pipelines from your browser

emr etl hadoop informatica odi pipeline pyspark spark

Last synced: 09 Nov 2024

https://github.com/oracle-samples/oracle-dataflow-samples

Sample examples Examples demonstrating how to use OCI Data Flow

dataflow java oracle-cloud oracle-cloud-infrastructure paas python scala serverless spark

Last synced: 17 Jan 2025

https://github.com/rainmaker712/nlp_ryan

Study for Natural Language Processing & Deep Learning Framework

chatbot deep-learning machine-comprehension machine-learning nlp python pytorch scala spark tensorflow

Last synced: 13 Nov 2024

https://github.com/yaooqinn/spark-postgres

PostgreSQL and GreenPlum Data Source for Apache Spark

greenplum postgres postgresql spark sparksql transactional

Last synced: 15 Oct 2024

https://github.com/wh1isper/sparglim

Sparglim✨ makes PySpark App Configurable and Deploy Spark Connect Server Easier!

jupyter-magic pyspark spark spark-connect spark-connect-server spark-on-kubernetes spark-sql

Last synced: 16 Jan 2025

https://github.com/hussein-awala/spark-on-k8s

A Python package to submit and manage Apache Spark applications on Kubernetes.

airflow kubernetes python spark

Last synced: 21 Jan 2025

https://github.com/mjhea0/flask-spark-docker

Just a boilerplate for PySpark and Flask

docker flask pyspark python redis-queue spark

Last synced: 28 Oct 2024

https://github.com/joomcode/trace-analysis

Library for performance bottleneck detection and optimization efficiency prediction

jaeger opentracing optimization performance spark

Last synced: 09 Nov 2024

https://github.com/mozilla/telemetry-batch-view

A Scala framework to build derived datasets, aka batch views, of Telemetry data.

bigdata biggest-data dataset mozilla scala spark telemetry

Last synced: 01 Nov 2024

https://github.com/felixcheung/vagrant-projects

Vagrant projects for various use-cases with Spark, Zeppelin, IPython / Jupyter, SparkR

cassandra ipython jupyter python r spark vagrant zeppelin

Last synced: 12 Oct 2024

https://github.com/agile-lab-dev/darwin

Avro Schema Evolution made easy

avro avro-schema hadoop hbase scala schema-evolution spark

Last synced: 14 Oct 2024

https://github.com/ksindi/kafka-compose

:musical_score: Docker compose files for various kafka stacks

avro docker-compose kafka kafka-connect pyspark python spark twitter

Last synced: 12 Nov 2024

https://github.com/weaviate/spark-connector

Weaviate connector for Apache Spark

spark vector-search weaviate

Last synced: 14 Nov 2024

https://github.com/fiatjaf/kwh

webln browser extension for lightningd/eclair/ptarmigan

c-lightning eclair lightning-network lightningd ptarmigan spark web-extension webln

Last synced: 17 Jan 2025

https://github.com/lewuathe/dllib

dllib is a distributed deep learning library running on Apache Spark

deep-learning mllib scala spark

Last synced: 12 Nov 2024

https://github.com/music-of-the-ainur/almaren-framework

The Almaren Framework provides a simplified consistent minimalistic layer over Apache Spark. While still allowing you to take advantage of native Apache Spark features. You can still combine it with standard Spark code.

spark

Last synced: 21 Jan 2025

https://github.com/Anant/Cassandra.Realtime

Different ways to process data into Cassandra in realtime with technologies such as Kafka, Spark, Akka, Flink

akka cassandra flink flink-stream-processing flink-streaming kafka kafka-connect spark spark-streaming

Last synced: 08 Nov 2024

https://github.com/dbt-labs/spark-utils

Utility functions for dbt projects running on Spark

dbt macros spark

Last synced: 12 Nov 2024

https://github.com/absaoss/enceladus

Dynamic Conformance Engine

bigdata datalake hadoop mongodb scala spark spring

Last synced: 19 Dec 2024

https://github.com/snowplow/snowplow-rdb-loader

Stores Snowplow enriched events in Redshift, Snowflake and Databricks

redshift scala snowplow spark

Last synced: 16 Nov 2024

https://github.com/anant/cassandra.realtime

Different ways to process data into Cassandra in realtime with technologies such as Kafka, Spark, Akka, Flink

akka cassandra flink flink-stream-processing flink-streaming kafka kafka-connect spark spark-streaming

Last synced: 18 Nov 2024

https://github.com/learningjournal/spark-streaming-in-scala

Apache Spark 3 - Structured Streaming Course Material

apache-spark big-data bigdata datalake scala spark spark-sql spark-streaming

Last synced: 19 Nov 2024

https://github.com/openucx/sparkucx

A high-performance, scalable and efficient ShuffleManager plugin for Apache Spark, utilizing UCX communication layer

apache-spark big-data hadoop hpc rdma spark

Last synced: 10 Nov 2024

https://github.com/agile-lab-dev/wasp

WASP is a framework to build complex real time big data applications. It relies on a kind of Kappa/Lambda architecture mainly leveraging Kafka and Spark. If you need to ingest huge amount of heterogeneous data and analyze them through complex pipelines, this is the framework for you.

akka elasticsearch hadoop hbase hdfs jdbc kafka parquet scala solr spark spark-streaming yarn

Last synced: 01 Jan 2025

https://github.com/endymecy/algorithmsonspark

Some popular algorithms(dbscan,knn,fm etc.) on spark

dbscan factorization-machines knn spark

Last synced: 25 Nov 2024

https://github.com/souvik-databricks/dlt-with-debug

A lightweight helper utility which allows developers to do interactive pipeline development by having a unified source code for both DLT run and Non-DLT interactive notebook run.

big-data big-data-processing databricks delta-live-tables dlt etl etl-pipeline python3 spark

Last synced: 01 Nov 2024

https://github.com/cretueusebiu/laravel-spark-camera

Profile Photo Camera support for Laravel Spark

camera laravel laravel-spark php spark

Last synced: 17 Nov 2024

https://github.com/laravel/spark-aurelius-mollie

Laravel Spark, Mollie edition

laravel mollie saas spark subscription-billing

Last synced: 07 Oct 2024

https://github.com/kairen/learning-spark

Tidy up Spark and Hadoop tutorials.

bigdata data-science hadoop spark

Last synced: 30 Oct 2024

https://github.com/debussy-labs/debussy_concert

Debussy is an opinionated Data Architecture and Engineering framework, enabling data analysts and engineers to build better platforms and pipelines.

airflow airflow-operators airflow-plugin big-data-platform bigquery data-architecture data-engineering data-pipeline dataform dataproc dbt gcp google-cloud mssql mysql postgresql spark sql workflow

Last synced: 10 Jan 2025

https://github.com/mu-sigma/analysis-pipelines

Enables data scientists to compose pipelines of analysis which consist of data manipulation, exploratory analysis & reporting, as well as modeling steps. Data scientists can use tools of their choice through an R interface, and compose interoperable pipelines between R, Spark, and Python.

analysis-pipeline interoperable-pipelines python r spark

Last synced: 09 Nov 2024

https://github.com/oracle/spark-oracle

On the fly, translation of Spark programs to run natively on your Oracle DB. Your Spark programs require no changes.

oracle spark sql

Last synced: 06 Nov 2024

https://github.com/o-gs/dji-packet-dumps

DUPC Packet communication dumps

dji duml dupc inspire mavic phantom spark uart wireshark

Last synced: 07 Nov 2024

https://github.com/indix/sparkplug

Spark package to "plug" holes in data using SQL based rules ⚡️ 🔌

datapipeline spark spark-sql

Last synced: 07 Nov 2024

https://github.com/projectnessie/nessie-demos

Demos for Nessie. Nessie provides Git-like capabilities for your Data Lake.

binder iceberg jupyter-notebooks nessie spark

Last synced: 12 Nov 2024

https://github.com/kpolley/relk

RELK -- The Research Elastic Stack (Kafka, Beats, Zookeeper, Logstash, ElasticSearch, Kibana, Spark, & Jupyter -- All in Docker)

beats docker elastic elasticsearch elk elk-stack es filebeats jupyter jupyter-lab jupyter-notebook kafka kibana logstash pyspark python spark zookeeper

Last synced: 11 Oct 2024

https://github.com/bbenzikry/spark-eks

Examples and custom spark images for working with the spark-on-k8s operator on AWS

aws docker dockerfile eks eks-cluster glue-catalog kubernetes kubernetes-operator metastore spark

Last synced: 27 Oct 2024

https://github.com/fsanaulla/chronicler-spark

InfluxDB connector to Apache Spark on top of Chronicler

chronicler dataframe influxdb rdd scala spark streaming

Last synced: 31 Oct 2024

https://github.com/vesoft-inc/nebula-exchange

NebulaGraph Exchange is an Apache Spark application to parse data from different sources to NebulaGraph in a distributed environment. It supports both batch and streaming data in various formats and sources including other Graph Databases, RDBMS, Data warehouses, NoSQL, Message Bus, File systems, etc.

data-import data-pipeline etl graph-database hacktoberfest nebulagraph spark

Last synced: 07 Nov 2024

https://github.com/ing-bank/spark-matcher

Record matching and entity resolution at scale in Spark

deduplication entity-resolution record-linkage spark

Last synced: 08 Nov 2024

https://github.com/faviovazquez/odsc_india_2018

My presentation at ODSC India 2018 about Deep Learning with Apache Spark

data datascience deeplearning optimus pyspark spark

Last synced: 09 Nov 2024

https://github.com/uniai-lab/uniai-maas

An opensource AI & model as a service platform.

ai chatglm chatgpt gpt kimichat midjourney moonshot spark stability-ai uniai

Last synced: 10 Nov 2024

https://github.com/datitran/spark-tdd-example

A simple Spark TDD example

pyspark python spark tdd

Last synced: 22 Oct 2024

https://github.com/1ambda/lakehouse

Playground for Lakehouse (Iceberg, Hudi, Spark, Flink, Trino, DBT, Airflow, Kafka, Debezium CDC)

airflow cdc dbt debezium docker flink hudi iceberg kafka spark trino

Last synced: 18 Nov 2024

https://github.com/bnosac/spark.sas7bdat

Read in SAS data in parallel into Apache Spark

r sas7bdat spark sparklyr

Last synced: 11 Nov 2024

https://github.com/geotrellis/geotrellis-pointcloud

GeoTrellis PointCloud library to work with any pointcloud data on Spark

geotrellis gis hacktoberfest pdal pointcloud scala spark

Last synced: 11 Nov 2024

https://github.com/propelledanalytics/sparksql.jl

SparkSQL.jl enables Julia programs to work with Apache Spark data using just SQL.

apachespark julia-language julialang spark

Last synced: 11 Oct 2024

https://github.com/leehuwuj/olh

Open source stack lakehouse

bigdata dataplatform deltalake kubernetes lakehouse spark

Last synced: 22 Jan 2025

https://github.com/drkostas/hgn

Hybrid Girvan Newman. Code for the "A Distributed Hybrid Community Detection Methodology for Social Networks" paper.

apache-spark community-detection distributed girvan-newman graphframes paper-implementations papers-with-code social-networks spark

Last synced: 28 Oct 2024

https://github.com/semyonsinchenko/tsumugi-spark

SparkConnect Server plugin and protobuf messages for the Amazon Deequ Data Quality Engine.

data-quality deequ pyspark spark

Last synced: 10 Oct 2024

https://github.com/timgent/data-flare

Data quality control tool built on spark and deequ

big-data data-quality spark

Last synced: 16 Nov 2024

https://github.com/wittline/pyspark-on-aws-emr

The goal of this project is to offer an AWS EMR template using Spot Fleet and On-Demand Instances that you can use quickly. Just focus on writing pyspark code.

aws aws-emr big-data big-data-analytics dataengineering ec2-spot ec2-spot-instances emr-cluster pyspark python spark wordcloud-generator

Last synced: 14 Oct 2024

https://github.com/rstudio/mleap

R Interface to MLeap

jvm mleap pipelines spark sparklyr

Last synced: 10 Nov 2024

https://github.com/alonsodomin/sbt-spark

Simple SBT plugin to configure Spark applications

boilerplate sbt scala spark

Last synced: 09 Nov 2024

https://github.com/zongxr/bigdata-competition

全国大数据竞赛三等奖解决方案,省赛二等奖解决方案。一键安装大数据环境脚本,自动部署集群环境,包括zookeeper、hadoop、mysql、hive、spark以及一些基础环境。已通过实际服务器测试,效果极佳,仅需要输入密码等少量人为干预。解放安装部署配置所需人力。并添加若干scala案例,结合spark用以进行数据准备。

bigdata hadoop hdfs hive mysql scala shell spark wordcount zookeeper

Last synced: 15 Nov 2024

https://github.com/hibayesian/spark-word2vec

A parallel implementation of word2vec based on Spark

machine-learning spark word2vec

Last synced: 23 Nov 2024

https://github.com/absaoss/pramen

Resilient data pipeline framework running on Apache Spark

big-data data-pipeline etl hacktoberfest scala spark

Last synced: 19 Dec 2024