Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
Apache Spark
Apache Spark is an open source distributed general-purpose cluster-computing framework. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.
- GitHub: https://github.com/topics/spark
- Wikipedia: https://en.wikipedia.org/wiki/Apache_Spark
- Repo: https://github.com/apache/spark
- Created by: Matei Zaharia
- Released: May 26, 2014
- Related Topics: scala, hadoop,
- Aliases: apache-spark,
- Last updated: 2025-01-25 00:28:13 UTC
- JSON Representation
https://github.com/jgperrin/net.jgp.books.spark.ch12
Spark in Action, 2nd edition - chapter 12 - Transforming your data
apache-spark java java8 manning spark sparkwithjava transformation
Last synced: 09 Nov 2024
https://github.com/jgperrin/net.jgp.books.spark.ch99
Spark in Action, 2nd edition - chapter 99
apache-spark java java8 manning spark sparkwithjava
Last synced: 09 Nov 2024
https://github.com/iaja/scalaLDAvis
Scala-Spark port of https://github.com/bmabey/pyLDAvis for Apache Spark LDA Topic Modelling Visualisation
apache lda machine-learning scala spark visulization
Last synced: 13 Nov 2024
https://github.com/leozqin/etl-markup-toolkit
ETL Markup Toolkit is a spark-native tool for expressing ETL transformations as configuration
Last synced: 29 Nov 2024
https://github.com/keks51/spark-salesforce
spark salesforce connector
salesforce soap spark sparkstreaming
Last synced: 31 Dec 2024
https://github.com/geotrellis/geotrellis-streaming-demo
A demo project that shows a GeoTrellis streaming application example
geotrellis gis kafka spark streaming
Last synced: 11 Nov 2024
https://github.com/flaviostutz/spark-scala-jupyter
Jupyter notebook server prepared for running Spark with Scala kernels on a remote Spark master
hdfs hdfs-cluster hdfs-docker jupyter jupyter-notebook scala scala-spark spark spark-sql
Last synced: 24 Oct 2024
https://github.com/kimtth/pyspark-tika-text-extraction
🚴♂️⛷Data Lake, Performance tuning for text extraction from a huge amount of files.
apache-spark apache-tika data-pipeline datalake multithreading pyspark spark tika-python
Last synced: 25 Dec 2024
https://github.com/piotr-kalanski/big-data-dev-environment-docker
Big Data Development environment based on Docker
big-data docker elasticsearch hadoop kafka kibana spark
Last synced: 27 Oct 2024
https://github.com/ren294/covid-data-process
This project integrates real-time data processing and analytics using Apache NiFi, Kafka, Spark, Hive, and AWS services for comprehensive COVID-19 data insights.
airflow aws aws-ec2 aws-quicksight big-data big-data-analytics covid19-data docker docker-compose hadoop-hdfs hdfs hive kafka nifi pipeline redpanda spark spark-sql spark-streaming sparksql
Last synced: 11 Oct 2024
https://github.com/kanchishimono/scopt
Calculate optimized properties of Spark configuration
Last synced: 28 Nov 2024
https://github.com/jofaval/tfm-iabd
Master's Final Degree Project on Artificial Intelligence and Big Data
ai-engineering big-data big-data-analytics data-analysis data-architecture data-engineering data-science data-science-project fastapi kafka mongo-db mongodb nlp node-red nodered python sentiment-analysis spark spark-streaming transformers
Last synced: 10 Oct 2024
https://github.com/nhsdigital/rap_example_pipeline_python
An example pipeline made in a RAP friendly way, using Python
aggregation artificial hospital-episode-statistics pyspark python spark
Last synced: 23 Dec 2024
https://github.com/spratiher9/exelog
Enabling meticulous logging for your Spark Applications
analytics apache apache-spark aws azure bigdata databricks gcp logging pyspark python spark spark-utils
Last synced: 12 Oct 2024
https://github.com/vitalibo/spark-aws-orchestration
Deployment/Orchestration of Apache Spark applications on Amazon EMR.
aws cloudformation emr spark step-functions
Last synced: 07 Nov 2024
https://github.com/radanalyticsio/workshop-notebook
Basic Jupyter notebook for learning Spark and OpenShift
containers data-science jupyter openshift spark
Last synced: 05 Nov 2024
https://github.com/renardeinside/wikiflow
Wikipedia updates streaming, transformation and visualisation
akka-http apache-spark kafka spark spark-streaming visualization wikipedia
Last synced: 12 Dec 2024
https://github.com/mahmoud-nfz/football-big-data
This is a comprehensive solution for real-time football analytics, leveraging Apache Spark execution on yarn for both streaming and batch processing, Hadoop HDFS for distributed storage, Kafka for real-time data ingestion, rethinkdb for live data updates , a custom built search engine and Next.js for data visualization.
hadoop hadoop-hdfs kafka nextjs rethinkdb search-engine spark spark-streaming t3-stack
Last synced: 10 Oct 2024
https://github.com/sircamp/mushrooms-ml-classfier-scala-spark
These are different machine learning algorithms used to classify and predict the poisoning of mushrooms
classification machine-learning machine-learning-algorithms scala spark
Last synced: 19 Nov 2024
https://github.com/myxof/sparknotes
Spark 2.0学习笔记
distributed-computing spark spark-sql
Last synced: 15 Oct 2024
https://github.com/aessing/demo-azuresynapse
This repository includes the demos and codes I use to play around with Azure Synapse Anayltics
analytics azure azure-sql-datawarehouse azure-synapse-analytics azure-synapse-dwh data-engineering data-warehousing datawarehouse machine-learning mdwh microsoft powerbi python scala spark spark-dotnet spark-sql sql-data-warehouse synapse synapse-analytics
Last synced: 14 Dec 2024
https://github.com/mgarralda/hadoop-spark-cluster
Repository containing Docker images for create a cluster Spark on Hadoop Yarn.
hadoop-hdfs spark spark-cluster spark-hadoop spark-hadoop-docker spark-yarn-docker
Last synced: 11 Nov 2024
https://github.com/garystafford/dataproc-workflow-templates
Demonstration of Google Cloud Dataproc Workflow Templates
dataproc gcp google-cloud-platform hadoop pyspark spark
Last synced: 06 Dec 2024
https://github.com/aureliusivan/spotify-recommender-system-with-word2vec
This project is a recommender system for Spotify songs. The system uses the Word2Vec model to find similar songs based on the song's lyrics.
Last synced: 27 Oct 2024
https://github.com/zero323/dlt
Mirror of https://gitlab.com/zero323/dlt
apache-spark delta delta-io delta-lake r rstats spark sparkr
Last synced: 27 Oct 2024
https://github.com/harpin-ai/toolkit-examples
Examples for trying out the harpin AI identity resolution and data quality toolkit
data-engineering data-quality dedupe deduplication entity-resolution identity identity-resolution spark
Last synced: 01 Nov 2024
https://github.com/sneaksanddata/hadoop-fs-wrapper
Python Wrappers for Hadoop FileSystem
distributed-computing hadoop spark
Last synced: 11 Nov 2024
https://github.com/abronte/pysparkproxy
Seamlessly execute pyspark code on remote clusters
Last synced: 28 Oct 2024
https://github.com/wazzabeee/twitter-sentiment-analysis-pyspark
Comparative study of classification algorithms implemented in PySpark on the Sentiment 140 dataset.
apache-spark data data-science gcp google-cloud logistic-regression naive-bayes-classifier natural-language-processing nlp nlp-machine-learning pyspark python python3 sentiment-analysis sentiment-classification sentiment140-dataset sentimental-analysis spark tweet twitter
Last synced: 13 Nov 2024
https://github.com/abhirockzz/synapse-azure-data-explorer-101
Getting started with Azure Synapse and Azure Data Explorer
azure-data-explorer azure-synapse-analytics pyspark python spark
Last synced: 21 Dec 2024
https://github.com/smola/spark-glusterfs-example
An example of Apache Spark integration with GlusterFS.
example-project glusterfs maven scala spark
Last synced: 16 Dec 2024
https://github.com/joristruong/youtube-setl
Youtube SETL is a project that aims at providing a project exercise to practice the SETL Framework: https://github.com/JCDecaux/setl
etl exercise practice scala setl-framework spark
Last synced: 18 Dec 2024
https://github.com/walshydev/spark-ratelimiter
Easy rate-limit implementation for SparkJava.
java rate-limiter rate-limiting ratelimit spark sparkjava
Last synced: 08 Nov 2024
https://github.com/rootsongjc/spark-on-k8s
Spark on kubernetes 中文文档 - https://jimmysong.io/spark-on-k8s
Last synced: 27 Oct 2024
https://github.com/amadeusitgroup/elastic-scaling
Elastic scaling is a library that allows to control the number of resources (executors or workers) instantiated by a Spark Structured Streaming Job in order to optimize the effective microbatch duration.
spark spark-structured-streaming
Last synced: 10 Nov 2024
https://github.com/mliarakos/spark-typed-ops
Lightweight type-safe operations for Spark
scala scala-macros shapeless spark spark-scala spark-sql
Last synced: 05 Dec 2024
https://github.com/manuelgil/vscode-codeigniter4-spark
CodeIgniter 4 Spark is a Visual Studio Code extension that provides a set of useful commands and shortcuts for CodeIgniter 4 framework.
codeigniter commands spark vscode vscode-extension
Last synced: 19 Nov 2024
https://github.com/tjc-lp/spark-instructor
A library for building structured LLM responses with Spark
databricks llm pydantic pydantic-v2 spark
Last synced: 12 Oct 2024
https://github.com/anant/example-cassandra-etl-with-airflow-and-spark
airflow cassandra datastax datastax-astra etl gitpod spark
Last synced: 18 Nov 2024
https://github.com/absaoss/spark-data-standardization
A library for Spark that helps to stadardize any input data (DataFrame) to adhere to the provided schema.
data-quality data-structures scala schema spark
Last synced: 07 Nov 2024
https://github.com/zejnilovic/scala-spark-template.g8
Scala + Spark template using Giter8
giter8-template sbt scala spark template
Last synced: 07 Nov 2024
https://github.com/nmarus/node-red-contrib-spark
Node-RED Nodes to integrate with the Cisco Webex Teams API
Last synced: 25 Oct 2024
https://github.com/aessing/demo-mdwh
Modern Dataware House Demos with Azure Databricks, Azure Data Factory & Azure Dedicated SQL pool (formerly SQL DW)
azure azure-data-factory azure-databricks data data-engineering data-science databricks databricks-notebooks datafactory datalake datawarehouse datawarehousing delta-lake demos etl machine-learning mdwh ml modern-data-warehouse spark
Last synced: 14 Dec 2024
https://github.com/ahmetfurkandemir/airflow-spark-kafka-example
Airflow, Spark and Kafka example
airflow airflow-dags airflow-docker docker kafka kafka-producer kafka-ui python3 spark spark-jobs
Last synced: 16 Nov 2024
https://github.com/brooksian/sbir_tfidf_kmeans
Document clustering using KMeans on TF/IDF features on Small Business Innovation Research (SBIR) data
machine-learning spark sparksql zeppelin-notebook
Last synced: 18 Nov 2024
https://github.com/dineshkarthik/n-gram_processor
Using n-gram get set of words and their frequency of occurrence in given directory / sub-directory/ text file, which are present in a specific order at specific distance from a word.
Last synced: 17 Nov 2024
https://github.com/cclient/spark-streaming-kafka-offset-mysql
mysql 维护 kafka offset,支持追踪并回滚到某个'异常'时间点,重新消费
mysql offset spark spark-streaming
Last synced: 16 Nov 2024
https://github.com/cclient/spark-java-mongo-demo
hadoop-on-mongo demo 迁移至 spark-on-hadoop-mongo 再迁移至 mongo-spark-connector
Last synced: 16 Nov 2024
https://github.com/aamend/ml-registry
Enabling continuous delivery and improvement of Spark pipeline models through devops methodology and ML governance
datascience devops machinelearning maven ml nexus spark
Last synced: 08 Nov 2024
https://github.com/konradmalik/spark-kafka-cassandra
This is an example/demo of Kafka - Spark Streaming - Cassandra/Kafka interoperability, with Spark Streaming as a focal point.
cassandra kafka spark spark-streaming streaming
Last synced: 16 Nov 2024
https://github.com/brooksian/censusecon
Data Mining Census ECON using Apache Spark
spark sparksql zeppelin-notebook
Last synced: 18 Nov 2024
https://github.com/canerturkseven/forecastflowml
🧙 Scalable machine learning forecasting framework with Pyspark
forecasting lightgbm machine-learning python spark time-series xgboost
Last synced: 27 Oct 2024
https://github.com/rberenguel/identity-graphs
Presentation about Graphframes and how we handle graphs with more than 2 billion nodes at Hybrid Theory
Last synced: 06 Dec 2024
https://github.com/utnaf/neo4j-connector-apache-spark-notebooks
Collection of notebooks to get started with Neo4j Connector for Apache Spark
Last synced: 14 Dec 2024
https://github.com/fpopic/wt-interview-challenge
(Interview) WT Data Engineer Interview Challenge
csv mysql scala spark spark-dataset sparksql
Last synced: 10 Jan 2025
https://github.com/jaehyeon-kim/iceberg-etl-demo
Data Warehousing ETL Demo with Apache Iceberg on EMR Local Environment
datawarehousing emr etl iceberg scd spark
Last synced: 17 Dec 2024
https://github.com/chezou/sparklytd
spaklyr plugin for td-spark to connect TD from R
dplyr spark sparklyr treasuredata
Last synced: 15 Oct 2024
https://github.com/9bow/komoranrestapiserver
Simple RESTful API Server for KOMORAN
java-8 komoran nlp restful-api spark sparkjava sparkjava-framework
Last synced: 18 Dec 2024
https://github.com/americanexpress/bloom
BLooM is a configuration driven bigdata framework to load massive data into MemSQL
Last synced: 10 Nov 2024
https://github.com/samueleresca/deequ.net
deequ.NET is a port of the awslabs/deequ library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.
apache-spark bigdata deequ dotnet spark
Last synced: 11 Nov 2024
https://github.com/sankamuk/spark-kubernetes
Production run of Apache Spark on Kubernetes
airflow apache-spark iac kubernetes spark
Last synced: 13 Nov 2024
https://github.com/erikerlandson/spark-tekton-demo
demo of running apache spark jobs using tekton and s2i workflows
apache-spark kubernetes openshift openshift-pipelines s2i spark tekton tekton-pipelines tektoncd
Last synced: 06 Jan 2025
https://github.com/us8945/aws_emr_pysparkling
Set Up Python environment on AWS EMR cluster with H2O Sparkling Water (Pysparling)
aws emr h2o jupyter-notebook pyspark pysparkling spark sparkling-water
Last synced: 21 Jan 2025
https://github.com/hafen/strata2017
Repository for the "Exploration and visualization of large, complex datasets with R, Hadoop, and Spark" tutorial at Strata Hadoop World 2017
Last synced: 27 Oct 2024
https://github.com/salva/fastdbfs
fastdbfs - An interactive command line client for Databricks DBFS.
Last synced: 16 Nov 2024
https://github.com/wangshibiaoflytiger/springmvc
java spring项目开发脚手架,主要用于学习和技术调研. 涉及的相关技术(spring + springboot + gradle项目构建 + mybatisplus + redis + HikariCP数据源 + 定时任务 + aop切面 + 自定义filter + 自定义拦截器 + 阿里云对象存储oss + kafka消息队列 + 认证授权shiro + scala和java混合编程 + 大数据spark + orm springdatajpa + orm jooq + jacoco生成测试报告 + sonar生成项目分析报告)
aop cron filter hikaricp interceptor jacoco java jooq jpa kafka mybatis orm oss redis scala shiro sonar spark spring springboot
Last synced: 10 Nov 2024
https://github.com/cbozan/graduation-project
Graduation project categorizes popular search phrases using Python and Spark and presents them on a website to inspire creators.
crisp-dm data-cleaning data-science machine-learning nlp nlp-machine-learning spark spark-mllib
Last synced: 23 Nov 2024
https://github.com/comcast/pxscene-ui
A declarative JavaScript library for building React-ish UI components for pxScene (aka Spark) apps
component frontend javascript library px2react pxscene react spark ui
Last synced: 14 Nov 2024
https://github.com/cevheri/spark-tutorial
Apache Spark Tutorial - Scala, Java, Python code samples
cassandra java kafka kafka-consumer kafka-producer mongodb python scala spark spark-streaming
Last synced: 09 Nov 2024
https://github.com/nickjer/singularity-rstudio-spark
Apache Spark with RStudio and the sparklyr package in a Singularity container
rstudio-server singularity-image spark
Last synced: 14 Nov 2024
https://github.com/izhangzhihao/spark-security
ranger ranger-plugin security spark spark-sql sql
Last synced: 16 Dec 2024
https://github.com/surajiyer/python-data-utils
🚀 Utility classes and functions for common data science libraries
clustering etc matplotlib multiview-clustering nlp pandas sklearn spark statsmodels utilities
Last synced: 10 Dec 2024
https://github.com/mdh266/twittersentimentanalysis
Twitter Sentiment Analysis using Spark, MongoDB, and Google Cloud
data-science etl google-cloud machine-learning mongodb natural-language-processing nlp pyspark sentiment-analysis spark sparkml twitter twitter-sentiment-analysis
Last synced: 04 Dec 2024
https://github.com/mrcolorr/supreme-pancake
Big Data Management project: The collection of data from a network of sensors was simulated (kafka), which then had to be processed (spark) and stored (cassandraDB) in a distributed and efficient way.
big-data bigdata cassandra cassandra-cluster cassandra-database cloud cloud-computing distributed-computing distributed-database distributed-storage distributed-systems hdfs kafka maven maven-pom spark zerotier zerotier-network zerotier-one
Last synced: 13 Nov 2024
https://github.com/anant/example-cassandra-spark-elasticsearch
cassandra datastax docker elasticsearch scala spark spark-sql
Last synced: 19 Jan 2025
https://github.com/s8sg/spark-py-submit
A python library to submit spark job in yarn cluster at different distributions (Currently CDH, HDP)
cdh hdfs hdp python-library spark spark-clusters spark-job
Last synced: 05 Dec 2024
https://github.com/mobiletelesystems/spark-dialect-extension
Package extending the default dialect capabilities for Spark.
etl etl-components plugin-system spark
Last synced: 11 Oct 2024
https://github.com/ging/fiware-cosmos
The Cosmos Generic Enabler enables an easier BigData analysis over context integrated with some of the most popular BigData platforms.
analysis big-data fiware fiware-cosmos flink processing real-time-analytics spark streaming-engine
Last synced: 01 Nov 2024
https://github.com/julienpeloton/mini_spark_broker
Design and proof-of-concept for a Broker for astronomy using Apache Spark
docker kafka python spark spark-structured-streaming
Last synced: 11 Oct 2024
https://github.com/conema/spark-terraform
This project create an Hadoop and Spark cluster on Amazon AWS with Terraform
aws cluster hadoop hadoop-cluster hcl spark spark-clusters terraform
Last synced: 20 Nov 2024
https://github.com/gabfr/truck-data-wrangler
ELT (Extract, Load, Transform) process of accelerometer/gyroscope events with Apache Spark (w/ Structured Streaming) and TimescaleDB
data-classification spark stream timescaledb
Last synced: 07 Dec 2024
https://github.com/abronte/pysparkgateway
Connect to remote Spark clusters seamlessly.
apache-spark bigdata pyspark python spark
Last synced: 28 Oct 2024
https://github.com/tomwhite/disq-original
A library for manipulating bioinformatics sequencing formats in Apache Spark.
bioinformatics genomics ngs sequencing spark
Last synced: 18 Dec 2024
https://github.com/imlegend19/vidspark
VidSpark is a prototype video CMS backend system powered by spark and elasticsearch
celery elasticsearch python redis scala spark
Last synced: 14 Jan 2025
https://github.com/multivacplatform/multivac-kaggle-titanic
Simple example of Titanic competition by Spark 2.2
kaggle-competition machine-learning scala spark
Last synced: 12 Jan 2025
https://github.com/michabirklbauer/mahout_docker
Running Apache Mahout in Docker.
apache docker dockerfile hadoop mahout maven spark
Last synced: 04 Jan 2025