Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
Apache Spark
Apache Spark is an open source distributed general-purpose cluster-computing framework. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.
- GitHub: https://github.com/topics/spark
- Wikipedia: https://en.wikipedia.org/wiki/Apache_Spark
- Repo: https://github.com/apache/spark
- Created by: Matei Zaharia
- Released: May 26, 2014
- Related Topics: scala, hadoop,
- Aliases: apache-spark,
- Last updated: 2025-02-07 00:28:28 UTC
- JSON Representation
https://github.com/retkowsky/azure-databricks-workshop
Azure Databricks workshop
Last synced: 08 Feb 2025
https://github.com/udao-moo/udao-spark-optimizer
A Spark Optimizer for Adaptive, Fine-Grained Parameter Tuning
knobs-tuning modeling multi-objective-optimization optimization spark sparksql
Last synced: 11 Oct 2024
https://github.com/garystafford/dataproc-workflow-templates
Demonstration of Google Cloud Dataproc Workflow Templates
dataproc gcp google-cloud-platform hadoop pyspark spark
Last synced: 06 Dec 2024
https://github.com/kimtth/pyspark-tika-text-extraction
🚴♂️⛷Data Lake, Performance tuning for text extraction from a huge amount of files.
apache-spark apache-tika data-pipeline datalake multithreading pyspark spark tika-python
Last synced: 25 Dec 2024
https://github.com/ren294/covid-data-process
This project integrates real-time data processing and analytics using Apache NiFi, Kafka, Spark, Hive, and AWS services for comprehensive COVID-19 data insights.
airflow aws aws-ec2 aws-quicksight big-data big-data-analytics covid19-data docker docker-compose hadoop-hdfs hdfs hive kafka nifi pipeline redpanda spark spark-sql spark-streaming sparksql
Last synced: 11 Oct 2024
https://github.com/build-on-aws/ci-cd-serverless-spark
Sample CI/CD pipeline for using GitHub Actions with Amazon EMR Serverless Spark.
amazon-emr apache-spark aws github-actions serverless spark
Last synced: 26 Dec 2024
https://github.com/jofaval/tfm-iabd
Master's Final Degree Project on Artificial Intelligence and Big Data
ai-engineering big-data big-data-analytics data-analysis data-architecture data-engineering data-science data-science-project fastapi kafka mongo-db mongodb nlp node-red nodered python sentiment-analysis spark spark-streaming transformers
Last synced: 10 Oct 2024
https://github.com/radanalyticsio/workshop-notebook
Basic Jupyter notebook for learning Spark and OpenShift
containers data-science jupyter openshift spark
Last synced: 05 Nov 2024
https://github.com/angadsingh/airflow-ditto
An airflow DAG transformation framework
airflow airflow-dag aws azure dataflow emr extensible framework graph-algorithms graph-manipulation hdinsight isomorphism livy networkx spark yarn
Last synced: 10 Nov 2024
https://github.com/iaja/scalaLDAvis
Scala-Spark port of https://github.com/bmabey/pyLDAvis for Apache Spark LDA Topic Modelling Visualisation
apache lda machine-learning scala spark visulization
Last synced: 13 Nov 2024
https://github.com/vitalibo/spark-aws-orchestration
Deployment/Orchestration of Apache Spark applications on Amazon EMR.
aws cloudformation emr spark step-functions
Last synced: 07 Nov 2024
https://github.com/aureliusivan/spotify-recommender-system-with-word2vec
This project is a recommender system for Spotify songs. The system uses the Word2Vec model to find similar songs based on the song's lyrics.
Last synced: 27 Oct 2024
https://github.com/surajiyer/python-data-utils
🚀 Utility classes and functions for common data science libraries
clustering etc matplotlib multiview-clustering nlp pandas sklearn spark statsmodels utilities
Last synced: 04 Feb 2025
https://github.com/myxof/sparknotes
Spark 2.0学习笔记
distributed-computing spark spark-sql
Last synced: 15 Oct 2024
https://github.com/aessing/demo-azuresynapse
This repository includes the demos and codes I use to play around with Azure Synapse Anayltics
analytics azure azure-sql-datawarehouse azure-synapse-analytics azure-synapse-dwh data-engineering data-warehousing datawarehouse machine-learning mdwh microsoft powerbi python scala spark spark-dotnet spark-sql sql-data-warehouse synapse synapse-analytics
Last synced: 14 Dec 2024
https://github.com/renardeinside/wikiflow
Wikipedia updates streaming, transformation and visualisation
akka-http apache-spark kafka spark spark-streaming visualization wikipedia
Last synced: 12 Dec 2024
https://github.com/djamelinfo/randomwalk-timeseriesgenerator-on-spark
This is a generator, where a random number is drawn from a Gaussian distribution N(0,1), then at each time point a new number is drawn from this distribution and added to the value of the last number.
data-mining indexing java random randomwalk scala spark time-series
Last synced: 17 Dec 2024
https://github.com/wittline/moving-average-spark
How to Compute Moving Average with Spark
databricks hadoop moving-average spark
Last synced: 14 Oct 2024
https://github.com/mliarakos/spark-typed-ops
Lightweight type-safe operations for Spark
scala scala-macros shapeless spark spark-scala spark-sql
Last synced: 01 Feb 2025
https://github.com/geotrellis/geotrellis-streaming-demo
A demo project that shows a GeoTrellis streaming application example
geotrellis gis kafka spark streaming
Last synced: 11 Nov 2024
https://github.com/sircamp/mushrooms-ml-classfier-scala-spark
These are different machine learning algorithms used to classify and predict the poisoning of mushrooms
classification machine-learning machine-learning-algorithms scala spark
Last synced: 19 Nov 2024
https://github.com/fpoli/view-spark-timeline
Visualize in an SVG the timeline of an Apache Spark execution.
Last synced: 15 Oct 2024
https://github.com/mgarralda/hadoop-spark-cluster
Repository containing Docker images for create a cluster Spark on Hadoop Yarn.
hadoop-hdfs spark spark-cluster spark-hadoop spark-hadoop-docker spark-yarn-docker
Last synced: 11 Nov 2024
https://github.com/minhthong582000/my-data-stack
A simple Big data stack with Docker
docker docker-compose hadoop spark
Last synced: 13 Jan 2025
https://github.com/flaviostutz/spark-scala-jupyter
Jupyter notebook server prepared for running Spark with Scala kernels on a remote Spark master
hdfs hdfs-cluster hdfs-docker jupyter jupyter-notebook scala scala-spark spark spark-sql
Last synced: 24 Oct 2024
https://github.com/fancellu/spark-streaming-examples
A few Spark Streaming examples
Last synced: 10 Nov 2024
https://github.com/leozqin/etl-markup-toolkit
ETL Markup Toolkit is a spark-native tool for expressing ETL transformations as configuration
Last synced: 29 Nov 2024
https://github.com/aixhunter/spark-k8s-pod-template
Steps to deploy a Spark app to Kubernetes cluster using spark-submit or a pod template
k8s kubernetes pod spark spark-cluster spark-submit
Last synced: 25 Jan 2025
https://github.com/dineshkarthik/n-gram_processor
Using n-gram get set of words and their frequency of occurrence in given directory / sub-directory/ text file, which are present in a specific order at specific distance from a word.
Last synced: 17 Nov 2024
https://github.com/americanexpress/bloom
BLooM is a configuration driven bigdata framework to load massive data into MemSQL
Last synced: 10 Nov 2024
https://github.com/husnusensoy/tt-bootcamp
Turk Telekom Data Bootcamp Repository
data-engineering data-governance data-quality data-science mlops python spark sql
Last synced: 08 Feb 2025
https://github.com/manuelgil/vscode-codeigniter4-spark
CodeIgniter 4 Spark is a Visual Studio Code extension that provides a set of useful commands and shortcuts for CodeIgniter 4 framework.
codeigniter commands spark vscode vscode-extension
Last synced: 19 Nov 2024
https://github.com/abronte/pysparkproxy
Seamlessly execute pyspark code on remote clusters
Last synced: 28 Oct 2024
https://github.com/aamend/ml-registry
Enabling continuous delivery and improvement of Spark pipeline models through devops methodology and ML governance
datascience devops machinelearning maven ml nexus spark
Last synced: 08 Nov 2024
https://github.com/cclient/spark-java-mongo-demo
hadoop-on-mongo demo 迁移至 spark-on-hadoop-mongo 再迁移至 mongo-spark-connector
Last synced: 16 Nov 2024
https://github.com/cclient/spark-streaming-kafka-offset-mysql
mysql 维护 kafka offset,支持追踪并回滚到某个'异常'时间点,重新消费
mysql offset spark spark-streaming
Last synced: 16 Nov 2024
https://github.com/zejnilovic/scala-spark-template.g8
Scala + Spark template using Giter8
giter8-template sbt scala spark template
Last synced: 07 Nov 2024
https://github.com/absaoss/spark-data-standardization
A library for Spark that helps to stadardize any input data (DataFrame) to adhere to the provided schema.
data-quality data-structures scala schema spark
Last synced: 07 Nov 2024
https://github.com/fpopic/wt-interview-challenge
(Interview) WT Data Engineer Interview Challenge
csv mysql scala spark spark-dataset sparksql
Last synced: 10 Jan 2025
https://github.com/sankamuk/spark-kubernetes
Production run of Apache Spark on Kubernetes
airflow apache-spark iac kubernetes spark
Last synced: 13 Nov 2024
https://github.com/salva/fastdbfs
fastdbfs - An interactive command line client for Databricks DBFS.
Last synced: 16 Nov 2024
https://github.com/comcast/pxscene-ui
A declarative JavaScript library for building React-ish UI components for pxScene (aka Spark) apps
component frontend javascript library px2react pxscene react spark ui
Last synced: 14 Nov 2024
https://github.com/tjc-lp/spark-instructor
A library for building structured LLM responses with Spark
databricks llm pydantic pydantic-v2 spark
Last synced: 12 Oct 2024
https://github.com/cumberlandgroup/node-red-contrib-spark
Node-RED Nodes to integrate with the Cisco Webex Teams API
Last synced: 02 Feb 2025
https://github.com/rootsongjc/spark-on-k8s
Spark on kubernetes 中文文档 - https://jimmysong.io/spark-on-k8s
Last synced: 27 Oct 2024
https://github.com/erikerlandson/spark-tekton-demo
demo of running apache spark jobs using tekton and s2i workflows
apache-spark kubernetes openshift openshift-pipelines s2i spark tekton tekton-pipelines tektoncd
Last synced: 06 Jan 2025
https://github.com/rberenguel/identity-graphs
Presentation about Graphframes and how we handle graphs with more than 2 billion nodes at Hybrid Theory
Last synced: 02 Feb 2025
https://github.com/jaehyeon-kim/iceberg-etl-demo
Data Warehousing ETL Demo with Apache Iceberg on EMR Local Environment
datawarehousing emr etl iceberg scd spark
Last synced: 17 Dec 2024
https://github.com/ahmetfurkandemir/airflow-spark-kafka-example
Airflow, Spark and Kafka example
airflow airflow-dags airflow-docker docker kafka kafka-producer kafka-ui python3 spark spark-jobs
Last synced: 16 Nov 2024
https://github.com/canerturkseven/forecastflowml
🧙 Scalable machine learning forecasting framework with Pyspark
forecasting lightgbm machine-learning python spark time-series xgboost
Last synced: 27 Oct 2024
https://github.com/zero323/dlt
Mirror of https://gitlab.com/zero323/dlt
apache-spark delta delta-io delta-lake r rstats spark sparkr
Last synced: 27 Oct 2024
https://github.com/joristruong/youtube-setl
Youtube SETL is a project that aims at providing a project exercise to practice the SETL Framework: https://github.com/JCDecaux/setl
etl exercise practice scala setl-framework spark
Last synced: 18 Dec 2024
https://github.com/pedro-manoel/iot-analytics-solution-tcc
🎓 Repositório com a solução de IoT Analytics desenvolvida como parte do Trabalho de Conclusão de Curso (TCC) do curso de Ciência da Computação da Universidade Federal de Campina Grande (UFCG)
analytics business-intelligence druid iot iot-analytics kafka nifi real-time spark spark-structured-streaming superset tcc ufcg
Last synced: 28 Jan 2025
https://github.com/wangshibiaoflytiger/springmvc
java spring项目开发脚手架,主要用于学习和技术调研. 涉及的相关技术(spring + springboot + gradle项目构建 + mybatisplus + redis + HikariCP数据源 + 定时任务 + aop切面 + 自定义filter + 自定义拦截器 + 阿里云对象存储oss + kafka消息队列 + 认证授权shiro + scala和java混合编程 + 大数据spark + orm springdatajpa + orm jooq + jacoco生成测试报告 + sonar生成项目分析报告)
aop cron filter hikaricp interceptor jacoco java jooq jpa kafka mybatis orm oss redis scala shiro sonar spark spring springboot
Last synced: 10 Nov 2024
https://github.com/aessing/demo-mdwh
Modern Dataware House Demos with Azure Databricks, Azure Data Factory & Azure Dedicated SQL pool (formerly SQL DW)
azure azure-data-factory azure-databricks data data-engineering data-science databricks databricks-notebooks datafactory datalake datawarehouse datawarehousing delta-lake demos etl machine-learning mdwh ml modern-data-warehouse spark
Last synced: 14 Dec 2024
https://github.com/walshydev/spark-ratelimiter
Easy rate-limit implementation for SparkJava.
java rate-limiter rate-limiting ratelimit spark sparkjava
Last synced: 08 Nov 2024
https://github.com/konradmalik/spark-kafka-cassandra
This is an example/demo of Kafka - Spark Streaming - Cassandra/Kafka interoperability, with Spark Streaming as a focal point.
cassandra kafka spark spark-streaming streaming
Last synced: 16 Nov 2024
https://github.com/abhirockzz/synapse-azure-data-explorer-101
Getting started with Azure Synapse and Azure Data Explorer
azure-data-explorer azure-synapse-analytics pyspark python spark
Last synced: 21 Dec 2024
https://github.com/wazzabeee/twitter-sentiment-analysis-pyspark
Comparative study of classification algorithms implemented in PySpark on the Sentiment 140 dataset.
apache-spark data data-science gcp google-cloud logistic-regression naive-bayes-classifier natural-language-processing nlp nlp-machine-learning pyspark python python3 sentiment-analysis sentiment-classification sentiment140-dataset sentimental-analysis spark tweet twitter
Last synced: 13 Nov 2024
https://github.com/harpin-ai/toolkit-examples
Examples for trying out the harpin AI identity resolution and data quality toolkit
data-engineering data-quality dedupe deduplication entity-resolution identity identity-resolution spark
Last synced: 01 Nov 2024
https://github.com/izhangzhihao/spark-security
ranger ranger-plugin security spark spark-sql sql
Last synced: 16 Dec 2024
https://github.com/cbozan/graduation-project
Graduation project categorizes popular search phrases using Python and Spark and presents them on a website to inspire creators.
crisp-dm data-cleaning data-science machine-learning nlp nlp-machine-learning spark spark-mllib
Last synced: 23 Nov 2024
https://github.com/brooksian/censusecon
Data Mining Census ECON using Apache Spark
spark sparksql zeppelin-notebook
Last synced: 18 Nov 2024
https://github.com/sneaksanddata/hadoop-fs-wrapper
Python Wrappers for Hadoop FileSystem
distributed-computing hadoop spark
Last synced: 11 Nov 2024
https://github.com/utnaf/neo4j-connector-apache-spark-notebooks
Collection of notebooks to get started with Neo4j Connector for Apache Spark
Last synced: 14 Dec 2024
https://github.com/brooksian/sbir_tfidf_kmeans
Document clustering using KMeans on TF/IDF features on Small Business Innovation Research (SBIR) data
machine-learning spark sparksql zeppelin-notebook
Last synced: 18 Nov 2024
https://github.com/chezou/sparklytd
spaklyr plugin for td-spark to connect TD from R
dplyr spark sparklyr treasuredata
Last synced: 15 Oct 2024
https://github.com/smola/spark-glusterfs-example
An example of Apache Spark integration with GlusterFS.
example-project glusterfs maven scala spark
Last synced: 09 Feb 2025
https://github.com/mdh266/twittersentimentanalysis
Twitter Sentiment Analysis using Spark, MongoDB, and Google Cloud
data-science etl google-cloud machine-learning mongodb natural-language-processing nlp pyspark sentiment-analysis spark sparkml twitter twitter-sentiment-analysis
Last synced: 04 Dec 2024
https://github.com/9bow/komoranrestapiserver
Simple RESTful API Server for KOMORAN
java-8 komoran nlp restful-api spark sparkjava sparkjava-framework
Last synced: 18 Dec 2024
https://github.com/samueleresca/deequ.net
deequ.NET is a port of the awslabs/deequ library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.
apache-spark bigdata deequ dotnet spark
Last synced: 11 Nov 2024
https://github.com/nmarus/node-red-contrib-spark
Node-RED Nodes to integrate with the Cisco Webex Teams API
Last synced: 25 Oct 2024
https://github.com/us8945/aws_emr_pysparkling
Set Up Python environment on AWS EMR cluster with H2O Sparkling Water (Pysparling)
aws emr h2o jupyter-notebook pyspark pysparkling spark sparkling-water
Last synced: 21 Jan 2025
https://github.com/anant/example-cassandra-etl-with-airflow-and-spark
airflow cassandra datastax datastax-astra etl gitpod spark
Last synced: 18 Nov 2024
https://github.com/hafen/strata2017
Repository for the "Exploration and visualization of large, complex datasets with R, Hadoop, and Spark" tutorial at Strata Hadoop World 2017
Last synced: 27 Oct 2024
https://github.com/nickjer/singularity-rstudio-spark
Apache Spark with RStudio and the sparklyr package in a Singularity container
rstudio-server singularity-image spark
Last synced: 14 Nov 2024
https://github.com/cevheri/spark-tutorial
Apache Spark Tutorial - Scala, Java, Python code samples
cassandra java kafka kafka-consumer kafka-producer mongodb python scala spark spark-streaming
Last synced: 09 Nov 2024
https://github.com/amadeusitgroup/elastic-scaling
Elastic scaling is a library that allows to control the number of resources (executors or workers) instantiated by a Spark Structured Streaming Job in order to optimize the effective microbatch duration.
spark spark-structured-streaming
Last synced: 10 Nov 2024
https://github.com/hdfgroup/hdf5-spark-connector
HDF5 Connector for Apache Spark
Last synced: 19 Dec 2024
https://github.com/dllllb/ml-pipelines-tutorial
SciKit-Learn vs Apache Spark pipelines
machine-learning scikit-learn spark
Last synced: 19 Jan 2025
https://github.com/denuvosoftwaresolutions/fighting-bots-at-scale
Fighting Bots at Scale: Identifying Bottlenecks & Best Practice
Last synced: 25 Dec 2024
https://github.com/timvisee/hhs-p7-movie-recommendation-engine
:movie_camera: Big data project for college (HHS) period 7
algorithm hadoop recommendation-engine spark
Last synced: 15 Jan 2025