Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
Apache Spark
![](https://explore-feed.github.com/topics/spark/spark.png)
Apache Spark is an open source distributed general-purpose cluster-computing framework. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.
- GitHub: https://github.com/topics/spark
- Wikipedia: https://en.wikipedia.org/wiki/Apache_Spark
- Repo: https://github.com/apache/spark
- Created by: Matei Zaharia
- Released: May 26, 2014
- Related Topics: scala, hadoop,
- Aliases: apache-spark,
- Last updated: 2025-02-11 00:28:31 UTC
- JSON Representation
https://github.com/badoo/hadoop-xargs
Util to run heterogenous applications on Hadoop synchronously
Last synced: 12 Nov 2024
https://github.com/tpvasconcelos/sparkypandy
It's not spark, it's not pandas, it's just awkward...
dataframe pandas pyspark spark
Last synced: 05 Nov 2024
https://github.com/kevinhartman/kafka-to-eventhub
Kafka to EventHub Mirror.
eventhub eventhub-topic kafka mirror spark spark-streaming
Last synced: 20 Dec 2024
https://github.com/vasnake/spark.ml.spatialjointransformer
spark.ml.transformer: join two datasets using spatial relations
geospatial join ml-pipeline python scala spark spark-ml spatial transformer
Last synced: 03 Jan 2025
https://github.com/tranthe170/nyc-taxi-pipeline
Building Data Lakehouse by open source technology. Support end to end data pipeline, from source data on AWS S3 to Lakehouse, visualize.
airflow delta-lake hive lakehouse presto python s3 spark superset
Last synced: 17 Jan 2025
https://github.com/jbris/docker-spark-sparklyr
Docker setup for Apache Spark and the R sparklyr package
adminer apache-spark docker docker-compose postgres postgresql rstats rstudio spark spark-dataset spark-master spark-ml spark-worker sparklyr sparklyr-extension
Last synced: 12 Jan 2025
https://github.com/viniciusmsousa/pyspark-ds-toolbox
A Pyspark companion for data science tasks.
Last synced: 09 Feb 2025
https://github.com/extwiii/bigdata-uc.san.diego
Unlock Value in Massive Datasets - UC San Diego
big-data classification data-science graph hadoop integration machine-learning management modeling neo4j processing regression spark
Last synced: 28 Jan 2025
https://github.com/garystafford/dataproc-java-demo
Demonstration of Google Cloud Dataproc for running Spark jobs with Java
big-data-analytics dataproc gcp google java spark
Last synced: 06 Dec 2024
https://github.com/codelytv/spark-best_practices_and_deploy-course
Deploy Spark course examples
Last synced: 03 Dec 2024
https://github.com/dharaneeshvrd/spark-examples
Spark Examples
pyspark spark spark-example spark-sql spark-streaming spark-streaming-kafka spark-structured-streaming
Last synced: 07 Nov 2024
https://github.com/yoongoing/bigdata_pyspark
⚡️공개용 맵리듀스 플랫폼인 Spark를 사용하여 데이터마이닝을 해보자⚡️
bigdata dataminig jupyter-notebook mapreduce mapreduce-python pyspark spark
Last synced: 09 Feb 2025
https://github.com/renardeinside/terrametria
Source code 3D population density map of Germany, with ETL and app logic on top the Databricks Platform.
databricks deckgl python react spark
Last synced: 03 Dec 2024
https://github.com/mtpatter/bilao
Jupyter notebooks for filtering Kafka data with Spark Streaming.
avro docker jupyter-notebook kafka spark spark-streaming
Last synced: 12 Jan 2025
https://github.com/hibadaoud/real-time-flight-data-kibana-visualization
Real-Time Flight Data Visualization Dashboard: Interactive web application for real-time flight tracking and airport analytics. Powered by Kafka, Pyspark, Elasticsearch, Kibana, Express NodeJs, MongoDB, and Docker.
css docker elasticsearch html javascript jwt-authentication kafka kibana nodejs real-time spark
Last synced: 10 Feb 2025
https://github.com/prabaprakash/docker-pipeline-for-hadoop-n-spark-submit
Docker CI/CD Pipeline
apache-spark docker docker-compose docker-pipeline gocd-agent gocd-agent-docker gocd-server hadoop spark
Last synced: 14 Jan 2025
https://github.com/hellomaxime/data-platform-on-kubernetes
Open Source Data Platform on Kubernetes
bigdata data data-pipeline dbt druid etl kubernetes ml open-source platform spark superset
Last synced: 28 Dec 2024
https://github.com/officiallysingh/spring-boot-starter-spark
Spark Spring Boot starter
spark spring spring-boot springboot starter starters
Last synced: 25 Dec 2024
https://github.com/yucl80/avrodemo
write , append avro to hdfs file
avro hdfs hive java kafka log scala spark sparksql sparkstreaming tomcat-log
Last synced: 27 Jan 2025
https://github.com/boazmohar/pysparkutils
A collection of utilities for handling pySpark's SparkContext
Last synced: 09 Feb 2025
https://github.com/mcddhub/mcdd-big-data-study
Study project for big data (Hadoop, Zookeeper, Kafka, Flink, Spark)
big-data data-processing docker flink hadoop kafka spark zookeeper
Last synced: 09 Feb 2025
https://github.com/bedrockstreaming/sparktest
A testing tool for Scala and Spark developers
Last synced: 31 Dec 2024
https://github.com/conema/transe-pyspark
TransE implementation in Spark (pyspark)
aws distrubuted embedding gradient-descent knowledge-graph pyspark spark terraform transe word-embeddings
Last synced: 21 Jan 2025
https://github.com/erikerlandson/pyspark-ubi
Minimalist install of pyspark on top of Red Hat UBI
container-image pyspark spark ubi
Last synced: 06 Jan 2025
https://github.com/xpcosmos/injestao-dados-enem-sql
Esse projeto tem o objetivo de estruturar dados do enem em bancos de dados e analisar os dados utilizando métodos estatísticos.
docker docker-compose postgresql pyspark python spark sql statistics
Last synced: 14 Jan 2025
https://github.com/gunantos/php-spark
PHP Server Develop
php server serverless-framework spark
Last synced: 13 Jan 2025
https://github.com/brunneis/minebench
Proof-of-Work based benchmark written in Python that works with real Bitcoin data
benchmark bitcoin mining proof-of-work spark
Last synced: 26 Jan 2025
https://github.com/angelotc/MacroDAG
A Dockerized Airflow ETL pipeline that processes macroeconomic indicators from the Federal Reserve.
Last synced: 06 Nov 2024
https://github.com/mangalaman93/dspark
Run spark in docker containers
big-data containers docker microservices spark
Last synced: 18 Jan 2025
https://github.com/apache/incubator-gluten-site
Apache Gluten is a middle layer responsible for offloading JVM-based SQL engines' execution to native engines.
Last synced: 04 Feb 2025
https://github.com/yalishanda42/scala-recsys
Scala(-ble) recommender system architecture using functional programming (PoC)
cats cats-effect functional-programming movielens recommender-system recsys scala spark
Last synced: 28 Dec 2024
https://github.com/tashi-2004/fma-a-dataset-for-music-analysis
🎶 Scripts for music feature analysis, model training, and real-time recommendation using Apache Kafka. Extract features with Librosa 🎹, store them in MongoDB 🗄️, and process the data with Apache Spark ⚡. A 🌐 web interface 💻✨ is also included. Contributors: Tashfeen Abbasi 👤, Laiba Mazhar 👤, and Rafia Khan 👤.
html kafka kafka-consumer kafka-producer kafka-streaming linux mongodb mongodb-compass python3 spark ubuntu web-application
Last synced: 03 Dec 2024
https://github.com/bria222/animal2
heroku-deployment java postgres spark velocity
Last synced: 04 Jan 2025
https://github.com/longi94/lsde2017-p3-flight-visualization
Animated interactive flight visualization
Last synced: 06 Jan 2025
https://github.com/fdmsantos/aws-twitter-data-analytics
Project to Learn Data analytics in AWS using twitter data
aws data-analytics data-engineering data-science data-visualization flink spark terraform
Last synced: 26 Jan 2025
https://github.com/gmartinezramirez-old/data-science-portafolio
:notebook: [Active] Portafolio of data science projects. Using: Python, PyTorch, Spark, Tensorflow, Scikit, Keras. Includes Classification, Regression, Time series, NLP, Deep learning, among others.
data-science data-science-learning data-science-notebook data-science-portfolio hadoop jupyter-notebook keras notebook pandas pyspark python pytorch r sci-kit spark tensorflow
Last synced: 05 Dec 2024
https://github.com/jldbc/big-data
Coursework from Big Data (CS3390) -- Machine Learning tasks performed using Hadoop, MapReduce, and Spark
big-data hadoop pagerank recommender-system spark
Last synced: 04 Jan 2025
https://github.com/dvelkow/real_time_bulgarian_news_aggregator
An ETL-driven web scraping and data visualization project that aggregates news from multiple Bulgarian news sources in real-time and creates an interactive dashboard with the fetched data.
Last synced: 12 Oct 2024
https://github.com/gaelfoppolo/self-service-data-analytics
Data analysis made for business users
aws big-data data-analytics hadoop spark
Last synced: 03 Feb 2025
https://github.com/alexioannides/py-readme-snippets
This repository contains snippits of writing (in Markdown), on various topics relating to various flavours of Python development project.
Last synced: 17 Jan 2025
https://github.com/jms0522/hadoop_system
✅ hadoop eco system을 구성하고 파이프라인 제작합니다.
Last synced: 11 Oct 2024
https://github.com/hupe1980/docker_pyspark_notebook
Docker Compose setup for PySpark
docker docker-compose ipython jupyter-notebook jupyterlab pyspark python spark uber
Last synced: 02 Feb 2025
https://github.com/rezacsedu/Mining-Maximal-Frequent-Pattern-Spark
Implementation of Static mining part of "Mining maximal frequent patterns in transactional databases and dynamic data streams: A spark-based approach" Information Sciences, Volume 432, March 2018, Pages 278-300
data-mining data-stream frequent-pattern-mining java maximal-frequent-pattern spark structured-streaming
Last synced: 30 Oct 2024
https://github.com/derlin/workshop-data-sciences
A two-days workshop material introducing data sciences for big data
big-data data-science hdfs hive spark workshop workshop-materials zeppelin
Last synced: 20 Jan 2025
https://github.com/alvarogarcia7/bank-kata-kotlin
Bank pet project, in kotlin. See interests as topics
api-first api-standard bank-kata blackbox-testing etude finite-state-machine gradle gradlew hateoas junit junit5 kata kotlin multimodule paypal-rest-api practice spark sparkjava trikitrok with-client
Last synced: 10 Jan 2025
https://github.com/earthquakesan/twittertrends
Twitter Trends is a Spark Streaming example application
Last synced: 17 Jan 2025
https://github.com/fpopic/gg-interview-challenge
(Interview) GG Interview Challenge in Scala/Spark
apache-spark json logstash parsing regex scala spark sparksql
Last synced: 10 Jan 2025
https://github.com/makohn/lambda-architecture-poc
♨️ A PoC implementation of the λ-Architecture for collecting and analysing tweets
cassandra kafka lambda-architecture sbt scala spark
Last synced: 19 Dec 2024
https://github.com/hifly81/1brc_streaming
1brc challenge with streaming solutions for Apache Kafka
1brc apache camel-kafka flink kafka kafkastreams ksqldb nifi spark spring-kafka streaming
Last synced: 02 Nov 2024
https://github.com/lepetitbloc/sparksd
:sparkler: Sparks wallet Docker container
cryptocurrency dockerfile masternode spark wallet
Last synced: 26 Jan 2025
https://github.com/gacwr/openuba-model-hub
frontend, model registry, model search, and model marketplace for OpenUBA
analytics anomaly-detection cybersecurity datascience elasticsearch elk flask information-security machine-learning security siem sklearn spark tensorflow threathunting uba ueba user-behaviour
Last synced: 15 Jan 2025
https://github.com/kadnan/vagrant-spark2
Vagrant Box with Python 3.6.1, Apache Spark 2.1.1 with Scala 2.11.8 and PySpark (2.1.1).
pyspark python3 spark vagrant vagrant-boxes
Last synced: 20 Jan 2025
https://mgrojo.github.io/adasearch/
Custom search engine for the Ada programming language
ada custom-search-google search-engine spark spark-ada
Last synced: 28 Nov 2024
https://github.com/timvw/adobe-analytics-datafeed-datasource
Apache Spark data source for Adobe Analytics Data Feed
adobe-analytics clickstream python scala spark
Last synced: 08 Nov 2024
https://github.com/simplexspatial/osm-facts
Proofs and checks about osm pbf format and data content facts
Last synced: 15 Jan 2025
https://github.com/burhanahmed1/big-data-analytics
Practice tasks in Python programming language using Hadoop, MRJob, PySpark for Big Data Analytics.
apache-spark hadoop hadoop-mapreduce jupyter-notebook mrjob pyspark python spark spark-sql sparksql
Last synced: 11 Oct 2024
https://github.com/cloudtik/cloudtik
Cloud Scale Platform for Distributed Data, Analytics and AI
ai alibaba-cloud analytics aws azure cloud data-science deep-learning gcp kubernetes machine-learning microservices spark
Last synced: 16 Jan 2025
https://github.com/mtumilowicz/big-data-scala-spark-batch-workshop
Introduction to Spark Batch processing.
batch-processing big-data big-data-processing spark spark-sql workshop workshop-materials
Last synced: 04 Jan 2025
https://github.com/mgrojo/adasearch
Custom search engine for the Ada programming language
ada custom-search-google search-engine spark spark-ada
Last synced: 27 Oct 2024
https://github.com/akarce/udacity-data-pipeline-with-airflow
Udacity Data Engineering Nanodegree Program, Data Pipeline with Airflow project using MinIO and Postgresql.
airflow minio postgresql pyspark spark
Last synced: 12 Oct 2024
https://github.com/wittline/sparksql-with-python
This repository has some examples of using Spark and SparkSQL with Python through PySpark
flask-api python spark sparksql
Last synced: 29 Jan 2025
https://github.com/joyceannie/us-immigrations-data-warehouse
A data warehouse to perform analytics on the immigration trends in the US.
airflow data-engineering etl pyspark redshift s3 spark
Last synced: 29 Jan 2025
https://github.com/pankajsingh09/data_engineering_using_aws
This Repository contains the contents related to Data Engineering Using AWS
aws data-ingestion dataengineering event-bridge lambda-functions pipeline pycharm-ide pyspark python s3 spark
Last synced: 19 Dec 2024
https://github.com/brooksian/sparkpipelinesparknlp
Build & Convert a Spark NLP Pipeline to PMML
corenlp nlp pmml spark zeppelin-notebook
Last synced: 19 Jan 2025
https://github.com/rpytel1/supercomputing-labs
Fork of the repository for Supercomputing in Big Data class on TU Delft. Scala, Spark and Kafka were used to perform processing and streaming of GDelt data segments.
big-data gdelt-data kafka scala spark
Last synced: 18 Jan 2025
https://github.com/kwartile/spark-benchmark
Spark Benchmark suite to evaluate cluster configuration and compare the performance with other big data frameworks.
apache-spark benchmark benchmarking-suite cdh cloudera-hadoop hadoop hive impala performance scala spark
Last synced: 08 Feb 2025
https://github.com/open-datastudio/hive-metastore
Hive metastore on Staroid
hadoop hive hive-metastore kubernetes spark staroid
Last synced: 18 Nov 2024
https://github.com/garciparedes/scala-examples
Set of awesome Scala Examples
breeze functional-programming java scala spark
Last synced: 16 Jan 2025
https://github.com/cclient/elasticsearch-spark-upsert-from-kafka
elasticsearch-hadoop官方不支持upsert doc,修改源码实现,spark kafka streaming 示例 upsert { "upsert": {}, "doc": {...} }
elasticsearch elasticsearch-hadoop kafka kafka-streams spark upsert upsert-doc
Last synced: 16 Jan 2025
https://github.com/thanaraklee/dataflow-with-gcp
This project demonstrates the workflow of a Data Engineer. It utilizes the Google Cloud Platform and Google Colab as the main tools.
airflow apache-spark data-engineering etl pandas spark
Last synced: 25 Dec 2024
https://github.com/angelcervera/poc-drivingdistance
Proof of concept to implement a service to calculate the driving distance using osm network
akka openstreetmap osm osm4scala scala spark
Last synced: 10 Feb 2025
https://github.com/adityajn105/apache-spark-tutorials
Apache spark is a big data analysis framework.
bigdata pyspark spark spark-ml spark-rdd spark-tutorials
Last synced: 16 Jan 2025
https://github.com/pedropark99/introd-pyspark
An open and introductory book for the Python API of Apache Spark (pyspark)
Last synced: 14 Oct 2024
https://github.com/lmouhib/auto-register-spark-ui-k8s
A lightweight operator to automatically expose Spark UI manage its ingress when running Spark on Kubernetes
spark spark-kubernetes spark-sql spark-streaming spark-ui
Last synced: 10 Feb 2025
https://github.com/jinsyin/datalink
⚡ 数据集成 | DataLink is a lightweight data integration framework build on top of DataX, Spark and Flink
batch big-data bigdata cdc data data-collection data-exchange data-integration data-pipeline data-synchronization datalink etl flink flink-cdc framework integration pipeline spark streaming
Last synced: 15 Nov 2024
https://github.com/piotr-kalanski/spark-local
API enabling switching between Spark execution engine and local fast implementation based on Scala collections.
Last synced: 21 Dec 2024
https://github.com/bluejoe2008/hippo-rpc
Hippo Transport Library enhances spark-commons with easy stream management & handling
Last synced: 10 Feb 2025
https://github.com/jabhij/crimerate_classification
Developing a system that could classify crime descriptions into different categories which would help the authorities to assign officers to crimes based on the report.
classification crime-analysis crime-classification crime-rates machine-learning mllib pyspark python spark tensorflow
Last synced: 17 Jan 2025
https://github.com/kmohamedalie/big-data-hadoop-spark-lab
Big Data🛢️ with Hadoop🐘 and Spark⭐ lab🧪🥼
big-data coursera data-engineering docker hadoop ibm kubernetes spark
Last synced: 02 Jan 2025
https://github.com/brooksian/solrtosparknotebook
Connecting Solr and Spark In An Apache Zeppelin Notebook
Last synced: 19 Jan 2025
https://github.com/brooksian/ds_gtdb
KMeans Clustering on Global Terrorism Database
global-terrorism-database machine-learning spark sparksql zeppelin-notebook
Last synced: 19 Jan 2025
https://github.com/jatin-8898/sparkwebsite
A clean and very interesting looking website. :sparkles:
bootstrap4 css html javascript spark typescript
Last synced: 17 Jan 2025
https://github.com/brooksian/epaairnow
Exploring EPA Air Now Time Series Data with Apache Spark and Apache Zeppelin
spark sparksql time-series zeppelin-notebook
Last synced: 19 Jan 2025
https://github.com/pprzetacznik/datalake
Simple datalake
avro data-engineering kafka parquet schema-registry spark spark-structured-streaming
Last synced: 03 Feb 2025
https://github.com/hibuz/hadoop-docker
🐳 hadoop ecosystems docker image
data-engineering docker docker-compose flink hadoop hbase hive spark zeppelin
Last synced: 15 Nov 2024
https://github.com/brooksian/sparkpipeline2mleapbundle
Convert Spark Pipeline Models to MLeap Bundles
Last synced: 19 Jan 2025
https://github.com/policratus/sparkmage
🐘 A tool for blazing fast analysis and clustering of similar images using 🐘 Hadoop and ⚡ Spark.
big-data computer-vision hadoop image-processing spark
Last synced: 02 Nov 2024
https://github.com/pomadchin/vlm-performance
GeoTrellis RasterSources Ingest benchmark
aws emr geotrellis gis raster spark
Last synced: 17 Jan 2025
https://github.com/stefen-taime/investissement
Jenkins Delta pipeline
delta-lake jenkins-pipeline minio spark
Last synced: 23 Jan 2025
https://github.com/cwienberg/spark-sorting-helpers
Helper library for using secondary sorting in Spark RDD and Dataset operations
Last synced: 23 Jan 2025
https://github.com/wtanaka/ansible-role-apache-spark
Ansible role to install Apache Spark
ansible ansible-galaxy ansible-role ansible-roles apache-spark batch galaxy mapreduce spark streaming
Last synced: 23 Jan 2025
https://github.com/nhviet03/is405_bigdata_mapreduce_knn
A MapReduce-Based k-Nearest Neighbor Approach for Big Data Classification on Apache Spark
knn-classification mapreduce pyspark spark
Last synced: 17 Jan 2025