Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
Apache Spark
Apache Spark is an open source distributed general-purpose cluster-computing framework. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.
- GitHub: https://github.com/topics/spark
- Wikipedia: https://en.wikipedia.org/wiki/Apache_Spark
- Repo: https://github.com/apache/spark
- Created by: Matei Zaharia
- Released: May 26, 2014
- Related Topics: scala, hadoop,
- Aliases: apache-spark,
- Last updated: 2025-02-13 00:28:14 UTC
- JSON Representation
https://github.com/bria222/animal2
heroku-deployment java postgres spark velocity
Last synced: 04 Jan 2025
https://github.com/kanchishimono/spark-on-k8s-images
Docker images for spark on kubernetes
docker docker-image dockerfile kubernetes pyspark spark spark-kubernetes spark-on-k8s spark-on-kubernetes
Last synced: 28 Nov 2024
https://github.com/apache/incubator-gluten-site
Apache Gluten is a middle layer responsible for offloading JVM-based SQL engines' execution to native engines.
Last synced: 04 Feb 2025
https://github.com/mangalaman93/dspark
Run spark in docker containers
big-data containers docker microservices spark
Last synced: 18 Jan 2025
https://github.com/codelytv/spark-best_practices_and_deploy-course
Deploy Spark course examples
Last synced: 03 Dec 2024
https://github.com/oracle-quickstart/oci-spark
Terraform module to deploy Spark on Oracle Cloud Infrastructure (OCI)
cloud oci oracle oracle-led spark terraform
Last synced: 07 Nov 2024
https://github.com/fdmsantos/aws-twitter-data-analytics
Project to Learn Data analytics in AWS using twitter data
aws data-analytics data-engineering data-science data-visualization flink spark terraform
Last synced: 26 Jan 2025
https://github.com/renardeinside/dbx-kafka-protobuf-example
Sample code for working with Kafka & Protobuf in Databricks
databricks kafka protobuf scala spark spark-streaming
Last synced: 06 Feb 2025
https://github.com/longi94/lsde2017-p3-flight-visualization
Animated interactive flight visualization
Last synced: 06 Jan 2025
https://github.com/brunneis/minebench
Proof-of-Work based benchmark written in Python that works with real Bitcoin data
benchmark bitcoin mining proof-of-work spark
Last synced: 26 Jan 2025
https://github.com/erikerlandson/pyspark-ubi
Minimalist install of pyspark on top of Red Hat UBI
container-image pyspark spark ubi
Last synced: 06 Jan 2025
https://github.com/renardeinside/terrametria
Source code 3D population density map of Germany, with ETL and app logic on top the Databricks Platform.
databricks deckgl python react spark
Last synced: 03 Dec 2024
https://github.com/mtpatter/bilao
Jupyter notebooks for filtering Kafka data with Spark Streaming.
avro docker jupyter-notebook kafka spark spark-streaming
Last synced: 12 Jan 2025
https://github.com/hellomaxime/data-platform-on-kubernetes
Open Source Data Platform on Kubernetes
bigdata data data-pipeline dbt druid etl kubernetes ml open-source platform spark superset
Last synced: 28 Dec 2024
https://github.com/hibadaoud/real-time-flight-data-kibana-visualization
Real-Time Flight Data Visualization Dashboard: Interactive web application for real-time flight tracking and airport analytics. Powered by Kafka, Pyspark, Elasticsearch, Kibana, Express NodeJs, MongoDB, and Docker.
css docker elasticsearch html javascript jwt-authentication kafka kibana nodejs real-time spark
Last synced: 10 Feb 2025
https://github.com/jbris/docker-spark-sparklyr
Docker setup for Apache Spark and the R sparklyr package
adminer apache-spark docker docker-compose postgres postgresql rstats rstudio spark spark-dataset spark-master spark-ml spark-worker sparklyr sparklyr-extension
Last synced: 12 Jan 2025
https://github.com/multivacplatform/multivac-fakenews
Detecting users and communities which propagate fake news on Twitter by Apache Spark
deep-learning fakenews machine-learning spark twitter
Last synced: 12 Jan 2025
https://github.com/multivacplatform/multivac-wikipedia
Wonderful reusable codes, libraries and scripts to process Wikipedia page views by using Apache Spark.
data-frame multivac-wikipedia spark spark-sql wikipedia
Last synced: 12 Jan 2025
https://github.com/dharaneeshvrd/spark-examples
Spark Examples
pyspark spark spark-example spark-sql spark-streaming spark-streaming-kafka spark-structured-streaming
Last synced: 07 Nov 2024
https://github.com/multivacplatform/multivac-ml
Pre-trained ML models for Apache Spark
machine-learning nlp spark spark-ml
Last synced: 12 Jan 2025
https://github.com/policratus/sparkmage
🐘 A tool for blazing fast analysis and clustering of similar images using 🐘 Hadoop and ⚡ Spark.
big-data computer-vision hadoop image-processing spark
Last synced: 02 Nov 2024
https://github.com/hibuz/hadoop-docker
🐳 hadoop ecosystems docker image
data-engineering docker docker-compose flink hadoop hbase hive spark zeppelin
Last synced: 15 Nov 2024
https://github.com/cloudtik/cloudtik
Cloud Scale Platform for Distributed Data, Analytics and AI
ai alibaba-cloud analytics aws azure cloud data-science deep-learning gcp kubernetes machine-learning microservices spark
Last synced: 16 Jan 2025
https://github.com/joyceannie/us-immigrations-data-warehouse
A data warehouse to perform analytics on the immigration trends in the US.
airflow data-engineering etl pyspark redshift s3 spark
Last synced: 29 Jan 2025
https://github.com/yalishanda42/scala-recsys
Scala(-ble) recommender system architecture using functional programming (PoC)
cats cats-effect functional-programming movielens recommender-system recsys scala spark
Last synced: 28 Dec 2024
https://github.com/inbravo/spark-movie-lens
Various examples of analytics using Apache Spark
Last synced: 02 Feb 2025
https://github.com/kruglov-dmitry/yelp_data
End to end example how to read big (well, comparably) data from Kafka and write it down into Cassandra using Spark Structured Streaming. Using yelp dataset for illustration purposes.
cassandra kafka spark streaming yelp-dataset
Last synced: 19 Jan 2025
https://github.com/gacwr/openuba-model-hub
frontend, model registry, model search, and model marketplace for OpenUBA
analytics anomaly-detection cybersecurity datascience elasticsearch elk flask information-security machine-learning security siem sklearn spark tensorflow threathunting uba ueba user-behaviour
Last synced: 15 Jan 2025
https://github.com/kadnan/vagrant-spark2
Vagrant Box with Python 3.6.1, Apache Spark 2.1.1 with Scala 2.11.8 and PySpark (2.1.1).
pyspark python3 spark vagrant vagrant-boxes
Last synced: 20 Jan 2025
https://github.com/piotr-kalanski/spark-local
API enabling switching between Spark execution engine and local fast implementation based on Scala collections.
Last synced: 21 Dec 2024
https://github.com/prabaprakash/docker-pipeline-for-hadoop-n-spark-submit
Docker CI/CD Pipeline
apache-spark docker docker-compose docker-pipeline gocd-agent gocd-agent-docker gocd-server hadoop spark
Last synced: 14 Jan 2025
https://github.com/bedrockstreaming/sparktest
A testing tool for Scala and Spark developers
Last synced: 31 Dec 2024
https://github.com/yucl80/avrodemo
write , append avro to hdfs file
avro hdfs hive java kafka log scala spark sparksql sparkstreaming tomcat-log
Last synced: 27 Jan 2025
https://github.com/simplexspatial/osm-facts
Proofs and checks about osm pbf format and data content facts
Last synced: 15 Jan 2025
https://github.com/tashi-2004/fma-a-dataset-for-music-analysis
🎶 Scripts for music feature analysis, model training, and real-time recommendation using Apache Kafka. Extract features with Librosa 🎹, store them in MongoDB 🗄️, and process the data with Apache Spark ⚡. A 🌐 web interface 💻✨ is also included. Contributors: Tashfeen Abbasi 👤, Laiba Mazhar 👤, and Rafia Khan 👤.
html kafka kafka-consumer kafka-producer kafka-streaming linux mongodb mongodb-compass python3 spark ubuntu web-application
Last synced: 03 Dec 2024
https://github.com/dustin-decker/elasticsearchsql
A simple example of using Apache Spark SQL against Elasticsearch 5
Last synced: 29 Jan 2025
https://github.com/anant/example-cassandra-spark-job-scala
apache-spark cassandra docker etl sbt scala spark
Last synced: 19 Jan 2025
https://github.com/jinsyin/datalink
⚡ 数据集成 | DataLink is a lightweight data integration framework build on top of DataX, Spark and Flink
batch big-data bigdata cdc data data-collection data-exchange data-integration data-pipeline data-synchronization datalink etl flink flink-cdc framework integration pipeline spark streaming
Last synced: 15 Nov 2024
https://github.com/maxinexiong/item-based-collaborative-filtering
This project utilizes PySpark DataFrames and PySpark RDD to implement item-based collaborative filtering. By calculating cosine similarity scores or identifying movies with the highest number of shared viewers, the system recommends 10 similar movies for a given target movie that aligns users’ preferences.
apache-spark collaborative-filtering movie-recommendation pyspark python spark spark-dataframes spark-rdd
Last synced: 21 Dec 2024
https://github.com/lucivpav/dnbc-scala
Parallel implementation of dynamic naive Bayesian classifier
apache-spark bayesian-networks ctu-fit dnbc fit-ctu naive-bayes-classifier scala spark
Last synced: 12 Feb 2025
https://github.com/conema/transe-pyspark
TransE implementation in Spark (pyspark)
aws distrubuted embedding gradient-descent knowledge-graph pyspark spark terraform transe word-embeddings
Last synced: 21 Jan 2025
https://github.com/kevinhartman/kafka-to-eventhub
Kafka to EventHub Mirror.
eventhub eventhub-topic kafka mirror spark spark-streaming
Last synced: 13 Feb 2025
https://github.com/aiday-mar/spark-recommendation-engine
Movie recommendation system built using Spark and Scala
recommendation-system scala spark university-project
Last synced: 05 Jan 2025
https://github.com/manuparra/taller-bigdata-con-r
Taller Big Data con Apache Spark + R desde Databricks cloud
bigdata cloudcomputing databricks r spark sparkr
Last synced: 27 Dec 2024
https://github.com/triandicAnt/TwitterSentimentAnalytics
Basic Twitter Sentiment Analytics using Apache Spark Streaming APIs and Python by processing live tweets from Twitter.
machine-learning python sentimental-analysis spark twitter twitter-api twitter-sentiment-analytics
Last synced: 23 Oct 2024
https://github.com/omar-besbes/football-big-data
This is a comprehensive solution for real-time football analytics, leveraging Apache Spark execution on yarn for both streaming and batch processing, Hadoop HDFS for distributed storage, Kafka for real-time data ingestion, RethinkDB for live data updates and Next.js for data visualization as well as a custom built search engine.
batch-processing hadoop kafka nextjs rethinkdb spark streaming t3-stack yarn
Last synced: 20 Jan 2025
https://github.com/r13i/spark-record-deduplicating
Data cleansing problem statement: Data in a record are often duplicated. How do we find the duplicate probability ? [Work In Progress]
big-data deduplication record-linkage records-management scala spark
Last synced: 02 Feb 2025
https://github.com/garciparedes/scala-examples
Set of awesome Scala Examples
breeze functional-programming java scala spark
Last synced: 16 Jan 2025
https://github.com/tpvasconcelos/sparkypandy
It's not spark, it's not pandas, it's just awkward...
dataframe pandas pyspark spark
Last synced: 05 Nov 2024
https://github.com/chimera-suite/pysparql
This is a simple module that allows developer to query SPARQL endpoints and analyze the results with Apache Spark.
apache apache-spark construct-query dataframe graphframe jena-fuseki spark sparql
Last synced: 01 Dec 2024
https://github.com/adityajn105/apache-spark-tutorials
Apache spark is a big data analysis framework.
bigdata pyspark spark spark-ml spark-rdd spark-tutorials
Last synced: 16 Jan 2025
https://github.com/viniciusmsousa/pyspark-ds-toolbox
A Pyspark companion for data science tasks.
Last synced: 09 Feb 2025
https://github.com/yoongoing/bigdata_pyspark
⚡️공개용 맵리듀스 플랫폼인 Spark를 사용하여 데이터마이닝을 해보자⚡️
bigdata dataminig jupyter-notebook mapreduce mapreduce-python pyspark spark
Last synced: 09 Feb 2025
https://github.com/officiallysingh/spring-boot-starter-spark
Spark Spring Boot starter
spark spring spring-boot springboot starter starters
Last synced: 25 Dec 2024
https://github.com/boazmohar/pysparkutils
A collection of utilities for handling pySpark's SparkContext
Last synced: 09 Feb 2025
https://github.com/xpcosmos/injestao-dados-enem-sql
Esse projeto tem o objetivo de estruturar dados do enem em bancos de dados e analisar os dados utilizando métodos estatísticos.
docker docker-compose postgresql pyspark python spark sql statistics
Last synced: 14 Jan 2025
https://github.com/mcddhub/mcdd-big-data-study
Study project for big data (Hadoop, Zookeeper, Kafka, Flink, Spark)
big-data data-processing docker flink hadoop kafka spark zookeeper
Last synced: 09 Feb 2025
https://github.com/maxinexiong/degrees-of-separation-with-breadth-first-search
This project utilizes PySpark RDD and the Breadth-first Search (BFS) algorithm to find the shortest path and degrees of separation between two given Marvel superheroes based on based on their appearances together in the same comic books, empowering users to discover connections between their favourite superheroes in the Marvel universe.
apache-spark bfs-algorithm breadth-first-search degrees-of-separation marvel-characters pyspark python spark spark-rdd
Last synced: 21 Dec 2024
https://github.com/gunantos/php-spark
PHP Server Develop
php server serverless-framework spark
Last synced: 13 Jan 2025
https://github.com/jabhij/crimerate_classification
Developing a system that could classify crime descriptions into different categories which would help the authorities to assign officers to crimes based on the report.
classification crime-analysis crime-classification crime-rates machine-learning mllib pyspark python spark tensorflow
Last synced: 17 Jan 2025
https://github.com/superruzafa/scala-spark-big-data
My solutions to the Coursera's Big Data Analysis with Scala and Spark course
Last synced: 30 Dec 2024
https://github.com/gmartinezramirez-old/data-science-portafolio
:notebook: [Active] Portafolio of data science projects. Using: Python, PyTorch, Spark, Tensorflow, Scikit, Keras. Includes Classification, Regression, Time series, NLP, Deep learning, among others.
data-science data-science-learning data-science-notebook data-science-portfolio hadoop jupyter-notebook keras notebook pandas pyspark python pytorch r sci-kit spark tensorflow
Last synced: 05 Dec 2024
https://github.com/jatin-8898/sparkwebsite
A clean and very interesting looking website. :sparkles:
bootstrap4 css html javascript spark typescript
Last synced: 17 Jan 2025
https://github.com/dvelkow/real_time_bulgarian_news_aggregator
An ETL-driven web scraping and data visualization project that aggregates news from multiple Bulgarian news sources in real-time and creates an interactive dashboard with the fetched data.
Last synced: 12 Oct 2024
https://github.com/jms0522/hadoop_system
✅ hadoop eco system을 구성하고 파이프라인 제작합니다.
Last synced: 11 Oct 2024
https://github.com/univalence/spark-plumbus
Collection of tools for Scala Spark
functional-programming scala spark
Last synced: 20 Jan 2025
https://github.com/derlin/workshop-data-sciences
A two-days workshop material introducing data sciences for big data
big-data data-science hdfs hive spark workshop workshop-materials zeppelin
Last synced: 20 Jan 2025
https://github.com/lepetitbloc/sparksd
:sparkler: Sparks wallet Docker container
cryptocurrency dockerfile masternode spark wallet
Last synced: 26 Jan 2025
https://mgrojo.github.io/adasearch/
Custom search engine for the Ada programming language
ada custom-search-google search-engine spark spark-ada
Last synced: 28 Nov 2024
https://github.com/mgrojo/adasearch
Custom search engine for the Ada programming language
ada custom-search-google search-engine spark spark-ada
Last synced: 27 Oct 2024
https://github.com/mtumilowicz/big-data-scala-spark-batch-workshop
Introduction to Spark Batch processing.
batch-processing big-data big-data-processing spark spark-sql workshop workshop-materials
Last synced: 04 Jan 2025
https://github.com/open-datastudio/hive-metastore
Hive metastore on Staroid
hadoop hive hive-metastore kubernetes spark staroid
Last synced: 18 Nov 2024
https://github.com/pomadchin/vlm-performance
GeoTrellis RasterSources Ingest benchmark
aws emr geotrellis gis raster spark
Last synced: 17 Jan 2025
https://github.com/wittline/sparksql-with-python
This repository has some examples of using Spark and SparkSQL with Python through PySpark
flask-api python spark sparksql
Last synced: 29 Jan 2025
https://github.com/nhviet03/is405_bigdata_mapreduce_knn
A MapReduce-Based k-Nearest Neighbor Approach for Big Data Classification on Apache Spark
knn-classification mapreduce pyspark spark
Last synced: 17 Jan 2025
https://github.com/hupe1980/docker_pyspark_notebook
Docker Compose setup for PySpark
docker docker-compose ipython jupyter-notebook jupyterlab pyspark python spark uber
Last synced: 02 Feb 2025
https://github.com/hifly81/1brc_streaming
1brc challenge with streaming solutions for Apache Kafka
1brc apache camel-kafka flink kafka kafkastreams ksqldb nifi spark spring-kafka streaming
Last synced: 02 Nov 2024
https://github.com/makohn/lambda-architecture-poc
♨️ A PoC implementation of the λ-Architecture for collecting and analysing tweets
cassandra kafka lambda-architecture sbt scala spark
Last synced: 12 Feb 2025
https://github.com/brooksian/solrtosparknotebook
Connecting Solr and Spark In An Apache Zeppelin Notebook
Last synced: 19 Jan 2025
https://github.com/brooksian/ds_gtdb
KMeans Clustering on Global Terrorism Database
global-terrorism-database machine-learning spark sparksql zeppelin-notebook
Last synced: 19 Jan 2025
https://github.com/rezacsedu/Mining-Maximal-Frequent-Pattern-Spark
Implementation of Static mining part of "Mining maximal frequent patterns in transactional databases and dynamic data streams: A spark-based approach" Information Sciences, Volume 432, March 2018, Pages 278-300
data-mining data-stream frequent-pattern-mining java maximal-frequent-pattern spark structured-streaming
Last synced: 30 Oct 2024
https://github.com/exacaster/delta-fetch
HTTP API on Delta Lake tables
big-data delta-lake parquet s3 spark
Last synced: 11 Nov 2024
https://github.com/rpytel1/supercomputing-labs
Fork of the repository for Supercomputing in Big Data class on TU Delft. Scala, Spark and Kafka were used to perform processing and streaming of GDelt data segments.
big-data gdelt-data kafka scala spark
Last synced: 18 Jan 2025
https://github.com/kwartile/spark-benchmark
Spark Benchmark suite to evaluate cluster configuration and compare the performance with other big data frameworks.
apache-spark benchmark benchmarking-suite cdh cloudera-hadoop hadoop hive impala performance scala spark
Last synced: 08 Feb 2025
https://github.com/brooksian/epaairnow
Exploring EPA Air Now Time Series Data with Apache Spark and Apache Zeppelin
spark sparksql time-series zeppelin-notebook
Last synced: 19 Jan 2025
https://github.com/angelotc/MacroDAG
A Dockerized Airflow ETL pipeline that processes macroeconomic indicators from the Federal Reserve.
Last synced: 06 Nov 2024
https://github.com/emso-exe/comercio_eletronico_brasileiro
Projeto de análise de dados do comércio eletrônico brasileiro disponibilizado pela Olist via plataforma Kaggle.
analise-de-dados ciencia-de-dados data-analytics data-science datascience e-commerce postgres postgresql pyspark python python-3 python3 spark spark-sql sql
Last synced: 16 Jan 2025
https://github.com/brooksian/sparkpipeline2mleapbundle
Convert Spark Pipeline Models to MLeap Bundles
Last synced: 19 Jan 2025
https://github.com/cclient/elasticsearch-spark-upsert-from-kafka
elasticsearch-hadoop官方不支持upsert doc,修改源码实现,spark kafka streaming 示例 upsert { "upsert": {}, "doc": {...} }
elasticsearch elasticsearch-hadoop kafka kafka-streams spark upsert upsert-doc
Last synced: 16 Jan 2025
https://github.com/brooksian/sparkpipelinesparknlp
Build & Convert a Spark NLP Pipeline to PMML
corenlp nlp pmml spark zeppelin-notebook
Last synced: 19 Jan 2025