Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
Apache Spark
![](https://explore-feed.github.com/topics/spark/spark.png)
Apache Spark is an open source distributed general-purpose cluster-computing framework. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.
- GitHub: https://github.com/topics/spark
- Wikipedia: https://en.wikipedia.org/wiki/Apache_Spark
- Repo: https://github.com/apache/spark
- Created by: Matei Zaharia
- Released: May 26, 2014
- Related Topics: scala, hadoop,
- Aliases: apache-spark,
- Last updated: 2025-02-14 00:24:01 UTC
- JSON Representation
https://github.com/bria222/animal2
heroku-deployment java postgres spark velocity
Last synced: 04 Jan 2025
https://github.com/gacwr/openuba-model-hub
frontend, model registry, model search, and model marketplace for OpenUBA
analytics anomaly-detection cybersecurity datascience elasticsearch elk flask information-security machine-learning security siem sklearn spark tensorflow threathunting uba ueba user-behaviour
Last synced: 15 Jan 2025
https://github.com/kadnan/vagrant-spark2
Vagrant Box with Python 3.6.1, Apache Spark 2.1.1 with Scala 2.11.8 and PySpark (2.1.1).
pyspark python3 spark vagrant vagrant-boxes
Last synced: 20 Jan 2025
https://github.com/simplexspatial/osm-facts
Proofs and checks about osm pbf format and data content facts
Last synced: 15 Jan 2025
https://github.com/fdmsantos/aws-twitter-data-analytics
Project to Learn Data analytics in AWS using twitter data
aws data-analytics data-engineering data-science data-visualization flink spark terraform
Last synced: 26 Jan 2025
https://github.com/hupe1980/docker_pyspark_notebook
Docker Compose setup for PySpark
docker docker-compose ipython jupyter-notebook jupyterlab pyspark python spark uber
Last synced: 02 Feb 2025
https://github.com/longi94/lsde2017-p3-flight-visualization
Animated interactive flight visualization
Last synced: 06 Jan 2025
https://github.com/mangalaman93/dspark
Run spark in docker containers
big-data containers docker microservices spark
Last synced: 18 Jan 2025
https://github.com/brunneis/minebench
Proof-of-Work based benchmark written in Python that works with real Bitcoin data
benchmark bitcoin mining proof-of-work spark
Last synced: 26 Jan 2025
https://github.com/erikerlandson/pyspark-ubi
Minimalist install of pyspark on top of Red Hat UBI
container-image pyspark spark ubi
Last synced: 06 Jan 2025
https://github.com/mtpatter/bilao
Jupyter notebooks for filtering Kafka data with Spark Streaming.
avro docker jupyter-notebook kafka spark spark-streaming
Last synced: 12 Jan 2025
https://github.com/hellomaxime/data-platform-on-kubernetes
Open Source Data Platform on Kubernetes
bigdata data data-pipeline dbt druid etl kubernetes ml open-source platform spark superset
Last synced: 28 Dec 2024
https://github.com/hibadaoud/real-time-flight-data-kibana-visualization
Real-Time Flight Data Visualization Dashboard: Interactive web application for real-time flight tracking and airport analytics. Powered by Kafka, Pyspark, Elasticsearch, Kibana, Express NodeJs, MongoDB, and Docker.
css docker elasticsearch html javascript jwt-authentication kafka kibana nodejs real-time spark
Last synced: 10 Feb 2025
https://github.com/jbris/docker-spark-sparklyr
Docker setup for Apache Spark and the R sparklyr package
adminer apache-spark docker docker-compose postgres postgresql rstats rstudio spark spark-dataset spark-master spark-ml spark-worker sparklyr sparklyr-extension
Last synced: 12 Jan 2025
https://github.com/multivacplatform/multivac-fakenews
Detecting users and communities which propagate fake news on Twitter by Apache Spark
deep-learning fakenews machine-learning spark twitter
Last synced: 12 Jan 2025
https://github.com/garciparedes/scala-examples
Set of awesome Scala Examples
breeze functional-programming java scala spark
Last synced: 16 Jan 2025
https://github.com/multivacplatform/multivac-wikipedia
Wonderful reusable codes, libraries and scripts to process Wikipedia page views by using Apache Spark.
data-frame multivac-wikipedia spark spark-sql wikipedia
Last synced: 12 Jan 2025
https://github.com/dharaneeshvrd/spark-examples
Spark Examples
pyspark spark spark-example spark-sql spark-streaming spark-streaming-kafka spark-structured-streaming
Last synced: 07 Nov 2024
https://github.com/adityajn105/apache-spark-tutorials
Apache spark is a big data analysis framework.
bigdata pyspark spark spark-ml spark-rdd spark-tutorials
Last synced: 16 Jan 2025
https://github.com/multivacplatform/multivac-ml
Pre-trained ML models for Apache Spark
machine-learning nlp spark spark-ml
Last synced: 12 Jan 2025
https://github.com/yucl80/avrodemo
write , append avro to hdfs file
avro hdfs hive java kafka log scala spark sparksql sparkstreaming tomcat-log
Last synced: 27 Jan 2025
https://github.com/jabhij/crimerate_classification
Developing a system that could classify crime descriptions into different categories which would help the authorities to assign officers to crimes based on the report.
classification crime-analysis crime-classification crime-rates machine-learning mllib pyspark python spark tensorflow
Last synced: 17 Jan 2025
https://github.com/kruglov-dmitry/yelp_data
End to end example how to read big (well, comparably) data from Kafka and write it down into Cassandra using Spark Structured Streaming. Using yelp dataset for illustration purposes.
cassandra kafka spark streaming yelp-dataset
Last synced: 19 Jan 2025
https://github.com/bedrockstreaming/sparktest
A testing tool for Scala and Spark developers
Last synced: 31 Dec 2024
https://github.com/rpytel1/supercomputing-labs
Fork of the repository for Supercomputing in Big Data class on TU Delft. Scala, Spark and Kafka were used to perform processing and streaming of GDelt data segments.
big-data gdelt-data kafka scala spark
Last synced: 18 Jan 2025
https://github.com/jatin-8898/sparkwebsite
A clean and very interesting looking website. :sparkles:
bootstrap4 css html javascript spark typescript
Last synced: 17 Jan 2025
https://github.com/jinsyin/datalink
⚡ 数据集成 | DataLink is a lightweight data integration framework build on top of DataX, Spark and Flink
batch big-data bigdata cdc data data-collection data-exchange data-integration data-pipeline data-synchronization datalink etl flink flink-cdc framework integration pipeline spark streaming
Last synced: 15 Nov 2024
https://github.com/inbravo/spark-movie-lens
Various examples of analytics using Apache Spark
Last synced: 02 Feb 2025
https://github.com/renardeinside/terrametria
Source code 3D population density map of Germany, with ETL and app logic on top the Databricks Platform.
databricks deckgl python react spark
Last synced: 03 Dec 2024
https://github.com/codelytv/spark-best_practices_and_deploy-course
Deploy Spark course examples
Last synced: 03 Dec 2024
https://github.com/anant/example-cassandra-spark-job-scala
apache-spark cassandra docker etl sbt scala spark
Last synced: 19 Jan 2025
https://github.com/j-sephb-lt-n/useful-code-snippets
A searchable collection of useful little pieces of code
aws bash cloud compute-engine docker dockerfile ec2 gcp graph pyspark python r-language shell spark virtual-machine
Last synced: 28 Dec 2024
https://github.com/cwienberg/spark-sorting-helpers
Helper library for using secondary sorting in Spark RDD and Dataset operations
Last synced: 23 Jan 2025
https://github.com/lucivpav/dnbc-scala
Parallel implementation of dynamic naive Bayesian classifier
apache-spark bayesian-networks ctu-fit dnbc fit-ctu naive-bayes-classifier scala spark
Last synced: 12 Feb 2025
https://github.com/garystafford/dataproc-java-demo
Demonstration of Google Cloud Dataproc for running Spark jobs with Java
big-data-analytics dataproc gcp google java spark
Last synced: 06 Dec 2024
https://github.com/aiday-mar/spark-recommendation-engine
Movie recommendation system built using Spark and Scala
recommendation-system scala spark university-project
Last synced: 05 Jan 2025
https://github.com/nhviet03/is405_bigdata_mapreduce_knn
A MapReduce-Based k-Nearest Neighbor Approach for Big Data Classification on Apache Spark
knn-classification mapreduce pyspark spark
Last synced: 17 Jan 2025
https://github.com/manuparra/taller-bigdata-con-r
Taller Big Data con Apache Spark + R desde Databricks cloud
bigdata cloudcomputing databricks r spark sparkr
Last synced: 27 Dec 2024
https://github.com/tranthe170/nyc-taxi-pipeline
Building Data Lakehouse by open source technology. Support end to end data pipeline, from source data on AWS S3 to Lakehouse, visualize.
airflow delta-lake hive lakehouse presto python s3 spark superset
Last synced: 17 Jan 2025
https://github.com/triandicAnt/TwitterSentimentAnalytics
Basic Twitter Sentiment Analytics using Apache Spark Streaming APIs and Python by processing live tweets from Twitter.
machine-learning python sentimental-analysis spark twitter twitter-api twitter-sentiment-analytics
Last synced: 23 Oct 2024
https://github.com/omar-besbes/football-big-data
This is a comprehensive solution for real-time football analytics, leveraging Apache Spark execution on yarn for both streaming and batch processing, Hadoop HDFS for distributed storage, Kafka for real-time data ingestion, RethinkDB for live data updates and Next.js for data visualization as well as a custom built search engine.
batch-processing hadoop kafka nextjs rethinkdb spark streaming t3-stack yarn
Last synced: 20 Jan 2025
https://github.com/r13i/spark-record-deduplicating
Data cleansing problem statement: Data in a record are often duplicated. How do we find the duplicate probability ? [Work In Progress]
big-data deduplication record-linkage records-management scala spark
Last synced: 02 Feb 2025
https://github.com/vasnake/spark.ml.spatialjointransformer
spark.ml.transformer: join two datasets using spatial relations
geospatial join ml-pipeline python scala spark spark-ml spatial transformer
Last synced: 03 Jan 2025
https://github.com/engineering-research-and-development/fiware-orion-pyspark-connector
Bidirectional Orion/Orion-LD <--> PySpark Connector
cognitive fiware ngsi ngsi-ld ngsi-v2 orion orion-context-broker orion-ld processing pyspark python spark
Last synced: 17 Jan 2025
https://github.com/exacaster/delta-fetch
HTTP API on Delta Lake tables
big-data delta-lake parquet s3 spark
Last synced: 11 Nov 2024
https://github.com/emso-exe/comercio_eletronico_brasileiro
Projeto de análise de dados do comércio eletrônico brasileiro disponibilizado pela Olist via plataforma Kaggle.
analise-de-dados ciencia-de-dados data-analytics data-science datascience e-commerce postgres postgresql pyspark python python-3 python3 spark spark-sql sql
Last synced: 16 Jan 2025
https://github.com/viniciusmsousa/pyspark-ds-toolbox
A Pyspark companion for data science tasks.
Last synced: 09 Feb 2025
https://github.com/yoongoing/bigdata_pyspark
⚡️공개용 맵리듀스 플랫폼인 Spark를 사용하여 데이터마이닝을 해보자⚡️
bigdata dataminig jupyter-notebook mapreduce mapreduce-python pyspark spark
Last synced: 09 Feb 2025
https://github.com/brooksian/sparkpipelinesparknlp
Build & Convert a Spark NLP Pipeline to PMML
corenlp nlp pmml spark zeppelin-notebook
Last synced: 19 Jan 2025
https://github.com/officiallysingh/spring-boot-starter-spark
Spark Spring Boot starter
spark spring spring-boot springboot starter starters
Last synced: 25 Dec 2024
https://github.com/extwiii/bigdata-uc.san.diego
Unlock Value in Massive Datasets - UC San Diego
big-data classification data-science graph hadoop integration machine-learning management modeling neo4j processing regression spark
Last synced: 28 Jan 2025
https://github.com/boazmohar/pysparkutils
A collection of utilities for handling pySpark's SparkContext
Last synced: 09 Feb 2025
https://github.com/mcddhub/mcdd-big-data-study
Study project for big data (Hadoop, Zookeeper, Kafka, Flink, Spark)
big-data data-processing docker flink hadoop kafka spark zookeeper
Last synced: 09 Feb 2025
https://github.com/aveek-saha/cricket-score-predictor
A Big data application to predict the outcome of a T20 cricket match.
big-data big-data-analytics clustering pyspark spark spark-mllib
Last synced: 24 Dec 2024
https://github.com/alexioannides/py-readme-snippets
This repository contains snippits of writing (in Markdown), on various topics relating to various flavours of Python development project.
Last synced: 17 Jan 2025
https://github.com/gunantos/php-spark
PHP Server Develop
php server serverless-framework spark
Last synced: 13 Jan 2025
https://github.com/prabaprakash/docker-pipeline-for-hadoop-n-spark-submit
Docker CI/CD Pipeline
apache-spark docker docker-compose docker-pipeline gocd-agent gocd-agent-docker gocd-server hadoop spark
Last synced: 14 Jan 2025
https://github.com/earthquakesan/twittertrends
Twitter Trends is a Spark Streaming example application
Last synced: 17 Jan 2025
https://github.com/gmartinezramirez-old/data-science-portafolio
:notebook: [Active] Portafolio of data science projects. Using: Python, PyTorch, Spark, Tensorflow, Scikit, Keras. Includes Classification, Regression, Time series, NLP, Deep learning, among others.
data-science data-science-learning data-science-notebook data-science-portfolio hadoop jupyter-notebook keras notebook pandas pyspark python pytorch r sci-kit spark tensorflow
Last synced: 05 Dec 2024
https://github.com/jms0522/hadoop_system
✅ hadoop eco system을 구성하고 파이프라인 제작합니다.
Last synced: 14 Feb 2025
https://github.com/cloudtik/cloudtik
Cloud Scale Platform for Distributed Data, Analytics and AI
ai alibaba-cloud analytics aws azure cloud data-science deep-learning gcp kubernetes machine-learning microservices spark
Last synced: 16 Jan 2025
https://github.com/yalishanda42/scala-recsys
Scala(-ble) recommender system architecture using functional programming (PoC)
cats cats-effect functional-programming movielens recommender-system recsys scala spark
Last synced: 28 Dec 2024
https://github.com/oracle-quickstart/oci-spark
Terraform module to deploy Spark on Oracle Cloud Infrastructure (OCI)
cloud oci oracle oracle-led spark terraform
Last synced: 07 Nov 2024
https://github.com/derlin/workshop-data-sciences
A two-days workshop material introducing data sciences for big data
big-data data-science hdfs hive spark workshop workshop-materials zeppelin
Last synced: 20 Jan 2025
https://github.com/angelotc/MacroDAG
A Dockerized Airflow ETL pipeline that processes macroeconomic indicators from the Federal Reserve.
Last synced: 06 Nov 2024
https://github.com/lepetitbloc/sparksd
:sparkler: Sparks wallet Docker container
cryptocurrency dockerfile masternode spark wallet
Last synced: 26 Jan 2025
https://mgrojo.github.io/adasearch/
Custom search engine for the Ada programming language
ada custom-search-google search-engine spark spark-ada
Last synced: 28 Nov 2024
https://github.com/mtumilowicz/big-data-scala-spark-batch-workshop
Introduction to Spark Batch processing.
batch-processing big-data big-data-processing spark spark-sql workshop workshop-materials
Last synced: 04 Jan 2025
https://github.com/xpcosmos/injestao-dados-enem-sql
Esse projeto tem o objetivo de estruturar dados do enem em bancos de dados e analisar os dados utilizando métodos estatísticos.
docker docker-compose postgresql pyspark python spark sql statistics
Last synced: 14 Jan 2025
https://github.com/chimera-suite/pysparql
This is a simple module that allows developer to query SPARQL endpoints and analyze the results with Apache Spark.
apache apache-spark construct-query dataframe graphframe jena-fuseki spark sparql
Last synced: 01 Dec 2024
https://github.com/dustin-decker/elasticsearchsql
A simple example of using Apache Spark SQL against Elasticsearch 5
Last synced: 29 Jan 2025
https://github.com/joyceannie/us-immigrations-data-warehouse
A data warehouse to perform analytics on the immigration trends in the US.
airflow data-engineering etl pyspark redshift s3 spark
Last synced: 29 Jan 2025
https://github.com/conema/transe-pyspark
TransE implementation in Spark (pyspark)
aws distrubuted embedding gradient-descent knowledge-graph pyspark spark terraform transe word-embeddings
Last synced: 21 Jan 2025
https://github.com/hussaintaj-w/spark_submit_project
An easy to use script that automatically adds files to the spark-submit command.
Last synced: 23 Jan 2025
https://github.com/univalence/spark-plumbus
Collection of tools for Scala Spark
functional-programming scala spark
Last synced: 20 Jan 2025
https://github.com/ichowdhury01/match
A social networking platform that allows users to find friends with similar interests in their area.
geolocation-api jdbc maven mysql pbkdf2 spark
Last synced: 06 Feb 2025
https://github.com/wtanaka/ansible-role-apache-spark
Ansible role to install Apache Spark
ansible ansible-galaxy ansible-role ansible-roles apache-spark batch galaxy mapreduce spark streaming
Last synced: 23 Jan 2025
https://github.com/maxinexiong/item-based-collaborative-filtering
This project utilizes PySpark DataFrames and PySpark RDD to implement item-based collaborative filtering. By calculating cosine similarity scores or identifying movies with the highest number of shared viewers, the system recommends 10 similar movies for a given target movie that aligns users’ preferences.
apache-spark collaborative-filtering movie-recommendation pyspark python spark spark-dataframes spark-rdd
Last synced: 21 Dec 2024
https://github.com/stefen-taime/investissement
Jenkins Delta pipeline
delta-lake jenkins-pipeline minio spark
Last synced: 23 Jan 2025
https://github.com/maxinexiong/degrees-of-separation-with-breadth-first-search
This project utilizes PySpark RDD and the Breadth-first Search (BFS) algorithm to find the shortest path and degrees of separation between two given Marvel superheroes based on based on their appearances together in the same comic books, empowering users to discover connections between their favourite superheroes in the Marvel universe.
apache-spark bfs-algorithm breadth-first-search degrees-of-separation marvel-characters pyspark python spark spark-rdd
Last synced: 21 Dec 2024
https://github.com/kanchishimono/spark-on-k8s-images
Docker images for spark on kubernetes
docker docker-image dockerfile kubernetes pyspark spark spark-kubernetes spark-on-k8s spark-on-kubernetes
Last synced: 28 Nov 2024
https://github.com/pankajsingh09/data_engineering_using_aws
This Repository contains the contents related to Data Engineering Using AWS
aws data-ingestion dataengineering event-bridge lambda-functions pipeline pycharm-ide pyspark python s3 spark
Last synced: 12 Feb 2025
https://github.com/brooksian/sparkpipeline2mleapbundle
Convert Spark Pipeline Models to MLeap Bundles
Last synced: 19 Jan 2025
https://github.com/brooksian/epaairnow
Exploring EPA Air Now Time Series Data with Apache Spark and Apache Zeppelin
spark sparksql time-series zeppelin-notebook
Last synced: 19 Jan 2025
https://github.com/superruzafa/scala-spark-big-data
My solutions to the Coursera's Big Data Analysis with Scala and Spark course
Last synced: 30 Dec 2024
https://github.com/brooksian/ds_gtdb
KMeans Clustering on Global Terrorism Database
global-terrorism-database machine-learning spark sparksql zeppelin-notebook
Last synced: 19 Jan 2025
https://github.com/kmohamedalie/big-data-hadoop-spark-lab
Big Data🛢️ with Hadoop🐘 and Spark⭐ lab🧪🥼
big-data coursera data-engineering docker hadoop ibm kubernetes spark
Last synced: 02 Jan 2025
https://github.com/brooksian/solrtosparknotebook
Connecting Solr and Spark In An Apache Zeppelin Notebook
Last synced: 19 Jan 2025
https://github.com/renardeinside/dbx-kafka-protobuf-example
Sample code for working with Kafka & Protobuf in Databricks
databricks kafka protobuf scala spark spark-streaming
Last synced: 06 Feb 2025
https://github.com/kwartile/spark-benchmark
Spark Benchmark suite to evaluate cluster configuration and compare the performance with other big data frameworks.
apache-spark benchmark benchmarking-suite cdh cloudera-hadoop hadoop hive impala performance scala spark
Last synced: 08 Feb 2025