Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
Apache Spark
Apache Spark is an open source distributed general-purpose cluster-computing framework. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.
- GitHub: https://github.com/topics/spark
- Wikipedia: https://en.wikipedia.org/wiki/Apache_Spark
- Repo: https://github.com/apache/spark
- Created by: Matei Zaharia
- Released: May 26, 2014
- Related Topics: scala, hadoop,
- Aliases: apache-spark,
- Last updated: 2025-02-10 00:27:59 UTC
- JSON Representation
https://github.com/aveek-saha/cricket-score-predictor
A Big data application to predict the outcome of a T20 cricket match.
big-data big-data-analytics clustering pyspark spark spark-mllib
Last synced: 24 Dec 2024
https://github.com/bluejoe2008/hippo-rpc
Hippo Transport Library enhances spark-commons with easy stream management & handling
Last synced: 18 Dec 2024
https://github.com/cwienberg/spark-sorting-helpers
Helper library for using secondary sorting in Spark RDD and Dataset operations
Last synced: 23 Jan 2025
https://github.com/afsalthaj/supaku-sukara
Functional Programming, Functional Programming Exercise Solutions in Scala & Spark
functional-programming functor language monad parallelism scala shapeless spark typeclasses
Last synced: 08 Jan 2025
https://github.com/engineering-research-and-development/fiware-orion-pyspark-connector
Bidirectional Orion/Orion-LD <--> PySpark Connector
cognitive fiware ngsi ngsi-ld ngsi-v2 orion orion-context-broker orion-ld processing pyspark python spark
Last synced: 17 Jan 2025
https://github.com/ugurcanerdogan/machine-learning-with-spark
BBM469*ASG3 - Machine Learning with Spark
apache-spark data-science machine-learning spark
Last synced: 19 Dec 2024
https://github.com/pedropark99/introd-pyspark
An open and introductory book for the Python API of Apache Spark (pyspark)
Last synced: 14 Oct 2024
https://github.com/thanaraklee/dataflow-with-gcp
This project demonstrates the workflow of a Data Engineer. It utilizes the Google Cloud Platform and Google Colab as the main tools.
airflow apache-spark data-engineering etl pandas spark
Last synced: 25 Dec 2024
https://github.com/pankajsingh09/data_engineering_using_aws
This Repository contains the contents related to Data Engineering Using AWS
aws data-ingestion dataengineering event-bridge lambda-functions pipeline pycharm-ide pyspark python s3 spark
Last synced: 19 Dec 2024
https://github.com/hb-chen/spark-elasticsearch-recommender
Zeppelin-v0.8.0 Notebook演示使用Spark -v2.3.2+ Elasticsearch-v6.3.2构建推荐系统
elasticsearch recommender spark zeppelin
Last synced: 08 Jan 2025
https://github.com/akarce/udacity-data-pipeline-with-airflow
Udacity Data Engineering Nanodegree Program, Data Pipeline with Airflow project using MinIO and Postgresql.
airflow minio postgresql pyspark spark
Last synced: 12 Oct 2024
https://github.com/burhanahmed1/big-data-analytics
Practice tasks in Python programming language using Hadoop, MRJob, PySpark for Big Data Analytics.
apache-spark hadoop hadoop-mapreduce jupyter-notebook mrjob pyspark python spark spark-sql sparksql
Last synced: 11 Oct 2024
https://github.com/badoo/hadoop-xargs
Util to run heterogenous applications on Hadoop synchronously
Last synced: 12 Nov 2024
https://github.com/hussaintaj-w/spark_submit_project
An easy to use script that automatically adds files to the spark-submit command.
Last synced: 23 Jan 2025
https://github.com/vasnake/spark.ml.spatialjointransformer
spark.ml.transformer: join two datasets using spatial relations
geospatial join ml-pipeline python scala spark spark-ml spatial transformer
Last synced: 03 Jan 2025
https://github.com/dimajix/docker-spark
Repository for building Docker containers for Spark
Last synced: 05 Jan 2025
https://github.com/fpopic/gg-interview-challenge
(Interview) GG Interview Challenge in Scala/Spark
apache-spark json logstash parsing regex scala spark sparksql
Last synced: 10 Jan 2025
https://github.com/alvarogarcia7/bank-kata-kotlin
Bank pet project, in kotlin. See interests as topics
api-first api-standard bank-kata blackbox-testing etude finite-state-machine gradle gradlew hateoas junit junit5 kata kotlin multimodule paypal-rest-api practice spark sparkjava trikitrok with-client
Last synced: 10 Jan 2025
https://github.com/jldbc/big-data
Coursework from Big Data (CS3390) -- Machine Learning tasks performed using Hadoop, MapReduce, and Spark
big-data hadoop pagerank recommender-system spark
Last synced: 04 Jan 2025
https://github.com/kanchishimono/spark-on-k8s-images
Docker images for spark on kubernetes
docker docker-image dockerfile kubernetes pyspark spark spark-kubernetes spark-on-k8s spark-on-kubernetes
Last synced: 28 Nov 2024
https://github.com/fdmsantos/aws-twitter-data-analytics
Project to Learn Data analytics in AWS using twitter data
aws data-analytics data-engineering data-science data-visualization flink spark terraform
Last synced: 26 Jan 2025
https://github.com/longi94/lsde2017-p3-flight-visualization
Animated interactive flight visualization
Last synced: 06 Jan 2025
https://github.com/bria222/animal2
heroku-deployment java postgres spark velocity
Last synced: 04 Jan 2025
https://github.com/apache/incubator-gluten-site
Apache Gluten is a middle layer responsible for offloading JVM-based SQL engines' execution to native engines.
Last synced: 04 Feb 2025
https://github.com/brunneis/minebench
Proof-of-Work based benchmark written in Python that works with real Bitcoin data
benchmark bitcoin mining proof-of-work spark
Last synced: 26 Jan 2025
https://github.com/renardeinside/dbx-kafka-protobuf-example
Sample code for working with Kafka & Protobuf in Databricks
databricks kafka protobuf scala spark spark-streaming
Last synced: 06 Feb 2025
https://github.com/erikerlandson/pyspark-ubi
Minimalist install of pyspark on top of Red Hat UBI
container-image pyspark spark ubi
Last synced: 06 Jan 2025
https://github.com/mangalaman93/dspark
Run spark in docker containers
big-data containers docker microservices spark
Last synced: 18 Jan 2025
https://github.com/hellomaxime/data-platform-on-kubernetes
Open Source Data Platform on Kubernetes
bigdata data data-pipeline dbt druid etl kubernetes ml open-source platform spark superset
Last synced: 28 Dec 2024
https://github.com/hibadaoud/real-time-flight-data-kibana-visualization
Real-Time Flight Data Visualization Dashboard: Interactive web application for real-time flight tracking and airport analytics. Powered by Kafka, Pyspark, Elasticsearch, Kibana, Express NodeJs, MongoDB, and Docker.
css docker elasticsearch html javascript jwt-authentication kafka kibana nodejs real-time spark
Last synced: 10 Feb 2025
https://github.com/dharaneeshvrd/spark-examples
Spark Examples
pyspark spark spark-example spark-sql spark-streaming spark-streaming-kafka spark-structured-streaming
Last synced: 07 Nov 2024
https://github.com/mtpatter/bilao
Jupyter notebooks for filtering Kafka data with Spark Streaming.
avro docker jupyter-notebook kafka spark spark-streaming
Last synced: 12 Jan 2025
https://github.com/kevinhartman/kafka-to-eventhub
Kafka to EventHub Mirror.
eventhub eventhub-topic kafka mirror spark spark-streaming
Last synced: 20 Dec 2024
https://github.com/jbris/docker-spark-sparklyr
Docker setup for Apache Spark and the R sparklyr package
adminer apache-spark docker docker-compose postgres postgresql rstats rstudio spark spark-dataset spark-master spark-ml spark-worker sparklyr sparklyr-extension
Last synced: 12 Jan 2025
https://github.com/multivacplatform/multivac-fakenews
Detecting users and communities which propagate fake news on Twitter by Apache Spark
deep-learning fakenews machine-learning spark twitter
Last synced: 12 Jan 2025
https://github.com/multivacplatform/multivac-wikipedia
Wonderful reusable codes, libraries and scripts to process Wikipedia page views by using Apache Spark.
data-frame multivac-wikipedia spark spark-sql wikipedia
Last synced: 12 Jan 2025
https://github.com/multivacplatform/multivac-ml
Pre-trained ML models for Apache Spark
machine-learning nlp spark spark-ml
Last synced: 12 Jan 2025
https://github.com/inbravo/spark-movie-lens
Various examples of analytics using Apache Spark
Last synced: 02 Feb 2025
https://github.com/kruglov-dmitry/yelp_data
End to end example how to read big (well, comparably) data from Kafka and write it down into Cassandra using Spark Structured Streaming. Using yelp dataset for illustration purposes.
cassandra kafka spark streaming yelp-dataset
Last synced: 19 Jan 2025
https://github.com/tashi-2004/fma-a-dataset-for-music-analysis
🎶 Scripts for music feature analysis, model training, and real-time recommendation using Apache Kafka. Extract features with Librosa 🎹, store them in MongoDB 🗄️, and process the data with Apache Spark ⚡. A 🌐 web interface 💻✨ is also included. Contributors: Tashfeen Abbasi 👤, Laiba Mazhar 👤, and Rafia Khan 👤.
html kafka kafka-consumer kafka-producer kafka-streaming linux mongodb mongodb-compass python3 spark ubuntu web-application
Last synced: 03 Dec 2024
https://github.com/gacwr/openuba-model-hub
frontend, model registry, model search, and model marketplace for OpenUBA
analytics anomaly-detection cybersecurity datascience elasticsearch elk flask information-security machine-learning security siem sklearn spark tensorflow threathunting uba ueba user-behaviour
Last synced: 15 Jan 2025
https://github.com/kadnan/vagrant-spark2
Vagrant Box with Python 3.6.1, Apache Spark 2.1.1 with Scala 2.11.8 and PySpark (2.1.1).
pyspark python3 spark vagrant vagrant-boxes
Last synced: 20 Jan 2025
https://github.com/j-sephb-lt-n/useful-code-snippets
A searchable collection of useful little pieces of code
aws bash cloud compute-engine docker dockerfile ec2 gcp graph pyspark python r-language shell spark virtual-machine
Last synced: 28 Dec 2024
https://github.com/simplexspatial/osm-facts
Proofs and checks about osm pbf format and data content facts
Last synced: 15 Jan 2025
https://github.com/lucivpav/dnbc-scala
Parallel implementation of dynamic naive Bayesian classifier
apache-spark bayesian-networks ctu-fit dnbc fit-ctu naive-bayes-classifier scala spark
Last synced: 20 Dec 2024
https://github.com/anant/example-cassandra-spark-job-scala
apache-spark cassandra docker etl sbt scala spark
Last synced: 19 Jan 2025
https://github.com/aiday-mar/spark-recommendation-engine
Movie recommendation system built using Spark and Scala
recommendation-system scala spark university-project
Last synced: 05 Jan 2025
https://github.com/manuparra/taller-bigdata-con-r
Taller Big Data con Apache Spark + R desde Databricks cloud
bigdata cloudcomputing databricks r spark sparkr
Last synced: 27 Dec 2024
https://github.com/triandicAnt/TwitterSentimentAnalytics
Basic Twitter Sentiment Analytics using Apache Spark Streaming APIs and Python by processing live tweets from Twitter.
machine-learning python sentimental-analysis spark twitter twitter-api twitter-sentiment-analytics
Last synced: 23 Oct 2024
https://github.com/omar-besbes/football-big-data
This is a comprehensive solution for real-time football analytics, leveraging Apache Spark execution on yarn for both streaming and batch processing, Hadoop HDFS for distributed storage, Kafka for real-time data ingestion, RethinkDB for live data updates and Next.js for data visualization as well as a custom built search engine.
batch-processing hadoop kafka nextjs rethinkdb spark streaming t3-stack yarn
Last synced: 20 Jan 2025
https://github.com/r13i/spark-record-deduplicating
Data cleansing problem statement: Data in a record are often duplicated. How do we find the duplicate probability ? [Work In Progress]
big-data deduplication record-linkage records-management scala spark
Last synced: 02 Feb 2025
https://github.com/tpvasconcelos/sparkypandy
It's not spark, it's not pandas, it's just awkward...
dataframe pandas pyspark spark
Last synced: 05 Nov 2024
https://github.com/viniciusmsousa/pyspark-ds-toolbox
A Pyspark companion for data science tasks.
Last synced: 09 Feb 2025
https://github.com/yoongoing/bigdata_pyspark
⚡️공개용 맵리듀스 플랫폼인 Spark를 사용하여 데이터마이닝을 해보자⚡️
bigdata dataminig jupyter-notebook mapreduce mapreduce-python pyspark spark
Last synced: 09 Feb 2025
https://github.com/officiallysingh/spring-boot-starter-spark
Spark Spring Boot starter
spark spring spring-boot springboot starter starters
Last synced: 25 Dec 2024
https://github.com/boazmohar/pysparkutils
A collection of utilities for handling pySpark's SparkContext
Last synced: 09 Feb 2025
https://github.com/mcddhub/mcdd-big-data-study
Study project for big data (Hadoop, Zookeeper, Kafka, Flink, Spark)
big-data data-processing docker flink hadoop kafka spark zookeeper
Last synced: 09 Feb 2025
https://github.com/garciparedes/scala-examples
Set of awesome Scala Examples
breeze functional-programming java scala spark
Last synced: 16 Jan 2025
https://github.com/gunantos/php-spark
PHP Server Develop
php server serverless-framework spark
Last synced: 13 Jan 2025
https://github.com/adityajn105/apache-spark-tutorials
Apache spark is a big data analysis framework.
bigdata pyspark spark spark-ml spark-rdd spark-tutorials
Last synced: 16 Jan 2025
https://github.com/gmartinezramirez-old/data-science-portafolio
:notebook: [Active] Portafolio of data science projects. Using: Python, PyTorch, Spark, Tensorflow, Scikit, Keras. Includes Classification, Regression, Time series, NLP, Deep learning, among others.
data-science data-science-learning data-science-notebook data-science-portfolio hadoop jupyter-notebook keras notebook pandas pyspark python pytorch r sci-kit spark tensorflow
Last synced: 05 Dec 2024
https://github.com/jabhij/crimerate_classification
Developing a system that could classify crime descriptions into different categories which would help the authorities to assign officers to crimes based on the report.
classification crime-analysis crime-classification crime-rates machine-learning mllib pyspark python spark tensorflow
Last synced: 17 Jan 2025
https://github.com/dvelkow/real_time_bulgarian_news_aggregator
An ETL-driven web scraping and data visualization project that aggregates news from multiple Bulgarian news sources in real-time and creates an interactive dashboard with the fetched data.
Last synced: 12 Oct 2024
https://github.com/jms0522/hadoop_system
✅ hadoop eco system을 구성하고 파이프라인 제작합니다.
Last synced: 11 Oct 2024
https://github.com/jatin-8898/sparkwebsite
A clean and very interesting looking website. :sparkles:
bootstrap4 css html javascript spark typescript
Last synced: 17 Jan 2025
https://github.com/derlin/workshop-data-sciences
A two-days workshop material introducing data sciences for big data
big-data data-science hdfs hive spark workshop workshop-materials zeppelin
Last synced: 20 Jan 2025
https://github.com/makohn/lambda-architecture-poc
♨️ A PoC implementation of the λ-Architecture for collecting and analysing tweets
cassandra kafka lambda-architecture sbt scala spark
Last synced: 19 Dec 2024
https://github.com/lepetitbloc/sparksd
:sparkler: Sparks wallet Docker container
cryptocurrency dockerfile masternode spark wallet
Last synced: 26 Jan 2025
https://mgrojo.github.io/adasearch/
Custom search engine for the Ada programming language
ada custom-search-google search-engine spark spark-ada
Last synced: 28 Nov 2024
https://github.com/mtumilowicz/big-data-scala-spark-batch-workshop
Introduction to Spark Batch processing.
batch-processing big-data big-data-processing spark spark-sql workshop workshop-materials
Last synced: 04 Jan 2025
https://github.com/wittline/sparksql-with-python
This repository has some examples of using Spark and SparkSQL with Python through PySpark
flask-api python spark sparksql
Last synced: 29 Jan 2025
https://github.com/open-datastudio/hive-metastore
Hive metastore on Staroid
hadoop hive hive-metastore kubernetes spark staroid
Last synced: 18 Nov 2024
https://github.com/nhviet03/is405_bigdata_mapreduce_knn
A MapReduce-Based k-Nearest Neighbor Approach for Big Data Classification on Apache Spark
knn-classification mapreduce pyspark spark
Last synced: 17 Jan 2025
https://github.com/brooksian/solrtosparknotebook
Connecting Solr and Spark In An Apache Zeppelin Notebook
Last synced: 19 Jan 2025
https://github.com/rpytel1/supercomputing-labs
Fork of the repository for Supercomputing in Big Data class on TU Delft. Scala, Spark and Kafka were used to perform processing and streaming of GDelt data segments.
big-data gdelt-data kafka scala spark
Last synced: 18 Jan 2025
https://github.com/brooksian/ds_gtdb
KMeans Clustering on Global Terrorism Database
global-terrorism-database machine-learning spark sparksql zeppelin-notebook
Last synced: 19 Jan 2025
https://github.com/brooksian/epaairnow
Exploring EPA Air Now Time Series Data with Apache Spark and Apache Zeppelin
spark sparksql time-series zeppelin-notebook
Last synced: 19 Jan 2025
https://github.com/exacaster/delta-fetch
HTTP API on Delta Lake tables
big-data delta-lake parquet s3 spark
Last synced: 11 Nov 2024
https://github.com/brooksian/sparkpipeline2mleapbundle
Convert Spark Pipeline Models to MLeap Bundles
Last synced: 19 Jan 2025
https://github.com/cclient/elasticsearch-spark-upsert-from-kafka
elasticsearch-hadoop官方不支持upsert doc,修改源码实现,spark kafka streaming 示例 upsert { "upsert": {}, "doc": {...} }
elasticsearch elasticsearch-hadoop kafka kafka-streams spark upsert upsert-doc
Last synced: 16 Jan 2025
https://github.com/emso-exe/comercio_eletronico_brasileiro
Projeto de análise de dados do comércio eletrônico brasileiro disponibilizado pela Olist via plataforma Kaggle.
analise-de-dados ciencia-de-dados data-analytics data-science datascience e-commerce postgres postgresql pyspark python python-3 python3 spark spark-sql sql
Last synced: 16 Jan 2025
https://github.com/jinsyin/datalink
⚡ 数据集成 | DataLink is a lightweight data integration framework build on top of DataX, Spark and Flink
batch big-data bigdata cdc data data-collection data-exchange data-integration data-pipeline data-synchronization datalink etl flink flink-cdc framework integration pipeline spark streaming
Last synced: 15 Nov 2024
https://github.com/stefen-taime/investissement
Jenkins Delta pipeline
delta-lake jenkins-pipeline minio spark
Last synced: 23 Jan 2025
https://github.com/brooksian/sparkpipelinesparknlp
Build & Convert a Spark NLP Pipeline to PMML
corenlp nlp pmml spark zeppelin-notebook
Last synced: 19 Jan 2025
https://github.com/wtanaka/ansible-role-apache-spark
Ansible role to install Apache Spark
ansible ansible-galaxy ansible-role ansible-roles apache-spark batch galaxy mapreduce spark streaming
Last synced: 23 Jan 2025
https://github.com/extwiii/bigdata-uc.san.diego
Unlock Value in Massive Datasets - UC San Diego
big-data classification data-science graph hadoop integration machine-learning management modeling neo4j processing regression spark
Last synced: 28 Jan 2025
https://github.com/kmohamedalie/big-data-hadoop-spark-lab
Big Data🛢️ with Hadoop🐘 and Spark⭐ lab🧪🥼
big-data coursera data-engineering docker hadoop ibm kubernetes spark
Last synced: 02 Jan 2025
https://github.com/alexioannides/py-readme-snippets
This repository contains snippits of writing (in Markdown), on various topics relating to various flavours of Python development project.
Last synced: 17 Jan 2025
https://github.com/hibuz/hadoop-docker
🐳 hadoop ecosystems docker image
data-engineering docker docker-compose flink hadoop hbase hive spark zeppelin
Last synced: 15 Nov 2024
https://github.com/earthquakesan/twittertrends
Twitter Trends is a Spark Streaming example application
Last synced: 17 Jan 2025
https://github.com/kwartile/spark-benchmark
Spark Benchmark suite to evaluate cluster configuration and compare the performance with other big data frameworks.
apache-spark benchmark benchmarking-suite cdh cloudera-hadoop hadoop hive impala performance scala spark
Last synced: 08 Feb 2025
https://github.com/joyceannie/us-immigrations-data-warehouse
A data warehouse to perform analytics on the immigration trends in the US.
airflow data-engineering etl pyspark redshift s3 spark
Last synced: 29 Jan 2025
https://github.com/cloudtik/cloudtik
Cloud Scale Platform for Distributed Data, Analytics and AI
ai alibaba-cloud analytics aws azure cloud data-science deep-learning gcp kubernetes machine-learning microservices spark
Last synced: 16 Jan 2025