Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
Apache Spark
![](https://explore-feed.github.com/topics/spark/spark.png)
Apache Spark is an open source distributed general-purpose cluster-computing framework. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.
- GitHub: https://github.com/topics/spark
- Wikipedia: https://en.wikipedia.org/wiki/Apache_Spark
- Repo: https://github.com/apache/spark
- Created by: Matei Zaharia
- Released: May 26, 2014
- Related Topics: scala, hadoop,
- Aliases: apache-spark,
- Last updated: 2025-02-14 00:24:01 UTC
- JSON Representation
https://github.com/vicnesterenko/apache-spark-labs
Base programs with datasets
apache-spark kpi-fict kpi-ua spark
Last synced: 10 Jan 2025
https://github.com/pierrekieffer/sparkstreaming_kafkaconsumer
Kafka consumer example based on spark streaming with message formatting to spark dataframe
kafka kafka-consumer scala spark spark-streaming
Last synced: 07 Feb 2025
https://github.com/vietdoo/sg-property-hub
SG Property Hub is a comprehensive platform for managing and analyzing property data.
airflow celery-redis crawler etl etl-pipeline fastapi minio mongodb nextjs postgresql s3 spark webscraping
Last synced: 07 Feb 2025
https://github.com/ronaldkanyepi/log-realtime-analysis
A scalable architecture for real-time log processing and visualization. Built with a Kafka-Spark ETL pipeline, DynamoDB for storing aggregate real-time metrics, and Python Dash for interactive dashboards. Designed for high-throughput log ingestion, real-time monitoring, and long-term storage.
dash docker docker-compose docker-container dynamodb etl etl-pipeline hdfs kafka kafka-consumer kafka-producer kafka-streams kafka-topic logs python realtime spark spark-streaming streaming visualization
Last synced: 25 Dec 2024
https://github.com/fsanaulla/spark-http-rdd
RDD primitive for fetching data from an HTTP source
Last synced: 14 Feb 2025
https://github.com/ferranbt/sparkanywhere
Run Apache spark multicloud and serverless
Last synced: 01 Jan 2025
https://github.com/vermicida/data-lake
Data Lake, the code corresponding the project #4 of the Udacity's Data Engineer Nanodegree Program
aws-s3 data-engineering data-lake etl-pipeline python spark
Last synced: 26 Dec 2024
https://github.com/vs4vijay/proof-of-concepts
A set of PoC which I had worked on
apache apache-kafka apache-spark authentication bitcoin blockchain blockchain-technology chirp chirp-sdk flask gui kafka poc proof-of-concept proof-of-work pykafka pyspark python python3 spark
Last synced: 10 Jan 2025
https://github.com/arun-george-zachariah/twitteranalytics
Web application to visualize interesting analytic Spark SQL queries executed on tweets for five famous brands namely Adidas, Nike, Puma, Skechers, and Reebok.
analytics distributed-computing docker spark twitter
Last synced: 26 Dec 2024
https://github.com/tuancamtbtx/etl-spark-k8s
ETL With Apache Spark Deployed on K8s
apache k8s spark spark-sql spark-streaming
Last synced: 02 Jan 2025
https://github.com/tuancamtbtx/python-spark-example
Spark template to submit to cluster
Last synced: 02 Jan 2025
https://github.com/tuancamtbtx/bigdata-spark-processing
Spark Batch Process
Last synced: 02 Jan 2025
https://github.com/peteprattis/road-safety-database-with-jdbc-and-spark-rdd
A jdbc application that runs queries in pgAdmin to simulate the functionality of the UK Ministry of Transport's database using Apache Spark RDD for query implementation.
computer-science index java jdbc jdbc-database partitions pgadmin postgresql program query spark spark-sql sparkjava sql student
Last synced: 18 Jan 2025
https://github.com/peteprattis/insurance-company-database-with-jdbc-and-spark-rdd
A jdbc application that runs queries in pgAdmin to simulate the functionality of an insurance company's database using Apache Spark RDD for query implementation.
computer-science java jdbc jdbc-database partitioning partitions postgresql program query spark spark-sql sparkjava sql student
Last synced: 18 Jan 2025
https://github.com/javaidiqbal11/arabic-tweets-sentiment-analysis-using-spark
This repo is for Twitter Arabic dataset for sentiment analysis using Apache Spark.
apache-spark arabic-nlp arabic-tweets flask pyhton3 sentiment-analysis spark twitter-api
Last synced: 03 Jan 2025
https://github.com/fpopic/hf-interview-challenge
(Interview) Mixin Data Engineering & Data Science with PySpark
data-engineering data-science pyspark python recipes spark
Last synced: 10 Jan 2025
https://github.com/scrapcodes/kafkaproducer
Benchmarks to measure latency using spark and kafka.
Last synced: 03 Jan 2025
https://github.com/scrapcodes/spark-templates
One stop shop for Apache spark starter samples.
Last synced: 03 Jan 2025
https://github.com/chimera-suite/spark-sidecar-setup
The sidecar setup container executes SparkSQL scripts against an Apache Spark instance.
docker setup sidecar-container spark sparksql
Last synced: 03 Jan 2025
https://github.com/chimera-suite/use-case
A step-by-step tutorial that showcases the capabilities of Chimera
chimera jena-fuseki knowledge-graph ontology pizza spark sparql-query
Last synced: 03 Jan 2025
https://github.com/nikoshet/pyspark-movie-similarities
Using Spark In Python For Movie Similarities With Jaccard Index
jaccard-index movie-similarities pyspark spark
Last synced: 03 Jan 2025
https://github.com/nikoshet/spark-mlp
Multilayer Perceptron Implementation Using Spark
hdfs machine-learning mapreduce multilayer-perceptron pyspark python spark
Last synced: 03 Jan 2025
https://github.com/palutz/functionalscala_coursera
Functional programming in Scala Certification path (EPFL)
big-data coursera distributed-computing functional-programming parallel-computing scala spark
Last synced: 17 Jan 2025
https://github.com/akaliutau/spark-recipes
Contains a collection of data processing solutions built on the top of Spark
Last synced: 11 Jan 2025
https://github.com/azurespheredev/microsoftfabric-exploratorium
A comprehensive educational resource hub dedicated to mastering Microsoft Fabric, offering in-depth tutorials, real-world use cases, and hands-on guides for seamless end-to-end analytics
analytics data-science data-transformation lakehouse microsoft-fabric one-lake powerbi real-time-analytics spark warehouse
Last synced: 11 Jan 2025
https://github.com/stabrise/scaledp-tutorials
Tutorials for ScaleDP library. ScaleDP is an Open-Source Library for Processing Documents in Apache Spark.
ner nlp ocr ocr-python pdf spark
Last synced: 30 Jan 2025
https://github.com/aldantanneo/bigints
WIP constant time bigint implementation in SPARK
ada bigint cryptography formal-verification spark
Last synced: 30 Jan 2025
https://github.com/abrahamkoloboe27/random-user-streaming-pipeline
Data Engeenering Project - Data Pipeline
airflow airflow-dags api docker docker-compose etl etl-pipeline extract-transform-load kafka kafka-consumer kafka-producer makefile orchestration postgresql python schema-registry spark spark-streaming zookeeper
Last synced: 30 Jan 2025
https://github.com/codelytv/spark-kafka_rabbitmq_sqs-course
Integrate Spark with queue system course examples
apache-spark aws-sqs kafka rabbitmq spark
Last synced: 30 Jan 2025
https://github.com/alexott/cyber-spark-data-connectors
Cybersecurity-related custom data connectors for Spark
cybersecurity databricks pyspark spark
Last synced: 30 Jan 2025
https://github.com/hungreeee/reddit-realtime-streaming-pipeline
End-to-end real-time pipeline for comments processing of any subreddit for sentiment analysis.
cassandra docker-compose kafka praw-reddit real-time reddit-api spark
Last synced: 12 Jan 2025
https://github.com/rupeshtr78/awsiot
AWS IOT Intergration Using EMR Spark Kinesis
aws aws-emr dynamodb iot kinesis spark spark-streaming
Last synced: 12 Jan 2025
https://github.com/rupeshtr78/aws-emr
Spark Job on Amazon EMR cluster
aws cluster emr-cluster mapreduce mapredue scala spark
Last synced: 12 Jan 2025
https://github.com/jbris/time-series-airflow-kafka-spark
A simple demonstration of an Airflow-Kafka-Spark (AKS) stack for online time series forecasting.
airflow airflow-dags bentoml bentoml-service kafka kafka-consumer kafka-producer kafka-streams minio mlflow mlflow-tracking-server mlops mlops-workflow online-learning spark spark-sql spark-streaming time-series time-series-analysis time-series-forecasting
Last synced: 12 Jan 2025
https://github.com/multivacplatform/multivac-elasticsearch
Demoing Spark 2.2 and Elasticsearch Hadoop connector
Last synced: 12 Jan 2025
https://github.com/multivacplatform/multivac-nlp
Testing and benchmarking some of the existing NLP libraries in Apache Spark
nlp spark spark-ml spark-mllib spark-nlp spark-sql stanford-corenlp word2vec
Last synced: 12 Jan 2025
https://github.com/dohabanoui/spark-structured-streaming
Real-time analysis of hospital incident data using Apache Spark Streaming to track incidents by service and identify the top years with the most incidents.
docker spark spark-streaming spark-structured-streaming
Last synced: 19 Jan 2025
https://github.com/ewertondrigues02/engenharia-de-dados
Varios Projetos de Engenharia de Dados usando principais ferramentas como: Airflow, Snowflake, dbt, Postrgres, Looker Studio, Power BI
airflow analise-exploratoria analytics aws-ec2 dados data dbt-cloud engenharia-de-dados looker-studio postgres pyspark python3 snowflake spark
Last synced: 19 Jan 2025
https://github.com/anant/example-cassandra-spark-sql
Cassandra Data Operations with Spark SQL
cassandra data-operations docker etl spark spark-sql
Last synced: 19 Jan 2025
https://github.com/anant/example-sql-on-cassandra-with-open-source-notebooks
Files to follow along with the Open Source Notebooks and Cassandra Webinar (see README.md)
cassandra datastax datastax-studio jupyter jupyter-notebook nosql notebooks quix spark sql
Last synced: 19 Jan 2025
https://github.com/brooksian/censussipp
Reprodicing Census SIPP Reports Using Apache Spark
spark sparksql zeppelin-notebook
Last synced: 19 Jan 2025
https://github.com/bluegranite/azure-synapse-vcf-analysis
Sample code for analyzing VCF files (converted to Parquet) in Azure Databricks and Synapse.
azure azure-databricks azure-synapse bioinformatics computational-biology databricks genomics glow parquet spark synapse vcf
Last synced: 19 Jan 2025
https://github.com/adampaternostro/azure-spark-livy-application-insights-external-dependency
Use Spark with Livy along with Application Insights. Learn to host your external dependencies in data lake.
application-insights azure azure-data-lake hdinsight java livy spark spark2
Last synced: 31 Jan 2025
https://github.com/shreyas-gopalakrishna/datacenter-scale-computing
big-data docker flask hadoop kubernetes rabbitmq redis spark
Last synced: 20 Jan 2025
https://github.com/aragonski97/fenrir-infra
Complete data infrastructure on Docker Swarm exposed on Tailscale network.
airflow data-engineering debezium-connector docker docker-swarm infrastructure-data kafdrop kafka kafka-connect kafka-registry metabase portainer postgresql pyspark scrappy spark tailscale zookeeper zoonavigator
Last synced: 14 Feb 2025
https://github.com/spycsh/movierecommendersystem
A movie recommender system
collaborative-filtering kafka-streams mongodb recommender-system redis spark tf-idf
Last synced: 01 Feb 2025
https://github.com/riversun/ml-fake-data-maker
Generate fake data for machine learning like regression analysis
arff arff-generator dummy-data fake-data generator machine-learning prediction regression spark weka
Last synced: 01 Feb 2025
https://github.com/morgan-sell/usa-tourism-etl
Coalesced and transformed various data sources to create a comprehensive data lake for the USA tourism sector.
aws data-engineering data-lake emr-cluster etl-pipeline python spark
Last synced: 08 Feb 2025
https://github.com/mounirbs/spark-livy
Spark Livy, a docker-compose solution enabling a Spark Cluster with a Livy endpoint
apache apache-spark docker docker-compose livy pyspark python spark
Last synced: 08 Feb 2025
https://github.com/viyadb/viyadb-spark
Data processing ang ingestion backend for ViyaDB based on Spark streaming
spark spark-streaming spark-streaming-kafka viyadb
Last synced: 08 Feb 2025
https://github.com/chen0040/vagrant-big-data
Vagrantfiles for development in big data
cassandra elasticsearc hdfs kafka mesos redis spark storm vagrantfile zookeeper
Last synced: 09 Feb 2025
https://github.com/damianmarti/7506-spark
Notebook de las clases de 75-06 Organización de Datos - FIUBA
Last synced: 09 Feb 2025
https://github.com/duyet/spark-docker
Spark image for running on Kubernetes
docker docker-image hacktoberfest spark
Last synced: 05 Feb 2025
https://github.com/leo-the-nardo/combopurifier
Data Pipeline from AWS SQS/S3 to Kubernetes w/ Spark using Airflow, EKS & Data Lakehouse
airflow argocd aws-glue-catalog aws-lake-formation aws-s3 aws-sqs data-lake delta-lake eks minio spark terraform
Last synced: 05 Jan 2025
https://github.com/renardeinside/nocturne
Useful elements and building blocks for scalable Deep Learning applications on Databricks.
databricks deep-learning gpu horovod petastorm spark
Last synced: 06 Feb 2025
https://github.com/gabrielenizzoli/spark_engine
Build a complex spark execution plan by composing many different spark operations.
Last synced: 12 Feb 2025
https://github.com/karimosman89/iot-predictive-maintenance
This repository will simulate an IoT-based predictive maintenance system designed to monitor industrial equipment through sensors. It will include data ingestion, processing, and machine learning components to predict potential failures, optimizing maintenance schedules and reducing downtime.
api cloud-platform dashboard data-collection data-processing deployment iot-platform predictive-analytics pressure-sensor real-time-sensor sensors spark temperature-sensor vibration
Last synced: 05 Jan 2025
https://github.com/ahmedennaifer/iot-streaming-platform
WIP
docker docker-compose iot kafka postgresql python scala spark streaming
Last synced: 09 Feb 2025
https://github.com/elfn/data-engineering-machine-learning-predictiveai
[SUPINFO PROJECT] Data science and Big Data (Spark, python, R, ....)
ai jupyter-notebook machine-learning mongodb prediction-ai python r rstudio spark
Last synced: 07 Feb 2025
https://github.com/nagpritam/identification-of-trucks-and-potential-risky-driver-using-databricks-spark-api-
The project intended to identify trucks based on their model, fuel consumption, driving behaviors and past records of violations/accidents
databricks hadoop hive powerbi python3 spark
Last synced: 13 Feb 2025
https://github.com/curusarn/spark-context-with
Python guard/wrapper for SparkContext from pyspark - allows you to use python `with` operator with SparkContext
guard python-operator spark sparkcontext
Last synced: 30 Jan 2025
https://github.com/omalperera/midget-sparkapps
Independent Spark applications & jobs to discover the various spark functionalities & kafka integrations
kafka-client spark spark-streaming
Last synced: 31 Jan 2025
https://github.com/senior-sigan/coursera_scala_specialization
coursera coursera-data-science scala spark
Last synced: 17 Jan 2025
https://github.com/vitalibo/distributed-alarm-system
Simple distributed alarm system on top of Apache Spark
Last synced: 27 Dec 2024
https://github.com/okdp/jupyterlab-docker
okdp jupyterlab docker images
datascience docker jupyter jupyter-notebook jupyterhub jupyterlab k8s-spark python spark spark-kubernetes spark-python
Last synced: 13 Nov 2024
https://github.com/asmrcodez-yt/realtime-voting-dataengineer-spark-kafka
kafka postgres python spark spark-streaming streamlit
Last synced: 24 Jan 2025
https://github.com/flaviostutz/spark-submit-scala
Spark submit extension from bde2020/spark-submit for Scala with SBT
bigdata sbt scala spark spark-cluster spark-submit
Last synced: 06 Feb 2025
https://github.com/flaviostutz/spark-scala-hdfs-docker-example
Spark with Scala reading/writing files to HDFS with automatic additions of new Spark workers using Docker "scale"
datanode docker example hdfs namenodes scala scale spark spark-workers
Last synced: 06 Feb 2025
https://github.com/billxsheng/436
MSCI436 Term Project
cassandra etl-pipeline java kafka sentiment-analysis spark twitter-api
Last synced: 17 Jan 2025
https://github.com/vjcitn/biocpyinterop
Material for Bioconductor 2023 workshop on interoperation with python
basilisk bioconductor cite-seq genetics hail reticulate scvi-tools single-cell-omics spark
Last synced: 09 Jan 2025
https://github.com/ndleah/stedi
Data Lakehouse solution for machine learning data
aws-athena aws-glue s3-bucket spark
Last synced: 12 Jan 2025
https://github.com/dimitrov-s-dev/pyspark
PySpark
pyspark python3 spark spark-sql
Last synced: 16 Jan 2025
https://github.com/vitalibo/aws-glue-java
Simple PoC that demonstrate usage Java in AWS Glue ETL pipelines.
Last synced: 27 Dec 2024
https://github.com/colinkiama/snippets
Code snippets used by the Spark Community
code-snippets snippets snippets-collection snippets-library spark uwp
Last synced: 14 Jan 2025
https://github.com/samlet/sagas-spark
use structured-streaming as olap engine
Last synced: 22 Dec 2024
https://github.com/smaddanki/data-science
Code blocks, algorithms, and research snippets in Data Science, Machine Learning, AI & Quant Finance.
deep-learning machine-learning pytorch scikit-learn spark
Last synced: 08 Feb 2025
https://github.com/fsanaulla/terling
Linguistic text analysis for detecting terrorists dangerous.
Last synced: 17 Jan 2025
https://github.com/cn-docker/spark-worker
Spark Worker Docker Image
docker-image spark spark-worker
Last synced: 27 Jan 2025
https://github.com/georgeerol/georgeerol.github.io
George Fouche Portfolio
airflow android-application aws cassandra deploy full-stack-web-development jpa-hibernate postgresql react robotics robotics-simulation spark spring-boot spring-mvc spring-security
Last synced: 13 Jan 2025
https://github.com/ericlondon/spark-csv-to-elasticsearch
Spark CSV to Elasticsearch
apache csv docker elasticsearch export hadoop spark
Last synced: 12 Jan 2025
https://github.com/teo-sl/us_flights_analysis
This repository contains a dashboard to visualize the US flights data and notebooks for some ML tasks on the same data
big-data classification dash dashboard flights machine-learning plotly regression spark usa
Last synced: 16 Jan 2025
https://github.com/sephiroth7712/k-nearest-neigbours
Implementation of K-Nearest Neighbors algorithm using multiple parallel computing approaches: CUDA (GPU), Hadoop, Spark, MPI, OpenMP, and PThreads. Demonstrates scalable machine learning across different parallel computing paradigms from GPU to distributed frameworks.
cuda cuda-programming hadoop-mapreduce java mpi multiprocessing multithreading openmp pthreads scala spark
Last synced: 06 Feb 2025
https://github.com/higorcazuza81/courses
Repository showcasing my educational journey in Quantitative Analysis, including projects and coursework in SQL, Python, Data Science, Machine Learning, and financial modeling. Focused on real-world applications in quantitative finance, data analysis, and statistical modeling.
airflow automation database dataengineering python shell-script spark sql
Last synced: 06 Feb 2025
https://github.com/eolecvk/intro_spark_twitter
Introduction to text mining with Spark
pyspark spark text-analysis text-mining
Last synced: 07 Feb 2025
https://github.com/kolia1985/kolia1985
Mykola Melnyk profile
data-engineering data-science spark
Last synced: 06 Feb 2025
https://github.com/jvm-operators/ansible-openshift-spark-operator
Ansible role for spark-operator
ansible kubernetes openshift spark spark-operator
Last synced: 30 Jan 2025
https://github.com/konradmalik/scala-seed
Seed project for dockerized Scala with included Spark and Cassandra.
cassandra docker makefile multimodule sbt scala seed spark template typesafe-config
Last synced: 17 Jan 2025