Projects in Awesome Lists tagged with spark-sql
A curated list of projects in awesome lists tagged with spark-sql .
https://github.com/getredash/redash
Make Your Company Data Driven. Connect to any data source, easily visualize, dashboard and share your data.
analytics athena bi bigquery business-intelligence dashboard databricks hacktoberfest javascript mysql postgresql python redash redshift spark spark-sql visualization
Last synced: 16 Dec 2025
https://github.com/apache/kyuubi
Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses.
data-lake hacktoberfest hadoop hive jdbc kubernetes spark spark-sql sql thrift
Last synced: 13 May 2025
https://github.com/dotnet/spark
.NET for Apache® Spark™ makes Apache Spark™ easily accessible to .NET developers.
analytics apache-spark azure bigdata csharp databricks dotnet dotnet-core dotnet-standard emr fsharp hdinsight machine-learning microsoft spark spark-sql spark-streaming streaming tpcds tpch
Last synced: 11 May 2025
https://github.com/almond-sh/almond
A Scala kernel for Jupyter
jupyter jupyter-kernels jupyter-notebook repl scala spark spark-sql
Last synced: 10 Apr 2025
https://github.com/apache/incubator-gluten
Gluten is a middle layer responsible for offloading JVM-based SQL engines' execution to native engines.
arrow clickhouse simd spark-sql vectorization velox
Last synced: 14 May 2025
https://github.com/databricks/learningsparkv2
This is the github repo for Learning Spark: Lightning-Fast Data Analytics [2nd Edition]
apache-spark delta-lake mlflow mllib spark spark-mllib spark-sql structured-streaming
Last synced: 14 May 2025
https://github.com/oeljeklaus-you/useractionanalyzeplatform
电商用户行为分析大数据平台
accumulator hadoop java kyro spark spark-sql sparkjava
Last synced: 16 May 2025
https://github.com/ploomber/jupysql
Better SQL in Jupyter. 📊
bigquery clickhouse data-engineering data-science duckdb hive jupyter mysql polars postgres presto python redshift snowflake spark-sql sql sqlite trino tsql
Last synced: 04 Oct 2025
https://github.com/qubole/sparklens
Qubole Sparklens tool for performance tuning Apache Spark
cluster performance performance-analysis performance-metrics performance-tuning performance-visualization scala scheduler scheduling simulation spark spark-applications spark-job spark-ml spark-mllib spark-sql sparkjava
Last synced: 12 Apr 2025
https://github.com/kevinschaich/pyspark-cheatsheet
🐍 Quick reference guide to common patterns & functions in PySpark.
cheat cheatsheet cheatsheets data data-science docs documentation guide guides pyspark pyspark-tutorial quickstart reference references spark spark-sql
Last synced: 10 Apr 2025
https://github.com/japila-books/spark-sql-internals
The Internals of Spark SQL
apache-spark book internals mkdocs-material spark spark-sql
Last synced: 23 Jan 2026
https://github.com/microsoft/data-accelerator
Data Accelerator for Apache Spark simplifies onboarding to Streaming of Big Data. It offers a rich, easy to use experience to help with creation, editing and management of Spark jobs on Azure HDInsights or Databricks while enabling the full power of the Spark engine.
apache-spark azure big-data cosmosdb docker eventhub hdinsight iot iothub kafka kafka-streams nodejs react servicefabric spark spark-sql spark-streaming sparksql streaming streaming-data
Last synced: 15 May 2025
https://github.com/cuebook/cuelake
Use SQL to build ELT pipelines on a data lakehouse.
apache-iceberg apache-spark data-engineering data-ingestion data-integration data-lake data-pipeline data-transfer datalake delta elt etl incremental-updates lakehouse pipelines spark-sql sql upsert zeppelin-notebook
Last synced: 07 Apr 2025
https://github.com/jaceklaskowski/spark-workshop
Apache Spark™ and Scala Workshops
apache-spark spark spark-mllib spark-sql spark-structured-streaming spark-workshops workshop
Last synced: 05 Apr 2025
https://github.com/qbeast-io/qbeast-spark
Qbeast-spark: DataSource enabling multi-dimensional indexing and efficient data sampling. Big Data, free from the unnecessary!
big-data data-lakehouse datasource sampling scala spark spark-sql
Last synced: 12 Aug 2025
https://github.com/chabane/bigdata-playground
A complete example of a big data application using : Kubernetes (kops/aws), Apache Spark SQL/Streaming/MLib, Apache Flink, Scala, Python, Apache Kafka, Apache Hbase, Apache Parquet, Apache Avro, Apache Storm, Twitter Api, MongoDB, NodeJS, Angular, GraphQL
angular apache-flink apache-spark avro big-data docker graphql hadoop hbase kafka kops machine-learning mongodb nodejs parquet python scala spark-sql spark-streaming twitter-api
Last synced: 13 Apr 2025
https://github.com/Chabane/bigdata-playground
A complete example of a big data application using : Kubernetes (kops/aws), Apache Spark SQL/Streaming/MLib, Apache Flink, Scala, Python, Apache Kafka, Apache Hbase, Apache Parquet, Apache Avro, Apache Storm, Twitter Api, MongoDB, NodeJS, Angular, GraphQL
angular apache-flink apache-spark avro big-data docker graphql hadoop hbase kafka kops machine-learning mongodb nodejs parquet python scala spark-sql spark-streaming twitter-api
Last synced: 28 Apr 2025
https://github.com/polomarcus/spark-structured-streaming-examples
Spark Structured Streaming / Kafka / Cassandra / Elastic
cassandra kafka spark spark-sql structured-streaming
Last synced: 10 Apr 2025
https://github.com/mc2-project/opaque-sql
An encrypted data analytics platform
analytics enclave machine-learning privacy security spark spark-sql
Last synced: 17 Jan 2026
https://github.com/learningjournal/spark-streaming-in-python
Apache Spark 3 - Structured Streaming Course Material
apache-spark big-data bigdata data-lake pyspark python spark-sql spark-streaming
Last synced: 04 Sep 2025
https://github.com/streamnative/pulsar-spark
Spark Connector to read and write with Pulsar
apache-pulsar apache-spark batch-processing data-processing data-science flink spark spark-sql stream-processing structured-streaming
Last synced: 06 Feb 2026
https://github.com/izhangzhihao/real-time-data-warehouse
Real-time Data Warehouse with Apache Flink & Apache Kafka & Apache Hudi
cdc change-data-capture data-warehouse data-warehousing datalake debezium delta delta-lake deltalake elasticsearch flink flink-sql hoodie hudi iceberg kafka real-time-data-warehouse spark spark-sql sql
Last synced: 07 Sep 2025
https://github.com/sjrusso8/spark-connect-rs
Apache Spark Connect Client for Rust
grpc-client spark spark-connect spark-sql
Last synced: 16 May 2025
https://github.com/minio/spark-select
A library for Spark DataFrame using MinIO Select API
amazon-s3 bigdata minio parquet-files pyspark sbt select spark spark-sql
Last synced: 20 Jun 2025
https://github.com/learningjournal/sparkprogramminginscala
Apache Spark Course Material
apache-spark big-data bigdata data-lake datalake scala spark spark-scala spark-sql
Last synced: 17 Mar 2025
https://github.com/huangyueranbbc/SparkDemo
spark全示例代码(java、scala) Spark most full instance code DEMO (java、scala)
bigdata hadoop operator spark spark-sql spark-streaming sparkfun-products sparkjava sparkline sparkp
Last synced: 27 Mar 2025
https://github.com/groda/big_data
Tutorials on Big Data essentials: Hadoop, MapReduce, Spark. Explore a variety of tutorials and demonstrations on Big Data technologies, primarily in the form of Jupyter notebooks. Most notebooks are self-contained and live—ready to run with a click.
apache-sedona apache-spark big-data bigdata bigtop docker gutenberg-ebooks hadoop hadoop-cluster hadoop-hdfs hadoop-mapreduce jupyter-notebook mapreduce mapreduce-bash mrjob pyspark spark spark-sql testdfsio
Last synced: 06 Apr 2025
https://github.com/harryprince/geospark
bring sf to spark in production
apache-spark gis large-scale-spatial-analysis r spark-sql sparklyr-extension spatial-analysis spatial-queries
Last synced: 16 Mar 2025
https://github.com/hablapps/sparkoptics
Optics for Spark DataFrames
dataframe dataframes optics scala spark spark-sql
Last synced: 30 Jun 2025
https://github.com/learningjournal/spark-streaming-in-scala
Apache Spark 3 - Structured Streaming Course Material
apache-spark big-data bigdata datalake scala spark spark-sql spark-streaming
Last synced: 16 May 2025
https://github.com/airbnb/airbnb-spark-thrift
A library for loadling Thrift data into Spark SQL
spark spark-sql spark-streaming thrift
Last synced: 04 Sep 2025
https://github.com/wh1isper/sparglim
Sparglim✨ makes PySpark App Configurable and Deploy Spark Connect Server Easier!
jupyter-magic pyspark spark spark-connect spark-connect-server spark-on-kubernetes spark-sql
Last synced: 10 Apr 2025
https://github.com/indix/sparkplug
Spark package to "plug" holes in data using SQL based rules ⚡️ 🔌
Last synced: 11 Apr 2025
https://github.com/thanaraklee/real-time-pyspark
This project introduces PySpark, a powerful open-source framework for distributed data processing. We explore its architecture, components, and applications for real-time data analysis.
pyspark python spark-sql training
Last synced: 05 Jul 2025
https://github.com/syedhassaanahmed/databricks-notebooks
Collection of Databricks and Jupyter Notebooks
azure-data-lake azure-databricks azure-event-hubs azure-iothub azure-sql-database azure-storage cosmos-db graphframes hive-udf jupyter-notebooks kafka matplotlib mongodb pandas-dataframe parquet power-bi pyspark spark spark-sql spark-udf
Last synced: 16 Sep 2025
https://github.com/astrolabsoftware/spark-fits
FITS data source for Spark SQL and DataFrames
apache-spark fits fitsio hdfs pyspark scala spark-sql
Last synced: 11 Jan 2026
https://github.com/fabiogouw/spark-aws-messaging
A custom sink provider for Apache Spark that sends the content of a dataframe to an AWS SQS
Last synced: 08 May 2025
https://github.com/zekeriyyaa/pyspark-structured-streaming-ros-kafka-apachespark-cassandra
A structured streaming was applied to the robot data from ROS-Gazebo simulation environment using Apache Spark. Data is collected in Kafka, analyzed by Apache Spark and stored in Cassandra.
apache-cassandra apache-kafka apache-spark cqlsh data-analysis kafka-consumer kafka-producer pyspark python python3 ros ros-noetic spark-cassandra spark-cassandra-connector spark-kafka-connector spark-kafka-integration spark-sql spark-streaming structured-streaming
Last synced: 30 Jun 2025
https://github.com/luckyzxl2016/spark-example
Spark1.6和spark2.2的示例,包含kafka,flume,structuredstreaming,jedis,elasticsearch,mysql,dataframe
dataframe elasticsearch jedis kafka mysql spark spark-example spark-sql spark-streaming spark-structured-streaming
Last synced: 10 Aug 2025
https://github.com/jgperrin/net.jgp.books.spark.ch11
Spark in Action, 2nd edition - chapter 11 - Working with SQL
apache-spark java java8 manning spark spark-sql sparkwithjava sql
Last synced: 28 Aug 2025
https://github.com/asuiu/sparkorm
ORM for Apache Spark and DataFrames schema manager
orm pyspark pyspark-python python python3 spark spark-orm spark-sql sparkql sqlalchemy sqlalchemy-orm
Last synced: 07 May 2025
https://github.com/apache/kyuubi-docker
Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses.
data-lake hadoop hive jdbc kubernetes spark spark-sql sql thrift
Last synced: 19 Oct 2025
https://github.com/dirkster99/pynotes
My notebook on using Python with Jupyter Notebook, PySpark etc
dataframe jupyter-notebook panda pandas-dataframe parquet pyspark python spark spark-sql sparknlp
Last synced: 27 Jan 2026
https://github.com/huemulsolutions/huemul-bigdatagovernance
Huemul BigDataGovernance, es una framework que trabaja sobre Spark, Hive y HDFS. Permite la implementación de una estrategia corporativa de dato único, basada en buenas prácticas de Gobierno de Datos. Permite implementar tablas con control de Primary Key y Foreing Key al insertar y actualizar datos utilizando la librería, Validación de nulos, largos de textos, máximos/mínimos de números y fechas, valores únicos y valores por default. También permite clasificar los campos en aplicabilidad de derechos ARCO para facilitar la implementación de leyes de protección de datos tipo GDPR, identificar los niveles de seguridad y si se está aplicando algún tipo de encriptación. Adicionalmente permite agregar reglas de validación más complejas sobre la misma tabla.
bigdata chile cloudera data data-engineer data-engineering data-governance data-warehouse datamart dataquality gdpr hadoop hive hortonworks huemul huemul-bigdatagovernance parquet spark spark-sql trabaja-sobre-spark
Last synced: 26 Apr 2025
https://github.com/selimhorri/spark-application
Java Application, uses Apache Spark, handles batch as well as streaming processing
dataframes-api java mysql spark spark-batch spark-sql spark-streaming
Last synced: 12 Apr 2025
https://github.com/imsanjoykb/pyspark-bootcamp
My Practice and project on PySpark
hadoop hadoop-mapreduce pyspark pyspark-machine-learning pyspark-ml pyspark-mllib pyspark-notebook spark-sql spark-streaming sparkjava transformation
Last synced: 09 Jul 2025
https://github.com/lucasbotang/coursera_big_data_for_data_engineers
Assignments for Big Data for Data Engineers specialization on Coursera by Yandex.
Last synced: 12 Apr 2025
https://github.com/nashtech-labs/sparkathon
A library having Java and Scala examples for Spark 2.x
apache-spark java-8 knoldus rdd scala spark spark-dataframes spark-dataset spark-ml spark-mllib spark-sql spark-streaming spark-structured-streaming
Last synced: 31 Aug 2025
https://github.com/maziyarpanahi/spark2-template
Intellij template to develop Apache Spark 2.x applications
spark-ml spark-sql spark-streaming spark2
Last synced: 17 Sep 2025
https://github.com/varunu28/aadhar-dataset-analysis
Data analysis of AADHAR dataset using Apache Spark
analysis scala spark spark-sql
Last synced: 23 Apr 2025
https://github.com/anaregdesign/vectorize-openai
Tabular calculation with LLM, Spark UDF Builder
apache-spark data-engineering data-science llm machinelearning openai openai-api pandas python spark-sql spark-udf
Last synced: 14 Mar 2025
https://github.com/aessing/demo-azuresynapse
This repository includes the demos and codes I use to play around with Azure Synapse Anayltics
analytics azure azure-sql-datawarehouse azure-synapse-analytics azure-synapse-dwh data-engineering data-warehousing datawarehouse machine-learning mdwh microsoft powerbi python scala spark spark-dotnet spark-sql sql-data-warehouse synapse synapse-analytics
Last synced: 08 May 2025
https://github.com/ren294/log-analysis-project
This project builds a scalable log analytics pipeline use Lambda architecture for real-time and batch processing of NASA server logs.
apache-kafka apache-nifi apache-spark big-data big-data-analytics cassandra cassandra-driver data-engineering data-science grafana hadoop hadoop-hdfs hive powerbi spark-rdd spark-sql spark-streaming
Last synced: 08 Jul 2025
https://github.com/myxof/sparknotes
Spark 2.0学习笔记
distributed-computing spark spark-sql
Last synced: 15 Apr 2025
https://github.com/mliarakos/spark-typed-ops
Lightweight type-safe operations for Spark
scala scala-macros shapeless spark spark-scala spark-sql
Last synced: 15 Oct 2025
https://github.com/tirth27/real-time-analytics-with-spark-streaming
This project aims to build a streaming application to perform real-time analytics of Covid-19 related tweets and deploy an ML model for real-time sentiment predictions.
apache apache-avro apache-kafka apache-spark confluent docker docker-compose ksql spark-sql spark-streaming twitter-api
Last synced: 05 Mar 2025
https://github.com/bayoadejare/lightning-streams
Batch/stream ETL pipeline of NOAA GLM dataset, using Python frameworks: Dagster, PySpark and Parquet storage.
clustering csv data-engineering data-pipeline data-warehousing database etl-pipeline jupyter-notebook k-means-clustering machine-learning noaa-data orchestration parquet pyspark python spark-sql spark-streaming sql streaming
Last synced: 14 Aug 2025
https://github.com/flaviostutz/spark-scala-jupyter
Jupyter notebook server prepared for running Spark with Scala kernels on a remote Spark master
hdfs hdfs-cluster hdfs-docker jupyter jupyter-notebook scala scala-spark spark spark-sql
Last synced: 14 Oct 2025
https://github.com/ren294/covid-data-process
This project integrates real-time data processing and analytics using Apache NiFi, Kafka, Spark, Hive, and AWS services for comprehensive COVID-19 data insights.
airflow aws aws-ec2 aws-quicksight big-data big-data-analytics covid19-data docker docker-compose hadoop-hdfs hdfs hive kafka nifi pipeline redpanda spark spark-sql spark-streaming sparksql
Last synced: 29 Oct 2025
https://github.com/izhangzhihao/spark-security
ranger ranger-plugin security spark spark-sql sql
Last synced: 15 Aug 2025
https://github.com/lmouhib/auto-register-spark-ui-k8s
A lightweight operator to automatically expose Spark UI manage its ingress when running Spark on Kubernetes
spark spark-kubernetes spark-sql spark-streaming spark-ui
Last synced: 11 Feb 2026
https://github.com/anant/example-cassandra-spark-elasticsearch
cassandra datastax docker elasticsearch scala spark spark-sql
Last synced: 04 Nov 2025
https://github.com/mtumilowicz/big-data-scala-spark-batch-workshop
Introduction to Spark Batch processing.
batch-processing big-data big-data-processing spark spark-sql workshop workshop-materials
Last synced: 15 Apr 2025
https://github.com/harshoza36/movielens_pyspark
MovieLens Dataset analysis using Hadoop and Pyspark
big-data-analytics hadoop movielens movielens-data-analysis pyspark spark spark-sql
Last synced: 21 Sep 2025
https://github.com/talegari/tidier
dplyr friendly spark style window aggregation for R dataframes and remote dbplyr tbls
dbplyr dplyr mutate rstats rstats-package spark-sql tidyverse
Last synced: 22 Oct 2025
https://github.com/san089/sf-crime-statistics
A Kafka and Spark Streaming Integration project : SF Crime Statistics with Spark Streaming
kafka kafka-consumer kafka-producer kafka-python spark-sql spark-streaming
Last synced: 19 Jun 2025
https://github.com/emso-exe/comercio_eletronico_brasileiro
Projeto de análise de dados do comércio eletrônico brasileiro disponibilizado pela Olist via plataforma Kaggle.
analise-de-dados ciencia-de-dados data-analytics data-science datascience e-commerce postgres postgresql pyspark python python-3 python3 spark spark-sql sql
Last synced: 30 Dec 2025
https://github.com/dharaneeshvrd/spark-examples
Spark Examples
pyspark spark spark-example spark-sql spark-streaming spark-streaming-kafka spark-structured-streaming
Last synced: 02 Jul 2025
https://github.com/kayvansol/pysparkjupyteronkubernetes
PySpark & Jupyter Notebooks Deployed On Kubernetes
apache-spark bitnami-charts docker-compose jupyter jupyter-notebook kubernetes-cluster pyspark python3 spark-sql
Last synced: 20 Jun 2025
https://github.com/ashirwadpradhan/tpsql
Asynchronous execution of parallely executing SQL query
asynchronous-tasks asyncio parallel-processing query-optimization spark-sql sql
Last synced: 06 Mar 2025
https://github.com/hcvazquez/machine-learning-in-spark-with-pyspark
[A repo to store some code and experiences about machine learning in spark with pyspark] Spark comes with a library containing common machine learning (ML) functionality, called MLlib. MLlib provides multiple types of machine learning algorithms, including classification, regression, clustering, and collaborative filtering, as well as supporting functionality such as model evaluation and data import. It also provides some lower-level ML primitives, including a generic gradient descent optimization algorithm. All of these methods are designed to scale out across a cluster.
apache-spark classification machine-learning pyspark python regression spark-sql
Last synced: 15 Oct 2025
https://github.com/kriss024/Spark
Spark for Data Science and ETL process.
data-science jupiter-notebook machine-learning mllib pyspark spark spark-sql
Last synced: 17 Jul 2025
https://github.com/okdp/spark-images
Collection of Apache Spark docker images for OKDP
apache-spark big-data docker k8s-spark kubernetes spark-kubernetes spark-python spark-r spark-sql
Last synced: 05 Feb 2026
https://github.com/burhanahmed1/big-data-analytics
Practice tasks in Python programming language using Hadoop, MRJob, PySpark for Big Data Analytics.
apache-spark hadoop hadoop-mapreduce jupyter-notebook mrjob pyspark python spark spark-sql sparksql
Last synced: 13 Nov 2025
https://github.com/multivacplatform/multivac-wikipedia
Wonderful reusable codes, libraries and scripts to process Wikipedia page views by using Apache Spark.
data-frame multivac-wikipedia spark spark-sql wikipedia
Last synced: 02 Mar 2025
https://github.com/debanjansarkar/pyspark-maestro
This repo contains implementations of PySpark for real-world use cases for batch data processing, streaming data processing sourced from Kafka, sockets, etc., spark optimizations, business specific bigdata processing scenario solutions, and machine learning use cases.
json kafka kafka-python kafka-streams pyspark pyspark-api pyspark-machine-learning pyspark-mllib pyspark-streaming python3 spark spark-mllib spark-sql spark-streaming
Last synced: 04 Feb 2026
https://github.com/tuanai-vireox/etl-spark-k8s
ETL With Apache Spark Deployed on K8s
apache k8s spark spark-sql spark-streaming
Last synced: 22 Aug 2025
https://github.com/multivacplatform/multivac-pubmed
Update PubMed articles daily on HDFS by using Spark Cluster
apache-spark dataframe hadoop hdfs pubmed pubmed-parser spark-sql yarn
Last synced: 02 Mar 2025
https://github.com/sayamalt/amazon-products-api-etl-and-ml-pipeline
In this project, I've created an end-to-end ETL pipeline and subsequently developed a machine learning model to predict the price of Amazon products based on several product-related features.
apache-spark azure-data-factory azure-data-lake-storage-gen2 azure-databricks data-ingestion delta-lake etl-pipeline extract-transform-load feature-engineering linear-regression machine-learning model-training-and-evaluation regression-models spark-mllib spark-sql
Last synced: 20 Mar 2025
https://github.com/manojpawar94/spark-scala-examples
I have implemented the sample programs using apache spark. The programs have developed on the concepts of Spark RDD and Spark SQL Dataframe.
apache-spark spark spark-rdd spark-sql
Last synced: 02 Mar 2025
https://github.com/angeligareta/spark-hadoop-hbase-overview
First lab for Data-Intensive Computing course at KTH where we are introduced to Apache Spark MLlib and Spark SQL, Hadoop, and HBase.
apache-spark data-intensive hadoop hbase hbase-table id2221 kth scala spark spark-mllib spark-sql
Last synced: 29 Oct 2025
https://github.com/adnanrahin/spark-flights-data-analysis
The U.S. Department of Transportation's (DOT) Bureau of Transportation Statistics tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled, and diverted flights is published in DOT's monthly Air Travel Consumer Report and in this dataset of 2015 flight delays and cancellations.
apache-spark big-data-analytics docker docker-compose docker-container java maven spark spark-sql spark-streaming
Last synced: 21 Jun 2025
https://github.com/adnanrahin/nfl-big-data-bowl-2022
The 2022 Big Data Bowl data contains Next Gen Stats player tracking, play, game, player, and PFF scouting data for all 2018-2020 Special Teams play. Here, you'll find a summary of each data set in the 2022 Data Bowl, a list of key variables to join on, and a description of each variable.
big-data big-data-processing rdd scala spark spark-sql
Last synced: 30 Oct 2025
https://github.com/hcvazquez/natural-language-processing-in-spark-with-pyspark
Natural language processing in spark with pyspark
apache-spark machine-learning natural-language-processing nlp pyspark python spark-sql
Last synced: 04 Mar 2025
https://github.com/pprattis/road-safety-database-with-jdbc-and-spark-rdd
A jdbc application that runs queries in pgAdmin to simulate the functionality of the UK Ministry of Transport's database using Apache Spark RDD for query implementation.
computer-science index java jdbc jdbc-database partitions pgadmin postgresql program query spark spark-sql sparkjava sql student
Last synced: 29 Mar 2025
https://github.com/pprattis/insurance-company-database-with-jdbc-and-spark-rdd
A jdbc application that runs queries in pgAdmin to simulate the functionality of an insurance company's database using Apache Spark RDD for query implementation.
computer-science java jdbc jdbc-database partitioning partitions postgresql program query spark spark-sql sparkjava sql student
Last synced: 27 Aug 2025
https://github.com/jbris/time-series-airflow-kafka-spark
A simple demonstration of an Airflow-Kafka-Spark (AKS) stack for online time series forecasting.
airflow airflow-dags bentoml bentoml-service kafka kafka-consumer kafka-producer kafka-streams minio mlflow mlflow-tracking-server mlops mlops-workflow online-learning spark spark-sql spark-streaming time-series time-series-analysis time-series-forecasting
Last synced: 30 Oct 2025
https://github.com/goamegah/flowstate
End-To-End Real-time Road Traffic Monitoring Spark Structured Streaming solution
airflow aws aws-kinesis dags kafka postgresql realtime scala spark-core spark-sql spark-streaming streaming
Last synced: 23 Jul 2025
https://github.com/miltiadiss/ceid_ne4348-big-data-management-systems
This project implements a real-time data pipeline with Kafka, Spark, and MongoDB. It generates vehicle data using UXSIM, streams it to a Kafka broker, processes it with Spark, and stores raw and processed data in MongoDB. Queries analyze vehicle counts, speeds, and routes over specified periods.
kafka-consumer kafka-producer pymongo pysaprk spark-sql uxsim
Last synced: 30 Mar 2025
https://github.com/wesslen/code-tutorials-for-sophi
Tutorials and templates for running Spark on UNCC's SOPHI platform
Last synced: 06 Apr 2025
https://github.com/rakibhhridoy/bigdataanalysiswithapachespark-stockprice
Often we have to deal with large dataset, handling them with traditional method is quite tedious and time consuming. There's come the distributed method like apache spark. This repo consist distributed analysis of stock price which is quite large dataset.
apache-spark big-data pandas pyspark python spark-sql sprk-api stock stock-price-forecasting
Last synced: 14 May 2025
https://github.com/nonppk/pyspark-etl-automation
A containerized automated ETL pipeline built with PySpark, PostgreSQL, and Docker.
automation data-engineering data-pipeline databasedesign docker etl poetry postgresql pyspark spark spark-sql
Last synced: 05 Nov 2025
https://github.com/vitalibo/distributed-heatmap-service
Simple distributed heatmap service on top of Apache HBase
aws hbase hbase-coprocessor heatmap spark spark-sql spring-boot
Last synced: 07 Nov 2025