An open API service indexing awesome lists of open source software.

Projects in Awesome Lists tagged with spark-sql

A curated list of projects in awesome lists tagged with spark-sql .

https://github.com/getredash/redash

Make Your Company Data Driven. Connect to any data source, easily visualize, dashboard and share your data.

analytics athena bi bigquery business-intelligence dashboard databricks hacktoberfest javascript mysql postgresql python redash redshift spark spark-sql visualization

Last synced: 16 Dec 2025

https://github.com/apache/kyuubi

Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses.

data-lake hacktoberfest hadoop hive jdbc kubernetes spark spark-sql sql thrift

Last synced: 13 May 2025

https://github.com/apache/incubator-gluten

Gluten is a middle layer responsible for offloading JVM-based SQL engines' execution to native engines.

arrow clickhouse simd spark-sql vectorization velox

Last synced: 14 May 2025

https://github.com/databricks/learningsparkv2

This is the github repo for Learning Spark: Lightning-Fast Data Analytics [2nd Edition]

apache-spark delta-lake mlflow mllib spark spark-mllib spark-sql structured-streaming

Last synced: 14 May 2025

https://github.com/oeljeklaus-you/useractionanalyzeplatform

电商用户行为分析大数据平台

accumulator hadoop java kyro spark spark-sql sparkjava

Last synced: 16 May 2025

https://github.com/zsvoboda/ngods-stocks

New Generation Opensource Data Stack Demo

cube dagster datahub dbt iceberg metabase python spark spark-sql trino trinodb

Last synced: 05 Apr 2025

https://github.com/microsoft/data-accelerator

Data Accelerator for Apache Spark simplifies onboarding to Streaming of Big Data. It offers a rich, easy to use experience to help with creation, editing and management of Spark jobs on Azure HDInsights or Databricks while enabling the full power of the Spark engine.

apache-spark azure big-data cosmosdb docker eventhub hdinsight iot iothub kafka kafka-streams nodejs react servicefabric spark spark-sql spark-streaming sparksql streaming streaming-data

Last synced: 15 May 2025

https://github.com/qbeast-io/qbeast-spark

Qbeast-spark: DataSource enabling multi-dimensional indexing and efficient data sampling. Big Data, free from the unnecessary!

big-data data-lakehouse datasource sampling scala spark spark-sql

Last synced: 12 Aug 2025

https://github.com/chabane/bigdata-playground

A complete example of a big data application using : Kubernetes (kops/aws), Apache Spark SQL/Streaming/MLib, Apache Flink, Scala, Python, Apache Kafka, Apache Hbase, Apache Parquet, Apache Avro, Apache Storm, Twitter Api, MongoDB, NodeJS, Angular, GraphQL

angular apache-flink apache-spark avro big-data docker graphql hadoop hbase kafka kops machine-learning mongodb nodejs parquet python scala spark-sql spark-streaming twitter-api

Last synced: 13 Apr 2025

https://github.com/Chabane/bigdata-playground

A complete example of a big data application using : Kubernetes (kops/aws), Apache Spark SQL/Streaming/MLib, Apache Flink, Scala, Python, Apache Kafka, Apache Hbase, Apache Parquet, Apache Avro, Apache Storm, Twitter Api, MongoDB, NodeJS, Angular, GraphQL

angular apache-flink apache-spark avro big-data docker graphql hadoop hbase kafka kops machine-learning mongodb nodejs parquet python scala spark-sql spark-streaming twitter-api

Last synced: 28 Apr 2025

https://github.com/polomarcus/spark-structured-streaming-examples

Spark Structured Streaming / Kafka / Cassandra / Elastic

cassandra kafka spark spark-sql structured-streaming

Last synced: 10 Apr 2025

https://github.com/mc2-project/opaque-sql

An encrypted data analytics platform

analytics enclave machine-learning privacy security spark spark-sql

Last synced: 17 Jan 2026

https://github.com/sjrusso8/spark-connect-rs

Apache Spark Connect Client for Rust

grpc-client spark spark-connect spark-sql

Last synced: 16 May 2025

https://github.com/minio/spark-select

A library for Spark DataFrame using MinIO Select API

amazon-s3 bigdata minio parquet-files pyspark sbt select spark spark-sql

Last synced: 20 Jun 2025

https://github.com/huangyueranbbc/SparkDemo

spark全示例代码(java、scala) Spark most full instance code DEMO (java、scala)

bigdata hadoop operator spark spark-sql spark-streaming sparkfun-products sparkjava sparkline sparkp

Last synced: 27 Mar 2025

https://github.com/groda/big_data

Tutorials on Big Data essentials: Hadoop, MapReduce, Spark. Explore a variety of tutorials and demonstrations on Big Data technologies, primarily in the form of Jupyter notebooks. Most notebooks are self-contained and live—ready to run with a click.

apache-sedona apache-spark big-data bigdata bigtop docker gutenberg-ebooks hadoop hadoop-cluster hadoop-hdfs hadoop-mapreduce jupyter-notebook mapreduce mapreduce-bash mrjob pyspark spark spark-sql testdfsio

Last synced: 06 Apr 2025

https://github.com/hablapps/sparkoptics

Optics for Spark DataFrames

dataframe dataframes optics scala spark spark-sql

Last synced: 30 Jun 2025

https://github.com/learningjournal/spark-streaming-in-scala

Apache Spark 3 - Structured Streaming Course Material

apache-spark big-data bigdata datalake scala spark spark-sql spark-streaming

Last synced: 16 May 2025

https://github.com/airbnb/airbnb-spark-thrift

A library for loadling Thrift data into Spark SQL

spark spark-sql spark-streaming thrift

Last synced: 04 Sep 2025

https://github.com/dbiir/paraflow

A real-time analytical system for ID-associated data

hadoop kafka orc parquet presto spark-sql

Last synced: 10 Jul 2025

https://github.com/wh1isper/sparglim

Sparglim✨ makes PySpark App Configurable and Deploy Spark Connect Server Easier!

jupyter-magic pyspark spark spark-connect spark-connect-server spark-on-kubernetes spark-sql

Last synced: 10 Apr 2025

https://github.com/indix/sparkplug

Spark package to "plug" holes in data using SQL based rules ⚡️ 🔌

datapipeline spark spark-sql

Last synced: 11 Apr 2025

https://github.com/thanaraklee/real-time-pyspark

This project introduces PySpark, a powerful open-source framework for distributed data processing. We explore its architecture, components, and applications for real-time data analysis.

pyspark python spark-sql training

Last synced: 05 Jul 2025

https://github.com/astrolabsoftware/spark-fits

FITS data source for Spark SQL and DataFrames

apache-spark fits fitsio hdfs pyspark scala spark-sql

Last synced: 11 Jan 2026

https://github.com/fabiogouw/spark-aws-messaging

A custom sink provider for Apache Spark that sends the content of a dataframe to an AWS SQS

aws-sqs spark spark-sql

Last synced: 08 May 2025

https://github.com/zekeriyyaa/pyspark-structured-streaming-ros-kafka-apachespark-cassandra

A structured streaming was applied to the robot data from ROS-Gazebo simulation environment using Apache Spark. Data is collected in Kafka, analyzed by Apache Spark and stored in Cassandra.

apache-cassandra apache-kafka apache-spark cqlsh data-analysis kafka-consumer kafka-producer pyspark python python3 ros ros-noetic spark-cassandra spark-cassandra-connector spark-kafka-connector spark-kafka-integration spark-sql spark-streaming structured-streaming

Last synced: 30 Jun 2025

https://github.com/luckyzxl2016/spark-example

Spark1.6和spark2.2的示例,包含kafka,flume,structuredstreaming,jedis,elasticsearch,mysql,dataframe

dataframe elasticsearch jedis kafka mysql spark spark-example spark-sql spark-streaming spark-structured-streaming

Last synced: 10 Aug 2025

https://github.com/jgperrin/net.jgp.books.spark.ch11

Spark in Action, 2nd edition - chapter 11 - Working with SQL

apache-spark java java8 manning spark spark-sql sparkwithjava sql

Last synced: 28 Aug 2025

https://github.com/asuiu/sparkorm

ORM for Apache Spark and DataFrames schema manager

orm pyspark pyspark-python python python3 spark spark-orm spark-sql sparkql sqlalchemy sqlalchemy-orm

Last synced: 07 May 2025

https://github.com/lifeomic/spark-vcf

Spark VCF data source implementation for Dataframes

dataframe genomics genotype lifeomic spark spark-sql team-clinical-intelligence variants vcf vcf-files

Last synced: 13 Sep 2025

https://github.com/apache/kyuubi-docker

Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses.

data-lake hadoop hive jdbc kubernetes spark spark-sql sql thrift

Last synced: 19 Oct 2025

https://github.com/dirkster99/pynotes

My notebook on using Python with Jupyter Notebook, PySpark etc

dataframe jupyter-notebook panda pandas-dataframe parquet pyspark python spark spark-sql sparknlp

Last synced: 27 Jan 2026

https://github.com/huemulsolutions/huemul-bigdatagovernance

Huemul BigDataGovernance, es una framework que trabaja sobre Spark, Hive y HDFS. Permite la implementación de una estrategia corporativa de dato único, basada en buenas prácticas de Gobierno de Datos. Permite implementar tablas con control de Primary Key y Foreing Key al insertar y actualizar datos utilizando la librería, Validación de nulos, largos de textos, máximos/mínimos de números y fechas, valores únicos y valores por default. También permite clasificar los campos en aplicabilidad de derechos ARCO para facilitar la implementación de leyes de protección de datos tipo GDPR, identificar los niveles de seguridad y si se está aplicando algún tipo de encriptación. Adicionalmente permite agregar reglas de validación más complejas sobre la misma tabla.

bigdata chile cloudera data data-engineer data-engineering data-governance data-warehouse datamart dataquality gdpr hadoop hive hortonworks huemul huemul-bigdatagovernance parquet spark spark-sql trabaja-sobre-spark

Last synced: 26 Apr 2025

https://github.com/chaokunyang/bigdata-examples

bigdata examples about spark and flink

bigdata flink hadoop monitor python samples spark spark-sql sparkml

Last synced: 15 May 2025

https://github.com/selimhorri/spark-application

Java Application, uses Apache Spark, handles batch as well as streaming processing

dataframes-api java mysql spark spark-batch spark-sql spark-streaming

Last synced: 12 Apr 2025

https://github.com/lucasbotang/coursera_big_data_for_data_engineers

Assignments for Big Data for Data Engineers specialization on Coursera by Yandex.

hadoop hive spark spark-sql

Last synced: 12 Apr 2025

https://github.com/maziyarpanahi/spark2-template

Intellij template to develop Apache Spark 2.x applications

spark-ml spark-sql spark-streaming spark2

Last synced: 17 Sep 2025

https://github.com/varunu28/aadhar-dataset-analysis

Data analysis of AADHAR dataset using Apache Spark

analysis scala spark spark-sql

Last synced: 23 Apr 2025

https://github.com/ren294/log-analysis-project

This project builds a scalable log analytics pipeline use Lambda architecture for real-time and batch processing of NASA server logs.

apache-kafka apache-nifi apache-spark big-data big-data-analytics cassandra cassandra-driver data-engineering data-science grafana hadoop hadoop-hdfs hive powerbi spark-rdd spark-sql spark-streaming

Last synced: 08 Jul 2025

https://github.com/myxof/sparknotes

Spark 2.0学习笔记

distributed-computing spark spark-sql

Last synced: 15 Apr 2025

https://github.com/mliarakos/spark-typed-ops

Lightweight type-safe operations for Spark

scala scala-macros shapeless spark spark-scala spark-sql

Last synced: 15 Oct 2025

https://github.com/tirth27/real-time-analytics-with-spark-streaming

This project aims to build a streaming application to perform real-time analytics of Covid-19 related tweets and deploy an ML model for real-time sentiment predictions.

apache apache-avro apache-kafka apache-spark confluent docker docker-compose ksql spark-sql spark-streaming twitter-api

Last synced: 05 Mar 2025

https://github.com/flaviostutz/spark-scala-jupyter

Jupyter notebook server prepared for running Spark with Scala kernels on a remote Spark master

hdfs hdfs-cluster hdfs-docker jupyter jupyter-notebook scala scala-spark spark spark-sql

Last synced: 14 Oct 2025

https://github.com/ren294/covid-data-process

This project integrates real-time data processing and analytics using Apache NiFi, Kafka, Spark, Hive, and AWS services for comprehensive COVID-19 data insights.

airflow aws aws-ec2 aws-quicksight big-data big-data-analytics covid19-data docker docker-compose hadoop-hdfs hdfs hive kafka nifi pipeline redpanda spark spark-sql spark-streaming sparksql

Last synced: 29 Oct 2025

https://github.com/lmouhib/auto-register-spark-ui-k8s

A lightweight operator to automatically expose Spark UI manage its ingress when running Spark on Kubernetes

spark spark-kubernetes spark-sql spark-streaming spark-ui

Last synced: 11 Feb 2026

https://github.com/harshoza36/movielens_pyspark

MovieLens Dataset analysis using Hadoop and Pyspark

big-data-analytics hadoop movielens movielens-data-analysis pyspark spark spark-sql

Last synced: 21 Sep 2025

https://github.com/talegari/tidier

dplyr friendly spark style window aggregation for R dataframes and remote dbplyr tbls

dbplyr dplyr mutate rstats rstats-package spark-sql tidyverse

Last synced: 22 Oct 2025

https://github.com/san089/sf-crime-statistics

A Kafka and Spark Streaming Integration project : SF Crime Statistics with Spark Streaming

kafka kafka-consumer kafka-producer kafka-python spark-sql spark-streaming

Last synced: 19 Jun 2025

https://github.com/emso-exe/comercio_eletronico_brasileiro

Projeto de análise de dados do comércio eletrônico brasileiro disponibilizado pela Olist via plataforma Kaggle.

analise-de-dados ciencia-de-dados data-analytics data-science datascience e-commerce postgres postgresql pyspark python python-3 python3 spark spark-sql sql

Last synced: 30 Dec 2025

https://github.com/ashirwadpradhan/tpsql

Asynchronous execution of parallely executing SQL query

asynchronous-tasks asyncio parallel-processing query-optimization spark-sql sql

Last synced: 06 Mar 2025

https://github.com/hcvazquez/machine-learning-in-spark-with-pyspark

[A repo to store some code and experiences about machine learning in spark with pyspark] Spark comes with a library containing common machine learning (ML) functionality, called MLlib. MLlib provides multiple types of machine learning algorithms, including classification, regression, clustering, and collaborative filtering, as well as supporting functionality such as model evaluation and data import. It also provides some lower-level ML primitives, including a generic gradient descent optimization algorithm. All of these methods are designed to scale out across a cluster.

apache-spark classification machine-learning pyspark python regression spark-sql

Last synced: 15 Oct 2025

https://github.com/kriss024/Spark

Spark for Data Science and ETL process.

data-science jupiter-notebook machine-learning mllib pyspark spark spark-sql

Last synced: 17 Jul 2025

https://github.com/okdp/spark-images

Collection of Apache Spark docker images for OKDP

apache-spark big-data docker k8s-spark kubernetes spark-kubernetes spark-python spark-r spark-sql

Last synced: 05 Feb 2026

https://github.com/burhanahmed1/big-data-analytics

Practice tasks in Python programming language using Hadoop, MRJob, PySpark for Big Data Analytics.

apache-spark hadoop hadoop-mapreduce jupyter-notebook mrjob pyspark python spark spark-sql sparksql

Last synced: 13 Nov 2025

https://github.com/multivacplatform/multivac-wikipedia

Wonderful reusable codes, libraries and scripts to process Wikipedia page views by using Apache Spark.

data-frame multivac-wikipedia spark spark-sql wikipedia

Last synced: 02 Mar 2025

https://github.com/debanjansarkar/pyspark-maestro

This repo contains implementations of PySpark for real-world use cases for batch data processing, streaming data processing sourced from Kafka, sockets, etc., spark optimizations, business specific bigdata processing scenario solutions, and machine learning use cases.

json kafka kafka-python kafka-streams pyspark pyspark-api pyspark-machine-learning pyspark-mllib pyspark-streaming python3 spark spark-mllib spark-sql spark-streaming

Last synced: 04 Feb 2026

https://github.com/tuanai-vireox/etl-spark-k8s

ETL With Apache Spark Deployed on K8s

apache k8s spark spark-sql spark-streaming

Last synced: 22 Aug 2025

https://github.com/abdelmajidlh/spark_ml_weather

Projet d'apprentissage Scala et Spark : Prédire la pluie de demain avec des données historiques

pom scala spark spark-ml spark-sql

Last synced: 22 Mar 2025

https://github.com/multivacplatform/multivac-pubmed

Update PubMed articles daily on HDFS by using Spark Cluster

apache-spark dataframe hadoop hdfs pubmed pubmed-parser spark-sql yarn

Last synced: 02 Mar 2025

https://github.com/sayamalt/amazon-products-api-etl-and-ml-pipeline

In this project, I've created an end-to-end ETL pipeline and subsequently developed a machine learning model to predict the price of Amazon products based on several product-related features.

apache-spark azure-data-factory azure-data-lake-storage-gen2 azure-databricks data-ingestion delta-lake etl-pipeline extract-transform-load feature-engineering linear-regression machine-learning model-training-and-evaluation regression-models spark-mllib spark-sql

Last synced: 20 Mar 2025

https://github.com/manojpawar94/spark-scala-examples

I have implemented the sample programs using apache spark. The programs have developed on the concepts of Spark RDD and Spark SQL Dataframe.

apache-spark spark spark-rdd spark-sql

Last synced: 02 Mar 2025

https://github.com/angeligareta/spark-hadoop-hbase-overview

First lab for Data-Intensive Computing course at KTH where we are introduced to Apache Spark MLlib and Spark SQL, Hadoop, and HBase.

apache-spark data-intensive hadoop hbase hbase-table id2221 kth scala spark spark-mllib spark-sql

Last synced: 29 Oct 2025

https://github.com/adnanrahin/spark-flights-data-analysis

The U.S. Department of Transportation's (DOT) Bureau of Transportation Statistics tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled, and diverted flights is published in DOT's monthly Air Travel Consumer Report and in this dataset of 2015 flight delays and cancellations.

apache-spark big-data-analytics docker docker-compose docker-container java maven spark spark-sql spark-streaming

Last synced: 21 Jun 2025

https://github.com/adnanrahin/nfl-big-data-bowl-2022

The 2022 Big Data Bowl data contains Next Gen Stats player tracking, play, game, player, and PFF scouting data for all 2018-2020 Special Teams play. Here, you'll find a summary of each data set in the 2022 Data Bowl, a list of key variables to join on, and a description of each variable.

big-data big-data-processing rdd scala spark spark-sql

Last synced: 30 Oct 2025

https://github.com/pprattis/road-safety-database-with-jdbc-and-spark-rdd

A jdbc application that runs queries in pgAdmin to simulate the functionality of the UK Ministry of Transport's database using Apache Spark RDD for query implementation.

computer-science index java jdbc jdbc-database partitions pgadmin postgresql program query spark spark-sql sparkjava sql student

Last synced: 29 Mar 2025

https://github.com/pprattis/insurance-company-database-with-jdbc-and-spark-rdd

A jdbc application that runs queries in pgAdmin to simulate the functionality of an insurance company's database using Apache Spark RDD for query implementation.

computer-science java jdbc jdbc-database partitioning partitions postgresql program query spark spark-sql sparkjava sql student

Last synced: 27 Aug 2025

https://github.com/goamegah/flowstate

End-To-End Real-time Road Traffic Monitoring Spark Structured Streaming solution

airflow aws aws-kinesis dags kafka postgresql realtime scala spark-core spark-sql spark-streaming streaming

Last synced: 23 Jul 2025

https://github.com/miltiadiss/ceid_ne4348-big-data-management-systems

This project implements a real-time data pipeline with Kafka, Spark, and MongoDB. It generates vehicle data using UXSIM, streams it to a Kafka broker, processes it with Spark, and stores raw and processed data in MongoDB. Queries analyze vehicle counts, speeds, and routes over specified periods.

kafka-consumer kafka-producer pymongo pysaprk spark-sql uxsim

Last synced: 30 Mar 2025

https://github.com/wesslen/code-tutorials-for-sophi

Tutorials and templates for running Spark on UNCC's SOPHI platform

pyspark scala spark-sql

Last synced: 06 Apr 2025

https://github.com/rakibhhridoy/bigdataanalysiswithapachespark-stockprice

Often we have to deal with large dataset, handling them with traditional method is quite tedious and time consuming. There's come the distributed method like apache spark. This repo consist distributed analysis of stock price which is quite large dataset.

apache-spark big-data pandas pyspark python spark-sql sprk-api stock stock-price-forecasting

Last synced: 14 May 2025

https://github.com/nonppk/pyspark-etl-automation

A containerized automated ETL pipeline built with PySpark, PostgreSQL, and Docker.

automation data-engineering data-pipeline databasedesign docker etl poetry postgresql pyspark spark spark-sql

Last synced: 05 Nov 2025

https://github.com/vitalibo/distributed-heatmap-service

Simple distributed heatmap service on top of Apache HBase

aws hbase hbase-coprocessor heatmap spark spark-sql spring-boot

Last synced: 07 Nov 2025