Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

Apache Spark

Apache Spark is an open source distributed general-purpose cluster-computing framework. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.

https://github.com/ferranbt/sparkanywhere

Run Apache spark multicloud and serverless

kubernetes serverless spark

Last synced: 01 Jan 2025

https://github.com/tadod12/airflow-spark-job

A workspace to experiment with Apache Spark and Airflow in a Docker environment

airflow docker rdbms spark

Last synced: 13 Jan 2025

https://github.com/pranavshashidhara/movie-recommendation-system

This project focuses on developing a recommendation system utilizing various learning techniques, including collaborative filtering, matrix factorization, and restricted Boltzmann machines (RBMs).

big-data recommendation-system spark

Last synced: 13 Jan 2025

https://github.com/vermicida/data-lake

Data Lake, the code corresponding the project #4 of the Udacity's Data Engineer Nanodegree Program

aws-s3 data-engineering data-lake etl-pipeline python spark

Last synced: 26 Dec 2024

https://github.com/tuancamtbtx/etl-spark-k8s

ETL With Apache Spark Deployed on K8s

apache k8s spark spark-sql spark-streaming

Last synced: 02 Jan 2025

https://github.com/tuancamtbtx/python-spark-example

Spark template to submit to cluster

python spark

Last synced: 02 Jan 2025

https://github.com/tuancamtbtx/bigdata-spark-processing

Spark Batch Process

spark

Last synced: 02 Jan 2025

https://github.com/peteprattis/road-safety-database-with-jdbc-and-spark-rdd

A jdbc application that runs queries in pgAdmin to simulate the functionality of the UK Ministry of Transport's database using Apache Spark RDD for query implementation.

computer-science index java jdbc jdbc-database partitions pgadmin postgresql program query spark spark-sql sparkjava sql student

Last synced: 18 Jan 2025

https://github.com/peteprattis/insurance-company-database-with-jdbc-and-spark-rdd

A jdbc application that runs queries in pgAdmin to simulate the functionality of an insurance company's database using Apache Spark RDD for query implementation.

computer-science java jdbc jdbc-database partitioning partitions postgresql program query spark spark-sql sparkjava sql student

Last synced: 18 Jan 2025

https://github.com/javaidiqbal11/arabic-tweets-sentiment-analysis-using-spark

This repo is for Twitter Arabic dataset for sentiment analysis using Apache Spark.

apache-spark arabic-nlp arabic-tweets flask pyhton3 sentiment-analysis spark twitter-api

Last synced: 03 Jan 2025

https://github.com/fpopic/hf-interview-challenge

(Interview) Mixin Data Engineering & Data Science with PySpark

data-engineering data-science pyspark python recipes spark

Last synced: 10 Jan 2025

https://github.com/declaredata/fuse_python

PySpark-compatible Python client for DeclareData Fuse Server: a blazing fast data processing engine and drop-in alternative to Spark clusters.

data-processing pyspark rust-lang spark

Last synced: 13 Jan 2025

https://github.com/scrapcodes/kafkaproducer

Benchmarks to measure latency using spark and kafka.

benchmark kafka spark

Last synced: 03 Jan 2025

https://github.com/scrapcodes/spark-templates

One stop shop for Apache spark starter samples.

apache samples spark

Last synced: 03 Jan 2025

https://github.com/chimera-suite/spark-sidecar-setup

The sidecar setup container executes SparkSQL scripts against an Apache Spark instance.

docker setup sidecar-container spark sparksql

Last synced: 03 Jan 2025

https://github.com/chimera-suite/use-case

A step-by-step tutorial that showcases the capabilities of Chimera

chimera jena-fuseki knowledge-graph ontology pizza spark sparql-query

Last synced: 03 Jan 2025

https://github.com/nikoshet/pyspark-movie-similarities

Using Spark In Python For Movie Similarities With Jaccard Index

jaccard-index movie-similarities pyspark spark

Last synced: 03 Jan 2025

https://github.com/nikoshet/spark-mlp

Multilayer Perceptron Implementation Using Spark

hdfs machine-learning mapreduce multilayer-perceptron pyspark python spark

Last synced: 03 Jan 2025

https://github.com/akaliutau/spark-recipes

Contains a collection of data processing solutions built on the top of Spark

java spark

Last synced: 11 Jan 2025

https://github.com/azurespheredev/microsoftfabric-exploratorium

A comprehensive educational resource hub dedicated to mastering Microsoft Fabric, offering in-depth tutorials, real-world use cases, and hands-on guides for seamless end-to-end analytics

analytics data-science data-transformation lakehouse microsoft-fabric one-lake powerbi real-time-analytics spark warehouse

Last synced: 11 Jan 2025

https://github.com/stabrise/scaledp-tutorials

Tutorials for ScaleDP library. ScaleDP is an Open-Source Library for Processing Documents in Apache Spark.

ner nlp ocr ocr-python pdf spark

Last synced: 30 Jan 2025

https://github.com/aldantanneo/bigints

WIP constant time bigint implementation in SPARK

ada bigint cryptography formal-verification spark

Last synced: 30 Jan 2025

https://github.com/codelytv/spark-kafka_rabbitmq_sqs-course

Integrate Spark with queue system course examples

apache-spark aws-sqs kafka rabbitmq spark

Last synced: 30 Jan 2025

https://github.com/alexott/cyber-spark-data-connectors

Cybersecurity-related custom data connectors for Spark

cybersecurity databricks pyspark spark

Last synced: 30 Jan 2025

https://github.com/hungreeee/reddit-realtime-streaming-pipeline

End-to-end real-time pipeline for comments processing of any subreddit for sentiment analysis.

cassandra docker-compose kafka praw-reddit real-time reddit-api spark

Last synced: 12 Jan 2025

https://github.com/worst001/note_bigdata

收录了大数据相关各类资料、笔记、手册

bigdata cdh datawarehouse development flink flume guide hadoop hbase hive learning markdown mkdocs note notebook spark

Last synced: 12 Jan 2025

https://github.com/rupeshtr78/awsiot

AWS IOT Intergration Using EMR Spark Kinesis

aws aws-emr dynamodb iot kinesis spark spark-streaming

Last synced: 12 Jan 2025

https://github.com/rupeshtr78/aws-emr

Spark Job on Amazon EMR cluster

aws cluster emr-cluster mapreduce mapredue scala spark

Last synced: 12 Jan 2025

https://github.com/multivacplatform/multivac-elasticsearch

Demoing Spark 2.2 and Elasticsearch Hadoop connector

elasticsearch hadoop spark

Last synced: 12 Jan 2025

https://github.com/multivacplatform/multivac-nlp

Testing and benchmarking some of the existing NLP libraries in Apache Spark

nlp spark spark-ml spark-mllib spark-nlp spark-sql stanford-corenlp word2vec

Last synced: 12 Jan 2025

https://github.com/morinian/pyspark

Estudos com PySpark

spark

Last synced: 18 Jan 2025

https://github.com/dohabanoui/spark-structured-streaming

Real-time analysis of hospital incident data using Apache Spark Streaming to track incidents by service and identify the top years with the most incidents.

docker spark spark-streaming spark-structured-streaming

Last synced: 19 Jan 2025

https://github.com/ewertondrigues02/engenharia-de-dados

Varios Projetos de Engenharia de Dados usando principais ferramentas como: Airflow, Snowflake, dbt, Postrgres, Looker Studio, Power BI

airflow analise-exploratoria analytics aws-ec2 dados data dbt-cloud engenharia-de-dados looker-studio postgres pyspark python3 snowflake spark

Last synced: 19 Jan 2025

https://github.com/anant/playbook

Anant Platform Playbook - Consists of principles, patterns, tools, a framework, and an approach to designing, building, and managing plaforms.

approach cassandra confluent datastax framework kafka platform playbook spark

Last synced: 19 Jan 2025

https://github.com/anant/example-cassandra-spark-sql

Cassandra Data Operations with Spark SQL

cassandra data-operations docker etl spark spark-sql

Last synced: 19 Jan 2025

https://github.com/anant/example-sql-on-cassandra-with-open-source-notebooks

Files to follow along with the Open Source Notebooks and Cassandra Webinar (see README.md)

cassandra datastax datastax-studio jupyter jupyter-notebook nosql notebooks quix spark sql

Last synced: 19 Jan 2025

https://github.com/brooksian/censussipp

Reprodicing Census SIPP Reports Using Apache Spark

spark sparksql zeppelin-notebook

Last synced: 19 Jan 2025

https://github.com/bluegranite/azure-synapse-vcf-analysis

Sample code for analyzing VCF files (converted to Parquet) in Azure Databricks and Synapse.

azure azure-databricks azure-synapse bioinformatics computational-biology databricks genomics glow parquet spark synapse vcf

Last synced: 19 Jan 2025

https://github.com/adampaternostro/azure-spark-livy-application-insights-external-dependency

Use Spark with Livy along with Application Insights. Learn to host your external dependencies in data lake.

application-insights azure azure-data-lake hdinsight java livy spark spark2

Last synced: 31 Jan 2025

https://github.com/michelderu/cassandra-csv-analytics

How to leverage Astra, DSE and Spark for analytics on large CSV files.

astra cassandra spark

Last synced: 20 Jan 2025

https://github.com/thom-x/graalvm-java-docker

Example creating native image of java app with GraalVM Gradle and Docker

docker graalvm gradle java spark

Last synced: 25 Jan 2025

https://github.com/imransilvake/semantic-partitioning

RDF Data (N-Triples) Partition and SPARQL Query Layer for SANSA-Stack using Scala and Spark.

big-data n-triples scala spark sparql

Last synced: 31 Jan 2025

https://github.com/chucheng92/sparkstreamingkafka

Spark Streaming logs to kafka.

kafka spark spark-streaming streaming

Last synced: 01 Feb 2025

https://github.com/riversun/ml-fake-data-maker

Generate fake data for machine learning like regression analysis

arff arff-generator dummy-data fake-data generator machine-learning prediction regression spark weka

Last synced: 01 Feb 2025

https://github.com/ronaldkanyepi/log-realtime-analysis

A scalable architecture for real-time log processing and visualization. Built with a Kafka-Spark ETL pipeline, DynamoDB for storing aggregate real-time metrics, and Python Dash for interactive dashboards. Designed for high-throughput log ingestion, real-time monitoring, and long-term storage.

dash docker docker-compose docker-container dynamodb etl etl-pipeline hdfs kafka kafka-consumer kafka-producer kafka-streams kafka-topic logs python realtime spark spark-streaming streaming visualization

Last synced: 16 Feb 2025

https://github.com/morgan-sell/usa-tourism-etl

Coalesced and transformed various data sources to create a comprehensive data lake for the USA tourism sector.

aws data-engineering data-lake emr-cluster etl-pipeline python spark

Last synced: 08 Feb 2025

https://github.com/mounirbs/spark-livy

Spark Livy, a docker-compose solution enabling a Spark Cluster with a Livy endpoint

apache apache-spark docker docker-compose livy pyspark python spark

Last synced: 08 Feb 2025

https://github.com/viyadb/viyadb-spark

Data processing ang ingestion backend for ViyaDB based on Spark streaming

spark spark-streaming spark-streaming-kafka viyadb

Last synced: 08 Feb 2025

https://github.com/chen0040/vagrant-big-data

Vagrantfiles for development in big data

cassandra elasticsearc hdfs kafka mesos redis spark storm vagrantfile zookeeper

Last synced: 09 Feb 2025

https://github.com/damianmarti/7506-spark

Notebook de las clases de 75-06 Organización de Datos - FIUBA

apache-spark pyspark spark

Last synced: 09 Feb 2025

https://github.com/gnaneshkunal/scala-hadoop

Hadoop programming using Scala

big-data bigdata hadoop scala spark sql

Last synced: 09 Feb 2025

https://github.com/neo4j-field/end-to-end-fraud-demo

An example of how to load the data backing Zach's awesome Fraud Demo

graph-algorithms neo4j spark

Last synced: 15 Feb 2025

https://github.com/ev2900/emr_studio_stock_price_demo

Demo EMR Studio notebook using PySpark to explore Stock Price Data

aws emr emr-studio spark

Last synced: 15 Feb 2025

https://github.com/ev2900/glue_spark_history_server

Host a Docker container for the Spark history server / Spark UI of AWS Glue jobs

aws glue spark spark-history-server spark-ui

Last synced: 15 Feb 2025

https://github.com/782e616c6d/covid-d.a

Academic project, using Apache Spark for ETL and Data Studio for data analysis.

academic analytics automation cluster covid-19 data database etl python spark sql

Last synced: 26 Jan 2025

https://github.com/cleberzumba/data-analysis-with-apache-spark-and-databricks

San Francisco Fire Calls. Creating a Spark application on the Databricks using PySpark and SQL for common data analytics patterns and operations on a San Francisco Fire Department Calls dataset.

databricks pyspark spark sql

Last synced: 16 Feb 2025

https://github.com/abdelmajidlh/spark_ml_weather

Projet d'apprentissage Scala et Spark : Prédire la pluie de demain avec des données historiques

pom scala spark spark-ml spark-sql

Last synced: 27 Jan 2025

https://github.com/cwienberg/spark-async-map

Helper library for running blocking IO operations in Spark jobs more efficiently

scala spark

Last synced: 02 Feb 2025

https://github.com/yeisson8a/dataframespyspark

Ejemplo de interacción con DataFrames (A partir de una lista, un CSV, un JSON y un archivo Parquet) en Spark utilizando tanto PySpark como Spark SQL

pyspark spark spark-sql

Last synced: 17 Feb 2025

https://github.com/hadarsharon/compars

DataFrame comparison done right, powered by Rust with polars (AKA the bear-agnostic 🐻 🐼 🐨 🐻‍❄️ DataFrame comparison library)

data-engineering data-profiling data-quality dataframe dataframes koalas pandas polars pyspark python rust spark

Last synced: 22 Jan 2025

https://github.com/flaviostutz/spark-submit-scala

Spark submit extension from bde2020/spark-submit for Scala with SBT

bigdata sbt scala spark spark-cluster spark-submit

Last synced: 06 Feb 2025

https://github.com/flaviostutz/spark-scala-hdfs-docker-example

Spark with Scala reading/writing files to HDFS with automatic additions of new Spark workers using Docker "scale"

datanode docker example hdfs namenodes scala scale spark spark-workers

Last synced: 06 Feb 2025

https://github.com/vjcitn/biocpyinterop

Material for Bioconductor 2023 workshop on interoperation with python

basilisk bioconductor cite-seq genetics hail reticulate scvi-tools single-cell-omics spark

Last synced: 09 Jan 2025

https://github.com/kelvynamaral/data_manipulation_spark

Repositório de treinamento para manipulação de dados no Apache Spark. Contém exemplos práticos de leitura, escrita, transformações, filtros, agregações, junções e uso de SQL em DataFrames.

pyspark spark spark-sql

Last synced: 17 Feb 2025

https://github.com/shixi99/spark-multinode

Standalone Spark cluster on Docker

docker spark

Last synced: 05 Jan 2025

https://github.com/knands42/dataengineering-1billion-rows-per-hour

A project that simulate how to build a complete workflow to persist 1 billion rows per hour

data-engineering graphana java java21 kafka makefile posgr prometheus python python3 spark sql

Last synced: 20 Jan 2025

https://github.com/gabrielenizzoli/spark_engine

Build a complex spark execution plan by composing many different spark operations.

spark sql yaml

Last synced: 12 Feb 2025

https://github.com/samlet/sagas-spark

use structured-streaming as olap engine

olap spark

Last synced: 15 Feb 2025

https://github.com/smaddanki/data-science

Code blocks, algorithms, and research snippets in Data Science, Machine Learning, AI & Quant Finance.

deep-learning machine-learning pytorch scikit-learn spark

Last synced: 08 Feb 2025

https://github.com/cn-docker/spark-worker

Spark Worker Docker Image

docker-image spark spark-worker

Last synced: 27 Jan 2025

https://github.com/ikajdan/spark_jupyter_docker

A Docker Compose setup for running PySpark with JupyterLab

docker notebook pyspark python spark

Last synced: 15 Feb 2025

https://github.com/juanpablo70/arep-taller03

Microframeworks Web

java spark webserver

Last synced: 22 Jan 2025

https://github.com/sephiroth7712/k-nearest-neigbours

Implementation of K-Nearest Neighbors algorithm using multiple parallel computing approaches: CUDA (GPU), Hadoop, Spark, MPI, OpenMP, and PThreads. Demonstrates scalable machine learning across different parallel computing paradigms from GPU to distributed frameworks.

cuda cuda-programming hadoop-mapreduce java mpi multiprocessing multithreading openmp pthreads scala spark

Last synced: 06 Feb 2025

https://github.com/higorcazuza81/courses

Repository showcasing my educational journey in Quantitative Analysis, including projects and coursework in SQL, Python, Data Science, Machine Learning, and financial modeling. Focused on real-world applications in quantitative finance, data analysis, and statistical modeling.

airflow automation database dataengineering python shell-script spark sql

Last synced: 06 Feb 2025

https://github.com/kolia1985/kolia1985

Mykola Melnyk profile

data-engineering data-science spark

Last synced: 06 Feb 2025

https://github.com/konradmalik/scala-seed

Seed project for dockerized Scala with included Spark and Cassandra.

cassandra docker makefile multimodule sbt scala seed spark template typesafe-config

Last synced: 17 Jan 2025

https://github.com/konradmalik/spark

Dockerized spark with tools

docker hadoop scala spark

Last synced: 17 Jan 2025

https://github.com/ikajdan/spark_docker

A Docker Compose setup for running PySpark with JupyterLab

docker notebook pyspark python spark

Last synced: 03 Dec 2024