Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

Apache Spark

Apache Spark is an open source distributed general-purpose cluster-computing framework. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.

https://github.com/azlinrusnan/movielens_data_analysis_with_mongodb_and_cassandra

This project presents an analysis of the MovieLens 100k dataset using Apache Spark integrated with MongoDB and Cassandra. The dataset includes user information, movie ratings, and movie details, providing a comprehensive basis for exploring user preferences and movie popularity.

cassandra ml-100k mongodb python spark

Last synced: 17 Jan 2025

https://github.com/divithraju/divith-raju-data-mining

This project focuses on customer segmentation using data mining techniques, specifically K-Means clustering, to classify customers into distinct groups based on their purchasing behaviors. The goal is to analyze customer data and segment them into clusters for targeted marketing strategies and better customer relationship management.

algorthims analytics apache business client connector data dataarchitecture database dataengineering datamining datascience hadoop k-means-clustering mysql project project-repository pyspark python3 spark

Last synced: 17 Jan 2025

https://github.com/vasnake/artefacts-2019_2023

Collection of some interesting pieces of my projects. Spark, Scala, Python, sh

catalyst etl ml scala spark udaf udf

Last synced: 17 Jan 2025

https://github.com/mounirbs/spark-connect

Spark Connect, a docker-compose solution enabling a Spark Cluster with Spark Connect feature. Could be used for local development.

apache apache-spark docker docker-compose pyspark python spark

Last synced: 15 Feb 2025

https://github.com/nkdwon/crud-spark

Um CRUD feito em Java com Integração do PostgreSQL e o Framework Spark utilizando o ambiente Eclipse

eclipse-ide git java maven pgadmin4 postgresql spark

Last synced: 06 Jan 2025

https://github.com/inf0rmatiker/model-service

A service providing federated model training for spatially-segregated data.

python spark

Last synced: 08 Jan 2025

https://github.com/sebastianruizm/pyspark-graphframes

Análisis de datos con GraphFrames y PySpark

python spark sql

Last synced: 08 Jan 2025

https://github.com/binwenwu/oge-computation-ogc

A computing project corresponding to an OGC style API

geotrellis scala spark

Last synced: 13 Feb 2025

https://github.com/adampaternostro/azure-spark-livy-application-insights-external-dependency

Use Spark with Livy along with Application Insights. Learn to host your external dependencies in data lake.

application-insights azure azure-data-lake hdinsight java livy spark spark2

Last synced: 31 Jan 2025

https://github.com/bluegranite/azure-synapse-vcf-analysis

Sample code for analyzing VCF files (converted to Parquet) in Azure Databricks and Synapse.

azure azure-databricks azure-synapse bioinformatics computational-biology databricks genomics glow parquet spark synapse vcf

Last synced: 19 Jan 2025

https://github.com/brooksian/censussipp

Reprodicing Census SIPP Reports Using Apache Spark

spark sparksql zeppelin-notebook

Last synced: 19 Jan 2025

https://github.com/s8sg/spark-standalone-cluster

Spark Standalone Cluster With Zookeeper

docker docker-compose spark zookeeper

Last synced: 01 Feb 2025

https://github.com/anant/example-sql-on-cassandra-with-open-source-notebooks

Files to follow along with the Open Source Notebooks and Cassandra Webinar (see README.md)

cassandra datastax datastax-studio jupyter jupyter-notebook nosql notebooks quix spark sql

Last synced: 19 Jan 2025

https://github.com/anant/example-cassandra-spark-sql

Cassandra Data Operations with Spark SQL

cassandra data-operations docker etl spark spark-sql

Last synced: 19 Jan 2025

https://github.com/anant/playbook

Anant Platform Playbook - Consists of principles, patterns, tools, a framework, and an approach to designing, building, and managing plaforms.

approach cassandra confluent datastax framework kafka platform playbook spark

Last synced: 19 Jan 2025

https://github.com/ev2900/iceberg_emr_athena

Resources from an virtual tech talk / workshop - Set Up and Use Apache Iceberg Tables on Your Data Lake

apache-iceberg athena aws emr spark

Last synced: 05 Nov 2024

https://github.com/imransilvake/semantic-partitioning

RDF Data (N-Triples) Partition and SPARQL Query Layer for SANSA-Stack using Scala and Spark.

big-data n-triples scala spark sparql

Last synced: 31 Jan 2025

https://github.com/imvision12/real-time-tracking

Real time bus tracking using MTA bus API

flask hadoop javascript leaflet python spark

Last synced: 08 Feb 2025

https://github.com/michelderu/cassandra-csv-analytics

How to leverage Astra, DSE and Spark for analytics on large CSV files.

astra cassandra spark

Last synced: 20 Jan 2025

https://github.com/zoltan-nz/learning-spark

Playing with Apache Spark

apache-spark java map-reduce spark

Last synced: 22 Jan 2025

https://github.com/ewertondrigues02/engenharia-de-dados

Varios Projetos de Engenharia de Dados usando principais ferramentas como: Airflow, Snowflake, dbt, Postrgres, Looker Studio, Power BI

airflow analise-exploratoria analytics aws-ec2 dados data dbt-cloud engenharia-de-dados looker-studio postgres pyspark python3 snowflake spark

Last synced: 19 Jan 2025

https://github.com/thdaraujo/cheat

A handful of cheatsheets and programming tips.

bash cheat-sheets cheatsheet dms hadoop postgresql spark sqoop

Last synced: 24 Jan 2025

https://github.com/dohabanoui/spark-structured-streaming

Real-time analysis of hospital incident data using Apache Spark Streaming to track incidents by service and identify the top years with the most incidents.

docker spark spark-streaming spark-structured-streaming

Last synced: 19 Jan 2025

https://github.com/riversun/ml-fake-data-maker

Generate fake data for machine learning like regression analysis

arff arff-generator dummy-data fake-data generator machine-learning prediction regression spark weka

Last synced: 01 Feb 2025

https://github.com/morinian/pyspark

Estudos com PySpark

spark

Last synced: 18 Jan 2025

https://github.com/manuparra/clustering-openstack

Make a dynamic and customizable cluster with OpenStack

cluster deployment hadoop openstack openstack-command script slave-nodes spark

Last synced: 18 Feb 2025

https://github.com/multivacplatform/multivac-nlp

Testing and benchmarking some of the existing NLP libraries in Apache Spark

nlp spark spark-ml spark-mllib spark-nlp spark-sql stanford-corenlp word2vec

Last synced: 12 Jan 2025

https://github.com/multivacplatform/multivac-elasticsearch

Demoing Spark 2.2 and Elasticsearch Hadoop connector

elasticsearch hadoop spark

Last synced: 12 Jan 2025

https://github.com/morgan-sell/usa-tourism-etl

Coalesced and transformed various data sources to create a comprehensive data lake for the USA tourism sector.

aws data-engineering data-lake emr-cluster etl-pipeline python spark

Last synced: 08 Feb 2025

https://github.com/rupeshtr78/aws-emr

Spark Job on Amazon EMR cluster

aws cluster emr-cluster mapreduce mapredue scala spark

Last synced: 12 Jan 2025

https://github.com/rupeshtr78/awsiot

AWS IOT Intergration Using EMR Spark Kinesis

aws aws-emr dynamodb iot kinesis spark spark-streaming

Last synced: 12 Jan 2025

https://github.com/lajwithsingh/magelocaldatapipeline

A compact project showcasing local data lake setup using Docker, Mage, Spark, MinIO, Iceberg, and StarRocks. Ideal for learning modern data engineering practices.

docker iceberg mage minio spark starrocks

Last synced: 29 Dec 2024

https://github.com/mounirbs/spark-livy

Spark Livy, a docker-compose solution enabling a Spark Cluster with a Livy endpoint

apache apache-spark docker docker-compose livy pyspark python spark

Last synced: 08 Feb 2025

https://github.com/worst001/note_bigdata

收录了大数据相关各类资料、笔记、手册

bigdata cdh datawarehouse development flink flume guide hadoop hbase hive learning markdown mkdocs note notebook spark

Last synced: 12 Jan 2025

https://github.com/hungreeee/reddit-realtime-streaming-pipeline

End-to-end real-time pipeline for comments processing of any subreddit for sentiment analysis.

cassandra docker-compose kafka praw-reddit real-time reddit-api spark

Last synced: 12 Jan 2025

https://github.com/hwywl/bigdata

大数据学习代码Spark、Hive、Storm、HBase

big-data flume hbase hdfs hive mr spark storm zook

Last synced: 08 Jan 2025

https://github.com/alexott/cyber-spark-data-connectors

Cybersecurity-related custom data connectors for Spark

cybersecurity databricks pyspark spark

Last synced: 30 Jan 2025

https://github.com/nicklitwinow/hse-python-capstone-project

This project is a comprehensive data engineering and analytics solution built using modern technologies such as Airflow, Spark, PostgreSQL, MySQL, Kafka, and Docker. It orchestrates data ingestion, processing, replication, streaming, and analytics across multiple containers.

airflow analytics dataengineering docker etl kafka mysql postgresql python spark streaming

Last synced: 03 Feb 2025

https://github.com/codelytv/spark-kafka_rabbitmq_sqs-course

Integrate Spark with queue system course examples

apache-spark aws-sqs kafka rabbitmq spark

Last synced: 30 Jan 2025

https://github.com/aldantanneo/bigints

WIP constant time bigint implementation in SPARK

ada bigint cryptography formal-verification spark

Last synced: 30 Jan 2025

https://github.com/viyadb/viyadb-spark

Data processing ang ingestion backend for ViyaDB based on Spark streaming

spark spark-streaming spark-streaming-kafka viyadb

Last synced: 08 Feb 2025

https://github.com/adrianmarino/spark-examples

Spark install & examples

notebooks python spark

Last synced: 24 Jan 2025

https://github.com/williamliu52/twitter-sc

Trending sports highlights from Twitter

nodejs python react reactjs scala spark twitter

Last synced: 23 Oct 2024

https://github.com/stabrise/scaledp-tutorials

Tutorials for ScaleDP library. ScaleDP is an Open-Source Library for Processing Documents in Apache Spark.

ner nlp ocr ocr-python pdf spark

Last synced: 30 Jan 2025

https://github.com/mohnoor94/learningspark

My journey to learn Spark using Scala <3

learning learning-by-doing scala spark sparkscala

Last synced: 22 Jan 2025

https://github.com/azurespheredev/microsoftfabric-exploratorium

A comprehensive educational resource hub dedicated to mastering Microsoft Fabric, offering in-depth tutorials, real-world use cases, and hands-on guides for seamless end-to-end analytics

analytics data-science data-transformation lakehouse microsoft-fabric one-lake powerbi real-time-analytics spark warehouse

Last synced: 11 Jan 2025

https://github.com/akaliutau/spark-recipes

Contains a collection of data processing solutions built on the top of Spark

java spark

Last synced: 11 Jan 2025

https://github.com/briansterle/cluster-fastcopy

copy data between hdfs clusters blazingly fast

bigdata distcp hadoop hdfs spark yarn

Last synced: 13 Feb 2025

https://github.com/pedropark99/spark_map

Easily apply a function over multiple columns of a Spark DataFrame

pyspark python spark

Last synced: 28 Nov 2024

https://github.com/ejw-data/google-colab-etl-amazon-reviews

Using Spark and Amazon RDS to clean and summarize amazon reviews to determine usefulness of product feedback

amazon-rds spark

Last synced: 22 Jan 2025

https://github.com/sandeepkundalwal/network-load-analysis-using-apache-spark

[CS561: MapReduce & BigData] Streaming Service using Apache Spark

big-data css html java javascript mapreduce spark

Last synced: 02 Feb 2025

https://github.com/nikoshet/spark-mlp

Multilayer Perceptron Implementation Using Spark

hdfs machine-learning mapreduce multilayer-perceptron pyspark python spark

Last synced: 03 Jan 2025

https://github.com/nikoshet/pyspark-movie-similarities

Using Spark In Python For Movie Similarities With Jaccard Index

jaccard-index movie-similarities pyspark spark

Last synced: 03 Jan 2025

https://github.com/chen0040/vagrant-big-data

Vagrantfiles for development in big data

cassandra elasticsearc hdfs kafka mesos redis spark storm vagrantfile zookeeper

Last synced: 09 Feb 2025

https://github.com/tupol/spark-utils-demos

Demos for the tupol/spark-utils project together with a storyline

configuration demo framework scala spark

Last synced: 17 Jan 2025

https://github.com/tupol/spark-apps.seed.g8

Create Spark applications projects based on the spark-utils library.

application scala spark template

Last synced: 17 Jan 2025

https://github.com/chimera-suite/use-case

A step-by-step tutorial that showcases the capabilities of Chimera

chimera jena-fuseki knowledge-graph ontology pizza spark sparql-query

Last synced: 03 Jan 2025

https://github.com/chimera-suite/spark-sidecar-setup

The sidecar setup container executes SparkSQL scripts against an Apache Spark instance.

docker setup sidecar-container spark sparksql

Last synced: 03 Jan 2025

https://github.com/najuzilu/dl-spark

Building a Data Lake with Spark

aws-emr aws-s3 data-engineering data-lake etl-pipeline spark

Last synced: 26 Jan 2025

https://github.com/mjngxwnj/olympics_data_project

A personal project that builds an end-to-end data pipeline using the 2024 Olympics data.

airflow docker hadoop python snowflake spark superset

Last synced: 09 Feb 2025

https://github.com/ltossian/bike-sales-data-metrics

Traitement, stockage, analyse et visualisation d'un fichier csv volumineux et de données en temps réel de ventes de vélos.

fastapi grafana hadoop kafka postgresql python spark

Last synced: 11 Feb 2025

https://github.com/vitalibo/distributed-heatmap-service

Simple distributed heatmap service on top of Apache HBase

aws hbase hbase-coprocessor heatmap spark spark-sql spring-boot

Last synced: 18 Feb 2025

https://github.com/amthorn/qutex

A basic Queue Management System, interactable via several mediums, that resembles a mutex.

ava bot bots cisco cisco-spark cisco-spark-bot mutex queue queuebot queues qutex spark thorn webex webex-teams

Last synced: 13 Nov 2024

https://github.com/scrapcodes/spark-templates

One stop shop for Apache spark starter samples.

apache samples spark

Last synced: 03 Jan 2025

https://github.com/scrapcodes/kafkaproducer

Benchmarks to measure latency using spark and kafka.

benchmark kafka spark

Last synced: 03 Jan 2025

https://github.com/fpopic/hf-interview-challenge

(Interview) Mixin Data Engineering & Data Science with PySpark

data-engineering data-science pyspark python recipes spark

Last synced: 10 Jan 2025

https://github.com/damianmarti/7506-spark

Notebook de las clases de 75-06 Organización de Datos - FIUBA

apache-spark pyspark spark

Last synced: 09 Feb 2025

https://github.com/javaidiqbal11/arabic-tweets-sentiment-analysis-using-spark

This repo is for Twitter Arabic dataset for sentiment analysis using Apache Spark.

apache-spark arabic-nlp arabic-tweets flask pyhton3 sentiment-analysis spark twitter-api

Last synced: 03 Jan 2025

https://github.com/samuele-lolli/steam-recommendation-system

A basic recommendation system built with Scala and Spark

mapreduce scala spark

Last synced: 04 Feb 2025

https://github.com/pprattis/road-safety-database-with-jdbc-and-spark-rdd

A jdbc application that runs queries in pgAdmin to simulate the functionality of the UK Ministry of Transport's database using Apache Spark RDD for query implementation.

computer-science index java jdbc jdbc-database partitions pgadmin postgresql program query spark spark-sql sparkjava sql student

Last synced: 04 Feb 2025

https://github.com/pprattis/insurance-company-database-with-jdbc-and-spark-rdd

A jdbc application that runs queries in pgAdmin to simulate the functionality of an insurance company's database using Apache Spark RDD for query implementation.

computer-science java jdbc jdbc-database partitioning partitions postgresql program query spark spark-sql sparkjava sql student

Last synced: 04 Feb 2025

https://github.com/peteprattis/insurance-company-database-with-jdbc-and-spark-rdd

A jdbc application that runs queries in pgAdmin to simulate the functionality of an insurance company's database using Apache Spark RDD for query implementation.

computer-science java jdbc jdbc-database partitioning partitions postgresql program query spark spark-sql sparkjava sql student

Last synced: 18 Jan 2025

https://github.com/peteprattis/road-safety-database-with-jdbc-and-spark-rdd

A jdbc application that runs queries in pgAdmin to simulate the functionality of the UK Ministry of Transport's database using Apache Spark RDD for query implementation.

computer-science index java jdbc jdbc-database partitions pgadmin postgresql program query spark spark-sql sparkjava sql student

Last synced: 18 Jan 2025

https://github.com/gnaneshkunal/scala-hadoop

Hadoop programming using Scala

big-data bigdata hadoop scala spark sql

Last synced: 09 Feb 2025

https://github.com/tuancamtbtx/bigdata-spark-processing

Spark Batch Process

spark

Last synced: 02 Jan 2025

https://github.com/ishaansathaye/csc369-introdistributedcomputing

Cal Poly Fall 2024 CSC 369 Intro to Distributed Computing

distributed-computing hadoop java map-reduce scala spark

Last synced: 09 Feb 2025

https://github.com/tuancamtbtx/python-spark-example

Spark template to submit to cluster

python spark

Last synced: 02 Jan 2025

https://github.com/neo4j-field/end-to-end-fraud-demo

An example of how to load the data backing Zach's awesome Fraud Demo

graph-algorithms neo4j spark

Last synced: 15 Feb 2025

https://github.com/tuancamtbtx/etl-spark-k8s

ETL With Apache Spark Deployed on K8s

apache k8s spark spark-sql spark-streaming

Last synced: 02 Jan 2025

https://github.com/ev2900/emr_studio_stock_price_demo

Demo EMR Studio notebook using PySpark to explore Stock Price Data

aws emr emr-studio spark

Last synced: 15 Feb 2025

https://github.com/rockfordwei/anagram

Anagram Solution Servers in Different Languages/Frameworks

anagram hdfs java javascript php python server spark swift

Last synced: 12 Jan 2025

https://github.com/vermicida/data-lake

Data Lake, the code corresponding the project #4 of the Udacity's Data Engineer Nanodegree Program

aws-s3 data-engineering data-lake etl-pipeline python spark

Last synced: 26 Dec 2024