Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

Apache Spark

Apache Spark is an open source distributed general-purpose cluster-computing framework. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.

https://github.com/multivacplatform/multivac-wikipedia

Wonderful reusable codes, libraries and scripts to process Wikipedia page views by using Apache Spark.

data-frame multivac-wikipedia spark spark-sql wikipedia

Last synced: 12 Jan 2025

https://github.com/multivacplatform/multivac-ml

Pre-trained ML models for Apache Spark

machine-learning nlp spark spark-ml

Last synced: 12 Jan 2025

https://github.com/kemalcanbora/ba_bigdata_docker

Docker containers provide a way to package applications with everything needed to run them, including base operating system images, databases, libraries, and binaries.

bigdata hadoop hue kafka spark

Last synced: 24 Jan 2025

https://github.com/kruglov-dmitry/yelp_data

End to end example how to read big (well, comparably) data from Kafka and write it down into Cassandra using Spark Structured Streaming. Using yelp dataset for illustration purposes.

cassandra kafka spark streaming yelp-dataset

Last synced: 19 Jan 2025

https://github.com/kwartile/spark-benchmark

Spark Benchmark suite to evaluate cluster configuration and compare the performance with other big data frameworks.

apache-spark benchmark benchmarking-suite cdh cloudera-hadoop hadoop hive impala performance scala spark

Last synced: 08 Feb 2025

https://github.com/peteprattis/insurance-company-database-with-jdbc-and-spark-rdd

A jdbc application that runs queries in pgAdmin to simulate the functionality of an insurance company's database using Apache Spark RDD for query implementation.

computer-science java jdbc jdbc-database partitioning partitions postgresql program query spark spark-sql sparkjava sql student

Last synced: 18 Jan 2025

https://github.com/javaidiqbal11/arabic-tweets-sentiment-analysis-using-spark

This repo is for Twitter Arabic dataset for sentiment analysis using Apache Spark.

apache-spark arabic-nlp arabic-tweets flask pyhton3 sentiment-analysis spark twitter-api

Last synced: 03 Jan 2025

https://github.com/beiyuouo/mi-store-log-analysis

👨‍🦽 伪·小米商城-大数据电商日志分析

flask full-stack java kafka python spark

Last synced: 02 Feb 2025

https://github.com/amthorn/qutex

A basic Queue Management System, interactable via several mediums, that resembles a mutex.

ava bot bots cisco cisco-spark cisco-spark-bot mutex queue queuebot queues qutex spark thorn webex webex-teams

Last synced: 13 Nov 2024

https://github.com/fpopic/hf-interview-challenge

(Interview) Mixin Data Engineering & Data Science with PySpark

data-engineering data-science pyspark python recipes spark

Last synced: 10 Jan 2025

https://github.com/f-lab-edu/league-of-legends-data-solution

‘리그 오브 레전드’를 벤치마킹해서 플레이어의 행동 이벤트를 발생하는 API를 통해 실시간으로 데이터가 잘 흐를 수 있도록 데이터 솔루션을 제공합니다.

airflow dataengineering spark

Last synced: 11 Oct 2024

https://github.com/rishav273/spark-cluster-multi-node-setup

Quickly setup and simulate a multi node spark cluster using docker and docker-compose.

docker docker-compose pyspark python3 spark

Last synced: 11 Oct 2024

https://github.com/scrapcodes/kafkaproducer

Benchmarks to measure latency using spark and kafka.

benchmark kafka spark

Last synced: 03 Jan 2025

https://github.com/tupol/spark-apps.seed.g8

Create Spark applications projects based on the spark-utils library.

application scala spark template

Last synced: 17 Jan 2025

https://github.com/hwywl/bigdata

大数据学习代码Spark、Hive、Storm、HBase

big-data flume hbase hdfs hive mr spark storm zook

Last synced: 08 Jan 2025

https://github.com/jimthompson5802/datascience_containers

Personal docker images for various data science software stacks

data-science docker h2oai jupyter-notebook kubernetes python rstudio-servers spark

Last synced: 29 Dec 2024

https://github.com/scrapcodes/spark-templates

One stop shop for Apache spark starter samples.

apache samples spark

Last synced: 03 Jan 2025

https://github.com/izeigerman/twinkle

The collection of helpers and utils for Apache Spark

apache-spark scala spark

Last synced: 08 Feb 2025

https://github.com/soumyadipta2020/sparkr_test

Sample Codes of Spark using R programming

r r-coding r-programming r-programming-language spark sparkr

Last synced: 05 Jan 2025

https://github.com/antonio-f/big-data-analysis-with-scala-and-spark

Coding assignments from the course "Big Data Analysis with Scala and Spark" (Coursera).

big-data bigdata coursera data-analysis scala spark

Last synced: 06 Feb 2025

https://github.com/s8sg/spark-standalone-cluster

Spark Standalone Cluster With Zookeeper

docker docker-compose spark zookeeper

Last synced: 01 Feb 2025

https://github.com/chimera-suite/spark-sidecar-setup

The sidecar setup container executes SparkSQL scripts against an Apache Spark instance.

docker setup sidecar-container spark sparksql

Last synced: 03 Jan 2025

https://github.com/chimera-suite/use-case

A step-by-step tutorial that showcases the capabilities of Chimera

chimera jena-fuseki knowledge-graph ontology pizza spark sparql-query

Last synced: 03 Jan 2025

https://github.com/chen0040/vagrant-big-data

Vagrantfiles for development in big data

cassandra elasticsearc hdfs kafka mesos redis spark storm vagrantfile zookeeper

Last synced: 09 Feb 2025

https://github.com/kampi/particle-mqtt

MQTT client implementation for TCP supporting devices (i. e. Argon, Photon) from Particle IoT.

cpp mqtt particle-argon particle-iot particle-swarm-optimization spark

Last synced: 21 Jan 2025

https://github.com/nikoshet/pyspark-movie-similarities

Using Spark In Python For Movie Similarities With Jaccard Index

jaccard-index movie-similarities pyspark spark

Last synced: 03 Jan 2025

https://github.com/nikoshet/spark-mlp

Multilayer Perceptron Implementation Using Spark

hdfs machine-learning mapreduce multilayer-perceptron pyspark python spark

Last synced: 03 Jan 2025

https://github.com/snexus/streaming-playground

Exploring streaming design patterns with Kafka and Spark Structural Streaming

kafka kafka-producer python spark spark-streaming

Last synced: 23 Jan 2025

https://github.com/talmago/pyspark-loglikelihood

PySpark Loglikelihood Similarity Examples

mahout pyspark recommendation-engine spark

Last synced: 03 Feb 2025

https://github.com/andreoss/spark-tabs

Custom tabs for Spark UI

scala spark

Last synced: 10 Feb 2025

https://github.com/zkan/machine-learning-with-spark-and-zeppelin

Machine Learning with Apache Spark & Zeppelin

pyspark python spark zeppelin

Last synced: 12 Feb 2025

https://github.com/782e616c6d/covid-d.a

Academic project, using Apache Spark for ETL and Data Studio for data analysis.

academic analytics automation cluster covid-19 data database etl python spark sql

Last synced: 26 Jan 2025

https://github.com/marcorfilacarreras/matemaquest

A simple API to get information of the "Pruebas Canguro" exams

api docker github-actions java math mathematics spark

Last synced: 13 Jan 2025

https://github.com/akaliutau/spark-recipes

Contains a collection of data processing solutions built on the top of Spark

java spark

Last synced: 11 Jan 2025

https://github.com/ishaansathaye/csc369-introdistributedcomputing

Cal Poly Fall 2024 CSC 369 Intro to Distributed Computing

distributed-computing hadoop java map-reduce scala spark

Last synced: 09 Feb 2025

https://github.com/fiware/tutorials.big-data-spark

:blue_book: FIWARE 306: Real-time Processing of Context Data using Apache Spark

apache-spark big-data-analytics fiware fiware-cosmos orion-spark-connector spark tutorial

Last synced: 17 Nov 2024

https://github.com/drsnowbird/nlp-deeplearning-projects

NLP Deep Learning Projects (Warning - Not ready for public consumption yet!)

chatbot deep-learning mallet nlp python3 rasa-core rasa-nlu spark tensorflow

Last synced: 13 Jan 2025

https://github.com/renardeinside/databricks-jobs-jsonnet

Example project with Databricks jobs and configuration management via jsonnet

databricks jsonnet spark

Last synced: 06 Feb 2025

https://github.com/shayartt/streaming-orders

Project to stream real-time orders and apply some ETL pipelines & analytics using DataBricks, Kafka, AWS

databricks etl kafka python spark spark-streaming

Last synced: 12 Oct 2024

https://github.com/kirbs-/svm

Apache Spark Version Manager

apache-spark spark

Last synced: 27 Jan 2025

https://github.com/bishalpaudel/sparkhbaseloganalyzer

Spark and HBase based HApache Access Log Analyzer

big-data cloudera hbase scala spark

Last synced: 06 Jan 2025

https://github.com/damianmarti/7506-spark

Notebook de las clases de 75-06 Organización de Datos - FIUBA

apache-spark pyspark spark

Last synced: 09 Feb 2025

https://github.com/vitalibo/distributed-heatmap-service

Simple distributed heatmap service on top of Apache HBase

aws hbase hbase-coprocessor heatmap spark spark-sql spring-boot

Last synced: 27 Dec 2024

https://github.com/williamliu52/twitter-sc

Trending sports highlights from Twitter

nodejs python react reactjs scala spark twitter

Last synced: 23 Oct 2024

https://github.com/manuparra/clustering-openstack

Make a dynamic and customizable cluster with OpenStack

cluster deployment hadoop openstack openstack-command script slave-nodes spark

Last synced: 27 Dec 2024

https://github.com/azurespheredev/microsoftfabric-exploratorium

A comprehensive educational resource hub dedicated to mastering Microsoft Fabric, offering in-depth tutorials, real-world use cases, and hands-on guides for seamless end-to-end analytics

analytics data-science data-transformation lakehouse microsoft-fabric one-lake powerbi real-time-analytics spark warehouse

Last synced: 11 Jan 2025

https://github.com/giuliosmall/twitter-trending-topics-pipeline

This project demonstrates trending topic detection using Apache Spark and MinIO. It processes Twitter JSON data with PySpark, leveraging distributed data processing and cloud storage. The entire project is containerized with Docker for easy deployment across architectures.

docker minio nlp pyspark pytest spacy spark streamlit

Last synced: 05 Feb 2025

https://github.com/librity/rtjvm_spark_essentials

Rock The JVM - Apache Spark Essentials

apache-spark big-data docker scala spark spark-sql

Last synced: 08 Jan 2025

https://github.com/wgierke/distributed_data_analytics

Solutions for the hands-on sessions of the course "Distributed Data Analytics" at Hasso-Plattner-Institute using Akka and Spark.

akka data-analytics distributed inclusion-dependency spark

Last synced: 09 Feb 2025

https://github.com/tallamjr/jetspark

Spark cluster on Jetson TX2 mini-project

gpu nvidia spark tx2-jetpack

Last synced: 10 Feb 2025

https://github.com/stabrise/scaledp-tutorials

Tutorials for ScaleDP library. ScaleDP is an Open-Source Library for Processing Documents in Apache Spark.

ner nlp ocr ocr-python pdf spark

Last synced: 30 Jan 2025

https://github.com/mounirbs/spark-livy

Spark Livy, a docker-compose solution enabling a Spark Cluster with a Livy endpoint

apache apache-spark docker docker-compose livy pyspark python spark

Last synced: 08 Feb 2025

https://github.com/aldantanneo/bigints

WIP constant time bigint implementation in SPARK

ada bigint cryptography formal-verification spark

Last synced: 30 Jan 2025

https://github.com/imvision12/real-time-tracking

Real time bus tracking using MTA bus API

flask hadoop javascript leaflet python spark

Last synced: 08 Feb 2025

https://github.com/codelytv/spark-kafka_rabbitmq_sqs-course

Integrate Spark with queue system course examples

apache-spark aws-sqs kafka rabbitmq spark

Last synced: 30 Jan 2025

https://github.com/alexott/cyber-spark-data-connectors

Cybersecurity-related custom data connectors for Spark

cybersecurity databricks pyspark spark

Last synced: 30 Jan 2025

https://github.com/hungreeee/reddit-realtime-streaming-pipeline

End-to-end real-time pipeline for comments processing of any subreddit for sentiment analysis.

cassandra docker-compose kafka praw-reddit real-time reddit-api spark

Last synced: 12 Jan 2025

https://github.com/nkdwon/crud-spark

Um CRUD feito em Java com Integração do PostgreSQL e o Framework Spark utilizando o ambiente Eclipse

eclipse-ide git java maven pgadmin4 postgresql spark

Last synced: 06 Jan 2025

https://github.com/darule0/yarndiff

A rudimentary command line utility for contrasting Apache Yarn container logs.

diff difference diffing hadoop hadoop-mapreduce hive log4j mapreduce pig spark yarn yarn2

Last synced: 23 Dec 2024

https://github.com/worst001/note_bigdata

收录了大数据相关各类资料、笔记、手册

bigdata cdh datawarehouse development flink flume guide hadoop hbase hive learning markdown mkdocs note notebook spark

Last synced: 12 Jan 2025

https://github.com/tomfran/lastfm-users-analysis

Last FM user's data collection and analysis using Spark

gcp lastfm spark

Last synced: 06 Jan 2025

https://github.com/samuele-lolli/steam-recommendation-system

A basic recommendation system built with Scala and Spark

mapreduce scala spark

Last synced: 04 Feb 2025

https://github.com/thdaraujo/cheat

A handful of cheatsheets and programming tips.

bash cheat-sheets cheatsheet dms hadoop postgresql spark sqoop

Last synced: 24 Jan 2025

https://github.com/stefanofioravanzo/evolving-wikipedia-graph

Distributed processing of Wikipedia history files using Hadoop and Spark

distributed-processing hadoop-hdfs spark wikipedia

Last synced: 19 Jan 2025

https://github.com/casassg/thesis

Undergraduate final thesis: Big Data Analytics on Container Orchestrated Systems

casassg-thesis cassandra docker kubernetes latex spark thesis zeppelin

Last synced: 17 Dec 2024

https://github.com/rupeshtr78/awsiot

AWS IOT Intergration Using EMR Spark Kinesis

aws aws-emr dynamodb iot kinesis spark spark-streaming

Last synced: 12 Jan 2025

https://github.com/rdalmarco/datascience

Estudos sobre data science, big data e machine learning

estatistica pandas python r spark sql

Last synced: 03 Jan 2025

https://github.com/rupeshtr78/aws-emr

Spark Job on Amazon EMR cluster

aws cluster emr-cluster mapreduce mapredue scala spark

Last synced: 12 Jan 2025

https://github.com/darule0/sparkdiff

A rudimentary command line utility for contrasting Apache Spark event logs.

apache-spark compare-files diff difference diffing spark spark-sql spark-streaming sparksql

Last synced: 06 Feb 2025

https://github.com/tallamjr/epfl-functional-scala

Materials and worked assignments for Functional Programming with Scala Specialization on Coursera

big-data scala spark

Last synced: 10 Feb 2025

https://github.com/opt-nc/opt-temps-attente-agences-camel

Pull datas from opt-temps-attente-agences-api and store data in various systems

camel datascience dataviz glia innovation kafka opensearch relation-client spark

Last synced: 12 Dec 2024

https://github.com/tuancamtbtx/java-spark-example

Spark ETL Generic Processor

etl spark

Last synced: 02 Jan 2025

https://github.com/multivacplatform/multivac-elasticsearch

Demoing Spark 2.2 and Elasticsearch Hadoop connector

elasticsearch hadoop spark

Last synced: 12 Jan 2025

https://github.com/multivacplatform/multivac-nlp

Testing and benchmarking some of the existing NLP libraries in Apache Spark

nlp spark spark-ml spark-mllib spark-nlp spark-sql stanford-corenlp word2vec

Last synced: 12 Jan 2025

https://github.com/wtsi-hgi/hgi-cloud

terraform and ansible codebase to provision clusters (e.g. hail/spark) at Sanger

ansible hail iac openstack packer spark terraform

Last synced: 28 Nov 2024

https://github.com/luisfalva/ophelia

Ophelian On Mars! More than a simple framework.

dask dataframe ophelia ophelia-spark rdd spark spark-ml spark-mllib spark-streaming

Last synced: 17 Dec 2024

https://github.com/mukjepscarlet/bilibili-predict-recommend

[大数据课程作业] Bilibili 助手: 视频推荐 + 热门预测

bilibili flask hadoop html javascript prediction pyspark python recommendation spark

Last synced: 18 Jan 2025

https://github.com/bomada/sparkify

This project is the final Capstone project of the Udacity Data Scientist Nanodegree program. The aim is to learn how to manipulate realistic datasets with Spark to engineer relevant features for predicting churn. Input data is related to the fictive music streaming service Sparkify (similar to Spotify and Pandora).

churn ml music portfolio python spark streaming

Last synced: 09 Feb 2025

https://github.com/starhe/balm

基于Spring Boot全家桶打造,大数据PAAS组件适配器,一键适配DolphinScheduler、Hadoop、Spark、Hive、Impala、HBase、Kafka、StarRocks、ClickHouse、Neo4j,通过标准REST接口操作,简单易用,方便二次开发和集成

clickhouse dolphinscheduler hadoop hbase hive impala kafka neo4j spark spring starrocks

Last synced: 21 Dec 2024

https://github.com/mxagar/spark_big_data_guide

This repository contains my personal guide on Spark and topics related to Big Data.

big-data hadoop machine-learning spark

Last synced: 23 Dec 2024