Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

Apache Spark

Apache Spark is an open source distributed general-purpose cluster-computing framework. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.

https://github.com/jldbc/big-data

Coursework from Big Data (CS3390) -- Machine Learning tasks performed using Hadoop, MapReduce, and Spark

big-data hadoop pagerank recommender-system spark

Last synced: 04 Jan 2025

https://github.com/mangalaman93/dspark

Run spark in docker containers

big-data containers docker microservices spark

Last synced: 18 Jan 2025

https://github.com/mtpatter/bilao

Jupyter notebooks for filtering Kafka data with Spark Streaming.

avro docker jupyter-notebook kafka spark spark-streaming

Last synced: 12 Jan 2025

https://github.com/multivacplatform/multivac-fakenews

Detecting users and communities which propagate fake news on Twitter by Apache Spark

deep-learning fakenews machine-learning spark twitter

Last synced: 12 Jan 2025

https://github.com/multivacplatform/multivac-wikipedia

Wonderful reusable codes, libraries and scripts to process Wikipedia page views by using Apache Spark.

data-frame multivac-wikipedia spark spark-sql wikipedia

Last synced: 12 Jan 2025

https://github.com/multivacplatform/multivac-ml

Pre-trained ML models for Apache Spark

machine-learning nlp spark spark-ml

Last synced: 12 Jan 2025

https://github.com/kemalcanbora/ba_bigdata_docker

Docker containers provide a way to package applications with everything needed to run them, including base operating system images, databases, libraries, and binaries.

bigdata hadoop hue kafka spark

Last synced: 24 Jan 2025

https://github.com/kruglov-dmitry/yelp_data

End to end example how to read big (well, comparably) data from Kafka and write it down into Cassandra using Spark Structured Streaming. Using yelp dataset for illustration purposes.

cassandra kafka spark streaming yelp-dataset

Last synced: 19 Jan 2025

https://github.com/dimajix/docker-spark

Repository for building Docker containers for Spark

cluster docker hadoop spark

Last synced: 05 Jan 2025

https://github.com/kwartile/spark-benchmark

Spark Benchmark suite to evaluate cluster configuration and compare the performance with other big data frameworks.

apache-spark benchmark benchmarking-suite cdh cloudera-hadoop hadoop hive impala performance scala spark

Last synced: 08 Feb 2025

https://github.com/aveek-saha/cricket-score-predictor

A Big data application to predict the outcome of a T20 cricket match.

big-data big-data-analytics clustering pyspark spark spark-mllib

Last synced: 15 Feb 2025

https://github.com/javaidiqbal11/arabic-tweets-sentiment-analysis-using-spark

This repo is for Twitter Arabic dataset for sentiment analysis using Apache Spark.

apache-spark arabic-nlp arabic-tweets flask pyhton3 sentiment-analysis spark twitter-api

Last synced: 03 Jan 2025

https://github.com/binwenwu/oge-computation-ogc

A computing project corresponding to an OGC style API

geotrellis scala spark

Last synced: 13 Feb 2025

https://github.com/fbraza/data-processing-scala-spark

A repository that contains code in Scala using spark to process a log data file. The full procedure to run the application can be read in the README.md file.

scala spark

Last synced: 26 Jan 2025

https://github.com/fpopic/hf-interview-challenge

(Interview) Mixin Data Engineering & Data Science with PySpark

data-engineering data-science pyspark python recipes spark

Last synced: 10 Jan 2025

https://github.com/chucheng92/sparkstreamingkafka

Spark Streaming logs to kafka.

kafka spark spark-streaming streaming

Last synced: 01 Feb 2025

https://github.com/azlinrusnan/iris_pyspark_analysis

Iris Classification using PySpark

apache pyspark-mllib python r spark

Last synced: 31 Dec 2024

https://github.com/wtsi-hgi/hgi-cloud

terraform and ansible codebase to provision clusters (e.g. hail/spark) at Sanger

ansible hail iac openstack packer spark terraform

Last synced: 28 Nov 2024

https://github.com/scrapcodes/kafkaproducer

Benchmarks to measure latency using spark and kafka.

benchmark kafka spark

Last synced: 03 Jan 2025

https://github.com/angeligareta/spark-kafka-cassandra-overview

Second lab for Data-Intensive Computing course at KTH where we use Apache Kafka, Spark, and Cassandra to practice stream processing.

apache-kafka apache-spark cassandra cassandra-server data-intensive id2221 kafka kafka-topic kth scala spark stream-processing

Last synced: 22 Jan 2025

https://github.com/omr5221/kafka-account-fraud-detector

Learning about Kafka and Spark with project built off of an existing project

kafka python spark superset

Last synced: 27 Jan 2025

https://github.com/manojpawar94/spark-scala-examples

I have implemented the sample programs using apache spark. The programs have developed on the concepts of Spark RDD and Spark SQL Dataframe.

apache-spark spark spark-rdd spark-sql

Last synced: 13 Jan 2025

https://github.com/scrapcodes/spark-templates

One stop shop for Apache spark starter samples.

apache samples spark

Last synced: 03 Jan 2025

https://github.com/ev2900/emr_studio_stock_price_demo

Demo EMR Studio notebook using PySpark to explore Stock Price Data

aws emr emr-studio spark

Last synced: 15 Feb 2025

https://github.com/ev2900/emr_studio_deployment

Example Jupyter notebook for EMR Studio

aws emr emr-studio spark

Last synced: 05 Nov 2024

https://github.com/tallamjr/jetspark

Spark cluster on Jetson TX2 mini-project

gpu nvidia spark tx2-jetpack

Last synced: 10 Feb 2025

https://github.com/chimera-suite/spark-sidecar-setup

The sidecar setup container executes SparkSQL scripts against an Apache Spark instance.

docker setup sidecar-container spark sparksql

Last synced: 03 Jan 2025

https://github.com/chimera-suite/use-case

A step-by-step tutorial that showcases the capabilities of Chimera

chimera jena-fuseki knowledge-graph ontology pizza spark sparql-query

Last synced: 03 Jan 2025

https://github.com/cn-docker/spark-master

Spark Master Docker Image

docker-image spark spark-master

Last synced: 27 Jan 2025

https://github.com/oracle-quickstart/oci-hortonworks

Terraform module to deploy Hortonworks on Oracle Cloud Infrastructure (OCI)

cloud hadoop hdf hdp hortonworks oci oracle partner-led spark terraform

Last synced: 07 Nov 2024

https://github.com/nikoshet/pyspark-movie-similarities

Using Spark In Python For Movie Similarities With Jaccard Index

jaccard-index movie-similarities pyspark spark

Last synced: 03 Jan 2025

https://github.com/nikoshet/spark-mlp

Multilayer Perceptron Implementation Using Spark

hdfs machine-learning mapreduce multilayer-perceptron pyspark python spark

Last synced: 03 Jan 2025

https://github.com/casassg/thesis

Undergraduate final thesis: Big Data Analytics on Container Orchestrated Systems

casassg-thesis cassandra docker kubernetes latex spark thesis zeppelin

Last synced: 17 Dec 2024

https://github.com/bnvulpe/paperslab

The project aims to automate content classification and knowledge retrieval, as well as to perform analysis on the temporal and thematic impact on research over a time period. In addition, the possibility of performing network analysis to analyze communication in the community is contemplated for users.

api-extractor big-data big-data-and-ml big-data-infrastructure docker elasticsearch etl-pipeline information-retrieval knowledge-discovery mysql neo4j network-analysis spark temporal-analysis

Last synced: 09 Feb 2025

https://github.com/fanqingsong/machine_learning_system_on_spark

a simple machine learning system demo(cluster and predict on iris data), for ML study. Based on machine_learning_system repo, add new process for ml model service with celery and spark.

celery django machine-learning reactjs spark

Last synced: 14 Feb 2025

https://github.com/stefanofioravanzo/evolving-wikipedia-graph

Distributed processing of Wikipedia history files using Hadoop and Spark

distributed-processing hadoop-hdfs spark wikipedia

Last synced: 19 Jan 2025

https://github.com/briansterle/cluster-fastcopy

copy data between hdfs clusters blazingly fast

bigdata distcp hadoop hdfs spark yarn

Last synced: 13 Feb 2025

https://github.com/ishaansathaye/csc369-introdistributedcomputing

Cal Poly Fall 2024 CSC 369 Intro to Distributed Computing

distributed-computing hadoop java map-reduce scala spark

Last synced: 09 Feb 2025

https://github.com/akaliutau/spark-recipes

Contains a collection of data processing solutions built on the top of Spark

java spark

Last synced: 11 Jan 2025

https://github.com/rdalmarco/datascience

Estudos sobre data science, big data e machine learning

estatistica pandas python r spark sql

Last synced: 03 Jan 2025

https://github.com/snexus/streaming-playground

Exploring streaming design patterns with Kafka and Spark Structural Streaming

kafka kafka-producer python spark spark-streaming

Last synced: 23 Jan 2025

https://github.com/darule0/sparkdiff

A rudimentary command line utility for contrasting Apache Spark event logs.

apache-spark compare-files diff difference diffing spark spark-sql spark-streaming sparksql

Last synced: 06 Feb 2025

https://github.com/kirbs-/svm

Apache Spark Version Manager

apache-spark spark

Last synced: 27 Jan 2025

https://github.com/opt-nc/opt-temps-attente-agences-camel

Pull datas from opt-temps-attente-agences-api and store data in various systems

camel datascience dataviz glia innovation kafka opensearch relation-client spark

Last synced: 12 Dec 2024

https://github.com/angeligareta/machine-learning-spark

Assignment for Scalable Machine Learning which aims to study the basics of regression and classification in Spark.

apache-spark machine-learning scala spark spark-classification spark-ml spark-mllib spark-regression spark-scala

Last synced: 22 Jan 2025

https://github.com/luisfalva/ophelia

Ophelian On Mars! More than a simple framework.

dask dataframe ophelia ophelia-spark rdd spark spark-ml spark-mllib spark-streaming

Last synced: 17 Dec 2024

https://github.com/tomwhite/single-cell-spark-demo

Experiments on Single Cell data from 10x Genomics using Apache Spark.

bioinformatics genomics single-cell spark

Last synced: 17 Jan 2025

https://github.com/damianmarti/7506-spark

Notebook de las clases de 75-06 Organización de Datos - FIUBA

apache-spark pyspark spark

Last synced: 09 Feb 2025

https://github.com/azurespheredev/microsoftfabric-exploratorium

A comprehensive educational resource hub dedicated to mastering Microsoft Fabric, offering in-depth tutorials, real-world use cases, and hands-on guides for seamless end-to-end analytics

analytics data-science data-transformation lakehouse microsoft-fabric one-lake powerbi real-time-analytics spark warehouse

Last synced: 11 Jan 2025

https://github.com/bomada/sparkify

This project is the final Capstone project of the Udacity Data Scientist Nanodegree program. The aim is to learn how to manipulate realistic datasets with Spark to engineer relevant features for predicting churn. Input data is related to the fictive music streaming service Sparkify (similar to Spotify and Pandora).

churn ml music portfolio python spark streaming

Last synced: 09 Feb 2025

https://github.com/mxagar/spark_big_data_guide

This repository contains my personal guide on Spark and topics related to Big Data.

big-data hadoop machine-learning spark

Last synced: 15 Feb 2025

https://github.com/manuparra/clustering-openstack

Make a dynamic and customizable cluster with OpenStack

cluster deployment hadoop openstack openstack-command script slave-nodes spark

Last synced: 18 Feb 2025

https://github.com/20cent16/airflow-spark

If you want to use airflow with spark, ready to use ;-)

airflow spark

Last synced: 14 Feb 2025

https://github.com/stabrise/scaledp-tutorials

Tutorials for ScaleDP library. ScaleDP is an Open-Source Library for Processing Documents in Apache Spark.

ner nlp ocr ocr-python pdf spark

Last synced: 30 Jan 2025

https://github.com/librity/rtjvm_spark_essentials

Rock The JVM - Apache Spark Essentials

apache-spark big-data docker scala spark spark-sql

Last synced: 08 Jan 2025

https://github.com/tallamjr/epfl-functional-scala

Materials and worked assignments for Functional Programming with Scala Specialization on Coursera

big-data scala spark

Last synced: 10 Feb 2025

https://github.com/aldantanneo/bigints

WIP constant time bigint implementation in SPARK

ada bigint cryptography formal-verification spark

Last synced: 30 Jan 2025

https://github.com/codelytv/spark-kafka_rabbitmq_sqs-course

Integrate Spark with queue system course examples

apache-spark aws-sqs kafka rabbitmq spark

Last synced: 30 Jan 2025

https://github.com/mukjepscarlet/bilibili-predict-recommend

[大数据课程作业] Bilibili 助手: 视频推荐 + 热门预测

bilibili flask hadoop html javascript prediction pyspark python recommendation spark

Last synced: 18 Jan 2025

https://github.com/alexott/cyber-spark-data-connectors

Cybersecurity-related custom data connectors for Spark

cybersecurity databricks pyspark spark

Last synced: 30 Jan 2025

https://github.com/hungreeee/reddit-realtime-streaming-pipeline

End-to-end real-time pipeline for comments processing of any subreddit for sentiment analysis.

cassandra docker-compose kafka praw-reddit real-time reddit-api spark

Last synced: 12 Jan 2025

https://github.com/tomfran/lastfm-users-analysis

Last FM user's data collection and analysis using Spark

gcp lastfm spark

Last synced: 06 Jan 2025

https://github.com/worst001/note_bigdata

收录了大数据相关各类资料、笔记、手册

bigdata cdh datawarehouse development flink flume guide hadoop hbase hive learning markdown mkdocs note notebook spark

Last synced: 12 Jan 2025

https://github.com/darenr/spark-pca

Dimensional reduction, Scatter, Hexbin and kde plots

pca python spark

Last synced: 05 Feb 2025

https://github.com/pedropark99/spark_map

Easily apply a function over multiple columns of a Spark DataFrame

pyspark python spark

Last synced: 28 Nov 2024

https://github.com/adelin-info/tp_datacloud

Architecture et développement des systèmes distribuées à large echelle

hadoop java map-reduce scala spark yarn zookeeper

Last synced: 30 Jan 2025

https://github.com/melezhik/sparrowdo-spark

Quick Spark Installer for CentOS and Docker

centos spark sparrowdo

Last synced: 15 Feb 2025

https://github.com/najuzilu/dl-spark

Building a Data Lake with Spark

aws-emr aws-s3 data-engineering data-lake etl-pipeline spark

Last synced: 26 Jan 2025

https://github.com/mahi97/internship-elk-loganalysis

~ The Report of Development and Deployment an ELK Stack for MCI BI softwares and servers to perform real-time log analysis

elasticsearch kafka kibana latex logstash mesos redis spark

Last synced: 05 Feb 2025

https://github.com/rupeshtr78/awsiot

AWS IOT Intergration Using EMR Spark Kinesis

aws aws-emr dynamodb iot kinesis spark spark-streaming

Last synced: 12 Jan 2025

https://github.com/michaelg-create/bank-branch-footfall

Data engineer project to track simulated footfall in banking agencies.

banking data-engineering insights orchestration pipeline real-time spark

Last synced: 17 Feb 2025

https://github.com/williamliu52/twitter-sc

Trending sports highlights from Twitter

nodejs python react reactjs scala spark twitter

Last synced: 23 Oct 2024

https://github.com/rupeshtr78/aws-emr

Spark Job on Amazon EMR cluster

aws cluster emr-cluster mapreduce mapredue scala spark

Last synced: 12 Jan 2025

https://github.com/darule0/yarndiff

A rudimentary command line utility for contrasting Apache Yarn container logs.

diff difference diffing hadoop hadoop-mapreduce hive log4j mapreduce pig spark yarn yarn2

Last synced: 15 Feb 2025

https://github.com/mounirbs/spark-connect

Spark Connect, a docker-compose solution enabling a Spark Cluster with Spark Connect feature. Could be used for local development.

apache apache-spark docker docker-compose pyspark python spark

Last synced: 15 Feb 2025

https://github.com/ralgond/bigdata-example

Hadoop、Hive和Spark的例子、细节和注意事项

bigdata hadoop hdfs hive map-reduce spark

Last synced: 09 Jan 2025

https://github.com/michelderu/cassandra-csv-analytics

How to leverage Astra, DSE and Spark for analytics on large CSV files.

astra cassandra spark

Last synced: 20 Jan 2025

https://github.com/cleberzumba/data-analysis-with-apache-spark-and-databricks

San Francisco Fire Calls. Creating a Spark application on the Databricks using PySpark and SQL for common data analytics patterns and operations on a San Francisco Fire Department Calls dataset.

databricks pyspark spark sql

Last synced: 16 Feb 2025

https://github.com/ev2900/glue_spark_history_server

Host a Docker container for the Spark history server / Spark UI of AWS Glue jobs

aws glue spark spark-history-server spark-ui

Last synced: 15 Feb 2025

https://github.com/multivacplatform/multivac-elasticsearch

Demoing Spark 2.2 and Elasticsearch Hadoop connector

elasticsearch hadoop spark

Last synced: 12 Jan 2025

https://github.com/multivacplatform/multivac-nlp

Testing and benchmarking some of the existing NLP libraries in Apache Spark

nlp spark spark-ml spark-mllib spark-nlp spark-sql stanford-corenlp word2vec

Last synced: 12 Jan 2025

https://github.com/bytemedirk/pyspark3-docker

PySpark3 Docker container for testing & development. With OpenJDK, Spark 3.1.2, and Hadoop 2.7.

aws docker docker-image python spark

Last synced: 13 Jan 2025