Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

Apache Spark

Apache Spark is an open source distributed general-purpose cluster-computing framework. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.

GitHub: https://github.com/topics/spark
Wikipedia: https://en.wikipedia.org/wiki/Apache_Spark
Repo: https://github.com/apache/spark
Created by: Matei Zaharia
Released: May 26, 2014
Related Topics: scala, hadoop,
Aliases: apache-spark,
Last updated: 2025-02-11 00:28:31 UTC
JSON Representation

https://github.com/fdmsantos/aws-twitter-data-analytics

Project to Learn Data analytics in AWS using twitter data

aws data-analytics data-engineering data-science data-visualization flink spark terraform

Last synced: 26 Jan 2025

https://github.com/burhanahmed1/big-data-analytics

Practice tasks in Python programming language using Hadoop, MRJob, PySpark for Big Data Analytics.

apache-spark hadoop hadoop-mapreduce jupyter-notebook mrjob pyspark python spark spark-sql sparksql

Last synced: 11 Oct 2024

https://github.com/akarce/udacity-data-pipeline-with-airflow

Udacity Data Engineering Nanodegree Program, Data Pipeline with Airflow project using MinIO and Postgresql.

airflow minio postgresql pyspark spark

Last synced: 12 Oct 2024

https://github.com/thanaraklee/dataflow-with-gcp

This project demonstrates the workflow of a Data Engineer. It utilizes the Google Cloud Platform and Google Colab as the main tools.

airflow apache-spark data-engineering etl pandas spark

Last synced: 25 Dec 2024

https://github.com/angelcervera/poc-drivingdistance

Proof of concept to implement a service to calculate the driving distance using osm network

akka openstreetmap osm osm4scala scala spark

Last synced: 10 Feb 2025

https://github.com/pedropark99/introd-pyspark

An open and introductory book for the Python API of Apache Spark (pyspark)

book pyspark python spark

Last synced: 14 Oct 2024

https://github.com/lmouhib/auto-register-spark-ui-k8s

A lightweight operator to automatically expose Spark UI manage its ingress when running Spark on Kubernetes

spark spark-kubernetes spark-sql spark-streaming spark-ui

Last synced: 10 Feb 2025

https://github.com/bluejoe2008/hippo-rpc

Hippo Transport Library enhances spark-commons with easy stream management & handling

kraps rpc spark stream

Last synced: 10 Feb 2025

https://github.com/ashishgopalhattimare/parallel-concurrent-and-distributed-programming-in-java

Parallel, Concurrent, and Distributed Programming in Java | Coursera

block-isolation boruvka-algorithm concurrent-programming critical-section distributed-programming java-8 kafka locks mapreduce-java mpi parallel-programming rice-university spark synchronization threads

Last synced: 21 Jan 2025

https://github.com/figuran04/big-data

📃 Praktikum Big Data

anaconda big data hadoop hive mongodb pig spark

Last synced: 01 Nov 2024

https://github.com/pprzetacznik/datalake

Simple datalake

avro data-engineering kafka parquet schema-registry spark spark-structured-streaming

Last synced: 03 Feb 2025

https://github.com/timvw/adobe-analytics-datafeed-datasource

Apache Spark data source for Adobe Analytics Data Feed

adobe-analytics clickstream python scala spark

Last synced: 08 Nov 2024

https://github.com/gaelfoppolo/self-service-data-analytics

Data analysis made for business users

aws big-data data-analytics hadoop spark

Last synced: 03 Feb 2025

https://github.com/spratiher9/valido

PySpark ⚡ dataframe workflow ⚒ validator

apache apache-spark bigdata databricks decorators pyspark python3 spark testing

Last synced: 01 Feb 2025

https://github.com/policratus/sparkmage

🐘 A tool for blazing fast analysis and clustering of similar images using 🐘 Hadoop and ⚡ Spark.

big-data computer-vision hadoop image-processing spark

Last synced: 02 Nov 2024

https://github.com/alfex4936/spark-studies

Apache Spark 공부 in Python

apache pyspark python spark

Last synced: 27 Jan 2025

https://github.com/piotr-kalanski/spark-local

API enabling switching between Spark execution engine and local fast implementation based on Scala collections.

scala spark unit-testing

Last synced: 21 Dec 2024

https://github.com/mgrojo/adasearch

Custom search engine for the Ada programming language

ada custom-search-google search-engine spark spark-ada

Last synced: 27 Oct 2024

https://github.com/hifly81/1brc_streaming

1brc challenge with streaming solutions for Apache Kafka

1brc apache camel-kafka flink kafka kafkastreams ksqldb nifi spark spring-kafka streaming

Last synced: 02 Nov 2024

https://github.com/rezacsedu/Mining-Maximal-Frequent-Pattern-Spark

Implementation of Static mining part of "Mining maximal frequent patterns in transactional databases and dynamic data streams: A spark-based approach" Information Sciences, Volume 432, March 2018, Pages 278-300

data-mining data-stream frequent-pattern-mining java maximal-frequent-pattern spark structured-streaming

Last synced: 30 Oct 2024

https://github.com/jaehyeon-kim/emr-local-dev

Spark Local Development Environment Using Docker (and vscode)

aws docker emr spark vscode

Last synced: 30 Oct 2024

https://github.com/yalishanda42/scala-recsys

Scala(-ble) recommender system architecture using functional programming (PoC)

cats cats-effect functional-programming movielens recommender-system recsys scala spark

Last synced: 28 Dec 2024

https://github.com/angelotc/MacroDAG

A Dockerized Airflow ETL pipeline that processes macroeconomic indicators from the Federal Reserve.

airflow docker spark

Last synced: 06 Nov 2024

https://github.com/xpcosmos/injestao-dados-enem-sql

Esse projeto tem o objetivo de estruturar dados do enem em bancos de dados e analisar os dados utilizando métodos estatísticos.

docker docker-compose postgresql pyspark python spark sql statistics

Last synced: 14 Jan 2025

https://github.com/conema/transe-pyspark

TransE implementation in Spark (pyspark)

aws distrubuted embedding gradient-descent knowledge-graph pyspark spark terraform transe word-embeddings

Last synced: 21 Jan 2025

https://github.com/yucl80/learn-spark

java maven scala spark

Last synced: 27 Jan 2025

https://github.com/yucl80/avrodemo

write , append avro to hdfs file

avro hdfs hive java kafka log scala spark sparksql sparkstreaming tomcat-log

Last synced: 27 Jan 2025

https://github.com/prabaprakash/docker-pipeline-for-hadoop-n-spark-submit

Docker CI/CD Pipeline

apache-spark docker docker-compose docker-pipeline gocd-agent gocd-agent-docker gocd-server hadoop spark

Last synced: 14 Jan 2025

https://github.com/aamend/texata-r2-2017

This project has been created in a 4h time for the purpose of the Texata Big Data world championship.

bigdata gdelt hackathon spark texata

Last synced: 30 Dec 2024

https://github.com/garystafford/dataproc-java-demo

Demonstration of Google Cloud Dataproc for running Spark jobs with Java

big-data-analytics dataproc gcp google java spark

Last synced: 06 Dec 2024

https://github.com/superruzafa/scala-spark-big-data

My solutions to the Coursera's Big Data Analysis with Scala and Spark course

big-data coursera scala spark

Last synced: 30 Dec 2024

https://github.com/maxinexiong/degrees-of-separation-with-breadth-first-search

This project utilizes PySpark RDD and the Breadth-first Search (BFS) algorithm to find the shortest path and degrees of separation between two given Marvel superheroes based on based on their appearances together in the same comic books, empowering users to discover connections between their favourite superheroes in the Marvel universe.

apache-spark bfs-algorithm breadth-first-search degrees-of-separation marvel-characters pyspark python spark spark-rdd

Last synced: 21 Dec 2024

https://github.com/maxinexiong/item-based-collaborative-filtering

This project utilizes PySpark DataFrames and PySpark RDD to implement item-based collaborative filtering. By calculating cosine similarity scores or identifying movies with the highest number of shared viewers, the system recommends 10 similar movies for a given target movie that aligns users’ preferences.

apache-spark collaborative-filtering movie-recommendation pyspark python spark spark-dataframes spark-rdd

Last synced: 21 Dec 2024

https://github.com/oracle-quickstart/oci-spark

Terraform module to deploy Spark on Oracle Cloud Infrastructure (OCI)

cloud oci oracle oracle-led spark terraform

Last synced: 07 Nov 2024

https://github.com/afsalthaj/supaku-sukara

Functional Programming, Functional Programming Exercise Solutions in Scala & Spark

functional-programming functor language monad parallelism scala shapeless spark typeclasses

Last synced: 08 Jan 2025

https://github.com/tuancamtbtx/reusable-bigdata-stack-on-k8s

Bigdata stack include spark, airflow run on k8s

airflow bigdata docker k8s spark

Last synced: 02 Jan 2025

https://github.com/hb-chen/spark-elasticsearch-recommender

Zeppelin-v0.8.0 Notebook演示使用Spark -v2.3.2+ Elasticsearch-v6.3.2构建推荐系统

elasticsearch recommender spark zeppelin

Last synced: 08 Jan 2025

https://github.com/badoo/hadoop-xargs

Util to run heterogenous applications on Hadoop synchronously

hadoop java spark

Last synced: 12 Nov 2024

https://github.com/vasnake/spark.ml.spatialjointransformer

spark.ml.transformer: join two datasets using spatial relations

geospatial join ml-pipeline python scala spark spark-ml spatial transformer

Last synced: 03 Jan 2025

https://github.com/apache/incubator-gluten-site

Apache Gluten is a middle layer responsible for offloading JVM-based SQL engines' execution to native engines.

gluten spark sql

Last synced: 04 Feb 2025

https://github.com/gacwr/openuba-model-hub

frontend, model registry, model search, and model marketplace for OpenUBA

analytics anomaly-detection cybersecurity datascience elasticsearch elk flask information-security machine-learning security siem sklearn spark tensorflow threathunting uba ueba user-behaviour

Last synced: 15 Jan 2025

https://github.com/kadnan/vagrant-spark2

Vagrant Box with Python 3.6.1, Apache Spark 2.1.1 with Scala 2.11.8 and PySpark (2.1.1).

pyspark python3 spark vagrant vagrant-boxes

Last synced: 20 Jan 2025

https://github.com/simplexspatial/osm-facts

Proofs and checks about osm pbf format and data content facts

osm osm4scala scala spark

Last synced: 15 Jan 2025

https://github.com/garciparedes/scala-examples

Set of awesome Scala Examples

breeze functional-programming java scala spark

Last synced: 16 Jan 2025

https://github.com/adityajn105/apache-spark-tutorials

Apache spark is a big data analysis framework.

bigdata pyspark spark spark-ml spark-rdd spark-tutorials

Last synced: 16 Jan 2025

https://github.com/jabhij/crimerate_classification

Developing a system that could classify crime descriptions into different categories which would help the authorities to assign officers to crimes based on the report.

classification crime-analysis crime-classification crime-rates machine-learning mllib pyspark python spark tensorflow

Last synced: 17 Jan 2025

https://github.com/jatin-8898/sparkwebsite

A clean and very interesting looking website. :sparkles:

bootstrap4 css html javascript spark typescript

Last synced: 17 Jan 2025

https://github.com/nhviet03/is405_bigdata_mapreduce_knn

A MapReduce-Based k-Nearest Neighbor Approach for Big Data Classification on Apache Spark

knn-classification mapreduce pyspark spark

Last synced: 17 Jan 2025

https://github.com/recipegrace/biglibrary

electric spark

Last synced: 17 Jan 2025

https://github.com/exacaster/delta-fetch

HTTP API on Delta Lake tables

big-data delta-lake parquet s3 spark

Last synced: 11 Nov 2024

https://github.com/emso-exe/comercio_eletronico_brasileiro

Projeto de análise de dados do comércio eletrônico brasileiro disponibilizado pela Olist via plataforma Kaggle.

analise-de-dados ciencia-de-dados data-analytics data-science datascience e-commerce postgres postgresql pyspark python python-3 python3 spark spark-sql sql

Last synced: 16 Jan 2025

https://github.com/brooksian/sparkpipelinesparknlp

Build & Convert a Spark NLP Pipeline to PMML

corenlp nlp pmml spark zeppelin-notebook

Last synced: 19 Jan 2025

https://github.com/extwiii/bigdata-uc.san.diego

Unlock Value in Massive Datasets - UC San Diego

big-data classification data-science graph hadoop integration machine-learning management modeling neo4j processing regression spark

Last synced: 28 Jan 2025

https://github.com/alexioannides/py-readme-snippets

This repository contains snippits of writing (in Markdown), on various topics relating to various flavours of Python development project.

python python-library spark

Last synced: 17 Jan 2025

https://github.com/earthquakesan/twittertrends

Twitter Trends is a Spark Streaming example application

spark streaming

Last synced: 17 Jan 2025

https://github.com/cloudtik/cloudtik

Cloud Scale Platform for Distributed Data, Analytics and AI

ai alibaba-cloud analytics aws azure cloud data-science deep-learning gcp kubernetes machine-learning microservices spark

Last synced: 16 Jan 2025

https://github.com/univalence/spark-plumbus

Collection of tools for Scala Spark

functional-programming scala spark

Last synced: 20 Jan 2025

https://github.com/comcast/spark-util

little spark util library

spark

Last synced: 14 Nov 2024

https://github.com/hibuz/hadoop-docker

🐳 hadoop ecosystems docker image

data-engineering docker docker-compose flink hadoop hbase hive spark zeppelin

Last synced: 15 Nov 2024

https://github.com/kmohamedalie/big-data-hadoop-spark-lab

Big Data🛢️ with Hadoop🐘 and Spark⭐ lab🧪🥼

big-data coursera data-engineering docker hadoop ibm kubernetes spark

Last synced: 02 Jan 2025

https://github.com/jinsyin/datalink

⚡ 数据集成 | DataLink is a lightweight data integration framework build on top of DataX, Spark and Flink

batch big-data bigdata cdc data data-collection data-exchange data-integration data-pipeline data-synchronization datalink etl flink flink-cdc framework integration pipeline spark streaming

Last synced: 15 Nov 2024

https://github.com/cclient/elasticsearch-spark-upsert-from-kafka

elasticsearch-hadoop官方不支持upsert doc,修改源码实现，spark kafka streaming 示例 upsert { "upsert": {}, "doc": {...} }

elasticsearch elasticsearch-hadoop kafka kafka-streams spark upsert upsert-doc

Last synced: 16 Jan 2025

https://github.com/rpytel1/supercomputing-labs

Fork of the repository for Supercomputing in Big Data class on TU Delft. Scala, Spark and Kafka were used to perform processing and streaming of GDelt data segments.

big-data gdelt-data kafka scala spark

Last synced: 18 Jan 2025

https://github.com/wittline/sparksql-with-python

This repository has some examples of using Spark and SparkSQL with Python through PySpark

flask-api python spark sparksql

Last synced: 29 Jan 2025

https://github.com/open-datastudio/hive-metastore

Hive metastore on Staroid

hadoop hive hive-metastore kubernetes spark staroid

Last synced: 18 Nov 2024

https://github.com/brooksian/solrtosparknotebook

Connecting Solr and Spark In An Apache Zeppelin Notebook

solr spark zeppelin-notebook

Last synced: 19 Jan 2025

https://github.com/brooksian/ds_gtdb

KMeans Clustering on Global Terrorism Database

global-terrorism-database machine-learning spark sparksql zeppelin-notebook

Last synced: 19 Jan 2025

https://github.com/brooksian/epaairnow

Exploring EPA Air Now Time Series Data with Apache Spark and Apache Zeppelin

spark sparksql time-series zeppelin-notebook

Last synced: 19 Jan 2025

https://github.com/brooksian/sparkpipeline2mleapbundle

Convert Spark Pipeline Models to MLeap Bundles

mleap-bundle spark

Last synced: 19 Jan 2025

https://github.com/stefen-taime/investissement

Jenkins Delta pipeline

delta-lake jenkins-pipeline minio spark

Last synced: 23 Jan 2025

https://github.com/wtanaka/ansible-role-apache-spark

Ansible role to install Apache Spark

ansible ansible-galaxy ansible-role ansible-roles apache-spark batch galaxy mapreduce spark streaming

Last synced: 23 Jan 2025

https://github.com/joyceannie/us-immigrations-data-warehouse

A data warehouse to perform analytics on the immigration trends in the US.

airflow data-engineering etl pyspark redshift s3 spark

Last synced: 29 Jan 2025

https://github.com/dustin-decker/elasticsearchsql

A simple example of using Apache Spark SQL against Elasticsearch 5

elasticsearch spark sql

Last synced: 29 Jan 2025

https://github.com/chimera-suite/pysparql

This is a simple module that allows developer to query SPARQL endpoints and analyze the results with Apache Spark.

apache apache-spark construct-query dataframe graphframe jena-fuseki spark sparql

Last synced: 01 Dec 2024

https://github.com/tashi-2004/fma-a-dataset-for-music-analysis

🎶 Scripts for music feature analysis, model training, and real-time recommendation using Apache Kafka. Extract features with Librosa 🎹, store them in MongoDB 🗄️, and process the data with Apache Spark ⚡. A 🌐 web interface 💻✨ is also included. Contributors: Tashfeen Abbasi 👤, Laiba Mazhar 👤, and Rafia Khan 👤.

html kafka kafka-consumer kafka-producer kafka-streaming linux mongodb mongodb-compass python3 spark ubuntu web-application

Last synced: 03 Dec 2024

https://github.com/renardeinside/dbx-kafka-protobuf-example

Sample code for working with Kafka & Protobuf in Databricks

databricks kafka protobuf scala spark spark-streaming

Last synced: 06 Feb 2025

https://github.com/bataeves/isparkcache

Jupyter модуль для кеширования Spark DataFrame, полученных в результате выполнения ячейки

cache ipython jupyter pyspark spark

Last synced: 06 Feb 2025

https://github.com/kanchishimono/spark-on-k8s-images

Docker images for spark on kubernetes

docker docker-image dockerfile kubernetes pyspark spark spark-kubernetes spark-on-k8s spark-on-kubernetes

Last synced: 28 Nov 2024

https://github.com/hussaintaj-w/spark_submit_project

An easy to use script that automatically adds files to the spark-submit command.

python spark spark-submit

Last synced: 23 Jan 2025

https://github.com/engineering-research-and-development/fiware-orion-pyspark-connector

Bidirectional Orion/Orion-LD <--> PySpark Connector

cognitive fiware ngsi ngsi-ld ngsi-v2 orion orion-context-broker orion-ld processing pyspark python spark

Last synced: 17 Jan 2025

https://github.com/tranthe170/nyc-taxi-pipeline

Building Data Lakehouse by open source technology. Support end to end data pipeline, from source data on AWS S3 to Lakehouse, visualize.

airflow delta-lake hive lakehouse presto python s3 spark superset

Last synced: 17 Jan 2025

https://github.com/codelytv/spark-best_practices_and_deploy-course

Deploy Spark course examples

apache-spark deploy spark

Last synced: 03 Dec 2024

https://github.com/renardeinside/terrametria

Source code 3D population density map of Germany, with ETL and app logic on top the Databricks Platform.

databricks deckgl python react spark

Last synced: 03 Dec 2024

https://github.com/bedrockstreaming/sparktest

A testing tool for Scala and Spark developers

scala spark

Last synced: 31 Dec 2024

https://github.com/hupe1980/docker_pyspark_notebook

Docker Compose setup for PySpark

docker docker-compose ipython jupyter-notebook jupyterlab pyspark python spark uber

Last synced: 02 Feb 2025

https://github.com/pomadchin/vlm-performance

GeoTrellis RasterSources Ingest benchmark

aws emr geotrellis gis raster spark

Last synced: 17 Jan 2025

https://github.com/ahmetfurkandemir/data-engineering-tools

Data Engineering Tools

adminer airflow datanode flink hadoop hdfs hive hue kafka livy namenode postgresql spark

Last synced: 17 Jan 2025

https://github.com/ichowdhury01/match

A social networking platform that allows users to find friends with similar interests in their area.

geolocation-api jdbc maven mysql pbkdf2 spark

Last synced: 06 Feb 2025

https://github.com/jgperrin/net.jgp.labs.spark.football

Having fun with soccer stats and Spark

java java8 soccer spark sparkjava worldcup

Last synced: 03 Jan 2025

https://github.com/aveek-saha/cricket-score-predictor

A Big data application to predict the outcome of a T20 cricket match.

big-data big-data-analytics clustering pyspark spark spark-mllib

Last synced: 24 Dec 2024

https://github.com/cwienberg/spark-sorting-helpers

Helper library for using secondary sorting in Spark RDD and Dataset operations

scala spark

Last synced: 23 Jan 2025

https://github.com/pankajsingh09/data_engineering_using_aws

This Repository contains the contents related to Data Engineering Using AWS

aws data-ingestion dataengineering event-bridge lambda-functions pipeline pycharm-ide pyspark python s3 spark

Last synced: 19 Dec 2024

https://github.com/fpopic/gg-interview-challenge

(Interview) GG Interview Challenge in Scala/Spark

apache-spark json logstash parsing regex scala spark sparksql

Last synced: 10 Jan 2025

https://github.com/alvarogarcia7/bank-kata-kotlin

Bank pet project, in kotlin. See interests as topics

api-first api-standard bank-kata blackbox-testing etude finite-state-machine gradle gradlew hateoas junit junit5 kata kotlin multimodule paypal-rest-api practice spark sparkjava trikitrok with-client

Last synced: 10 Jan 2025

https://github.com/jldbc/big-data

Coursework from Big Data (CS3390) -- Machine Learning tasks performed using Hadoop, MapReduce, and Spark

big-data hadoop pagerank recommender-system spark

Last synced: 04 Jan 2025

https://github.com/bria222/animal2

heroku-deployment java postgres spark velocity

Last synced: 04 Jan 2025

https://github.com/mangalaman93/dspark

Run spark in docker containers

big-data containers docker microservices spark

Last synced: 18 Jan 2025

https://github.com/mtpatter/bilao

Jupyter notebooks for filtering Kafka data with Spark Streaming.

avro docker jupyter-notebook kafka spark spark-streaming

Last synced: 12 Jan 2025

https://github.com/jbris/docker-spark-sparklyr

Docker setup for Apache Spark and the R sparklyr package

adminer apache-spark docker docker-compose postgres postgresql rstats rstudio spark spark-dataset spark-master spark-ml spark-worker sparklyr sparklyr-extension

Last synced: 12 Jan 2025

https://github.com/multivacplatform/multivac-fakenews

Detecting users and communities which propagate fake news on Twitter by Apache Spark

deep-learning fakenews machine-learning spark twitter

Last synced: 12 Jan 2025

Apache Spark Awesome Lists

awesome-ada 415 awesome-AI-kubernetes 65 awesome-pulsar 35 Apache-Spark-Guide 672 awesome-data-pipeline 80 awesome-spark 37 awesome-azure-databricks 62 awesome-datalake 43

Apache Spark Categories

Libraries 158 Education 149 Azure Cosmos DB 89 Azure ML 85 Azure Networking 82 Uncategorized 72 Components 53 Frameworks 50 Reinforcement Learning Learning Resources 46 SQL/NoSQL Tools and Databases 44 ML Frameworks, Libraries, and Tools 37 Table of Contents 36 Tools 29 Hardware and Embedded 28 Applications 27 NLP Tools, Libraries, and Frameworks 26 Java Tools, Libraries, and Frameworks 24 R Tools, Libraries, and Frameworks 22 Python Frameworks and Tools 21 Computer Vision Tools, Libraries, and Frameworks 20