Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

Apache Spark

Apache Spark is an open source distributed general-purpose cluster-computing framework. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.

https://github.com/bataeves/isparkcache

Jupyter модуль для кеширования Spark DataFrame, полученных в результате выполнения ячейки

cache ipython jupyter pyspark spark

Last synced: 06 Feb 2025

https://github.com/apache/incubator-gluten-site

Apache Gluten is a middle layer responsible for offloading JVM-based SQL engines' execution to native engines.

gluten spark sql

Last synced: 04 Feb 2025

https://github.com/mangalaman93/dspark

Run spark in docker containers

big-data containers docker microservices spark

Last synced: 18 Jan 2025

https://github.com/oracle-quickstart/oci-spark

Terraform module to deploy Spark on Oracle Cloud Infrastructure (OCI)

cloud oci oracle oracle-led spark terraform

Last synced: 07 Nov 2024

https://github.com/fdmsantos/aws-twitter-data-analytics

Project to Learn Data analytics in AWS using twitter data

aws data-analytics data-engineering data-science data-visualization flink spark terraform

Last synced: 26 Jan 2025

https://github.com/renardeinside/dbx-kafka-protobuf-example

Sample code for working with Kafka & Protobuf in Databricks

databricks kafka protobuf scala spark spark-streaming

Last synced: 06 Feb 2025

https://github.com/longi94/lsde2017-p3-flight-visualization

Animated interactive flight visualization

ads-b big-data d3js spark

Last synced: 06 Jan 2025

https://github.com/brunneis/minebench

Proof-of-Work based benchmark written in Python that works with real Bitcoin data

benchmark bitcoin mining proof-of-work spark

Last synced: 26 Jan 2025

https://github.com/erikerlandson/pyspark-ubi

Minimalist install of pyspark on top of Red Hat UBI

container-image pyspark spark ubi

Last synced: 06 Jan 2025

https://github.com/renardeinside/terrametria

Source code 3D population density map of Germany, with ETL and app logic on top the Databricks Platform.

databricks deckgl python react spark

Last synced: 03 Dec 2024

https://github.com/mtpatter/bilao

Jupyter notebooks for filtering Kafka data with Spark Streaming.

avro docker jupyter-notebook kafka spark spark-streaming

Last synced: 12 Jan 2025

https://github.com/hibadaoud/real-time-flight-data-kibana-visualization

Real-Time Flight Data Visualization Dashboard: Interactive web application for real-time flight tracking and airport analytics. Powered by Kafka, Pyspark, Elasticsearch, Kibana, Express NodeJs, MongoDB, and Docker.

css docker elasticsearch html javascript jwt-authentication kafka kibana nodejs real-time spark

Last synced: 10 Feb 2025

https://github.com/multivacplatform/multivac-fakenews

Detecting users and communities which propagate fake news on Twitter by Apache Spark

deep-learning fakenews machine-learning spark twitter

Last synced: 12 Jan 2025

https://github.com/multivacplatform/multivac-wikipedia

Wonderful reusable codes, libraries and scripts to process Wikipedia page views by using Apache Spark.

data-frame multivac-wikipedia spark spark-sql wikipedia

Last synced: 12 Jan 2025

https://github.com/multivacplatform/multivac-ml

Pre-trained ML models for Apache Spark

machine-learning nlp spark spark-ml

Last synced: 12 Jan 2025

https://github.com/kemalcanbora/ba_bigdata_docker

Docker containers provide a way to package applications with everything needed to run them, including base operating system images, databases, libraries, and binaries.

bigdata hadoop hue kafka spark

Last synced: 24 Jan 2025

https://github.com/policratus/sparkmage

🐘 A tool for blazing fast analysis and clustering of similar images using 🐘 Hadoop and ⚡ Spark.

big-data computer-vision hadoop image-processing spark

Last synced: 02 Nov 2024

https://github.com/cloudtik/cloudtik

Cloud Scale Platform for Distributed Data, Analytics and AI

ai alibaba-cloud analytics aws azure cloud data-science deep-learning gcp kubernetes machine-learning microservices spark

Last synced: 16 Jan 2025

https://github.com/alfex4936/spark-studies

Apache Spark 공부 in Python

apache pyspark python spark

Last synced: 27 Jan 2025

https://github.com/joyceannie/us-immigrations-data-warehouse

A data warehouse to perform analytics on the immigration trends in the US.

airflow data-engineering etl pyspark redshift s3 spark

Last synced: 29 Jan 2025

https://github.com/yalishanda42/scala-recsys

Scala(-ble) recommender system architecture using functional programming (PoC)

cats cats-effect functional-programming movielens recommender-system recsys scala spark

Last synced: 28 Dec 2024

https://github.com/inbravo/spark-movie-lens

Various examples of analytics using Apache Spark

apache-spark scala spark

Last synced: 02 Feb 2025

https://github.com/kruglov-dmitry/yelp_data

End to end example how to read big (well, comparably) data from Kafka and write it down into Cassandra using Spark Structured Streaming. Using yelp dataset for illustration purposes.

cassandra kafka spark streaming yelp-dataset

Last synced: 19 Jan 2025

https://github.com/kadnan/vagrant-spark2

Vagrant Box with Python 3.6.1, Apache Spark 2.1.1 with Scala 2.11.8 and PySpark (2.1.1).

pyspark python3 spark vagrant vagrant-boxes

Last synced: 20 Jan 2025

https://github.com/piotr-kalanski/spark-local

API enabling switching between Spark execution engine and local fast implementation based on Scala collections.

scala spark unit-testing

Last synced: 21 Dec 2024

https://github.com/anskarl/auxlib-spark-nlp

NLP utilities for Apache Spark

nlp opennlp scala spark

Last synced: 12 Feb 2025

https://github.com/bedrockstreaming/sparktest

A testing tool for Scala and Spark developers

scala spark

Last synced: 31 Dec 2024

https://github.com/yucl80/avrodemo

write , append avro to hdfs file

avro hdfs hive java kafka log scala spark sparksql sparkstreaming tomcat-log

Last synced: 27 Jan 2025

https://github.com/simplexspatial/osm-facts

Proofs and checks about osm pbf format and data content facts

osm osm4scala scala spark

Last synced: 15 Jan 2025

https://github.com/tashi-2004/fma-a-dataset-for-music-analysis

🎶 Scripts for music feature analysis, model training, and real-time recommendation using Apache Kafka. Extract features with Librosa 🎹, store them in MongoDB 🗄️, and process the data with Apache Spark ⚡. A 🌐 web interface 💻✨ is also included. Contributors: Tashfeen Abbasi 👤, Laiba Mazhar 👤, and Rafia Khan 👤.

html kafka kafka-consumer kafka-producer kafka-streaming linux mongodb mongodb-compass python3 spark ubuntu web-application

Last synced: 03 Dec 2024

https://github.com/dustin-decker/elasticsearchsql

A simple example of using Apache Spark SQL against Elasticsearch 5

elasticsearch spark sql

Last synced: 29 Jan 2025

https://github.com/jinsyin/datalink

⚡ 数据集成 | DataLink is a lightweight data integration framework build on top of DataX, Spark and Flink

batch big-data bigdata cdc data data-collection data-exchange data-integration data-pipeline data-synchronization datalink etl flink flink-cdc framework integration pipeline spark streaming

Last synced: 15 Nov 2024

https://github.com/maxinexiong/item-based-collaborative-filtering

This project utilizes PySpark DataFrames and PySpark RDD to implement item-based collaborative filtering. By calculating cosine similarity scores or identifying movies with the highest number of shared viewers, the system recommends 10 similar movies for a given target movie that aligns users’ preferences.

apache-spark collaborative-filtering movie-recommendation pyspark python spark spark-dataframes spark-rdd

Last synced: 21 Dec 2024

https://github.com/lucivpav/dnbc-scala

Parallel implementation of dynamic naive Bayesian classifier

apache-spark bayesian-networks ctu-fit dnbc fit-ctu naive-bayes-classifier scala spark

Last synced: 12 Feb 2025

https://github.com/aiday-mar/spark-recommendation-engine

Movie recommendation system built using Spark and Scala

recommendation-system scala spark university-project

Last synced: 05 Jan 2025

https://github.com/manuparra/taller-bigdata-con-r

Taller Big Data con Apache Spark + R desde Databricks cloud

bigdata cloudcomputing databricks r spark sparkr

Last synced: 27 Dec 2024

https://github.com/triandicAnt/TwitterSentimentAnalytics

Basic Twitter Sentiment Analytics using Apache Spark Streaming APIs and Python by processing live tweets from Twitter.

machine-learning python sentimental-analysis spark twitter twitter-api twitter-sentiment-analytics

Last synced: 23 Oct 2024

https://github.com/omar-besbes/football-big-data

This is a comprehensive solution for real-time football analytics, leveraging Apache Spark execution on yarn for both streaming and batch processing, Hadoop HDFS for distributed storage, Kafka for real-time data ingestion, RethinkDB for live data updates and Next.js for data visualization as well as a custom built search engine.

batch-processing hadoop kafka nextjs rethinkdb spark streaming t3-stack yarn

Last synced: 20 Jan 2025

https://github.com/r13i/spark-record-deduplicating

Data cleansing problem statement: Data in a record are often duplicated. How do we find the duplicate probability ? [Work In Progress]

big-data deduplication record-linkage records-management scala spark

Last synced: 02 Feb 2025

https://github.com/garciparedes/scala-examples

Set of awesome Scala Examples

breeze functional-programming java scala spark

Last synced: 16 Jan 2025

https://github.com/tpvasconcelos/sparkypandy

It's not spark, it's not pandas, it's just awkward...

dataframe pandas pyspark spark

Last synced: 05 Nov 2024

https://github.com/chimera-suite/pysparql

This is a simple module that allows developer to query SPARQL endpoints and analyze the results with Apache Spark.

apache apache-spark construct-query dataframe graphframe jena-fuseki spark sparql

Last synced: 01 Dec 2024

https://github.com/adityajn105/apache-spark-tutorials

Apache spark is a big data analysis framework.

bigdata pyspark spark spark-ml spark-rdd spark-tutorials

Last synced: 16 Jan 2025

https://github.com/viniciusmsousa/pyspark-ds-toolbox

A Pyspark companion for data science tasks.

data-science spark

Last synced: 09 Feb 2025

https://github.com/yoongoing/bigdata_pyspark

⚡️공개용 맵리듀스 플랫폼인 Spark를 사용하여 데이터마이닝을 해보자⚡️

bigdata dataminig jupyter-notebook mapreduce mapreduce-python pyspark spark

Last synced: 09 Feb 2025

https://github.com/boazmohar/pysparkutils

A collection of utilities for handling pySpark's SparkContext

pyspark python spark

Last synced: 09 Feb 2025

https://github.com/xpcosmos/injestao-dados-enem-sql

Esse projeto tem o objetivo de estruturar dados do enem em bancos de dados e analisar os dados utilizando métodos estatísticos.

docker docker-compose postgresql pyspark python spark sql statistics

Last synced: 14 Jan 2025

https://github.com/mcddhub/mcdd-big-data-study

Study project for big data (Hadoop, Zookeeper, Kafka, Flink, Spark)

big-data data-processing docker flink hadoop kafka spark zookeeper

Last synced: 09 Feb 2025

https://github.com/maxinexiong/degrees-of-separation-with-breadth-first-search

This project utilizes PySpark RDD and the Breadth-first Search (BFS) algorithm to find the shortest path and degrees of separation between two given Marvel superheroes based on based on their appearances together in the same comic books, empowering users to discover connections between their favourite superheroes in the Marvel universe.

apache-spark bfs-algorithm breadth-first-search degrees-of-separation marvel-characters pyspark python spark spark-rdd

Last synced: 21 Dec 2024

https://github.com/gunantos/php-spark

PHP Server Develop

php server serverless-framework spark

Last synced: 13 Jan 2025

https://github.com/jabhij/crimerate_classification

Developing a system that could classify crime descriptions into different categories which would help the authorities to assign officers to crimes based on the report.

classification crime-analysis crime-classification crime-rates machine-learning mllib pyspark python spark tensorflow

Last synced: 17 Jan 2025

https://github.com/superruzafa/scala-spark-big-data

My solutions to the Coursera's Big Data Analysis with Scala and Spark course

big-data coursera scala spark

Last synced: 30 Dec 2024

https://github.com/gmartinezramirez-old/data-science-portafolio

:notebook: [Active] Portafolio of data science projects. Using: Python, PyTorch, Spark, Tensorflow, Scikit, Keras. Includes Classification, Regression, Time series, NLP, Deep learning, among others.

data-science data-science-learning data-science-notebook data-science-portfolio hadoop jupyter-notebook keras notebook pandas pyspark python pytorch r sci-kit spark tensorflow

Last synced: 05 Dec 2024

https://github.com/jatin-8898/sparkwebsite

A clean and very interesting looking website. :sparkles:

bootstrap4 css html javascript spark typescript

Last synced: 17 Jan 2025

https://github.com/dvelkow/real_time_bulgarian_news_aggregator

An ETL-driven web scraping and data visualization project that aggregates news from multiple Bulgarian news sources in real-time and creates an interactive dashboard with the fetched data.

portfolio python spark

Last synced: 12 Oct 2024

https://github.com/jms0522/hadoop_system

✅ hadoop eco system을 구성하고 파이프라인 제작합니다.

hadoop pipeline spark

Last synced: 11 Oct 2024

https://github.com/univalence/spark-plumbus

Collection of tools for Scala Spark

functional-programming scala spark

Last synced: 20 Jan 2025

https://github.com/aamend/texata-r2-2017

This project has been created in a 4h time for the purpose of the Texata Big Data world championship.

bigdata gdelt hackathon spark texata

Last synced: 30 Dec 2024

https://github.com/derlin/workshop-data-sciences

A two-days workshop material introducing data sciences for big data

big-data data-science hdfs hive spark workshop workshop-materials zeppelin

Last synced: 20 Jan 2025

https://github.com/lepetitbloc/sparksd

:sparkler: Sparks wallet Docker container

cryptocurrency dockerfile masternode spark wallet

Last synced: 26 Jan 2025

https://mgrojo.github.io/adasearch/

Custom search engine for the Ada programming language

ada custom-search-google search-engine spark spark-ada

Last synced: 28 Nov 2024

https://github.com/mgrojo/adasearch

Custom search engine for the Ada programming language

ada custom-search-google search-engine spark spark-ada

Last synced: 27 Oct 2024

https://github.com/jaehyeon-kim/emr-local-dev

Spark Local Development Environment Using Docker (and vscode)

aws docker emr spark vscode

Last synced: 30 Oct 2024

https://github.com/pomadchin/vlm-performance

GeoTrellis RasterSources Ingest benchmark

aws emr geotrellis gis raster spark

Last synced: 17 Jan 2025

https://github.com/wittline/sparksql-with-python

This repository has some examples of using Spark and SparkSQL with Python through PySpark

flask-api python spark sparksql

Last synced: 29 Jan 2025

https://github.com/nhviet03/is405_bigdata_mapreduce_knn

A MapReduce-Based k-Nearest Neighbor Approach for Big Data Classification on Apache Spark

knn-classification mapreduce pyspark spark

Last synced: 17 Jan 2025

https://github.com/hifly81/1brc_streaming

1brc challenge with streaming solutions for Apache Kafka

1brc apache camel-kafka flink kafka kafkastreams ksqldb nifi spark spring-kafka streaming

Last synced: 02 Nov 2024

https://github.com/makohn/lambda-architecture-poc

♨️ A PoC implementation of the λ-Architecture for collecting and analysing tweets

cassandra kafka lambda-architecture sbt scala spark

Last synced: 12 Feb 2025

https://github.com/brooksian/solrtosparknotebook

Connecting Solr and Spark In An Apache Zeppelin Notebook

solr spark zeppelin-notebook

Last synced: 19 Jan 2025

https://github.com/brooksian/ds_gtdb

KMeans Clustering on Global Terrorism Database

global-terrorism-database machine-learning spark sparksql zeppelin-notebook

Last synced: 19 Jan 2025

https://github.com/rezacsedu/Mining-Maximal-Frequent-Pattern-Spark

Implementation of Static mining part of "Mining maximal frequent patterns in transactional databases and dynamic data streams: A spark-based approach" Information Sciences, Volume 432, March 2018, Pages 278-300

data-mining data-stream frequent-pattern-mining java maximal-frequent-pattern spark structured-streaming

Last synced: 30 Oct 2024

https://github.com/exacaster/delta-fetch

HTTP API on Delta Lake tables

big-data delta-lake parquet s3 spark

Last synced: 11 Nov 2024

https://github.com/rpytel1/supercomputing-labs

Fork of the repository for Supercomputing in Big Data class on TU Delft. Scala, Spark and Kafka were used to perform processing and streaming of GDelt data segments.

big-data gdelt-data kafka scala spark

Last synced: 18 Jan 2025

https://github.com/kwartile/spark-benchmark

Spark Benchmark suite to evaluate cluster configuration and compare the performance with other big data frameworks.

apache-spark benchmark benchmarking-suite cdh cloudera-hadoop hadoop hive impala performance scala spark

Last synced: 08 Feb 2025

https://github.com/brooksian/epaairnow

Exploring EPA Air Now Time Series Data with Apache Spark and Apache Zeppelin

spark sparksql time-series zeppelin-notebook

Last synced: 19 Jan 2025

https://github.com/angelotc/MacroDAG

A Dockerized Airflow ETL pipeline that processes macroeconomic indicators from the Federal Reserve.

airflow docker spark

Last synced: 06 Nov 2024

https://github.com/emso-exe/comercio_eletronico_brasileiro

Projeto de análise de dados do comércio eletrônico brasileiro disponibilizado pela Olist via plataforma Kaggle.

analise-de-dados ciencia-de-dados data-analytics data-science datascience e-commerce postgres postgresql pyspark python python-3 python3 spark spark-sql sql

Last synced: 16 Jan 2025

https://github.com/brooksian/sparkpipeline2mleapbundle

Convert Spark Pipeline Models to MLeap Bundles

mleap-bundle spark

Last synced: 19 Jan 2025

https://github.com/cclient/elasticsearch-spark-upsert-from-kafka

elasticsearch-hadoop官方不支持upsert doc,修改源码实现,spark kafka streaming 示例 upsert { "upsert": {}, "doc": {...} }

elasticsearch elasticsearch-hadoop kafka kafka-streams spark upsert upsert-doc

Last synced: 16 Jan 2025

https://github.com/brooksian/sparkpipelinesparknlp

Build & Convert a Spark NLP Pipeline to PMML

corenlp nlp pmml spark zeppelin-notebook

Last synced: 19 Jan 2025