Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

Apache Spark

Apache Spark is an open source distributed general-purpose cluster-computing framework. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.

https://github.com/vicnesterenko/apache-spark-labs

Base programs with datasets

apache-spark kpi-fict kpi-ua spark

Last synced: 10 Jan 2025

https://github.com/pierrekieffer/sparkstreaming_kafkaconsumer

Kafka consumer example based on spark streaming with message formatting to spark dataframe

kafka kafka-consumer scala spark spark-streaming

Last synced: 07 Feb 2025

https://github.com/hsm207/demo-spark-weaviate

How to set up a dev environment to work with spark and weaviate

big-data etl kafka python spark weaviate

Last synced: 14 Jan 2025

https://github.com/vietdoo/sg-property-hub

SG Property Hub is a comprehensive platform for managing and analyzing property data.

airflow celery-redis crawler etl etl-pipeline fastapi minio mongodb nextjs postgresql s3 spark webscraping

Last synced: 07 Feb 2025

https://github.com/ronaldkanyepi/log-realtime-analysis

A scalable architecture for real-time log processing and visualization. Built with a Kafka-Spark ETL pipeline, DynamoDB for storing aggregate real-time metrics, and Python Dash for interactive dashboards. Designed for high-throughput log ingestion, real-time monitoring, and long-term storage.

dash docker docker-compose docker-container dynamodb etl etl-pipeline hdfs kafka kafka-consumer kafka-producer kafka-streams kafka-topic logs python realtime spark spark-streaming streaming visualization

Last synced: 25 Dec 2024

https://github.com/fsanaulla/spark-http-rdd

RDD primitive for fetching data from an HTTP source

scala spark

Last synced: 14 Feb 2025

https://github.com/ferranbt/sparkanywhere

Run Apache spark multicloud and serverless

kubernetes serverless spark

Last synced: 01 Jan 2025

https://github.com/vermicida/data-lake

Data Lake, the code corresponding the project #4 of the Udacity's Data Engineer Nanodegree Program

aws-s3 data-engineering data-lake etl-pipeline python spark

Last synced: 26 Dec 2024

https://github.com/arun-george-zachariah/twitteranalytics

Web application to visualize interesting analytic Spark SQL queries executed on tweets for five famous brands namely Adidas, Nike, Puma, Skechers, and Reebok.

analytics distributed-computing docker spark twitter

Last synced: 26 Dec 2024

https://github.com/tuancamtbtx/etl-spark-k8s

ETL With Apache Spark Deployed on K8s

apache k8s spark spark-sql spark-streaming

Last synced: 02 Jan 2025

https://github.com/tuancamtbtx/python-spark-example

Spark template to submit to cluster

python spark

Last synced: 02 Jan 2025

https://github.com/tuancamtbtx/bigdata-spark-processing

Spark Batch Process

spark

Last synced: 02 Jan 2025

https://github.com/peteprattis/road-safety-database-with-jdbc-and-spark-rdd

A jdbc application that runs queries in pgAdmin to simulate the functionality of the UK Ministry of Transport's database using Apache Spark RDD for query implementation.

computer-science index java jdbc jdbc-database partitions pgadmin postgresql program query spark spark-sql sparkjava sql student

Last synced: 18 Jan 2025

https://github.com/peteprattis/insurance-company-database-with-jdbc-and-spark-rdd

A jdbc application that runs queries in pgAdmin to simulate the functionality of an insurance company's database using Apache Spark RDD for query implementation.

computer-science java jdbc jdbc-database partitioning partitions postgresql program query spark spark-sql sparkjava sql student

Last synced: 18 Jan 2025

https://github.com/javaidiqbal11/arabic-tweets-sentiment-analysis-using-spark

This repo is for Twitter Arabic dataset for sentiment analysis using Apache Spark.

apache-spark arabic-nlp arabic-tweets flask pyhton3 sentiment-analysis spark twitter-api

Last synced: 03 Jan 2025

https://github.com/fpopic/hf-interview-challenge

(Interview) Mixin Data Engineering & Data Science with PySpark

data-engineering data-science pyspark python recipes spark

Last synced: 10 Jan 2025

https://github.com/scrapcodes/kafkaproducer

Benchmarks to measure latency using spark and kafka.

benchmark kafka spark

Last synced: 03 Jan 2025

https://github.com/scrapcodes/spark-templates

One stop shop for Apache spark starter samples.

apache samples spark

Last synced: 03 Jan 2025

https://github.com/chimera-suite/spark-sidecar-setup

The sidecar setup container executes SparkSQL scripts against an Apache Spark instance.

docker setup sidecar-container spark sparksql

Last synced: 03 Jan 2025

https://github.com/chimera-suite/use-case

A step-by-step tutorial that showcases the capabilities of Chimera

chimera jena-fuseki knowledge-graph ontology pizza spark sparql-query

Last synced: 03 Jan 2025

https://github.com/nikoshet/pyspark-movie-similarities

Using Spark In Python For Movie Similarities With Jaccard Index

jaccard-index movie-similarities pyspark spark

Last synced: 03 Jan 2025

https://github.com/nikoshet/spark-mlp

Multilayer Perceptron Implementation Using Spark

hdfs machine-learning mapreduce multilayer-perceptron pyspark python spark

Last synced: 03 Jan 2025

https://github.com/akaliutau/spark-recipes

Contains a collection of data processing solutions built on the top of Spark

java spark

Last synced: 11 Jan 2025

https://github.com/azurespheredev/microsoftfabric-exploratorium

A comprehensive educational resource hub dedicated to mastering Microsoft Fabric, offering in-depth tutorials, real-world use cases, and hands-on guides for seamless end-to-end analytics

analytics data-science data-transformation lakehouse microsoft-fabric one-lake powerbi real-time-analytics spark warehouse

Last synced: 11 Jan 2025

https://github.com/stabrise/scaledp-tutorials

Tutorials for ScaleDP library. ScaleDP is an Open-Source Library for Processing Documents in Apache Spark.

ner nlp ocr ocr-python pdf spark

Last synced: 30 Jan 2025

https://github.com/aldantanneo/bigints

WIP constant time bigint implementation in SPARK

ada bigint cryptography formal-verification spark

Last synced: 30 Jan 2025

https://github.com/codelytv/spark-kafka_rabbitmq_sqs-course

Integrate Spark with queue system course examples

apache-spark aws-sqs kafka rabbitmq spark

Last synced: 30 Jan 2025

https://github.com/alexott/cyber-spark-data-connectors

Cybersecurity-related custom data connectors for Spark

cybersecurity databricks pyspark spark

Last synced: 30 Jan 2025

https://github.com/hungreeee/reddit-realtime-streaming-pipeline

End-to-end real-time pipeline for comments processing of any subreddit for sentiment analysis.

cassandra docker-compose kafka praw-reddit real-time reddit-api spark

Last synced: 12 Jan 2025

https://github.com/worst001/note_bigdata

收录了大数据相关各类资料、笔记、手册

bigdata cdh datawarehouse development flink flume guide hadoop hbase hive learning markdown mkdocs note notebook spark

Last synced: 12 Jan 2025

https://github.com/rupeshtr78/awsiot

AWS IOT Intergration Using EMR Spark Kinesis

aws aws-emr dynamodb iot kinesis spark spark-streaming

Last synced: 12 Jan 2025

https://github.com/rupeshtr78/aws-emr

Spark Job on Amazon EMR cluster

aws cluster emr-cluster mapreduce mapredue scala spark

Last synced: 12 Jan 2025

https://github.com/multivacplatform/multivac-elasticsearch

Demoing Spark 2.2 and Elasticsearch Hadoop connector

elasticsearch hadoop spark

Last synced: 12 Jan 2025

https://github.com/multivacplatform/multivac-nlp

Testing and benchmarking some of the existing NLP libraries in Apache Spark

nlp spark spark-ml spark-mllib spark-nlp spark-sql stanford-corenlp word2vec

Last synced: 12 Jan 2025

https://github.com/morinian/pyspark

Estudos com PySpark

spark

Last synced: 18 Jan 2025

https://github.com/dohabanoui/spark-structured-streaming

Real-time analysis of hospital incident data using Apache Spark Streaming to track incidents by service and identify the top years with the most incidents.

docker spark spark-streaming spark-structured-streaming

Last synced: 19 Jan 2025

https://github.com/ewertondrigues02/engenharia-de-dados

Varios Projetos de Engenharia de Dados usando principais ferramentas como: Airflow, Snowflake, dbt, Postrgres, Looker Studio, Power BI

airflow analise-exploratoria analytics aws-ec2 dados data dbt-cloud engenharia-de-dados looker-studio postgres pyspark python3 snowflake spark

Last synced: 19 Jan 2025

https://github.com/anant/playbook

Anant Platform Playbook - Consists of principles, patterns, tools, a framework, and an approach to designing, building, and managing plaforms.

approach cassandra confluent datastax framework kafka platform playbook spark

Last synced: 19 Jan 2025

https://github.com/anant/example-cassandra-spark-sql

Cassandra Data Operations with Spark SQL

cassandra data-operations docker etl spark spark-sql

Last synced: 19 Jan 2025

https://github.com/anant/example-sql-on-cassandra-with-open-source-notebooks

Files to follow along with the Open Source Notebooks and Cassandra Webinar (see README.md)

cassandra datastax datastax-studio jupyter jupyter-notebook nosql notebooks quix spark sql

Last synced: 19 Jan 2025

https://github.com/brooksian/censussipp

Reprodicing Census SIPP Reports Using Apache Spark

spark sparksql zeppelin-notebook

Last synced: 19 Jan 2025

https://github.com/bluegranite/azure-synapse-vcf-analysis

Sample code for analyzing VCF files (converted to Parquet) in Azure Databricks and Synapse.

azure azure-databricks azure-synapse bioinformatics computational-biology databricks genomics glow parquet spark synapse vcf

Last synced: 19 Jan 2025

https://github.com/adampaternostro/azure-spark-livy-application-insights-external-dependency

Use Spark with Livy along with Application Insights. Learn to host your external dependencies in data lake.

application-insights azure azure-data-lake hdinsight java livy spark spark2

Last synced: 31 Jan 2025

https://github.com/thom-x/graalvm-java-docker

Example creating native image of java app with GraalVM Gradle and Docker

docker graalvm gradle java spark

Last synced: 25 Jan 2025

https://github.com/imransilvake/semantic-partitioning

RDF Data (N-Triples) Partition and SPARQL Query Layer for SANSA-Stack using Scala and Spark.

big-data n-triples scala spark sparql

Last synced: 31 Jan 2025

https://github.com/riversun/ml-fake-data-maker

Generate fake data for machine learning like regression analysis

arff arff-generator dummy-data fake-data generator machine-learning prediction regression spark weka

Last synced: 01 Feb 2025

https://github.com/morgan-sell/usa-tourism-etl

Coalesced and transformed various data sources to create a comprehensive data lake for the USA tourism sector.

aws data-engineering data-lake emr-cluster etl-pipeline python spark

Last synced: 08 Feb 2025

https://github.com/mounirbs/spark-livy

Spark Livy, a docker-compose solution enabling a Spark Cluster with a Livy endpoint

apache apache-spark docker docker-compose livy pyspark python spark

Last synced: 08 Feb 2025

https://github.com/viyadb/viyadb-spark

Data processing ang ingestion backend for ViyaDB based on Spark streaming

spark spark-streaming spark-streaming-kafka viyadb

Last synced: 08 Feb 2025

https://github.com/chen0040/vagrant-big-data

Vagrantfiles for development in big data

cassandra elasticsearc hdfs kafka mesos redis spark storm vagrantfile zookeeper

Last synced: 09 Feb 2025

https://github.com/damianmarti/7506-spark

Notebook de las clases de 75-06 Organización de Datos - FIUBA

apache-spark pyspark spark

Last synced: 09 Feb 2025

https://github.com/duyet/spark-docker

Spark image for running on Kubernetes

docker docker-image hacktoberfest spark

Last synced: 05 Feb 2025

https://github.com/leo-the-nardo/combopurifier

Data Pipeline from AWS SQS/S3 to Kubernetes w/ Spark using Airflow, EKS & Data Lakehouse

airflow argocd aws-glue-catalog aws-lake-formation aws-s3 aws-sqs data-lake delta-lake eks minio spark terraform

Last synced: 05 Jan 2025

https://github.com/renardeinside/nocturne

Useful elements and building blocks for scalable Deep Learning applications on Databricks.

databricks deep-learning gpu horovod petastorm spark

Last synced: 06 Feb 2025

https://github.com/gabrielenizzoli/spark_engine

Build a complex spark execution plan by composing many different spark operations.

spark sql yaml

Last synced: 12 Feb 2025

https://github.com/karimosman89/iot-predictive-maintenance

This repository will simulate an IoT-based predictive maintenance system designed to monitor industrial equipment through sensors. It will include data ingestion, processing, and machine learning components to predict potential failures, optimizing maintenance schedules and reducing downtime.

api cloud-platform dashboard data-collection data-processing deployment iot-platform predictive-analytics pressure-sensor real-time-sensor sensors spark temperature-sensor vibration

Last synced: 05 Jan 2025

https://github.com/elfn/data-engineering-machine-learning-predictiveai

[SUPINFO PROJECT] Data science and Big Data (Spark, python, R, ....)

ai jupyter-notebook machine-learning mongodb prediction-ai python r rstudio spark

Last synced: 07 Feb 2025

https://github.com/nagpritam/identification-of-trucks-and-potential-risky-driver-using-databricks-spark-api-

The project intended to identify trucks based on their model, fuel consumption, driving behaviors and past records of violations/accidents

databricks hadoop hive powerbi python3 spark

Last synced: 13 Feb 2025

https://github.com/curusarn/spark-context-with

Python guard/wrapper for SparkContext from pyspark - allows you to use python `with` operator with SparkContext

guard python-operator spark sparkcontext

Last synced: 30 Jan 2025

https://github.com/omalperera/midget-sparkapps

Independent Spark applications & jobs to discover the various spark functionalities & kafka integrations

kafka-client spark spark-streaming

Last synced: 31 Jan 2025

https://github.com/juanpablo70/arep-taller03

Microframeworks Web

java spark webserver

Last synced: 22 Jan 2025

https://github.com/vitalibo/distributed-alarm-system

Simple distributed alarm system on top of Apache Spark

aws azure spark

Last synced: 27 Dec 2024

https://github.com/flaviostutz/spark-submit-scala

Spark submit extension from bde2020/spark-submit for Scala with SBT

bigdata sbt scala spark spark-cluster spark-submit

Last synced: 06 Feb 2025

https://github.com/flaviostutz/spark-scala-hdfs-docker-example

Spark with Scala reading/writing files to HDFS with automatic additions of new Spark workers using Docker "scale"

datanode docker example hdfs namenodes scala scale spark spark-workers

Last synced: 06 Feb 2025

https://github.com/vjcitn/biocpyinterop

Material for Bioconductor 2023 workshop on interoperation with python

basilisk bioconductor cite-seq genetics hail reticulate scvi-tools single-cell-omics spark

Last synced: 09 Jan 2025

https://github.com/ndleah/stedi

Data Lakehouse solution for machine learning data

aws-athena aws-glue s3-bucket spark

Last synced: 12 Jan 2025

https://github.com/vitalibo/aws-glue-java

Simple PoC that demonstrate usage Java in AWS Glue ETL pipelines.

aws glue spark

Last synced: 27 Dec 2024

https://github.com/colinkiama/snippets

Code snippets used by the Spark Community

code-snippets snippets snippets-collection snippets-library spark uwp

Last synced: 14 Jan 2025

https://github.com/samlet/sagas-spark

use structured-streaming as olap engine

olap spark

Last synced: 22 Dec 2024

https://github.com/smaddanki/data-science

Code blocks, algorithms, and research snippets in Data Science, Machine Learning, AI & Quant Finance.

deep-learning machine-learning pytorch scikit-learn spark

Last synced: 08 Feb 2025

https://github.com/fsanaulla/terling

Linguistic text analysis for detecting terrorists dangerous.

scala spark

Last synced: 17 Jan 2025

https://github.com/cn-docker/spark-worker

Spark Worker Docker Image

docker-image spark spark-worker

Last synced: 27 Jan 2025

https://github.com/teo-sl/us_flights_analysis

This repository contains a dashboard to visualize the US flights data and notebooks for some ML tasks on the same data

big-data classification dash dashboard flights machine-learning plotly regression spark usa

Last synced: 16 Jan 2025

https://github.com/sephiroth7712/k-nearest-neigbours

Implementation of K-Nearest Neighbors algorithm using multiple parallel computing approaches: CUDA (GPU), Hadoop, Spark, MPI, OpenMP, and PThreads. Demonstrates scalable machine learning across different parallel computing paradigms from GPU to distributed frameworks.

cuda cuda-programming hadoop-mapreduce java mpi multiprocessing multithreading openmp pthreads scala spark

Last synced: 06 Feb 2025

https://github.com/higorcazuza81/courses

Repository showcasing my educational journey in Quantitative Analysis, including projects and coursework in SQL, Python, Data Science, Machine Learning, and financial modeling. Focused on real-world applications in quantitative finance, data analysis, and statistical modeling.

airflow automation database dataengineering python shell-script spark sql

Last synced: 06 Feb 2025

https://github.com/eolecvk/intro_spark_twitter

Introduction to text mining with Spark

pyspark spark text-analysis text-mining

Last synced: 07 Feb 2025

https://github.com/kolia1985/kolia1985

Mykola Melnyk profile

data-engineering data-science spark

Last synced: 06 Feb 2025

https://github.com/konradmalik/scala-seed

Seed project for dockerized Scala with included Spark and Cassandra.

cassandra docker makefile multimodule sbt scala seed spark template typesafe-config

Last synced: 17 Jan 2025

https://github.com/konradmalik/spark

Dockerized spark with tools

docker hadoop scala spark

Last synced: 17 Jan 2025