Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

Apache Spark

Apache Spark is an open source distributed general-purpose cluster-computing framework. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.

https://github.com/brinthat/world-development-indicators

Exploring World Development Indicators: Identifying relationship between Health Indicators using Linear Regression & Classification of Income Group based on Health Indicators using Logistic Regression.

lasso-regularization linear-regression python ridge-regression-model scatter-plot spark spark-sql

Last synced: 17 Jan 2025

https://github.com/aidenfockens/aiden_neel_healthevents

Consumes Kafka health events, holding them in AWS RDS, before using flask to query and perform EDA

flask kafka kubernetes spark sql

Last synced: 13 Jan 2025

https://github.com/dexterposh/azurehdinsight

Repository housing the artifacts to deploy the Hadoop clusters on Azure for my learning.

azure hadoop hdinsight-cluster learning-by-doing spark

Last synced: 04 Jan 2025

https://github.com/mustafahakkoz/preprocessing_w_spark

Preprocessing + Feature Extraction pipeline by SPARK and Neo4J

cypher feature-engineering mongodb neo4j nlp spark

Last synced: 28 Jan 2025

https://github.com/spektom/data-formats-samples

Spark-based different data formats samples generator

avro json orc parquet spark

Last synced: 20 Jan 2025

https://github.com/dev88jerry/cs450

Bishop's University - CS450 Elements of Big Data

big-data data-science hadoop spark

Last synced: 08 Jan 2025

https://github.com/coreyauger/spark-ashley-madison-ml

spark ml hacking on the Ashley Madison dataset.

scala spark

Last synced: 16 Jan 2025

https://github.com/maheshwarineeraj/quickcodeblocks

Collection of small reusable code blocks and automations

bigdata python spark

Last synced: 13 Jan 2025

https://github.com/robertdavidwest/blog-post-analysis

Analysing blog post data using Spark/Scala

scala spark

Last synced: 15 Jan 2025

https://github.com/sebastianruizm/cca175-exam-preparation

Backup de mi preparación para el examen CCA175 de Cloudera

hdfs mysql python spark sqoop

Last synced: 08 Jan 2025

https://github.com/chimera-suite/test-spark-datatypes

Test OntopSpark's OBDA datatype conversions to be compliant with W3C standards

obda spark test

Last synced: 03 Jan 2025

https://github.com/ammahmoudi/mapreduce-examples

Map Rduce Examples using pure Scala and Then using Spark

map-reduce mapreduce scala spark spark-mapreduce

Last synced: 15 Jan 2025

https://github.com/shuuji3/spark-ceph-connector

🌟Spark Ceph Connector: Implementation of Hadoop Filesystem API for Ceph

apache-hadoop apache-spark ceph hadoop spark

Last synced: 27 Jan 2025

https://github.com/jofaval/mnist

Computer Vision with Neural Networks for handwritten digits recognition in 1998

classification computer-vision data-science deep-learning deep-neural-networks google-colab keras matplotlib pyspark python pytorch seaborn spark tensorflow

Last synced: 04 Feb 2025

https://github.com/billxsheng/oubre-sentiment-analysis

Complete data platform that performs sentiment analysis on tweets. Built using Cassandra, Kafka, Spark, Node, and React.

cassandra etl-pipeline java kafka nodejs sentiment-analysis spark twitter-api

Last synced: 17 Jan 2025

https://github.com/seilylook/spark_definition_guide_ch_3

Spark: The Definition Guide - Chapter 3

spark structured-streaming

Last synced: 16 Jan 2025

https://github.com/jonny-binns/fase-practicals

Contains the practice work and code from lectures in the Formal Approaches to Software Engineering Module

ada ada-spark spark

Last synced: 18 Jan 2025

https://github.com/safaa-p/machine-failure-prediction

Predicting Machine failure using Machine learning on a synthetic dataset of an existing milling machine consisting of 10,000 data points

big-data classification clustering decision-tree machine-failure machine-learning neural-network pyspark-mllib spark spark-sql svm-classifier unbalanced-data

Last synced: 16 Jan 2025

https://github.com/hafizhhasyhari/Big-Data_AI_Streaming-Data-Visualization-2024

Big Data dengan Spark, Scala by hafizhhasyhari. Kuliah selama semester 3

big-data data-analytic data-besar data-science scala spark

Last synced: 08 Feb 2025

https://github.com/captainirs/hadoop-yarn-k8s

A sandbox for running a Hadoop-YARN cluster on Kubernetes

hadoop kubernetes spark yarn

Last synced: 11 Jan 2025

https://github.com/dutrevis/spark-resources-metrics-plugin

Spark plugin to retrieve metrics from a variety of cluster resources

scala spark

Last synced: 30 Dec 2024

https://github.com/non-neutralzero/spark-feature-engineering-toolkit

Snippets of spark/scala code used to do some handy feature engineering

data-engineering feature-engineering feature-extraction scala spark spark-sql

Last synced: 22 Jan 2025

https://github.com/francesco-biscaccia-carrara/bigdata_projects

Assignment repository for the Big Data Computing course at the University of Padova for the academic year 2023-2024.

big-data k-center-problem map-reduce reservoir-sampling spark spark-streaming sticky-sampling

Last synced: 22 Jan 2025

https://github.com/bst-depractice/spark_play

Setup and practice pyspark transformations

spark

Last synced: 22 Jan 2025

https://github.com/flynn3103/loadhouse

Loading data into the Lakehouse using JSON configuration and utilities for ETL tasks.

delta-lake spark

Last synced: 13 Jan 2025

https://github.com/offthetab/vkapi-ml-dataharvester

Pipeline to harvest data via VK API for ML analysis with hadoop and spark

hadoop hdfs hive linux mariadb python requests spark sqoop

Last synced: 30 Dec 2024

https://github.com/ayresgneto/use-case-gcp-etl

ELT pipeline GCP. Tecnologias utilizadas: Postgresql, GCP Storage, Airflow (local), Pyspark (local), BigQuery

airflow big-data bigquery data data-engineering etl gcp pipeline postgresql programming-oriented-object pyspark python spark

Last synced: 21 Jan 2025

https://github.com/springworks/node-scale-reader

Application that reads value off a scale from a local file stream

spark

Last synced: 22 Jan 2025

https://github.com/oyvinddd/scala

For learning purposes only

scala spark

Last synced: 31 Dec 2024

https://github.com/rurumimic/streaming

Streaming Systems

kafka spark

Last synced: 03 Jan 2025

https://github.com/matteofasulo/contact-center_databricks

Analysis of Contact Center data on DataBricks

databricks datamining pandas rdbms spark

Last synced: 20 Jan 2025

https://github.com/atechguide/spark

Spark Scripts Repository

spark spark-sql

Last synced: 31 Dec 2024

https://github.com/vigneshss-07/bigdata_technologies

This repo contains all technical knowledge and implementation of big data technologies.

big-data hadoop hadoop-hdfs hbase hive hive-metastore kafka mapreduce-python pyspark spark sparksql

Last synced: 16 Jan 2025

https://github.com/alchemine/realtime-trend-pipeline

실시간 검색어에 대한 수집/분석 데이터 파이프라인

airflow docker hadoop hive kafka python selenium spark

Last synced: 16 Jan 2025

https://github.com/moriyoshi/dummydf

Emulates PySpark DataFrame API by Pandas

dummy pandas pyspark spark testing

Last synced: 22 Jan 2025

https://github.com/luismanuelamengual/neogroup-sparks

Great Server framework with MVC oriented structure

framework mvc-pattern spark

Last synced: 22 Jan 2025

https://github.com/alexdyysp/sparkscala

learning notebook

spark

Last synced: 19 Jan 2025

https://github.com/archie-cm/real_time_product_recommendations_with_machine_learning_on_gcp

This project demonstrates how to build a real-time product recommendation system using Pub/Sub Lite and Apache Spark with Dataproc

dataproc pubsublite spark

Last synced: 13 Jan 2025

https://github.com/ankushkhanna/spark-common

Spark Commons, some hacks to simplify programming with Spark.

spark transfomer

Last synced: 17 Jan 2025

https://github.com/teo-sl/denver-crimes-dash

This is a simple dashboard made with Dash and Plotly (for the frontend) and Apache Spark (for the backend)

dash denver plotly spark

Last synced: 16 Jan 2025

https://github.com/teo-sl/us_flights_analysis

This repository contains a dashboard to visualize the US flights data and notebooks for some ML tasks on the same data

big-data classification dash dashboard flights machine-learning plotly regression spark usa

Last synced: 16 Jan 2025

https://github.com/fsanaulla/terling

Linguistic text analysis for detecting terrorists dangerous.

scala spark

Last synced: 17 Jan 2025

https://github.com/manuelmtzv/spark_flask_cluster

Docker configuration for setting up a cluster environment with Apache Spark and Flask.

devcontainer docker python spark

Last synced: 31 Jan 2025

https://github.com/afsalthaj/biggudeta-kyukyoku

This project is made for personal learning purpose, and to show case an instance of an end to end solution in a Big Data environment.

ansible linux spark yarn

Last synced: 08 Jan 2025

https://github.com/emelis-ptr/sabd1

Progetto: Sistemi e Architetture per Big Data

big-data covid19 docker hdfs java spark

Last synced: 08 Jan 2025

https://github.com/yash-chauhan-dev/etl_airbnb_listing

A scalable ETL pipeline for transforming raw Airbnb listings data into a structured format for price and availability analysis. Built using Spark, HDFS, PostgreSQL, Airflow, and Docker.

airflow docker docker-compose etl etl-pipeline hdfs local-development postgresql python spark sql

Last synced: 11 Feb 2025

https://github.com/colinkiama/snippets

Code snippets used by the Spark Community

code-snippets snippets snippets-collection snippets-library spark uwp

Last synced: 14 Jan 2025

https://github.com/evertonsavio/spark-big-data-analitycs

Base Codes Repository for technologies related to Big Data such as Spark, Kafka, Storm and others. Languages: Python, Java

java kafka python spark

Last synced: 02 Jan 2025

https://github.com/manojkarthick/rightfluencer

An interactive web application and dashboard that allows you to find the right influencers for your brand by analyzing their posts, images and videos.

facebook google-cloud instagram marketing spark twitter youtube

Last synced: 12 Nov 2024

https://github.com/benmizrahi/duckspark

duckspark - A DuckDB based distributed data processing engine

data-engineering distributed-systems golang spark

Last synced: 27 Dec 2024

https://github.com/memojja/basic-recommendation-engine

for learning apache spark and mlib library

java-8 spark sparkmllib

Last synced: 21 Jan 2025

https://github.com/ndleah/stedi

Data Lakehouse solution for machine learning data

aws-athena aws-glue s3-bucket spark

Last synced: 12 Jan 2025

https://github.com/oracle-quickstart/oci-mapr

Terraform module to deploy MapR on Oracle Cloud Infrastructure (OCI)

cloud hadoop mapr oci oracle partner-led spark terraform

Last synced: 07 Nov 2024

https://github.com/milankinen/las-emr

AWS EMR parallelized SeCo Lexical Analysis Services for big data

aws big-data emr finnish nlp spark text-processing

Last synced: 19 Jan 2025

https://github.com/librity/rtjvm_spark_tuning

Rock The JVM - Spark Performance Tuning with Scala

spark sparktuning tuning

Last synced: 08 Jan 2025

https://github.com/librity/rtjvm_spark_optimizations

Rock The JVM - Spark Optimizations with Scala

optimization scala spark

Last synced: 08 Jan 2025

https://github.com/piero24/big-data_hw_23-24

Exercises in Java and Spark for the Big Data Computing course at unipd

big-data clustering fft java mapreduce sampling spark streaming

Last synced: 08 Jan 2025

https://github.com/juanpablo70/arep-taller03

Microframeworks Web

java spark webserver

Last synced: 22 Jan 2025

https://github.com/zsomborjoel/pyspark-basics

Teaching and learning the functionality of the Spark Python API on dataframes

basics dataframes spark

Last synced: 11 Feb 2025

https://github.com/anras5/nyc-yellow-taxi

Processing data streams with Kafka + Spark

docker google-cloud kafka postgresql spark spark-streaming

Last synced: 21 Jan 2025

https://github.com/gabrielenizzoli/spark_engine

Build a complex spark execution plan by composing many different spark operations.

spark sql yaml

Last synced: 12 Feb 2025

https://github.com/hadarsharon/compars

DataFrame comparison done right, powered by Rust with polars (AKA the bear-agnostic 🐻 🐼 🐨 🐻‍❄️ DataFrame comparison library)

data-engineering data-profiling data-quality dataframe dataframes koalas pandas polars pyspark python rust spark

Last synced: 22 Jan 2025

https://github.com/rikukissa/sql-exercises

Java app for creating SQL exercises

java java-8 react spark sql

Last synced: 28 Jan 2025

https://github.com/banknatchapol/us-immigration-data-pipeline

Create Data Pipeline for US Imigration data using Spark.

data-pipeline spark

Last synced: 27 Jan 2025

https://github.com/traunguyentvt/study_big_data_technology

Kafka, Spark Streaming, Spark SQL, Hive, Tableau

api hive json kafka spark spark-sql spark-streaming tableau

Last synced: 14 Jan 2025

https://github.com/mainak431/hadoop

HADOOP BASICS AND DIFFERENT TECHNOLOGIES

cassandra hbase hive mapreduce mongodb mysql pig spark

Last synced: 31 Jan 2025

https://github.com/iamdsc/bigdataanalytics

Using Spark with Python for analyzing Big Data.

big-data jupyter-notebook python spark

Last synced: 18 Jan 2025

https://github.com/sirnicholas1st/feedback_processor

This repository contains a simple Flask application that serves as a customer feedback form. The submitted data is sent to a Kafka topic. The Kafka consumer, implemented as a Spark application, processes the data and writes it to a Cassandra table for further analysis.

cassandra flask kafka spark

Last synced: 15 Jan 2025

https://github.com/justinjjlee/simulation-discrete

Employing data transformations and simulations to answer random questions

analytics data data-science julia python simulation spark

Last synced: 28 Jan 2025

https://github.com/sumanthvrao/ipl-spark-analysis

Predict outcomes of IPL Cricket Matches for the year 2018 using Spark MLLib framework.

decision-tree kmeans-clustering pyspark spark spark-mllib-library

Last synced: 08 Jan 2025

https://github.com/queraltsm/ada-spark-exercises

Compilation of exercises in Ada of formal verifications with Spark 2014

formal-verification spark

Last synced: 28 Jan 2025

https://github.com/owengregson/sparkev-ui

A recreation of the Tesla UI Interface, made in HTML, CSS, and JS for my EV project.

css3 html5 javascript spark spark-ev tesla tesla-clone tesla-ui tesla-ui-clone ui ui-design web

Last synced: 22 Jan 2025

https://github.com/josericodata/mscdataanalyticssecondsemesterassignmentone

Summary of Assignment One from the Second semester of the MSc in Data Analytics program. This repository contains the CA1 assignment guidelines from the college and my submission. To see all original commits and progress, please visit the original repository using the link below.

advanced-data-analysis big-data big-data-storage-and-processing cct-college cnn-keras data-science dropout-layers dublin hadoop ireland jose-maria-rico-leal jose-rico jupyter-notebook machine-learning msc mysql neural-network rdbms spark ubuntu-linux

Last synced: 17 Jan 2025

https://github.com/bryanbill/tracker

Wildlife animal tracking application

animals handlebars java postgresql spark

Last synced: 26 Dec 2024