Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

Apache Spark

Apache Spark is an open source distributed general-purpose cluster-computing framework. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.

https://github.com/multivacplatform/multivac-elasticsearch

Demoing Spark 2.2 and Elasticsearch Hadoop connector

elasticsearch hadoop spark

Last synced: 12 Jan 2025

https://github.com/multivacplatform/multivac-nlp

Testing and benchmarking some of the existing NLP libraries in Apache Spark

nlp spark spark-ml spark-mllib spark-nlp spark-sql stanford-corenlp word2vec

Last synced: 12 Jan 2025

https://github.com/f-lab-edu/league-of-legends-data-solution

‘리그 오브 레전드’를 벤치마킹해서 플레이어의 행동 이벤트를 발생하는 API를 통해 실시간으로 데이터가 잘 흐를 수 있도록 데이터 솔루션을 제공합니다.

airflow dataengineering spark

Last synced: 14 Feb 2025

https://github.com/renardeinside/databricks-jobs-jsonnet

Example project with Databricks jobs and configuration management via jsonnet

databricks jsonnet spark

Last synced: 06 Feb 2025

https://github.com/vubacktracking/hdfs-stream-processing

Streaming data processing using Hadoop HDFS, Spark, Kafka, Minio, Elasticsearch

airflow elastic hadoop hdfs kafka kibana minio spark

Last synced: 14 Feb 2025

https://github.com/shayartt/streaming-orders

Project to stream real-time orders and apply some ETL pipelines & analytics using DataBricks, Kafka, AWS

databricks etl kafka python spark spark-streaming

Last synced: 13 Feb 2025

https://github.com/vasnake/artefacts-2019_2023

Collection of some interesting pieces of my projects. Spark, Scala, Python, sh

catalyst etl ml scala spark udaf udf

Last synced: 17 Jan 2025

https://github.com/divithraju/divith-raju-data-mining

This project focuses on customer segmentation using data mining techniques, specifically K-Means clustering, to classify customers into distinct groups based on their purchasing behaviors. The goal is to analyze customer data and segment them into clusters for targeted marketing strategies and better customer relationship management.

algorthims analytics apache business client connector data dataarchitecture database dataengineering datamining datascience hadoop k-means-clustering mysql project project-repository pyspark python3 spark

Last synced: 17 Jan 2025

https://github.com/morinian/pyspark

Estudos com PySpark

spark

Last synced: 18 Jan 2025

https://github.com/drsnowbird/nlp-deeplearning-projects

NLP Deep Learning Projects (Warning - Not ready for public consumption yet!)

chatbot deep-learning mallet nlp python3 rasa-core rasa-nlu spark tensorflow

Last synced: 13 Jan 2025

https://github.com/bishalpaudel/sparkhbaseloganalyzer

Spark and HBase based HApache Access Log Analyzer

big-data cloudera hbase scala spark

Last synced: 06 Jan 2025

https://github.com/dohabanoui/spark-structured-streaming

Real-time analysis of hospital incident data using Apache Spark Streaming to track incidents by service and identify the top years with the most incidents.

docker spark spark-streaming spark-structured-streaming

Last synced: 19 Jan 2025

https://github.com/ewertondrigues02/engenharia-de-dados

Varios Projetos de Engenharia de Dados usando principais ferramentas como: Airflow, Snowflake, dbt, Postrgres, Looker Studio, Power BI

airflow analise-exploratoria analytics aws-ec2 dados data dbt-cloud engenharia-de-dados looker-studio postgres pyspark python3 snowflake spark

Last synced: 19 Jan 2025

https://github.com/ev2900/emr_studio_stock_price_demo

Demo EMR Studio notebook using PySpark to explore Stock Price Data

aws emr emr-studio spark

Last synced: 15 Feb 2025

https://github.com/annettaqi/spam-detection

Using Stochastic gradient descent to classify emails into spam or ham

spark stochastic-gradient-descent

Last synced: 13 Feb 2025

https://github.com/tallamjr/jetspark

Spark cluster on Jetson TX2 mini-project

gpu nvidia spark tx2-jetpack

Last synced: 10 Feb 2025

https://github.com/zkan/machine-learning-with-spark-and-zeppelin

Machine Learning with Apache Spark & Zeppelin

pyspark python spark zeppelin

Last synced: 12 Feb 2025

https://github.com/mukjepscarlet/bilibili-predict-recommend

[大数据课程作业] Bilibili 助手: 视频推荐 + 热门预测

bilibili flask hadoop html javascript prediction pyspark python recommendation spark

Last synced: 18 Jan 2025

https://github.com/mjngxwnj/olympics_data_project

A personal project that builds an end-to-end data pipeline using the 2024 Olympics data.

airflow docker hadoop python snowflake spark superset

Last synced: 09 Feb 2025

https://github.com/azlinrusnan/movielens_data_analysis_with_mongodb_and_cassandra

This project presents an analysis of the MovieLens 100k dataset using Apache Spark integrated with MongoDB and Cassandra. The dataset includes user information, movie ratings, and movie details, providing a comprehensive basis for exploring user preferences and movie popularity.

cassandra ml-100k mongodb python spark

Last synced: 17 Jan 2025

https://github.com/anant/playbook

Anant Platform Playbook - Consists of principles, patterns, tools, a framework, and an approach to designing, building, and managing plaforms.

approach cassandra confluent datastax framework kafka platform playbook spark

Last synced: 19 Jan 2025

https://github.com/anant/example-cassandra-spark-sql

Cassandra Data Operations with Spark SQL

cassandra data-operations docker etl spark spark-sql

Last synced: 19 Jan 2025

https://github.com/anant/example-sql-on-cassandra-with-open-source-notebooks

Files to follow along with the Open Source Notebooks and Cassandra Webinar (see README.md)

cassandra datastax datastax-studio jupyter jupyter-notebook nosql notebooks quix spark sql

Last synced: 19 Jan 2025

https://github.com/chrispyl/learning-latent-representations-for-nitrogen-response-rate-prediction

Implementation for the paper 'Learning latent representations for operational nitrogen response rate prediction'

neural-networks python spark

Last synced: 17 Jan 2025

https://github.com/brooksian/censussipp

Reprodicing Census SIPP Reports Using Apache Spark

spark sparksql zeppelin-notebook

Last synced: 19 Jan 2025

https://github.com/imvision12/real-time-tracking

Real time bus tracking using MTA bus API

flask hadoop javascript leaflet python spark

Last synced: 08 Feb 2025

https://github.com/bluegranite/azure-synapse-vcf-analysis

Sample code for analyzing VCF files (converted to Parquet) in Azure Databricks and Synapse.

azure azure-databricks azure-synapse bioinformatics computational-biology databricks genomics glow parquet spark synapse vcf

Last synced: 19 Jan 2025

https://github.com/adampaternostro/azure-spark-livy-application-insights-external-dependency

Use Spark with Livy along with Application Insights. Learn to host your external dependencies in data lake.

application-insights azure azure-data-lake hdinsight java livy spark spark2

Last synced: 31 Jan 2025

https://github.com/tallamjr/epfl-functional-scala

Materials and worked assignments for Functional Programming with Scala Specialization on Coursera

big-data scala spark

Last synced: 10 Feb 2025

https://github.com/exasol/spark-connector-common-java

Common library for Exasol Apache Spark based connectors

apache-spark exasol exasol-integration spark streaming

Last synced: 09 Feb 2025

https://github.com/tuancamtbtx/java-spark-example

Spark ETL Generic Processor

etl spark

Last synced: 02 Jan 2025

https://github.com/binwenwu/oge-computation-ogc

A computing project corresponding to an OGC style API

geotrellis scala spark

Last synced: 13 Feb 2025

https://github.com/tomwhite/single-cell-spark-demo

Experiments on Single Cell data from 10x Genomics using Apache Spark.

bioinformatics genomics single-cell spark

Last synced: 17 Jan 2025

https://github.com/fbraza/data-processing-scala-spark

A repository that contains code in Scala using spark to process a log data file. The full procedure to run the application can be read in the README.md file.

scala spark

Last synced: 26 Jan 2025

https://github.com/manojpawar94/spark-scala-examples

I have implemented the sample programs using apache spark. The programs have developed on the concepts of Spark RDD and Spark SQL Dataframe.

apache-spark spark spark-rdd spark-sql

Last synced: 13 Jan 2025

https://github.com/crazybber/go-jupyter

spark big data exploring in jupyterlab

bigdata jupyter-notebook jupyterlab rdd spark

Last synced: 28 Jan 2025

https://github.com/bnvulpe/paperslab

The project aims to automate content classification and knowledge retrieval, as well as to perform analysis on the temporal and thematic impact on research over a time period. In addition, the possibility of performing network analysis to analyze communication in the community is contemplated for users.

api-extractor big-data big-data-and-ml big-data-infrastructure docker elasticsearch etl-pipeline information-retrieval knowledge-discovery mysql neo4j network-analysis spark temporal-analysis

Last synced: 09 Feb 2025

https://github.com/chen0040/vagrant-big-data

Vagrantfiles for development in big data

cassandra elasticsearc hdfs kafka mesos redis spark storm vagrantfile zookeeper

Last synced: 09 Feb 2025

https://github.com/wadiebenabdouh/socialmedia-usage-pipeline

Data from Kaggle, containing wide range of users with different age, gender, and interest.

apache-spark data-visualization jupyter-notebook kaggle pyspark python spark

Last synced: 16 Jan 2025

https://github.com/vicnesterenko/apache-spark-labs

Base programs with datasets

apache-spark kpi-fict kpi-ua spark

Last synced: 10 Jan 2025

https://github.com/gnaneshkunal/scala-hadoop

Hadoop programming using Scala

big-data bigdata hadoop scala spark sql

Last synced: 09 Feb 2025

https://github.com/wtsi-hgi/hgi-cloud

terraform and ansible codebase to provision clusters (e.g. hail/spark) at Sanger

ansible hail iac openstack packer spark terraform

Last synced: 28 Nov 2024

https://github.com/thom-x/graalvm-java-docker

Example creating native image of java app with GraalVM Gradle and Docker

docker graalvm gradle java spark

Last synced: 25 Jan 2025

https://github.com/tom474/data_pipeline_with_aws

[RMIT 2024C] EEET2574 - Big Data for Engineering - Group Project

aws data-engineering data-science data-visualization machine-learning mongodb python spark

Last synced: 20 Feb 2025

https://github.com/omr5221/kafka-account-fraud-detector

Learning about Kafka and Spark with project built off of an existing project

kafka python spark superset

Last synced: 27 Jan 2025

https://github.com/sankamuk/aws-kinesis-redshift-sparkstream

Spark Structured Streaming from AWS Kinesis and Redshift

aws kinesis pyspark redshift spark structured-streaming terraform

Last synced: 13 Jan 2025

https://github.com/brooksian/twittersentimentsparkcorenlp

Twitter Sentiment Analysis Using Spark CoreNLP

nlp-machine-learning spark sparksql zeppelin-notebook

Last synced: 18 Nov 2024

https://github.com/darenr/spark-pca

Dimensional reduction, Scatter, Hexbin and kde plots

pca python spark

Last synced: 05 Feb 2025

https://github.com/pierrekieffer/sparkstreaming_kafkaconsumer

Kafka consumer example based on spark streaming with message formatting to spark dataframe

kafka kafka-consumer scala spark spark-streaming

Last synced: 07 Feb 2025

https://github.com/facaiy/spark-for-the-impatient

Collections of short code snippet for impatient readers who want to learn using Spark right away.

spark spark-training tutorial

Last synced: 20 Jan 2025

https://github.com/pierrekieffer/genericsupervisedmachinelearning

Generic supervised machine learning application

machine-learning spark

Last synced: 07 Feb 2025

https://github.com/ezeparziale/big-data-cluster

:elephant: Cluster big data

big-data bigdata hadoop hdfs hive spark zookeeper

Last synced: 20 Jan 2025

https://github.com/bytemedirk/pyspark3-docker

PySpark3 Docker container for testing & development. With OpenJDK, Spark 3.1.2, and Hadoop 2.7.

aws docker docker-image python spark

Last synced: 13 Jan 2025

https://github.com/hsm207/demo-spark-weaviate

How to set up a dev environment to work with spark and weaviate

big-data etl kafka python spark weaviate

Last synced: 14 Jan 2025

https://github.com/briansterle/cluster-fastcopy

copy data between hdfs clusters blazingly fast

bigdata distcp hadoop hdfs spark yarn

Last synced: 13 Feb 2025

https://github.com/wgierke/distributed_data_analytics

Solutions for the hands-on sessions of the course "Distributed Data Analytics" at Hasso-Plattner-Institute using Akka and Spark.

akka data-analytics distributed inclusion-dependency spark

Last synced: 09 Feb 2025

https://github.com/imransilvake/semantic-partitioning

RDF Data (N-Triples) Partition and SPARQL Query Layer for SANSA-Stack using Scala and Spark.

big-data n-triples scala spark sparql

Last synced: 31 Jan 2025

https://github.com/tsovak/spark-demo

The Spark REST API with Spring Boot and MongoDB

docker-compose mongodb rest-api spark sparkjava sparkrest spring-boot

Last synced: 08 Feb 2025

https://github.com/20cent16/airflow-spark

If you want to use airflow with spark, ready to use ;-)

airflow spark

Last synced: 14 Feb 2025

https://github.com/oracle-quickstart/oci-hortonworks

Terraform module to deploy Hortonworks on Oracle Cloud Infrastructure (OCI)

cloud hadoop hdf hdp hortonworks oci oracle partner-led spark terraform

Last synced: 07 Nov 2024

https://github.com/cn-docker/spark-master

Spark Master Docker Image

docker-image spark spark-master

Last synced: 27 Jan 2025

https://github.com/jimthompson5802/datascience_containers

Personal docker images for various data science software stacks

data-science docker h2oai jupyter-notebook kubernetes python rstudio-servers spark

Last synced: 13 Feb 2025

https://github.com/ev2900/glue_spark_history_server

Host a Docker container for the Spark history server / Spark UI of AWS Glue jobs

aws glue spark spark-history-server spark-ui

Last synced: 15 Feb 2025

https://github.com/e2fyi/databricks-utils

`databricks-utils` is a python package that provide several utility classes/func that improve ease-of-use in databricks notebook.

aws databricks jupyter-notebooks notebook pyspark s3 spark vega vega-lite

Last synced: 16 Jan 2025

https://github.com/angeligareta/spark-hadoop-hbase-overview

First lab for Data-Intensive Computing course at KTH where we are introduced to Apache Spark MLlib and Spark SQL, Hadoop, and HBase.

apache-spark data-intensive hadoop hbase hbase-table id2221 kth scala spark spark-mllib spark-sql

Last synced: 22 Jan 2025

https://github.com/darule0/yarndiff

A rudimentary command line utility for contrasting Apache Yarn container logs.

diff difference diffing hadoop hadoop-mapreduce hive log4j mapreduce pig spark yarn yarn2

Last synced: 15 Feb 2025

https://github.com/vietdoo/sg-property-hub

SG Property Hub is a comprehensive platform for managing and analyzing property data.

airflow celery-redis crawler etl etl-pipeline fastapi minio mongodb nextjs postgresql s3 spark webscraping

Last synced: 07 Feb 2025

https://github.com/ralgond/bigdata-example

Hadoop、Hive和Spark的例子、细节和注意事项

bigdata hadoop hdfs hive map-reduce spark

Last synced: 09 Jan 2025

https://github.com/zncdatadev/spark-k8s-operator

Operator for Apache Spark-on-Kubernetes of the Kubernetes Data Stack

k8s kubernetes spark

Last synced: 19 Nov 2024

https://github.com/mauriciovazquezm/spark_bigdata_architecture_project

Final project for the course 'Architecture for Large Data Volumes', taught in the Bachelor's program in Data Science at ITAM

data-stream-processing data-streaming pyspark python spark time-series

Last synced: 13 Jan 2025

https://github.com/lajwithsingh/magelocaldatapipeline

A compact project showcasing local data lake setup using Docker, Mage, Spark, MinIO, Iceberg, and StarRocks. Ideal for learning modern data engineering practices.

docker iceberg mage minio spark starrocks

Last synced: 19 Feb 2025

https://github.com/hwywl/bigdata

大数据学习代码Spark、Hive、Storm、HBase

big-data flume hbase hdfs hive mr spark storm zook

Last synced: 08 Jan 2025

https://github.com/ev2900/emr_studio_deployment

Example Jupyter notebook for EMR Studio

aws emr emr-studio spark

Last synced: 05 Nov 2024

https://github.com/riversun/ml-fake-data-maker

Generate fake data for machine learning like regression analysis

arff arff-generator dummy-data fake-data generator machine-learning prediction regression spark weka

Last synced: 01 Feb 2025

https://github.com/tadod12/airflow-spark-job

A workspace to experiment with Apache Spark and Airflow in a Docker environment

airflow docker rdbms spark

Last synced: 13 Jan 2025

https://github.com/pranavshashidhara/movie-recommendation-system

This project focuses on developing a recommendation system utilizing various learning techniques, including collaborative filtering, matrix factorization, and restricted Boltzmann machines (RBMs).

big-data recommendation-system spark

Last synced: 13 Jan 2025

https://github.com/ev2900/iceberg_emr_athena

Resources from an virtual tech talk / workshop - Set Up and Use Apache Iceberg Tables on Your Data Lake

apache-iceberg athena aws emr spark

Last synced: 05 Nov 2024

https://github.com/tonyz0x0/parallel-ml

An implementation of parallel machine learning algorithms using Spark

machine-learning python spark

Last synced: 02 Feb 2025