Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
Apache Spark
Apache Spark is an open source distributed general-purpose cluster-computing framework. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.
- GitHub: https://github.com/topics/spark
- Wikipedia: https://en.wikipedia.org/wiki/Apache_Spark
- Repo: https://github.com/apache/spark
- Created by: Matei Zaharia
- Released: May 26, 2014
- Related Topics: scala, hadoop,
- Aliases: apache-spark,
- Last updated: 2025-01-22 00:29:18 UTC
- JSON Representation
https://github.com/absaoss/pramen
Resilient data pipeline framework running on Apache Spark
big-data data-pipeline etl hacktoberfest scala spark
Last synced: 19 Dec 2024
https://github.com/ibm-cloud/biginsights-on-apache-hadoop
Example projects for 'BigInsights for Apache Hadoop' on IBM Bluemix
ambari biginsights bigsql hadoop hbase hive ibm-bluemix knox oozie spark spark-streaming webhdfs zeppelin
Last synced: 17 Nov 2024
https://github.com/trainingbypackt/big-data-analysis-with-python
Combine Spark and Python to process large datasets and unlock the power of parallel computing and machine learning
combine-spark dataset machine-learning python spark
Last synced: 14 Nov 2024
https://github.com/netease/spark-alarm
Alerting and monitoring tool for Apache Spark
alert monitoring monitoring-tool scala spark
Last synced: 16 Nov 2024
https://github.com/lynnlangit/spark-scala-eks
Spark Scala docker container sample for AWS testing - EKS & S3
docker-image scala spark spark-ml
Last synced: 28 Oct 2024
https://github.com/spektom/realtime-dashboard-example
This is a real-time dashboard example using Spark Streaming and Node.js
dashboard-application flink kafka meetup rethinkdb spark spark-streaming
Last synced: 19 Nov 2024
https://github.com/san089/cloudera_material
Cloudera_Material: Study Material to help people preparing for Cloudera CCA Spark and Hadoop Developer Exam (CCA175). Feel free to collaborate.
big-data bigdata cca cca175 certification cloudera flume hadoop hive hive-metastore pyspark spark sqoop sqoop-export sqoop-import sqoop-session
Last synced: 12 Oct 2024
https://github.com/pierrenodet/spark-ensemble
Ensemble Learning for Apache Spark 🌲
bagging boosting ensemble-learning gbm machine-learning scala spark spark-ml stacking
Last synced: 11 Oct 2024
https://github.com/webysther/aws-glue-docker
🐋 Docker image for AWS Glue Spark/Python
apache-arrow aws aws-cli aws-glue aws-glue-docker cdk data-engineering development docker docker-image dockerfile etl glue-catalog glue-pyspark pandas pytest python python-poetry sam spark
Last synced: 13 Nov 2024
https://github.com/moritzkoerber/covid-19-data-engineering-pipeline
A Covid-19 data pipeline on AWS featuring PySpark/Glue, Docker, Great Expectations, Airflow, and Redshift, templated in CloudFormation and CDK, deployable via Github Actions.
apache-airflow apache-spark api aws aws-cdk aws-cloudformation aws-ecr aws-glue aws-lambda aws-redshift aws-s3 docker great-expectations pyspark spark
Last synced: 11 Nov 2024
https://github.com/archivesunleashed/notebooks
Various examples of notebooks for working with web archives with the Archives Unleashed Toolkit, and derivatives generated by the Archives Unleashed Toolkit.
juypter-notebook notebooks pyspark-notebook python3 spark web-archives
Last synced: 11 Nov 2024
https://github.com/pdsuwwz/chatgpt-vue3-light-mvp
💭 一个可二次开发 Chat Bot 对话 Web 端 MVP 原型模板, 基于 Vue3、Vite 5、TypeScript、Naive UI 、UnoCSS 等主流技术构建, 🧤简单集成大模型 API, 采用单轮 AI 问答对话模式, 每次提问独立响应, 无需上下文, 支持打字机效果流式输出, 集成 markdown-it 预览, 💼 易于定制和快速搭建 Chat 类大语言模型产品 (附示例截图)
ai chat chatbot deepseek event glm gpt llm ollama openai qwen siliconcloud siliconflow source spark stream ts
Last synced: 11 Oct 2024
https://github.com/medmes/twitterstreamingsparkkafkademo
a demo project to Analyze most popular twitter hashtags using Java 8 Spring-Boot Spark Streaming Kafka & Docker Demo.
apache docker java-8 kafka spark spark-streaming spring-boot twitter twitter-streaming-api zookeeper
Last synced: 08 Nov 2024
https://github.com/syedhassaanahmed/databricks-notebooks
Collection of Databricks and Jupyter Notebooks
azure-data-lake azure-databricks azure-event-hubs azure-iothub azure-sql-database azure-storage cosmos-db graphframes hive-udf jupyter-notebooks kafka matplotlib mongodb pandas-dataframe parquet power-bi pyspark spark spark-sql spark-udf
Last synced: 09 Jan 2025
https://github.com/Componolit/gneiss
Framework for platform-independent SPARK components
ada component-based embedded formal-methods formal-verification spark
Last synced: 25 Oct 2024
https://github.com/Componolit/SXML
Formally verified, bounded-stack XML library
ada formal-methods formal-verification parser spark xml
Last synced: 26 Oct 2024
https://github.com/crflynn/pbspark
protobuf pyspark conversion
dataframe protobuf protocol-buffers pyspark spark
Last synced: 08 Nov 2024
https://github.com/ember-sparks/ember-sparks
✨ Ambitious UI components for your Ember app.
addon ember ember-css-modules javascript spark ui ui-components
Last synced: 21 Nov 2024
https://github.com/maropu/datasketches-spark
Data Sketches for Apache Spark
Last synced: 08 Nov 2024
https://github.com/jgperrin/net.jgp.books.spark.ch02
Spark in Action, 2nd edition - chapter 2
apache-spark java java8 manning spark sparkwithjava
Last synced: 09 Nov 2024
https://github.com/hoangsonww/moodify-emotion-music-app
🎹 Moodify - an emotion-based music recommendation system that uses AI/ML models to analyze text, speech, and facial expressions, providing personalized music recommendations across web and mobile platforms.
artificial-intelligence django django-rest-framework emotion fullstack-development hadoop kubernetes machine-learning mobile-development mongodb music python pytorch react-native reactjs redis restful-api spark tensorflow torch
Last synced: 01 Nov 2024
https://github.com/ysh329/link-prediction
[UNMAINTAINED] 基于PySpark与MySQL的复杂网络链路预测。
link-prediction network pyspark spark
Last synced: 23 Oct 2024
https://github.com/mlr-org/mlr3db
Data Backends to let mlr3 work transparently with (remote) data bases
bigquery data-backend database duckdb machine-learning mariadb mlr3 mysql odbc postgresql r r-package spark sqlite
Last synced: 14 Oct 2024
https://github.com/opensearch-project/opensearch-spark
Spark Accelerator framework ; It enables secondary indices to remote data stores.
compute opensearch secondary-index spark
Last synced: 11 Nov 2024
https://github.com/stabrise/spark-pdf
PDF DataSource for Apache Spark
big-data data-engineering data-extraction data-science ocr ocr-recognition pdf pdf-document pdf-document-processor spark spark-datasource tesseract tesseract-ocr
Last synced: 03 Dec 2024
https://github.com/akshitvjain/realtime-twitter-trends-analytics
A big data project to develop a real-time data pipeline for analyzing the popularity and sentiments of trending topics on Twitter.
big-data business-intelligence data-pipeline drill dstream geo-visualization hashtags kafka kafka-producer-consumer mongodb parallel-data-processing rdds realtime-dashboard realtime-data-pipeline spark tableau twitter twitter-sentiment-analysis twitter-streaming-api zookeeper
Last synced: 23 Oct 2024
https://github.com/vmitchell85/spark-kiosk-notify
Adds a notification panel to your Laravel Spark Kiosk, allowing you to send notifications to users.
Last synced: 12 Oct 2024
https://github.com/cognitedata/cdp-spark-datasource
Spark data source for Cognite Data Fusion
cognite datasource scala spark
Last synced: 31 Oct 2024
https://github.com/mahmoudparsian/machine-learning-course
Machine Learning Course @ Santa Clara University
clustering data-algorithms kmeans-clustering linear-regression logistic-regression machine-learning pyspark pyspark-algorithms-book santa-clara-university scikit-learn spark spark-ml supervised-learning unsupervised-learning
Last synced: 06 Nov 2024
https://github.com/geotrellis/geotrellis-netcdf
Scala/Spark Project For Reading NetCDF
Last synced: 11 Nov 2024
https://github.com/hortonworks-spark/cloud-integration
Spark cloud integration: tests, cloud committers and more
apache-spark aws-s3 azure gcs spark
Last synced: 14 Nov 2024
https://github.com/microsoft/masc
Microsoft's contributions for Spark with Apache Accumulo
accumulo apache big-data machine-learning spark
Last synced: 22 Jan 2025
https://github.com/nashtech-labs/spark-graphx-twitter
An example of Spark and GraphX with Twitter as sample
apache-spark graph knoldus sbt spark spark-graphx twitter
Last synced: 05 Nov 2024
https://github.com/snowplow/dataflow-runner
Run templatable playbooks of Hadoop/Spark/et al jobs on Amazon EMR
amazon-emr flink golang-application hadoop spark
Last synced: 09 Nov 2024
https://github.com/streamnative/pulsar-hub
The canonical source of StreamNative Hub.
apache-pulsar connector data-processing event-streaming flink messaging offloader opentracing prestosql pubsub pulsar-functions pulsar-io spark tracing
Last synced: 01 Dec 2024
https://github.com/jacopodl/dstar
DHCP attack tool :imp:
dhcp dhcp-starvation-attack dstar hacking-tool mitm rogue-dhcp spark
Last synced: 04 Dec 2024
https://github.com/NashTech-Labs/spark-graphx-twitter
An example of Spark and GraphX with Twitter as sample
apache-spark graph knoldus sbt spark spark-graphx twitter
Last synced: 23 Oct 2024
https://github.com/miraisolutions/sparkbq
Sparklyr extension package to connect to Google BigQuery
Last synced: 18 Nov 2024
https://github.com/dataeval/dingo
Dingo: A Comprehensive Data Quality Evaluation Tool
data-evaluation data-quality data-science data-validation dataquality datascience gpt llm openai opencompass spark vlm
Last synced: 13 Jan 2025
https://github.com/s22s/pre-lt-raster-frames
Spark DataFrames for earth observation data
earth-observation geotrellis image-processing machine-learning scala spark spark-ml sparksql
Last synced: 22 Jan 2025
https://github.com/aphp/spark-etl
Better bridge apache spark and postgresql
Last synced: 25 Nov 2024
https://github.com/vemonet/setup-spark
:octocat:✨ Setup Apache Spark in GitHub Action workflows
apache-spark github-actions setup spark
Last synced: 11 Nov 2024
https://github.com/yj8023xx/xiwenlejian
一个基于深度学习的书籍推荐系统,可以根据用户的行为进行个性化的推荐
deep-learning java python recommender-system spark springcloud vue
Last synced: 14 Nov 2024
https://github.com/hiejulia/data-pipeline-project
Data pipeline project
amazon-web-services azure bigml classification data-pipeline deployment distributed-systems hadoop java kafka machine-learning mapreduce maven spark streaming
Last synced: 16 Dec 2024
https://github.com/zunzhuowei/qs-hadoop
大数据生态圈学习
bigdata elasticsearch hadoop mapreduce spark spark-streaming storm
Last synced: 02 Dec 2024
https://github.com/bluejoe2008/spark-http-stream
spark structured streaming via HTTP communication
http spark spark-structured-streaming
Last synced: 23 Oct 2024
https://github.com/gilbitron/spark-create-stripe-plans
A simple Laravel artisan command to create Spark plans in Stripe
laravel laravel-artisan-command spark stripe
Last synced: 14 Oct 2024
https://github.com/jplane/pyspark-devcontainer
A simple VS Code devcontainer setup for local PySpark development
devcontainer devcontainers jupyter jupyter-notebooks pyspark pyspark-notebook python spark vscode
Last synced: 17 Oct 2024
https://github.com/longnguyen010203/youtube-recommend-master-etl-pipeline
💜🌈📊 A Data Engineering Project that implements an ETL data pipeline using Dagster, Apache Spark, Streamlit, MinIO, Metabase, Dbt, Polars, Docker. Data from kaggle and youtube-api 🌺
cleaning-data dagster data-engineering data-engineering-pipeline dbt docker docker-compose dockerfile etl-pipeline metabase minio mysql polars postgresql processing pyspark spark streamlit youtube youtube-api
Last synced: 22 Nov 2024
https://github.com/romans-weapon/spear-framework
Rapid ETL/ELT-connectors/pipeline development leveraged on top of Apache Spark
docker-compose hadoop kafka scala shell-script spark
Last synced: 10 Oct 2024
https://github.com/natanfelles/codeigniter-db
Database Commands for CodeIgniter 4
cli codeigniter codeigniter4 command-line database mariadb mysql spark
Last synced: 14 Oct 2024
https://github.com/chen0040/spring-boot-spark-integration-demo
Demo on how to integrate Spring Data JPA, Apache Spark and GraphX with Java and Scala mixed codes
graphx spark spring-boot spring-jpa
Last synced: 16 Dec 2024
https://github.com/mohamedhmini/d-pandisim
distributed pandemics simulator, uses the power of spark to generate huge bulks of contact-tracing data.
big-data distributed-programming epidemic-simulations epidemics graph-algorithms markov-chain pandemic-simulator pyspark spark
Last synced: 15 Nov 2024
https://github.com/simbafl/interview-notes
Python随笔
data-science hadoop hive machine-learning python spark
Last synced: 06 Nov 2024
https://github.com/lovenui/etl-with-aws-emr-and-mwaa
Create a data pipeline on AWS to execute batch processing in a Spark cluster provisioned by Amazon EMR. ETL using managed airflow: extracts data from S3, transform data using spark, load transformed data back to S3.
airflow aws-ec2 aws-s3 data-engineering etl spark
Last synced: 19 Jan 2025
https://github.com/qubole/streaminglens
Qubole Streaminglens tool for tuning Spark Structured Streaming Pipelines
cluster-management micro-batches scala sla spark spark-streaming sparklens streaming streaming-pipeline structured-streaming
Last synced: 21 Nov 2024
https://github.com/lovenui/dataengineering-capstone-project
airflow aws-redshift aws-s3 data-engineering python spark sql
Last synced: 19 Jan 2025
https://github.com/miztiik/s3-to-rds-with-glue
Extract, transform, and load data for analytic processing using AWS Glue
cdk cloud-development-kit etl glue glue-catalog glue-job miztiik-automation s3-to-rds spark
Last synced: 04 Dec 2024
https://github.com/tonyz0x0/football-manager
Data Analysis as a Football Manager
Last synced: 01 Jan 2025
https://github.com/flint-bot/sparky
Cisco Spark API for NodeJS (deprecated in favor of https://github.com/webex/webex-bot-node-framework)
Last synced: 27 Oct 2024
https://github.com/Componolit/jwx
JSON/JWK/JWS/JWT/Base64 library in SPARK
ada base64 jose json json-web-signature jwk jws jwt jwt-authentication jwt-token spark
Last synced: 26 Oct 2024
https://github.com/woltapp/spark-osm-datasource
Native Spark OSM PBF data source
Last synced: 11 Oct 2024
https://github.com/zoltan-nz/kafka-spark-project
Distributed System in Docker with Apache Kafka and Spark for big data streaming and visualisation (NodeJS, TypeScript, React, NestJS, Java)
java javascript kafka nodejs spark typescript
Last synced: 12 Oct 2024
https://github.com/amzn/rheoceros
Cloud-based AI / ML workflow and data application development framework
ai aws aws-emr aws-glue aws-lambda bring-your-own-account cloud data-science event-based feature-engineering flow low-code-framework machine-learning pyspark sagemaker-notebook sagemaker-notebook-instance scala-spark serverless spark
Last synced: 11 Nov 2024
https://github.com/hibayesian/spark-lof
A parallel implementation of local outlier factor based on Spark
local-outlier-factor machine-learning outlier-detection spark
Last synced: 23 Nov 2024
https://github.com/qubole/spark-state-store
Rocksdb state storage implementation for Structured Streaming.
performance qubole real-time-processing rocksdb scalability spark state-management streaming structured-streaming
Last synced: 21 Nov 2024
https://github.com/alvertogit/bigdata_docker
Big Data Docker Data Science Spark Spark3 Hadoop HDFS Scala Python Artificial Intelligence Machine Learning Jupyter Lab Notebook
big-data data-science docker jupyter-lab jupyter-notebook machine-learning python scala spark spark3
Last synced: 23 Nov 2024
https://github.com/qubole/s3-sqs-connector
A library for reading data from Amzon S3 with optimised listing using Amazon SQS using Spark SQL Streaming ( or Structured streaming).
s3 scala spark spark-streaming sqs streaming structured-streaming
Last synced: 21 Nov 2024
https://github.com/wazzabeee/pyspark-etl-twitter
Implementation of an ETL process for real-time sentiment analysis of tweets with Docker, Apache Kafka, Spark Streaming, MongoDB and Delta Lake
delta-lake docker etl etl-pipeline etl-process kafka kafka-consumer kafka-producer kafka-streams mongodb nlp pyspark python sentiment-analysis spark spark-streaming tweet-analysis tweet-classification twitter twitter-sentiment-analysis
Last synced: 13 Nov 2024
https://github.com/jgperrin/net.jgp.books.spark.ch03
Spark in Action, 2nd edition - chapter 3
apache-spark dataframe java java8 manning spark sparkwithjava
Last synced: 09 Nov 2024
https://github.com/hammerlab/spark-util
low-level helpers for Apache Spark libraries and tests
Last synced: 12 Oct 2024
https://github.com/piotr-kalanski/data-quality-monitoring
Data Quality Monitoring Tool
data-quality monitoring scala spark
Last synced: 27 Oct 2024
https://github.com/dvgodoy/yelpdatasetchallenge
Restaurant recommendations and review text-based quality predictions
dataset lstm-sentiment-analysis recommender-systems sentiment-analysis spark spark-ml yelp-dataset
Last synced: 13 Oct 2024
https://github.com/luckyzxl2016/spark-example
Spark1.6和spark2.2的示例,包含kafka,flume,structuredstreaming,jedis,elasticsearch,mysql,dataframe
dataframe elasticsearch jedis kafka mysql spark spark-example spark-sql spark-streaming spark-structured-streaming
Last synced: 28 Oct 2024
https://github.com/radanalyticsio/oshinko-s2i
This is a place to put s2i images and utilities for spark application builders for openshift
java openshift oshinko-s2i pyspark s2i-image scala spark
Last synced: 05 Nov 2024
https://github.com/camposvinicius/aws-etl
This is an ETL application on AWS with general open sales and customer data that you can find here: https://github.com/camposvinicius/data/blob/main/AdventureWorks.zip, it's a zipped file with some .csvs inside that we will apply transformations.
airflow argocd athena aws catalog data data-engineer database emr emr-cluster etl glue kubernetes pipeline postgres pyspark rds spark
Last synced: 04 Dec 2024
https://github.com/nikoshet/spark-cherry-shuffle-service
Code for the "Cherry: A Distributed Task-Aware Shuffle Service for Serverless Analytics" paper for 2021 IEEE International Conference on Big Data
ansible apache-spark bigdata devops distributed docker ieee kubernetes papers-with-code serverless shuffling spark
Last synced: 09 Nov 2024
https://github.com/steven-matison/HDP3-Hue-Service
A continuation of Ambari Hue Service for HDP 3.x and Hue 4.6.0
ambari ambari-hue-service hbase hdp3 hive hue spark
Last synced: 31 Oct 2024
https://github.com/hashload/freeza-offset
Spark stream consumption commit in kafka consumer group
databricks kafka kafka-commit kafka-offset-commits spark spark-streaming
Last synced: 12 Oct 2024
https://github.com/qiushisun/distributed-computing-systems
2021 Spring (Distributed Computing Systems) 分布式系统与编程
distributed-computing distributed-systems ecnu-dase flink hadoop-mapreduce spark
Last synced: 19 Dec 2024
https://github.com/absaoss/spark-hofs
Scala API for Apache Spark SQL high-order functions
high-order-functions scala spark sql
Last synced: 10 Oct 2024
https://github.com/ehsanmok/sparkling-titanic
Training models with Apache Spark, PySpark for Titanic Kaggle competition
Last synced: 10 Jan 2025
https://github.com/lovenui/marketing_analysis-aws-spark-sql
aws aws-rds aws-s3 data-analysis machine-learning marketing-analytics spark
Last synced: 19 Jan 2025
https://github.com/bluegranite/databrickstraining
Repository for Microsoft Databricks Training Events - Hosted by BlueGranite
apache-spark azure azure-databricks databricks distributed-computing machine-learning pyspark spark spark-streaming
Last synced: 18 Nov 2024
https://github.com/qxzzxq/faker
Generate fake data for Scala and Spark :tophat:
fake fake-data faker faker4s scala spark spark-data-generator test-data test-data-generator testing
Last synced: 18 Dec 2024