Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
Apache Spark
Apache Spark is an open source distributed general-purpose cluster-computing framework. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.
- GitHub: https://github.com/topics/spark
- Wikipedia: https://en.wikipedia.org/wiki/Apache_Spark
- Repo: https://github.com/apache/spark
- Created by: Matei Zaharia
- Released: May 26, 2014
- Related Topics: scala, hadoop,
- Aliases: apache-spark,
- Last updated: 2025-01-23 00:24:54 UTC
- JSON Representation
https://github.com/santiagortiiz/advanced-data-engineering-with-databricks
Databricks. Incremental data processing, task orchestration, and production job monitoring.
big-data databricks databricks-notebooks kafka spark spark-streaming streaming
Last synced: 08 Jan 2025
https://github.com/navicore/spark-on-kubernetes
docker image of spark for k8s
apache-spark k8s kubernetes scala spark
Last synced: 06 Nov 2024
https://github.com/fscm/packer-aws-spark
Packer Template to build a AWS Apache Spark AMI
Last synced: 07 Nov 2024
https://github.com/flarco/dbnet-python
A Python/VueJS database client (Web GUI) to access Oracle, Spark (Hive), Postgres, etc.
apache-spark database jdbc oracle postgresql spark web-gui
Last synced: 22 Jan 2025
https://github.com/ppatierno/enmasse-iot-demo
EnMasse - IoT demo
internet-of-things iot kafka kafka-streams kubernetes-cluster messaging minikube openshift-cluster spark spark-streaming-jobs
Last synced: 27 Oct 2024
https://github.com/stonezhong/DataManager
Better organize data in data lake and build ETL pipeline with Web UI tool.
datalake datawarehouse etl spark sparksql
Last synced: 27 Nov 2024
https://github.com/fancellu/graphx-citymap
CityMap coding test plus 3 solutions, 1 with Spark/GraphX
Last synced: 10 Nov 2024
https://github.com/lucasbotang/coursera_big_data_for_data_engineers
Assignments for Big Data for Data Engineers specialization on Coursera by Yandex.
Last synced: 25 Nov 2024
https://github.com/mdrakiburrahman/sgx-pyspark-sql-demo
Demonstrating Confidential Analytics on Azure SGX VM's with Apache Spark and SCONE.
azure azure-sql-database docker kubernetes sgx spark
Last synced: 09 Nov 2024
https://github.com/ibmstreams/streamsx.sparkmllib
Toolkit for real-time scoring using Apache Spark MLLib library
ibm-streams spark spark-mllib-library stream-processing toolkit
Last synced: 23 Nov 2024
https://github.com/jgperrin/net.jgp.books.spark.ch13
Spark in Action, 2nd edition - chapter 13 - Transforming documents
apache-spark java java8 manning spark sparkwithjava
Last synced: 09 Nov 2024
https://github.com/jgperrin/net.jgp.books.spark.ch10
Spark in Action, 2e - chapter 10 - Ingestion through structured streaming
bigdata book java java8 manning spark sparkstreaming sparkwithjava
Last synced: 09 Nov 2024
https://github.com/jgperrin/net.jgp.books.spark.ch16
Spark in Action, 2nd edition - chapter 16 - performance, checkpointing, and caching
apache-spark cache checkpoint java java8 manning spark sparkwithjava
Last synced: 09 Nov 2024
https://github.com/hamza88-coder/real-time-recruitment-system-with-ai-and-data-analytics
Simulation of job offers and CVs with real-time processing, classification, and analytics using Kafka, Ray, Spark, and Databricks. Includes a Flask-based recommendation system and Tableau visualizations.
apache-nifi chatbot databricks dbt delta-lake docker faiss flask k-means kafka llama3 pinecone postgresql ray redis snowflake spark sparkml
Last synced: 13 Jan 2025
https://github.com/pyaesoneaungrgn/vitepress-pilgrim-starter
Documentation template styled like Forge, Envoyer, Vapor, Jetstream, and Spark
documentation envoyer forge jetstream laravel pilgrim spark tailwindcss vapor vite vitepress vitepress-doc vitepress-starter
Last synced: 02 Jan 2025
https://github.com/coxautomotivedatasolutions/vegalite4s
Vega-Lite4s is a small library over the comprehensive Vega-Lite Javascript visualisation library, allowing you to create beautiful Vega-Lite visualisations in Scala
apache-spark scala spark vega vega-lite visualization
Last synced: 30 Sep 2024
https://github.com/pierrenodet/aruku
A Random Walk Engine for Apache Spark
deepwalk graph node2vec random-walk spark
Last synced: 10 Oct 2024
https://github.com/drsnowbird/tensorflow-python3-jupyter
tensorflow-python3-jupyter
docker docker-compose hadoop jupyter jupyter-notebook machine-learning python spark tensorflow tensorflow-board tensorflow-tutorials topic-modeling
Last synced: 14 Nov 2024
https://github.com/absaoss/hermes
A E2E test tool for Enceladus. Also general dataframe comparison tool
atum dataset-comparison e2e-tests enceladus spark
Last synced: 07 Nov 2024
https://github.com/msukmanowsky/drpyspark
Handy utilities for debugging and tuning pyspark programs. A work in progress.
pyspark python spark tuning-pyspark-programs
Last synced: 09 Nov 2024
https://github.com/edyoda/big-data-analytics-pipeline
Build your own Big Data Analytics Pipeline using Kafka-Spark-Cassandra. Videos ->
Last synced: 18 Nov 2024
https://github.com/suvayu/emr-scripts
Shell scripts for AWS EMR clusters
aws-cli aws-emr-clusters cluster spark
Last synced: 12 Oct 2024
https://github.com/nashtech-labs/sparkathon
A library having Java and Scala examples for Spark 2.x
apache-spark java-8 knoldus rdd scala spark spark-dataframes spark-dataset spark-ml spark-mllib spark-sql spark-streaming spark-structured-streaming
Last synced: 05 Nov 2024
https://github.com/hibayesian/spark-optim
A library of scalable optimization algorithms based on Spark
machine-learning optimization-algorithms spark
Last synced: 23 Nov 2024
https://github.com/adidas/lakehouse-engine-docs
The Goal of this project is to provide documentation for the Lakehouse Engine framework.
big-data data-engineering data-quality databricks delta-lake framework great-expectations lakehouse lakehouse-engine spark
Last synced: 12 Oct 2024
https://github.com/datumbrain/gossub
Trigger spark-submit in Golang. A Go implementation of famous SparkLauncher.java.
Last synced: 17 Nov 2024
https://github.com/mvillafuertem/scala
🤓 Examples Advanced 🧐 Projects Akka 🚀 ZIO ⚡️ Algorithms 😼 Cats
akka akka-streams aws cats cdktf kafka slick spark sttp tapir terraform zio zio-streams
Last synced: 07 Nov 2024
https://github.com/iglee/outrunjulesverne
Personalizing unique travel experiences using data science.
data-mining gis-data jules-verne natural-language-processing nlp personalizing-travels python recommender-system scraping spark travel travelling-salesman-problem tripadvisor unsupervised-learning
Last synced: 06 Nov 2024
https://github.com/Nosto/spartann
Hyper performant kNN using Annoy for Apache Spark.
ann annoy apache-spark k-nearest-neighbors k-nearest-neighbours knn ml spark
Last synced: 04 Nov 2024
https://github.com/jaehyeon-kim/general-demos
Data engineering demo projects
aws dataengineering dbt kafka kafkaconnect opensearch serverlessapplicationmodel spark
Last synced: 30 Oct 2024
https://github.com/jklmnn/STOTP
SPARK TOTP library
2fa 2fa-security ada base32 formal-verification hotp spark totp
Last synced: 26 Oct 2024
https://github.com/sneaksanddata/spark-utils
Comfy Utilities for Spark Job Authoring
Last synced: 11 Nov 2024
https://github.com/edgararuiz-zz/sparkvis
Integrates ggvis and sparklyr
ggvis spark sparklyr visualization
Last synced: 09 Nov 2024
https://github.com/brayanjuls/diane
Hive helper functions for apache spark users
Last synced: 28 Oct 2024
https://github.com/trainingbypackt/big-data-processing-with-apache-spark-elearning
Efficiently tackle large datasets and perform big data analysis with Spark and Python
dataset python rdds spark spark-mllib structured-streaming
Last synced: 14 Nov 2024
https://github.com/ansrivas/yelp_dataset
Sample analysis for the latest yelp dataset using spark
Last synced: 14 Oct 2024
https://github.com/dharmeshkakadia/tpcds-hdinsight
TPCDS benchmark for various engines
benchmarking hive llap presto spark tpcds
Last synced: 18 Nov 2024
https://github.com/nandtel/spark-streaming-kafka-cassandra-starter
Application built on Spark Streaming, Kafka and Cassandra.
cassandra docker docker-compose kafka scala spark spark-streaming
Last synced: 24 Nov 2024
https://github.com/jklmnn/stotp
SPARK TOTP library
2fa 2fa-security ada base32 formal-verification hotp spark totp
Last synced: 13 Nov 2024
https://github.com/napsternxg/pubmed_selfcitationanalysis
Repository of our paper on Self-citation analysis in PubMed data
citation-analysis medline pubmed-central regression-models spark
Last synced: 13 Oct 2024
https://github.com/gerashegalov/rapids-shell
Utility to run/debug Spark RAPIDS in REPL
Last synced: 12 Oct 2024
https://github.com/naupio/pical
(Work In Process) pita is a general distributed computation system with Erlang language base on DAG model. This project is inspired by DouBan 's DPark and Apache Spark.
big-data bigdata dag data distributed distributed-computing distributed-systems erlang erlang-otp flink spark
Last synced: 13 Nov 2024
https://github.com/russellspitzer/firstsparkcassandraapp
A quick workshop on building your first Spark Cassandra Stand Alone Application
spark tutorial workshop zeppelin
Last synced: 16 Oct 2024
https://github.com/keks51/spark_plan_as_uml
visualizing spark plan as UML diagram
graph plan spark spark-streaming uml visualization
Last synced: 12 Oct 2024
https://github.com/agile-lab-dev/literate-programming-articles
Collection of articles, using the Literate Programming style, about Data Engineering and Software Tooling in general
literate-programming ruby spark spark-connect
Last synced: 17 Jan 2025
https://github.com/chen0040/spark-ml-genetic-programming
Package provides java implementation of big-data genetic programming for Apache Spark
big-data genetic-programming linear-genetic-programming rdd spark tree-genetic-programming tree-gp
Last synced: 16 Dec 2024
https://github.com/codam-coding-college/spark-sessions
Spark sessions help beginning students dissect the first larger projects of the curriculum.
Last synced: 10 Nov 2024
https://github.com/shivam5992/classification_pipeline
:orange_book: A complete document classification pipeline using Apache Spark in scala
document-classification-pipeline scala spark text-classification
Last synced: 24 Dec 2024
https://github.com/san089/spark_packaged_project
This project contains pyspark jobs to create data pipelines and shows how to distribute the project package on Cluster.
data-pipeline etl etl-framework etl-pipeline job pyspark spark
Last synced: 16 Nov 2024
https://github.com/varunu28/aadhar-dataset-analysis
Data analysis of AADHAR dataset using Apache Spark
analysis scala spark spark-sql
Last synced: 08 Nov 2024
https://github.com/inbravo/scala-feature-set
-:- My random Scala experiements -:-
Last synced: 24 Nov 2024
https://github.com/vlad-bystrov/spark-user-feedback
conversion dataframes datasets rdd spark
Last synced: 12 Nov 2024
https://github.com/terrier-org/terrier-spark
A Spark API for the Terrier.org information retrieval platform
Last synced: 12 Oct 2024
https://github.com/jerryshao/spark-atlas-connector
A Spark Atlas connector to track data lineage in Apache Atlas
Last synced: 17 Dec 2024
https://github.com/navdeep-g/sdss-2019
Interpretable Machine Learning with rsparkling
data-science h2o-3 machine-learning r rsparkling spark sparklyr xai
Last synced: 06 Nov 2024
https://github.com/sainipray/spark-streaming
This is for spark streaming tutorials
pyspark pyspark-tutorial python python3 spark spark-streaming streaming text-stream
Last synced: 02 Dec 2024
https://github.com/akarce/e2e-structured-streaming
End-to-end data pipeline that ingests, processes, and stores data. It uses Apache Airflow to schedule scripts that fetch data from an API, sends the data to Kafka, and processes it with Spark before writing to Cassandra. The pipeline, built with Python and Apache Zookeeper, is containerized with Docker for easy deployment and scalability.
airflow apache-airflow apache-kafka apache-spark big-data cassandra docker docker-compose kafka postgresql python spark zookeeper
Last synced: 12 Oct 2024
https://github.com/mikma03/spark-databricks
🔥 Master Apache Spark & Databricks! Dive into a world of big data with exclusive insights from Udemy courses, personal notes, and practical guides. Whether you're starting out or scaling new heights in data engineering, this is your ultimate resource hub! 🌟🚀
apache-spark aws big-data data-engineering databricks delta-lake etl python spark streaming
Last synced: 11 Nov 2024
https://github.com/arbox/learning-scala-for-data-science
Data Science: Scala for brave and impatient
big-data bigdata data-science datascience scala spark
Last synced: 27 Nov 2024
https://github.com/newrelic-experimental/nri-spark
This New Relic standalone integration polls the Apache Spark REST API for metrics and pushes them into New Relic using Metrics API It uses the New Relic Telemetry sdk for go
apache-spark databricks databricks-notebooks metrics newrelic nrlabs nrlabs-data nrlabs-odp spark
Last synced: 14 Nov 2024
https://github.com/angeligareta/cheaper-travelling
Project developed with Apache Spark and Kafka that works with different public streaming data APIs such as SkyScanner, GeoDB Cities, and Flixbus to consider more ways of travelling in a cheaper way.
apache-spark flixbus geodb-cities kafka scala skyscanner skyscanner-api skyscanner-flight-search spark
Last synced: 22 Nov 2024
https://github.com/azavea/hiveless
Scala API for Hive UDFs with the GIS extension
geospatial gis scala spark typelevel
Last synced: 10 Nov 2024
https://github.com/jgperrin/net.jgp.books.spark.ch15
Spark in Action, 2nd edition - chapter 15 - Aggregating your data
aggregation apache-spark java java8 manning spark sparkwithjava sql-aggregation udaf
Last synced: 09 Nov 2024
https://github.com/jgperrin/net.jgp.books.spark.ch14
Spark in Action, 2nd edition - chapter 14 - extending data transformation with UDFs
apache-spark java java8 manning spark sparkwithjava udf
Last synced: 09 Nov 2024
https://github.com/brh55/generator-spark-bot
:zap: Yeoman generator that scaffold out a Cisco spark bot with usability and simplicity in mind
cisco cisco-spark flint nodejs scaffold spark yeoman
Last synced: 14 Oct 2024
https://github.com/jgperrin/net.jgp.books.spark.ch05
Spark in Action, 2nd edition - chapter 5 - Deployment
apache-spark java manning spark sparkjava sparkwithjava
Last synced: 09 Nov 2024
https://github.com/chezou/amazon-movie-review
Recommendation for Amazon movie review data
factorization-machines recommendations spark
Last synced: 15 Oct 2024
https://github.com/nikoshet/monitoring-spark-on-docker
Spark Monitoring With Prometheus And Grafana Using Docker
docker docker-compose grafana hadoop hdfs monitoring node-exporter prometheus spark
Last synced: 09 Nov 2024
https://github.com/nashtech-labs/spark-streaming-gnip
An Apache Spark utility for pulling Tweets from Gnip's PowerTrack in realtime
gnip gnip-powertrack knoldus pulling-tweets realtime scala spark spark-streaming spark-utility sparkconf sparkcontext tweets
Last synced: 21 Jan 2025
https://github.com/anant/data.engineers.lunch
Resources from weekly Zoom lunches revolving around Data Engineering. Hosted by Anant Corporation.
data-engineering etl kubernetes python spark
Last synced: 18 Nov 2024
https://github.com/eleflow/pyspark-connectors
apache-spark connectors cosmosdb databricks hacktoberfest pipedrive pyspark python rest-api spark
Last synced: 17 Dec 2024
https://github.com/dharmeshkakadia/tpch-hdinsight
TPCH benchmark for various engines
benchmarking hive llap presto spark tpch
Last synced: 18 Nov 2024
https://github.com/dvgodoy/dsr-spark-appliedml
DSR Class - Applied Machine Learning with Apache Spark
Last synced: 13 Oct 2024
https://github.com/trk54ylmz/spark-bigquery
Google BigQuery support for Spark SQL
Last synced: 18 Nov 2024
https://github.com/chabane/mitosis-microservice-spark-cassandra
Microservice application that uses Apache Spark, Kafka and Cassandra
cassandra dockerfile hadoop jenkinsfile kafka sbt scala spark spark-streaming
Last synced: 15 Nov 2024
https://github.com/iaja/scalaLDAvis
Scala-Spark port of https://github.com/bmabey/pyLDAvis for Apache Spark LDA Topic Modelling Visualisation
apache lda machine-learning scala spark visulization
Last synced: 13 Nov 2024
https://github.com/kimtth/pyspark-tika-text-extraction
🚴♂️⛷Data Lake, Performance tuning for text extraction from a huge amount of files.
apache-spark apache-tika data-pipeline datalake multithreading pyspark spark tika-python
Last synced: 25 Dec 2024
https://github.com/udao-moo/udao-spark-optimizer
A Spark Optimizer for Adaptive, Fine-Grained Parameter Tuning
knobs-tuning modeling multi-objective-optimization optimization spark sparksql
Last synced: 11 Oct 2024
https://github.com/jofaval/tfm-iabd
Master's Final Degree Project on Artificial Intelligence and Big Data
ai-engineering big-data big-data-analytics data-analysis data-architecture data-engineering data-science data-science-project fastapi kafka mongo-db mongodb nlp node-red nodered python sentiment-analysis spark spark-streaming transformers
Last synced: 10 Oct 2024
https://github.com/radanalyticsio/workshop-notebook
Basic Jupyter notebook for learning Spark and OpenShift
containers data-science jupyter openshift spark
Last synced: 05 Nov 2024
https://github.com/ren294/covid-data-process
This project integrates real-time data processing and analytics using Apache NiFi, Kafka, Spark, Hive, and AWS services for comprehensive COVID-19 data insights.
airflow aws aws-ec2 aws-quicksight big-data big-data-analytics covid19-data docker docker-compose hadoop-hdfs hdfs hive kafka nifi pipeline redpanda spark spark-sql spark-streaming sparksql
Last synced: 11 Oct 2024
https://github.com/garystafford/dataproc-workflow-templates
Demonstration of Google Cloud Dataproc Workflow Templates
dataproc gcp google-cloud-platform hadoop pyspark spark
Last synced: 06 Dec 2024
https://github.com/nhsdigital/rap_example_pipeline_python
An example pipeline made in a RAP friendly way, using Python
aggregation artificial hospital-episode-statistics pyspark python spark
Last synced: 23 Dec 2024
https://github.com/kanchishimono/scopt
Calculate optimized properties of Spark configuration
Last synced: 28 Nov 2024
https://github.com/angadsingh/airflow-ditto
An airflow DAG transformation framework
airflow airflow-dag aws azure dataflow emr extensible framework graph-algorithms graph-manipulation hdinsight isomorphism livy networkx spark yarn
Last synced: 10 Nov 2024
https://github.com/geotrellis/geotrellis-streaming-demo
A demo project that shows a GeoTrellis streaming application example
geotrellis gis kafka spark streaming
Last synced: 11 Nov 2024