Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
Apache Spark
Apache Spark is an open source distributed general-purpose cluster-computing framework. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.
- GitHub: https://github.com/topics/spark
- Wikipedia: https://en.wikipedia.org/wiki/Apache_Spark
- Repo: https://github.com/apache/spark
- Created by: Matei Zaharia
- Released: May 26, 2014
- Related Topics: scala, hadoop,
- Aliases: apache-spark,
- Last updated: 2025-01-23 00:24:54 UTC
- JSON Representation
https://github.com/franzdiebold/docker-datascience-ultimate
Customized Jupyter Spark Docker images with everything you need
docker jupyter jupyterlab polars pyspark python spark
Last synced: 05 Nov 2024
https://github.com/lovenui/marketing_analysis-aws-spark-sql
aws aws-rds aws-s3 data-analysis machine-learning marketing-analytics spark
Last synced: 19 Jan 2025
https://github.com/qxzzxq/faker
Generate fake data for Scala and Spark :tophat:
fake fake-data faker faker4s scala spark spark-data-generator test-data test-data-generator testing
Last synced: 18 Dec 2024
https://github.com/innfactory/akka-lift-ml
akka http service for serving spark machine learning models
akka akka-http data-engineering fast-data machine-learning scala spark
Last synced: 28 Nov 2024
https://github.com/getyourguide/typedpyspark
Type-annotate your spark dataframes and validate them
Last synced: 14 Nov 2024
https://github.com/mattjw/sparkql
sparkql: Apache Spark SQL DataFrame schema management for sensible humans
apache-spark pyspark spark structured-spark
Last synced: 17 Dec 2024
https://github.com/mach-kernel/databricks-kube-operator
A Kubernetes operator to enable GitOps style deploys for Databricks resources
ci cicd databricks gitops helm kubernetes operators rust spark
Last synced: 11 Nov 2024
https://github.com/asuiu/sparkorm
ORM for Apache Spark and DataFrames schema manager
orm pyspark pyspark-python python python3 spark spark-orm spark-sql sparkql sqlalchemy sqlalchemy-orm
Last synced: 27 Dec 2024
https://github.com/dazheng/SparkETL
Implement a complete data warehouse etl using spark SQL
datawarehouse etl spark sparksql
Last synced: 13 Nov 2024
https://github.com/DataEval/dingo
Dingo: A Comprehensive Data Quality Evaluation Tool
data-evaluation data-quality data-science data-validation gpt llm spark vlm
Last synced: 06 Jan 2025
https://github.com/azavea/geotrellis-collections-api-research
A research project to investigate using GeoTrellis as a REST service
akka-http geotrellis leaflet react react-leaflet redux scala spark victory
Last synced: 10 Nov 2024
https://github.com/qyu-ai/reina
PySpark-based causal inference package.
big-data causal-inference machine-learning spark
Last synced: 02 Nov 2024
https://github.com/lovenui/weblogs-analysis-system
A big data platform for analyzing web access logs
hbase javascript log-analysis python scala spark
Last synced: 19 Jan 2025
https://github.com/fscm/terraform-module-aws-spark
Terraform Module to create a Apache Spark cluster on AWS
Last synced: 07 Nov 2024
https://github.com/zuinnote/spark-hadoopcryptoledger-ds
A Spark datasource for the HadoopCryptoLedger library
altcoin auxpow bitcoin cryptoledger datasource ethereum hadoopcryptoledger read spark
Last synced: 03 Dec 2024
https://github.com/mlverse/pysparklyr
Extension to {sparklyr} that allows you to interact with Spark & Databricks Connect
databricks pyspark r spark spark-connect
Last synced: 22 Nov 2024
https://github.com/AuFeld/Data_Engineering_Projects
A collection of data engineering projects: data modeling, ETL pipelines, data lakes, infrastructure configuration on AWS, data warehousing, containerization, and a dashboard to monitor data pipeline KPIs
airflow aws cassandra data-engineering data-lake data-warehouse docker emr etl-pipeline infrastructure-as-code infrastructure-setup postgresql python redshift s3 spark
Last synced: 04 Dec 2024
https://github.com/chezou/sparkavro
Load Avro data into Spark with sparklyr
Last synced: 18 Nov 2024
https://github.com/manuparra/taller_sparkr
Taller SparkR para las Jornadas de Usuarios de R
artificial-intelligence bigdata data-analysis data-mining hdfs ipynb machine-learning-algorithms r rstudio spark sparklyr sparkr
Last synced: 11 Oct 2024
https://github.com/dmwm/cmsspark
General purpose framework to run CMS experiment workflows on HDFS/Spark platform
analytics bigdata cms-framework hdfs spark
Last synced: 11 Dec 2024
https://github.com/daniel-acuna/pyspark_pipes
Helper functions for building complex Spark ML pipelines
Last synced: 30 Oct 2024
https://github.com/microsoft/azure-synapse-content-recommendations-solution-accelerator
This is a solution accelerator for creating personalized content recommendations based on user activity.
azure-synapse-analytics power-bi spark
Last synced: 02 Nov 2024
https://github.com/jgperrin/net.jgp.books.spark.ch08
Spark in Action, 2nd edition - chapter 8
apache-spark elastic elasticsearch informix java java8 manning mysql spark sparkwithjava
Last synced: 09 Nov 2024
https://github.com/analyticalmonk/pyspark_nlp_workshop
Instructions and code for the workshop "From Big Data to NLP Insights: Unlocking the Power of PySpark and Spark NLP"
databricks databricks-notebooks distributed-computing nlp pyspark spark spark-nlp workshop
Last synced: 08 Nov 2024
https://github.com/maropu/spark-data-repair-plugin
Provide functionality to build statistical models to repair dirty tabular data in Spark
data-repairing distributed-computing error-detection parallel-computing spark
Last synced: 08 Nov 2024
https://github.com/lovenui/emr-aws-apache-spark
airflow aws big-data data-analysis data-engineering spark
Last synced: 19 Jan 2025
https://github.com/ravi72munde/scala-spark-cab-rides-predictions
A big data project for predicting prices of Uber/Lyft rides depending on the weather
predict-prices scala spark spark-streaming streaming uber weather
Last synced: 16 Dec 2024
https://github.com/archivesunleashed/docker-aut
Docker image for the Archives Unleashed Toolkit
archives-unleashed aut docker docker-image spark webarchives
Last synced: 11 Nov 2024
https://github.com/microsoft/Azure-Synapse-Content-Recommendations-Solution-Accelerator
This is a solution accelerator for creating personalized content recommendations based on user activity.
azure-synapse-analytics power-bi spark
Last synced: 01 Nov 2024
https://github.com/data-tools/big-data-types
A library to transform Scala product types and Schemes from different systems into other Schemes. Any implemented type automatically gets methods to convert it into the rest of the types and vice versa. E.g: a Spark Schema can be transformed into a BigQuery table.
apache-spark bigquery bigquery-tables cassandra circe database-types scala schemas spark typeclass typeclass-derivation typesafe
Last synced: 12 Oct 2024
https://github.com/exasol/spark-connector
A connector for Apache Spark to access Exasol
apache-spark connector exasol exasol-integration spark streaming
Last synced: 02 Nov 2024
https://github.com/sysgears/akka-spark-pipeline
An example project that implements a data pipeline using Scala, Akka, and Spark and works with document-oriented and graph databases to let you find out how frequently a specific technology is used with different technology stacks.
akka akka-http akka-streams mongodb neo4j scala spark spark-graphx
Last synced: 16 Nov 2024
https://github.com/allwefantasy/mlsql
New Repo: https://github.com/byzer-org/kolo-lang
Last synced: 11 Oct 2024
https://github.com/blaze-init/spark-blaze-extension
Blazing-fast query execution engine speaks Apache Spark language and has Arrow-DataFusion at its core.
Last synced: 01 Nov 2024
https://github.com/xd-deng/diy-a-cluster
How to Do-It-Yourself A Cluster for Spark & Hadoop
cluster-computing hadoop spark
Last synced: 16 Oct 2024
https://github.com/apache/kyuubi-docker
Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses.
data-lake hadoop hive jdbc kubernetes spark spark-sql sql thrift
Last synced: 07 Oct 2024
https://github.com/vsouza/spark-kinesis-redshift
Example project for consuming AWS Kinesis streamming and save data on Amazon Redshift using Apache Spark
aws aws-kinesis aws-kinesis-stream aws-redshift etl etl-pipeline python shell spark spark-streaming
Last synced: 17 Nov 2024
https://github.com/hadesarchitect/caspark
Cassandra + Spark = ❤️ Machine Learning with Apache Spark & Cassandra
cassandra jupyter machine-learning spark
Last synced: 12 Oct 2024
https://github.com/erikerlandson/spark-kafka-sink
A Kafka metric sink for Apache Spark
apache-kafka apache-spark kafka kafka-producer metric-sink metrics metrics-gathering spark
Last synced: 09 Nov 2024
https://github.com/hbutani/icebergsql
Integration of Iceberg table management into Spark SQL
Last synced: 31 Oct 2024
https://github.com/manuparra/masterdatcom_bdcc_practice
Practice and Workshop on BigData and Cloud Computing using Docker Containers and OpenNebula. HDFS, hadoop and spark+R
bigdata cloudcomputing containers docker hadoop hdfs linux opennebula practices spark sparkr
Last synced: 07 Nov 2024
https://github.com/jhleeeme/fake-data-pipeline
Data Generators -> Kafka -> Spark Streaming -> PostgreSQL -> Grafana
data-engineering data-pipeline docker docker-compose grafana kafka postgresql scala spark
Last synced: 17 Jan 2025
https://github.com/anskarl/parsimonious
Parsimonious is a helper library for encoding/decoding Apache Thrift and Twitter Scrooge classes to Spark Dataframes and Jackson JSON.
deserialization jackson json serialization spark thrift
Last synced: 31 Oct 2024
https://github.com/brunocampos01/data-engineering
algorithms-techniques big-data big-o-notation bigdata cookbook data-engineering data-pipelines data-processing data-sctructures database-fundamentals dataops design-patterns design-systems java mysql paradigms python spark sql storage
Last synced: 16 Nov 2024
https://github.com/selimhorri/spark-application
Java Application, uses Apache Spark, handles batch as well as streaming processing
dataframes-api java mysql spark spark-batch spark-sql spark-streaming
Last synced: 14 Oct 2024
https://github.com/newfront/odsc-west-streaming-trends
All Data, Relevant Information, Scripts, and Applications for the Open Data Science Conference (2018)
Last synced: 02 Dec 2024
https://github.com/baghelamit/video-stream-classification
Video Stream Classification
java kafka opencv spark tensorflow
Last synced: 10 Nov 2024
https://github.com/tupol/spark-tools
Executable Apache Spark Tools: Format Converter & SQL Processor
apache-spark converts format-converter scala spark sql tools
Last synced: 12 Oct 2024
https://github.com/salmon-brain/dead-salmon-brain
Apache Spark based framework for analysis A/B experiments
ab-testing abtesting analytics apache-spark experimentation experiments java python scala spark split-testing statistics
Last synced: 12 Oct 2024
https://github.com/ahmetfurkandemir/dataengineering-youtube-project
Data Engineering Youtube Project
amazon amazon-athena amazon-glue aws bash dataengineering iam-role lambda lambda-functions python s3-bucket s3-storage s3api spark
Last synced: 16 Nov 2024
https://github.com/ibm/diem
DIEM Data Integration Engine Multipurpose
cloud cron data-transfers diem diem-tribe docker etl execute-sql-statements pipelines python scheduling slack spark sql-statements typescript
Last synced: 16 Dec 2024
https://github.com/manuzhang/jupyterlab_spark
Spark Application UI extension for JupyterLab
jupyterlab jupyterlab-extension spark typescript
Last synced: 17 Nov 2024
https://github.com/minzhang-1/PointHop-PointHop2_Spark
A fast and low memory requirement version of PointHop and PointHop++, which is built upon Apache Spark.
3d 3d-classification classification feature-extraction knn least-square-regression pca point-cloud pyspark python spark
Last synced: 28 Oct 2024
https://github.com/fabianmurariu/website-categories-nn
Build a deep learning model predicting categories from dmoz datasource
deep-learning deep-neural-networks keras spark tensorflow
Last synced: 18 Jan 2025
https://github.com/stefen-taime/etl-data-pipeline-rdbms-to-hdfs-using-airflow-apache-sqoop-spark-postgres-and-hive
This project aims to move the data from a Relational database system (RDBMS) to a Hadoop file system (HDFS)
airflow big-data data docker-compose etl-pipeline hdfs hive infrastructure-as-code rdbms spark sql sqoop
Last synced: 17 Jan 2025
https://github.com/miquido/datascience
Useful scripts and notebooks for Data Science. The project was made by Miquido. https://www.miquido.com/
aws-s3 docker machine-learning pipeline pyspark pyspark-mllib pyspark-notebook pyspark-tutorial spark
Last synced: 09 Nov 2024
https://github.com/zhaytam/realtimesentimentanalysis
A real-time sentiment analysis of Youtube comments using Python, Spark and Kafka
kafka python sentiment-analysis spark video webserver youtube
Last synced: 19 Dec 2024
https://github.com/jgperrin/net.jgp.books.spark.ch04
Spark in Action, 2nd edition - chapter 4
java manning spark sparkjava sparkwithjava
Last synced: 09 Nov 2024
https://github.com/jgperrin/net.jgp.books.spark.ch09
Spark in Action, 2e - chapter 9 - Advanced ingestion: finding data sources and building your own
apache-spark ingestion java java8 manning spark sparkwithjava
Last synced: 09 Nov 2024
https://github.com/sashgorokhov/pyspark-spy
Collect and aggregate on spark events for profitz
Last synced: 27 Oct 2024
https://github.com/jgperrin/net.jgp.books.spark.ch17
Spark in Action, 2nd edition - chapter 16 - exporting data, using delta lake
apache-spark delta-lake java java8 manning spark sparkwithjava
Last synced: 09 Nov 2024
https://github.com/gvcgo/gogpt
A GPT TUI client with proxy supported.
chatgpt client go golang iflytek iflytek-spark openai proxy spark tui xf-spark xunfei xunfei-spark
Last synced: 11 Nov 2024
https://github.com/andrewpalumbo/mahout-samsara-book
Accompanying code examples for Apache Mahout: Beyond MapReduce. Distributed Algorithm Design.
distributed-algorithm mahout mahout-samsara-book spark spark-mllib-naivebayes
Last synced: 08 Nov 2024
https://github.com/aamend/pathogen
The rooster crows immediately before sunrise, the rooster causes the sun to rise
big-data bigdata causation contagion correlation datascience fcm graph graphx machine-learning spark
Last synced: 08 Nov 2024
https://github.com/aphp/uimaonspark
Way to run Uima Pipelines on Apache Spark
Last synced: 25 Nov 2024
https://github.com/oneoffcoder/pyspark-formula
R-like formula approach to Spark Dataframes
classification clustering dataframes interaction-design patsy pyspark regression rlike-formulas spark
Last synced: 14 Oct 2024
https://github.com/hsiehshujeng/cdk-emrserverless-with-delta-lake
This construct builds some elements for you to quickly launch an EMR Serverless application. After submitting the Emr Serverless job, you could also launch an EMR notebook via cluster template to check the outcome from the EMR Serverless application.
aws aws-cloudformation aws-service-catalog cdk-constructs delta-lake dotnet emr-notebooks emr-serverless emr-studio golang java javacript projen python serverless spark
Last synced: 16 Nov 2024
https://github.com/xianwill/spark-boilerplate
A boilerplate for spark projects with docker support for local development and scripts for emr support.
apache-spark boilerplate docker emr emr-cluster spark
Last synced: 14 Oct 2024
https://github.com/dirkster99/pynotes
My notebook on using Python with Jupyter Notebook, PySpark etc
dataframe jupyter-notebook panda pandas-dataframe parquet pyspark python spark spark-sql sparknlp
Last synced: 01 Jan 2025
https://github.com/getyourguide/ddataflow
A tool to help you to test and develop pyspark code with sampled and local data
Last synced: 14 Nov 2024
https://github.com/miraisolutions/sparkgeo
Sparklyr extension package providing geospatial analytics capabilities
geospatial-analytics r spark sparklyr udf
Last synced: 18 Nov 2024
https://github.com/zekeriyyaa/traffic-data-analysis-with-apache-spark-based-on-mobile-robot-data
Mobile robot data were analyzed with Apache-Spark to extract five different statistical result such as travel time, waiting time, average speed, occupancy and density were produced.
agv apache-spark big-data data-analysis data-visualization industrial-robot mobile-robot mongodb mssql pyqt5 pyspark python spark
Last synced: 09 Nov 2024
https://github.com/codingcat/kittenwhisker
debugging performance issues for Spark applications
apache-spark debugging flamegraph jvm jvm-performance performance spark
Last synced: 13 Oct 2024
https://github.com/r-spark/sparklyr.flint
Sparklyr extension making Flint time series library functionalities (https://github.com/twosigma/flint) easily accessible through R
apache-spark data-analysis data-mining data-science distributed distributed-computing flint r remote-clusters rstats spark sparklyr statistical-analysis statistics stats summarization summary-statistics time-series time-series-analysis twosigma-flint
Last synced: 30 Oct 2024
https://github.com/jgperrin/net.jgp.books.spark.ch11
Spark in Action, 2nd edition - chapter 11 - Working with SQL
apache-spark java java8 manning spark spark-sql sparkwithjava sql
Last synced: 09 Nov 2024
https://github.com/x4ax/lxss-install-zeppelin
Step by step guide on how to install Zeppelin 0.7.3 on Linux subsystem (WSL) for Windows 10
hadoop linux-subsystem lxss spark wsl zeppelin
Last synced: 04 Dec 2024
https://github.com/duhanmin/bigdata-sql-parser
数据血缘,支持spark sql,hive sql,pg sql,presto sql,mysql sql,tidb sql, flink sql, datax血缘,spark/flink jar 运行命令的血缘解析;支持with语法
datax flink hive mysql postgresql presto spark tidb trino
Last synced: 05 Nov 2024
https://github.com/brooksian/churnbabychurn
Telco Churn - Ensemble and Stacked Classifer Models
Last synced: 18 Nov 2024
https://github.com/sfproductlabs/tracker
Growth Tracker. GDPR friendly Telemetry. Subsystem of SFPL Experimentation Framework.
cassandra data-residency docker ecs elassandra gdpr gdpr-tracker google-analytics nats posthog privacy privacy-shield residency spark telemetry telemetry-data telemetry-system tracker user-tracker user-tracking
Last synced: 24 Nov 2024
https://github.com/asvyatkovskiy/scabillmatch
Policy diffusion in the US legislature
data-frame graph policy-diffusion spark tf-idf
Last synced: 18 Oct 2024
https://github.com/newfront/odsc-east-2020-decision-intelligence
This is the home of the 2020 Open Data Science Conference workshop (Creating Streaming Predictive Analytics and Decision Intelligence Systems with Apache Spark)
decision-intelligence-systems odsc odsc-east-2020 spark
Last synced: 02 Dec 2024
https://github.com/egemenzeytinci/data-science-notes
My own notes about data science
course-materials data-science machine-learning neo4j pandas python scala spark
Last synced: 20 Oct 2024
https://github.com/agoda-com/spark-hpopt
Bayesian hyperparamter tuning for Spark MLLib
hyperparameter-optimization hyperparameter-tuning machine-learning mllib model-selection scala spark
Last synced: 06 Nov 2024
https://github.com/jason-dai/cvpr2018
CVPR 2018 Tutorial
analytics-zoo bigdl cvpr deep-learning keras neural-network spark tensorflow
Last synced: 12 Oct 2024
https://github.com/jason-dai/aaai2019
AAAI 2019 Tutorial
aaai analytics bigdl deep-learning keras neural-network spark tensorflow
Last synced: 12 Oct 2024
https://github.com/archivesunleashed/twut
An open-source toolkit for analyzing line-oriented JSON Twitter archives with Apache Spark.
apache-spark spark spark-packages tweets twitter-data twitter-json
Last synced: 12 Oct 2024
https://github.com/airscholar/sparkingflow
This project demonstrates how to use Apache Airflow to submit jobs to Apache spark cluster in different programming laguages using Python, Scala and Java as an example.
apache-airflow dataengineering docker java pyspark scala spark
Last synced: 14 Nov 2024
https://github.com/da91666/daph
Daph是一个通用的数据同步与数据处理平台级工具,既具有丰富的数据同步能力,又具有强大的数据处理能力,一站式满足数据开发所有需求,可用于构建可视化配置化的数据同步与数据处理平台。
Last synced: 11 Oct 2024
https://github.com/fancellu/graphx-citymap
CityMap coding test plus 3 solutions, 1 with Spark/GraphX
Last synced: 10 Nov 2024
https://github.com/lucasbotang/coursera_big_data_for_data_engineers
Assignments for Big Data for Data Engineers specialization on Coursera by Yandex.
Last synced: 25 Nov 2024