Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
Apache Spark
Apache Spark is an open source distributed general-purpose cluster-computing framework. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.
- GitHub: https://github.com/topics/spark
- Wikipedia: https://en.wikipedia.org/wiki/Apache_Spark
- Repo: https://github.com/apache/spark
- Created by: Matei Zaharia
- Released: May 26, 2014
- Related Topics: scala, hadoop,
- Aliases: apache-spark,
- Last updated: 2025-01-22 00:29:18 UTC
- JSON Representation
https://github.com/flint-bot/flint
Webex Bot SDK for Node.js (deprecated in favor of https://github.com/webex/webex-bot-node-framework)
Last synced: 19 Dec 2024
https://github.com/learningjournal/sparkprogramminginscala
Apache Spark Course Material
apache-spark big-data bigdata data-lake datalake scala spark spark-scala spark-sql
Last synced: 16 Jan 2025
https://github.com/snowch/movie-recommender-demo
This project walks through how you can create recommendations using Apache Spark machine learning. There are a number of jupyter notebooks that you can run on IBM Data Science Experience, and there a live demo of a movie recommendation web application you can interact with. The demo also uses IBM Message Hub (kafka) to push application events to topic where they are consumed by a spark streaming job running on IBM BigInsights (hadoop).
alternating-least-squares biginsights bluemix bokeh cloudant collaborative-filtering dsx hadoop hive ibm-biginsights ibm-bluemix jupyter-notebook kafka machine-learning messagehub notebook python-flask-application redis spark spark-streaming
Last synced: 17 Nov 2024
https://github.com/rogaha/data-processing-pipeline
Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka and Cassandra
cassandra digital-ocean docker-machine kafka spark twitter twitter-streaming-api visualization
Last synced: 06 Nov 2024
https://github.com/kakao/cuesheet
A framework for writing Spark 2.x applications in a pretty way
apache-spark magic mango scala spark yarn
Last synced: 22 Jan 2025
https://github.com/apache/doris-spark-connector
Spark Connector for Apache Doris
apache connector data-warehousing dbms doris mpp olap spark
Last synced: 19 Jan 2025
https://github.com/apache/spark-kubernetes-operator
Apache Spark Kubernetes Operator
Last synced: 21 Jan 2025
https://github.com/oeljeklaus-you/sparkcore
Spark源码分析,主要包含SparkContext源码、Executor进程启动、Stage划分、Task执行和Spark2.0的新特性
scala spark spark-learning sparkcore
Last synced: 12 Nov 2024
https://github.com/huangyueranbbc/SparkDemo
spark全示例代码(java、scala) Spark most full instance code DEMO (java、scala)
bigdata hadoop operator spark spark-sql spark-streaming sparkfun-products sparkjava sparkline sparkp
Last synced: 30 Oct 2024
https://github.com/mahmoudparsian/pyspark-algorithms
PySpark Algorithms Book: https://www.amazon.com/dp/B07X4B2218/ref=sr_1_2
algorithms big-data data data-abstractions data-science dataframe distributed-computing graphframes mapreduce monoid nosql partitioning pyspark pyspark-algorithms python rdd spark transformations
Last synced: 06 Nov 2024
https://github.com/chabane/generator-mitosis
A micro-service infrastructure generator based on Yeoman/Chatbot, Kubernetes/Docker Swarm, Traefik, Ansible, Jenkins, Spark, Hadoop, Kafka, etc.
ansible chatbot docker elasticsearch golang jenkins kafka kibana kubernetes logstash machine-learning rust sonarqube spark swarm traefik vagrant yeoman-generator
Last synced: 01 Nov 2024
https://github.com/Chabane/generator-mitosis
A micro-service infrastructure generator based on Yeoman/Chatbot, Kubernetes/Docker Swarm, Traefik, Ansible, Jenkins, Spark, Hadoop, Kafka, etc.
ansible chatbot docker elasticsearch golang jenkins kafka kibana kubernetes logstash machine-learning rust sonarqube spark swarm traefik vagrant yeoman-generator
Last synced: 04 Nov 2024
https://github.com/docandrew/CuBit
General-purpose, formally-verified, 64-bit operating system in SPARK/Ada for x86-64
Last synced: 25 Oct 2024
https://github.com/simplexspatial/osm4scala
Scala and Spark library focused on reading OpenStreetMap Pbf files.
gis openstreetmap openstreetmap-pbf-files osm pbf scala spark
Last synced: 11 Oct 2024
https://github.com/azure/azure-kusto-spark
Apache Spark Connector for Azure Kusto
Last synced: 21 Jan 2025
https://github.com/ibm/kafka-streaming-click-analysis
Use Kafka and Apache Spark streaming to perform click stream analytics
apache-spark clickstream data-science ibm-data-science-experience ibmcode jupyter-notebook kafka spark structured-streaming
Last synced: 22 Jan 2025
https://github.com/coxautomotivedatasolutions/waimak
Waimak is an open-source framework that makes it easier to create complex data flows in Apache Spark.
data-engineering hadoop scala spark
Last synced: 12 Oct 2024
https://github.com/ehsanmok/spark-lp
Distributed Linear Programming Solver on top of Apache Spark
distributed-computing distributed-optimization high-performance linear-programming scala spark
Last synced: 10 Jan 2025
https://github.com/yokawasa/databricks-notebooks
Collection of Sample Databricks Spark Notebooks ( mostly for Azure Databricks )
azure azuredatabricks databricks elt python spark streaming
Last synced: 30 Oct 2024
https://github.com/cbilgili/zemberek-nlp-server
Zemberek Türkçe NLP Java Kütüphanesi üzerine REST Docker Sunucu
docker javascript nlp part-of-speech-tagger rest sentence-tokenizer spark turkish turkish-language zemberek
Last synced: 12 Nov 2024
https://github.com/nashtech-labs/lambda-arch-spark
apache-spark cassandra kafka lambda-architecture spark
Last synced: 12 Oct 2024
https://github.com/hibayesian/spark-fm
A parallel implementation of factorization machines based on Spark
factorization-machines machine-learning spark
Last synced: 23 Nov 2024
https://github.com/wey-gu/nebulagraph-ai
(Pre Alpha)NebulaGraph AI High-Level API, do Graph Algo, Analytics in 4 lines of code.
graph graph-algorithms hacktoberfest nebulagraph networkx spark
Last synced: 19 Jan 2025
https://github.com/cloudposse/terraform-aws-emr-cluster
Terraform module to provision an Elastic MapReduce (EMR) cluster on AWS
emr emr-cluster emr-notebooks emrfs hadoop hcl2 hive presto spark terraform terraform-aws terraform-module terraform-modules
Last synced: 18 Jan 2025
https://github.com/swoop-inc/spark-records
Bulletproof Apache Spark jobs with fast root cause analysis of failures.
apache-spark big-data scala spark spark-records sparksql swoop
Last synced: 12 Oct 2024
https://github.com/wookey-project/ewok-kernel
A secure and high performances microkernel for building secure MCU-based IoTs
ada arm armv7m embedded ewok ewok-kernel microcontroller microcontroller-firmware microkernel security spark
Last synced: 25 Oct 2024
https://github.com/zaratsian/spark
Apache Spark (Scala, PySpark, SparkR) Code, Tricks, and References
machine-learning nlp pyspark spark text-analysis
Last synced: 07 Nov 2024
https://github.com/jaceklaskowski/spark-kubernetes-book
The Internals of Spark on Kubernetes
apache-spark book internals kubernetes spark
Last synced: 12 Oct 2024
https://github.com/vesoft-inc/nebula-algorithm
Nebula-Algorithm is a Spark Application based on GraphX, which enables state of art Graph Algorithms to run on top of NebulaGraph and write back results to NebulaGraph.
graph-algorithm graph-database graphx hacktoberfest nebulagraph spark
Last synced: 21 Jan 2025
https://github.com/src-d/jgit-spark-connector
jgit-spark-connector is a library for running scalable data retrieval pipelines that process any number of Git repositories for source code analysis.
datasource git pyspark python scala spark
Last synced: 16 Dec 2024
https://github.com/samelamin/spark-bigquery
Google BigQuery support for Spark, Structured Streaming, SQL, and DataFrames with easy Databricks integration.
bigquery data-frame schema spark
Last synced: 12 Oct 2024
https://github.com/pofulu/Spark-AR-PFTools
Utilities for Meta Spark Studio (Spark AR)
ar facebook meta-spark-studio picker scripting spark spark-ar spark-ar-pftools spark-ar-studio sparkar sparkar-screen
Last synced: 14 Oct 2024
https://github.com/ibm/sparksql-for-hbase
Learn how to use Spark SQL and HSpark connector package to create / query data tables that reside in HBase region servers
apache-spark hadoop-hdfs hbase ibmcode nosql spark sql
Last synced: 12 Oct 2024
https://github.com/nielsbasjes/splittablegzip
Splittable Gzip codec for Hadoop
codec gzip gzip-codec gzipped-files hadoop mapreduce-java pig spark splittable
Last synced: 15 Jan 2025
https://github.com/wanghan0501/usersessionbehaviorofflineanalysis
四川大学拓思爱诺用户session行为数据离线分析项目
Last synced: 11 Nov 2024
https://github.com/jwplayer/sparksteps
:star: CLI tool to launch Spark jobs on AWS EMR
Last synced: 05 Nov 2024
https://github.com/groda/big_data
Tutorials on Big Data essentials: Hadoop, MapReduce, Spark.
apache-sedona apache-spark big-data bigdata bigtop docker gutenberg-ebooks hadoop hadoop-cluster hadoop-hdfs hadoop-mapreduce jupyter-notebook mapreduce mapreduce-bash mrjob pyspark spark spark-sql testdfsio
Last synced: 22 Jan 2025
https://github.com/starlake-ai/starlake
Declarative text based tool for data analysts and engineers to extract, load, transform and orchestrate their data pipelines.
bigquery data-engineering data-integration data-pipeline etl hdfs redshift snowflake spark synapse
Last synced: 15 Jan 2025
https://github.com/zuinnote/hadoopoffice
HadoopOffice - Analyze Office documents using the Hadoop ecosystem (Spark/Flink/Hive)
analyze-office-documents bigdata excel flink hadoop hadoop-ecosystem hadoopoffice hive office poi spark
Last synced: 14 Oct 2024
https://github.com/scylladb/scylla-migrator
Migrate data extract using Spark to Scylla, normally from Cassandra/parquet files. Alt. from DynamoDB to Scylla Alternator.
alternator dynamodb migration scylladb spark
Last synced: 21 Jan 2025
https://github.com/ansrivas/spark-structured-streaming
Spark structured streaming with Kafka data source and writing to Cassandra
cassandra kafka kafka-topic spark
Last synced: 14 Oct 2024
https://github.com/gaglia88/sparker
SparkER: an Entity Resolution framework for Apache Spark
apache apache-spark entity entity-resolution meta-blocking python python-library python27 python3 resolution scala spark
Last synced: 20 Jan 2025
https://github.com/ing-bank/entitymatchingmodel
Entity Matching Model solves the problem of matching company names between two possibly very large datasets.
Last synced: 20 Jan 2025
https://github.com/treebeardtech/kubeflow-bootstrap
🪐 1-click Kubeflow using ArgoCD
ai airflow argocd dask gpu helm jupyter jupyterhub jupyterlab kserve kubeflow kubernetes kustomize llms machine-learning mlflow ray spark terraform
Last synced: 16 Jan 2025
https://github.com/garystafford/kafka-connect-msk-demo
For a series of posts on Amazon MSK, Amazon EKS, and Amazon EMR
aws kafka kafka-connect kubernetes spark spark-streaming
Last synced: 06 Dec 2024
https://github.com/oneoffcoder/books
A collection of online books for data science, computer science and coding!
books coder computer-science data-science docker java python r scikit-learn scratch software software-development software-engineering spark sphinx tutorials
Last synced: 05 Nov 2024
https://github.com/vivek-bombatkar/mylearningnotes
Because its never late to start taking notes and 'public' it...
blockchain hadoop hive pandas python spark sparkml
Last synced: 21 Jan 2025
https://github.com/rubenafo/docker-spark-cluster
A Spark cluster setup running on Docker containers
big-data docker docker-image hadoop openjdk scala spark
Last synced: 13 Oct 2024
https://github.com/spatialx-project/geolake
Universal solution for geospatial data tailored to data lakehouse systems for the first time in the industry
geospatial geospatial-analysis geospatial-processing iceberg spark spatial spatial-data
Last synced: 22 Jan 2025
https://github.com/mrpowers/spark-stringmetric
Spark functions to run popular phonetic and string matching algorithms
cosine-distance double-metaphone fuzzy-score hamming-distance jaccard-similarity jaro-winkler nysiis refined-soundex spark
Last synced: 28 Oct 2024
https://github.com/mobiletelesystems/onetl
One ETL tool to rule them all
etl etl-components etl-pipeline hwm plugin-system pydantic spark
Last synced: 17 Nov 2024
https://github.com/zhuyuqing/bestconf
A tool automatically improving the performance of large-scale systems by finding better configuration settings
benchmark cassandra configuration hadoop hive mysql optimization performance spark tomcat tuning
Last synced: 05 Nov 2024
https://github.com/potix2/spark-google-spreadsheets
Google Spreadsheets datasource for SparkSQL and DataFrames
data-frame scala spark sparksql spreadsheet
Last synced: 14 Oct 2024
https://github.com/jaceklaskowski/spark-streaming-notebook
Notes about Spark Streaming in Apache Spark
apache-spark notebook spark spark-streaming
Last synced: 08 Nov 2024
https://github.com/wordpress/openverse-catalog
Identifies and collects data on cc-licensed content across web crawl data and public apis.
airflow apache-airflow creative-commons hacktoberfest openverse pytest python search-engine spark
Last synced: 19 Jan 2025
https://github.com/turboway/pybigdata
使用 python 操作大数据的各种组件
elasticsearch hadoop hbase hive impala kafka mapreduce spark
Last synced: 15 Nov 2024
https://github.com/googlecloudplatform/serverless-spark-workshop
Solution Accelerators for Serverless Spark on GCP, the industry's first auto-scaling and serverless Spark as a service
apache-spark autoscaling bigdata dataproc hadoop serverless solution-accelerator spark usecases
Last synced: 07 Oct 2024
https://github.com/kislerdm/data-engineering-interviews
Data engineering interviews Q&A for data community by data community
dataengineering interview-questions kafka linux opensource python spark sql
Last synced: 11 Nov 2024
https://github.com/goldmansachs/tablasco
Tablasco is a JUnit rule for comparing tables and Spark module for comparing large data sets
avro integration java junit regression spark tablasco testing
Last synced: 07 Nov 2024
https://github.com/fancellu/zio-restful-webservice
ZIO 2.0 Restful webservice example using zio, zio-http, zio-json, quill, H2, twirl, zio-streams, zio-cache, zio-logging, zio-actors, zio-spark, openai, DallE
dalle2 h2-database openai quill scala spark twirl zio zio-actors zio-cache zio-http zio-logging zio-spark zio-streams
Last synced: 10 Nov 2024
https://github.com/ashleymarkfletcher/spark-ar-boilerplate
A boilerplate Spark AR project with Webpack
augmented-reality spark spark-ar spark-ar-studio sparkar webpack
Last synced: 14 Oct 2024
https://github.com/myamafuj/hadoop-hive-spark-docker
Hadoop-Hive-Spark cluster + Jupyter on Docker
docker hadoop hive jupyter jupyter-notebook pyspark spark
Last synced: 11 Nov 2024
https://github.com/dimajix/spark-training
Repository used for Spark Trainings
hadoop hadoop-training hive pyspark python scala spark spark-ml spark-streaming spark-training sqoop
Last synced: 09 Nov 2024
https://github.com/aws/sagemaker-sparkml-serving-container
This code is used to build & run a Docker container for performing predictions against a Spark ML Pipeline.
inference inference-pipeline machine-learning mleap mleap-serialized-spark pipeline sagemaker serving spark sparkml
Last synced: 07 Oct 2024
https://github.com/iobruno/data-engineering-zoomcamp
Data Engineering examples for Airflow, Prefect, and Mage.ai; dbt for BigQuery, Redshift, ClickHouse, PostgreSQL; Spark/PySpark for Batch processing; and Kafka for Stream processing
airflow airflow-dags dbt-bigquery dbt-clickhouse dbt-postgres dbt-redshift kafka ksqldb mageai prefect pyspark spark typer-cli
Last synced: 14 Dec 2024
https://github.com/distributedsystemsgroup/zoe
Zoe: Container Analytics as a Service -- mirror of https://gitlab.eurecom.fr/zoe/main/
analytics containers data jupyter python spark
Last synced: 13 Nov 2024
https://github.com/vigneshss-07/cloud-ai-analytics
This Repo contain details related to Data Engineering tech stacks in GCP
apachebeam bigdata bigquery clouddataflow cloudsql datalab google-cloud-platform spark
Last synced: 20 Jan 2025
https://github.com/yaooqinn/spark-ranger
已经合入(apache/incubator-kyuubi) ACL Management for Apache Spark SQL with Apache Ranger.
acl authorization data-masking ranger row-level-security spark sparksql
Last synced: 01 Oct 2024
https://github.com/squashql/squashql
Official repository of SquashQL, the SQL query engine for multi-dimensional and hierarchical analysis that empowers your SQL database
bigquery clickhouse database duckdb java jdbc query querybuilder snowflake spark sql typescript
Last synced: 14 Dec 2024
https://github.com/logicalclocks/feature-store-api
Python - Java/Scala API for the Hopsworks feature store
feature-store hopsworks hsfs python scala spark
Last synced: 21 Jan 2025
https://github.com/paypal/PPExtensions
Set of iPython and Jupyter extensions to improve user experience
gimel hive ipython-magic jupyer jupyter-extension magics notebooks spark tableau teradata
Last synced: 07 Nov 2024
https://github.com/pnavaro/big-data
Python tools for big data
dask data-science hadoop jupyter-book notebooks python spark
Last synced: 02 Nov 2024
https://github.com/geotrellis/geotrellis-chatta-demo
Demo of GeoTrellis - weighted overlay and zonal summary for University of Tennessee at Chattanooga.
chattanooga geodocker-cluster geotrellis s3 spark
Last synced: 11 Nov 2024
https://github.com/zaleslaw/spark-tutorial
How to build your first Spark application with MLlib, StructuredStreaming, GraphFrames, Datasets and so on? Answer is here!
kafka spark streaming structured-streaming
Last synced: 17 Nov 2024
https://github.com/zhonghuasheng/java_love_go
专注Java与Golang!!!Java基础、Java Core、JVM、Spring大家族、Golang语言、各种中间件(如rabbitmq、netty、mybatis、redis、mongodb、Spark等)
java java-8 jdbc mybatis spark spring springboot springmvc
Last synced: 16 Jan 2025
https://github.com/Merck/rdf2x
RDF2X converts big RDF datasets to the relational database model, CSV, JSON and ElasticSearch.
conversion json linked-data postgresql rdf spark sparql sql
Last synced: 18 Jan 2025
https://github.com/contiamo/rhombic
SQL parsing, lineage extraction and manipulation
lineage parser postgresql spark sql sql-lineage
Last synced: 07 Nov 2024
https://github.com/hydrospheredata/spark-ml-serving
Spark ML Lib serving library
inference scoring serving spark
Last synced: 28 Nov 2024
https://github.com/dcfjs/dcf
Yet another distributed compute framework
distributed-computing nodejs spark
Last synced: 22 Jan 2025
https://github.com/jonathandinu/spark-ray-data-science
Supporting content (slides and exercises) for the Pearson video series covering best practices for developing scalable applications with Spark and Ray in the context of a data scientist's standard workflow.
artificial-intelligence data-science distributed-computing machine-learning python ray spark
Last synced: 15 Nov 2024
https://github.com/mozilla/emr-bootstrap-spark
AWS bootstrap scripts for Mozilla's flavoured Spark setup.
Last synced: 29 Sep 2024
https://github.com/rstudio/sparkxgb
R interface for XGBoost on Spark
apache-spark machine-learning r rstats spark xgboost
Last synced: 10 Nov 2024
https://github.com/tomaztk/azure-databricks
Azure Databricks - Advent of 2020 Blogposts
azure-data-factory azure-databricks azure-machine-learnning data-analytics data-engineerg databricks databricks-notebooks machine-learning mlflow mllib notebook notebooks pyspark python r-language scala spark spark-structured-streaming sparkr sql
Last synced: 19 Nov 2024
https://github.com/merck/rdf2x
RDF2X converts big RDF datasets to the relational database model, CSV, JSON and ElasticSearch.
conversion json linked-data postgresql rdf spark sparql sql
Last synced: 16 Nov 2024
https://github.com/johnsnowlabs/johnsnowlabs
Gateway into the John Snow Labs Ecosystem
bert databricks gpt machine-learning natural-language-processing nlp python seq2seq spark t5
Last synced: 18 Nov 2024
https://github.com/xd-deng/spark-ml-intro
PySpark Machine Learning Examples
Last synced: 16 Oct 2024