Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
Apache Spark
Apache Spark is an open source distributed general-purpose cluster-computing framework. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.
- GitHub: https://github.com/topics/spark
- Wikipedia: https://en.wikipedia.org/wiki/Apache_Spark
- Repo: https://github.com/apache/spark
- Created by: Matei Zaharia
- Released: May 26, 2014
- Related Topics: scala, hadoop,
- Aliases: apache-spark,
- Last updated: 2025-01-22 00:29:18 UTC
- JSON Representation
https://github.com/apache/carbondata
High performance data store solution
apache big-data carbondata data-format hadoop java scala spark
Last synced: 21 Jan 2025
https://github.com/san089/goodreads_etl_pipeline
An end-to-end GoodReads Data Pipeline for Building Data Lake, Data Warehouse and Analytics Platform.
airflow airflow-dag apache-airflow apache-spark data-engineering data-engineering-pipeline data-lake data-migration emr-cluster etl-framework etl-job etl-pipeline goodreads-data-pipeline livy python redshift s3 scheduler spark warehouse
Last synced: 20 Jan 2025
https://github.com/jupyter-incubator/sparkmagic
Jupyter magics and kernels for working with remote Spark clusters
cluster jupyter jupyter-notebook kerberos kernel livy magic notebook pandas-dataframe pyspark spark sql-query
Last synced: 21 Jan 2025
https://github.com/moj-analytical-services/splink
Fast, accurate and scalable probabilistic data linkage with support for multiple SQL backends
data-matching data-science deduplicate-data deduplication duckdb em-algorithm entity-resolution fuzzy-matching record-linkage spark uk-gov-data-science
Last synced: 21 Jan 2025
https://github.com/DTStack/Taier
Taier is a big data development platform for submission, scheduling, operation and maintenance, and indicator information display
azkaban chunjun cronjob-scheduler dag data-schedule distributed-schedule-system flink hadoop hive job-scheduler scheduler spark task-schedule workflow-scheduling-system
Last synced: 30 Oct 2024
https://github.com/harisekhon/dockerfiles
50+ DockerHub public images for Docker & Kubernetes - DevOps, CI/CD, GitHub Actions, CircleCI, Jenkins, TeamCity, Alpine, CentOS, Debian, Fedora, Ubuntu, Hadoop, Kafka, ZooKeeper, HBase, Cassandra, Solr, SolrCloud, Presto, Apache Drill, Nifi, Spark, Consul, Riak
apache-drill cassandra consul devops docker dockerhub hacktoberfest hadoop hbase kafka kubernetes linux nagios-plugins presto rabbitmq rabbitmq-cluster solr solrcloud spark zookeeper
Last synced: 16 Jan 2025
https://github.com/HariSekhon/Dockerfiles
50+ DockerHub public images for Docker & Kubernetes - DevOps, CI/CD, GitHub Actions, CircleCI, Jenkins, TeamCity, Alpine, CentOS, Debian, Fedora, Ubuntu, Hadoop, Kafka, ZooKeeper, HBase, Cassandra, Solr, SolrCloud, Presto, Apache Drill, Nifi, Spark, Consul, Riak
apache-drill cassandra consul devops docker dockerhub hacktoberfest hadoop hbase kafka kubernetes linux nagios-plugins presto rabbitmq rabbitmq-cluster solr solrcloud spark zookeeper
Last synced: 04 Nov 2024
https://github.com/kwai/blaze
Blazing-fast query execution engine speaks Apache Spark language and has Arrow-DataFusion at its core.
big-data datafusion rust-lang spark
Last synced: 16 Jan 2025
https://github.com/dtstack/taier
Taier is a big data development platform for submission, scheduling, operation and maintenance, and indicator information display
azkaban chunjun cronjob-scheduler dag data-schedule distributed-schedule-system flink hadoop hive job-scheduler scheduler spark task-schedule workflow-scheduling-system
Last synced: 16 Jan 2025
https://github.com/databricks/learningsparkv2
This is the github repo for Learning Spark: Lightning-Fast Data Analytics [2nd Edition]
apache-spark delta-lake mlflow mllib spark spark-mllib spark-sql structured-streaming
Last synced: 18 Jan 2025
https://github.com/obenner/data-engineering-interview-questions
More than 2000+ Data engineer interview questions.
airflow avro aws azure cassandra data-engineering data-structures flink flume hadoop hadoop-hdfs hbase hive impala interview interview-questions kafka nifi spark sql
Last synced: 16 Jan 2025
https://github.com/mahmoudparsian/pyspark-tutorial
PySpark-Tutorial provides basic algorithms using PySpark
big-data big-data-analytics data-algorithms pyspark spark spark-dataframes spark-rdd
Last synced: 19 Jan 2025
https://github.com/apachecn/spark-doc-zh
Apache Spark 官方文档中文版
big-data documentation java spark
Last synced: 18 Jan 2025
https://github.com/OBenner/data-engineering-interview-questions
More than 2000+ Data engineer interview questions.
airflow avro aws azure cassandra data-engineering data-structures flink flume hadoop hadoop-hdfs hbase hive impala interview interview-questions kafka nifi spark sql
Last synced: 07 Nov 2024
https://github.com/abhishek-ch/around-dataengineering
A Data Engineering & Machine Learning Knowledge Hub
airflow data-engineering datascience devops infrastructure machine-learning mlops spark
Last synced: 19 Jan 2025
https://github.com/teradata/kylo
Kylo is a data lake management software platform and framework for enabling scalable enterprise-class data lakes on big data technologies such as Teradata, Apache Spark and/or Hadoop. Kylo is licensed under Apache 2.0. Contributed by Teradata Inc.
data-lake hadoop kylo nifi spark teradata
Last synced: 17 Jan 2025
https://github.com/Teradata/kylo
Kylo is a data lake management software platform and framework for enabling scalable enterprise-class data lakes on big data technologies such as Teradata, Apache Spark and/or Hadoop. Kylo is licensed under Apache 2.0. Contributed by Teradata Inc.
data-lake hadoop kylo nifi spark teradata
Last synced: 05 Nov 2024
https://github.com/datavane/datasophon
The next generation of cloud-native big data management expert , Aims to help users rapidly build stable, efficient, and scalable cloud-native platforms for big data.
cloudnative doris easy-to-use kubernetes spark yarn
Last synced: 16 Jan 2025
https://github.com/jacksu/utils4s
scala、spark使用过程中,各种测试用例以及相关资料整理
akka breeze json4s scala scala-demo scala-spark spark spark-streaming
Last synced: 19 Jan 2025
https://github.com/pixiedust/pixiedust
Python Helper library for Jupyter Notebooks
data-science jupyter-notebook pixiedust python python-notebook scala-notebooks spark visualization
Last synced: 17 Jan 2025
https://github.com/TIBCOSoftware/snappydata
Project SnappyData - memory optimized analytics database, based on Apache Spark™ and Apache Geode™. Stream, Transact, Analyze, Predict in one cluster
analytics memory-database scale snappydata spark stream transaction
Last synced: 18 Nov 2024
https://github.com/tibcosoftware/snappydata
Project SnappyData - memory optimized analytics database, based on Apache Spark™ and Apache Geode™. Stream, Transact, Analyze, Predict in one cluster
analytics memory-database scale snappydata spark stream transaction
Last synced: 17 Jan 2025
https://ibm-cds-labs.github.io/pixiedust
Python Helper library for Jupyter Notebooks
data-science jupyter-notebook pixiedust python python-notebook scala-notebooks spark visualization
Last synced: 04 Oct 2024
https://github.com/projectnessie/nessie
Nessie: Transactional Catalog for Data Lakes with Git-like semantics
aws-lambda data git iceberg java spark
Last synced: 15 Jan 2025
https://github.com/josonle/coding-now
学习记录的一些笔记,以及所看得一些电子书eBooks、视频资源和平常收纳的一些自己认为比较好的博客、网站、工具。涉及大数据几大组件、Python机器学习和数据分析、Linux、操作系统、算法、网络等
bigdata coding ebook-collection hadoop-hive java linux notes spark
Last synced: 19 Jan 2025
https://github.com/deanwampler/spark-scala-tutorial
A free tutorial for Apache Spark.
Last synced: 19 Jan 2025
https://github.com/oeljeklaus-you/useractionanalyzeplatform
电商用户行为分析大数据平台
accumulator hadoop java kyro spark spark-sql sparkjava
Last synced: 20 Jan 2025
https://github.com/h2oai/sparkling-water
Sparkling Water provides H2O functionality inside Spark cluster
big-data h2o integration machine-learning pyspark pysparkling rsparkling scala spark
Last synced: 21 Jan 2025
https://github.com/zinggai/zingg
Scalable identity resolution, entity resolution, data mastering and deduplication using ML
analytics analytics-engineering data-science data-transformation data-transformations dataengineering datalake dataquality dedupe deduplication entity-resolution etl fuzzy-matching fuzzymatch identity identity-resolution masterdata ml modern-data-stack spark
Last synced: 16 Jan 2025
https://github.com/sparklyr/sparklyr
R interface for Apache Spark
apache-spark distributed dplyr ide livy machine-learning r remote-clusters rstats spark sparklyr
Last synced: 16 Jan 2025
https://github.com/Microsoft/Mobius
C# and F# language binding and extensions to Apache Spark
apache-spark bigdata csharp dataframe dataset dstream eventhubs fsharp kafka-streaming mapreduce mobius near-real-time rdd spark spark-streaming streaming
Last synced: 25 Oct 2024
https://github.com/redislabs/spark-redis
A connector for Spark that allows reading and writing to/from Redis cluster
Last synced: 16 Jan 2025
https://github.com/microsoft/Mobius
C# and F# language binding and extensions to Apache Spark
apache-spark bigdata csharp dataframe dataset dstream eventhubs fsharp kafka-streaming mapreduce mobius near-real-time rdd spark spark-streaming streaming
Last synced: 06 Nov 2024
https://github.com/microsoft/mobius
C# and F# language binding and extensions to Apache Spark
apache-spark bigdata csharp dataframe dataset dstream eventhubs fsharp kafka-streaming mapreduce mobius near-real-time rdd spark spark-streaming streaming
Last synced: 19 Jan 2025
https://github.com/RedisLabs/spark-redis
A connector for Spark that allows reading and writing to/from Redis cluster
Last synced: 31 Oct 2024
https://github.com/rstudio/sparklyr
R interface for Apache Spark
apache-spark distributed dplyr ide livy machine-learning r remote-clusters rstats spark sparklyr
Last synced: 07 Oct 2024
https://github.com/apache/celeborn
Apache Celeborn is an elastic and high-performance service for shuffle and spilled data.
Last synced: 17 Jan 2025
https://github.com/wzhe06/sparkctr
CTR prediction model based on spark(LR, GBDT, DNN)
computational-advertising ctr-prediction machine-learning scala spark spark-ml spark-mllib
Last synced: 22 Jan 2025
https://github.com/wzhe06/SparkCTR
CTR prediction model based on spark(LR, GBDT, DNN)
computational-advertising ctr-prediction machine-learning scala spark spark-ml spark-mllib
Last synced: 17 Dec 2024
https://github.com/apache/incubator-livy
Apache Livy is an open source REST interface for interacting with Apache Spark from anywhere.
Last synced: 21 Jan 2025
https://github.com/alanchn31/data-engineering-projects
Personal Data Engineering Projects
airflow aws-redshift cassandra data-engineering data-engineering-nanodegree data-lake data-modeling data-warehouse ingest-data mongodb postgres scrapy spark star-schema
Last synced: 22 Jan 2025
https://github.com/pingcap/tispark
TiSpark is built for running Apache Spark on top of TiDB/TiKV
Last synced: 16 Jan 2025
https://github.com/typelevel/frameless
Expressive types for Spark.
fp functional-programming scala spark typelevel
Last synced: 16 Jan 2025
https://github.com/apache/datafusion-comet
Apache DataFusion Comet Spark Accelerator
Last synced: 16 Jan 2025
https://github.com/alanchn31/Data-Engineering-Projects
Personal Data Engineering Projects
airflow aws-redshift cassandra data-engineering data-engineering-nanodegree data-lake data-modeling data-warehouse ingest-data mongodb postgres scrapy spark star-schema
Last synced: 08 Nov 2024
https://github.com/nvidia/spark-rapids
Spark RAPIDS plugin - accelerate Apache Spark with GPUs
Last synced: 16 Jan 2025
https://github.com/IBM/elasticsearch-spark-recommender
Use Jupyter Notebooks to demonstrate how to build a Recommender with Apache Spark & Elasticsearch
elasticsearch ibmcode jupyter python spark
Last synced: 12 Nov 2024
https://github.com/jadianes/spark-movie-lens
An on-line movie recommender using Spark, Python Flask, and the MovieLens dataset
big-data bigdata flask movie-recommendation movielens-dataset python spark
Last synced: 22 Jan 2025
https://github.com/NVIDIA/spark-rapids
Spark RAPIDS plugin - accelerate Apache Spark with GPUs
Last synced: 05 Nov 2024
https://github.com/WeBankFinTech/Scriptis
Scriptis is for interactive data analysis with script development(SQL, Pyspark, HiveQL), task submission(Spark, Hive), UDF, function, resource management and intelligent diagnosis.
errorcode hive hive-table hql hue ide linkis pyspark resouce-management scala spark sql udf zeppelin
Last synced: 23 Nov 2024
https://github.com/delta-io/delta-sharing
An open protocol for secure data sharing
big-data data-sharing delta-lake pandas spark
Last synced: 15 Jan 2025
https://github.com/lyhue1991/eat_pyspark_in_10_days
pyspark🍒🥭 is delicious,just eat it!😋😋
Last synced: 17 Jan 2025
https://github.com/cdapio/cdap
An open source framework for building data analytic applications.
cdap dataset integration java java-8 mapreduce middleware platform python spark spark-streaming unified
Last synced: 21 Jan 2025
https://github.com/HariSekhon/DevOps-Python-tools
80+ DevOps & Data CLI Tools - AWS, GCP, GCF Python Cloud Functions, Log Anonymizer, Spark, Hadoop, HBase, Hive, Impala, Linux, Docker, Spark Data Converters & Validators (Avro/Parquet/JSON/CSV/INI/XML/YAML), Travis CI, AWS CloudFormation, Elasticsearch, Solr etc.
avro aws cloudformation devops docker dockerhub elasticsearch gcf gcp hadoop hbase hdfs json linux parquet pyspark python solr spark travis-ci
Last synced: 07 Nov 2024
https://github.com/mrpowers-io/spark-daria
Essential Spark extensions and helper methods ✨😲
Last synced: 20 Jan 2025
https://github.com/miguno/kafka-storm-starter
[PROJECT IS NO LONGER MAINTAINED] Code examples that show to integrate Apache Kafka 0.8+ with Apache Storm 0.9+ and Apache Spark Streaming 1.1+, while using Apache Avro as the data serialization format.
apache-avro apache-kafka apache-spark apache-storm avro integration kafka scala spark storm
Last synced: 22 Jan 2025
https://github.com/a616567126/gpt-web-java
基于JDK8 AI 聊天机器人!微信公众号 Midjourney画图、卡密兑换、web 支持ChatGPT、Midjourney画图、sd画图,卡密兑换,易支付,公众号引流,邮件注册🔥
bard-api chatgpt google midjourney-api spark stable-diffusion
Last synced: 09 Nov 2024
https://github.com/mongodb/mongo-spark
The MongoDB Spark Connector
connector mongo-spark mongodb spark spark-packages
Last synced: 16 Jan 2025
https://github.com/lucacanali/sparkmeasure
This is the development repository for sparkMeasure, a tool and library designed for efficient analysis and troubleshooting of Apache Spark jobs. It focuses on easing the collection and examination of Spark metrics, making it a practical choice for both developers and data engineers.
apache-spark performance-metrics performance-troubleshooting python scala spark
Last synced: 15 Jan 2025
https://github.com/LucaCanali/sparkMeasure
This is the development repository for sparkMeasure, a tool and library designed for efficient analysis and troubleshooting of Apache Spark jobs. It focuses on easing the collection and examination of Spark metrics, making it a practical choice for both developers and data engineers.
apache-spark performance-metrics performance-troubleshooting python scala spark
Last synced: 25 Nov 2024
https://github.com/metabrainz/listenbrainz-server
Server for the ListenBrainz project, including the front-end (javascript/react) code that it serves and all of the data processing components that LB uses.
big-data database listenbrainz-server music python react spark typescript web
Last synced: 08 Nov 2024
https://github.com/deanwampler/justenoughscalaforspark
A tutorial on the most important features and idioms of Scala that you need to use Spark's Scala APIs.
Last synced: 15 Jan 2025
https://github.com/deanwampler/JustEnoughScalaForSpark
A tutorial on the most important features and idioms of Scala that you need to use Spark's Scala APIs.
Last synced: 25 Oct 2024
https://github.com/WeBankFinTech/WeDataSphere
WeDataSphere is a financial grade, one-stop big data platform suite.
analytics bigdata data-analysis datafabric datagovernance dataspherestudio exchangis flink hadoop hive ide linkis prophecis qualitis schedulis scriptis spark streamis visualis
Last synced: 30 Oct 2024
https://github.com/jupyter-server/enterprise_gateway
A lightweight, multi-tenant, scalable and secure gateway that enables Jupyter Notebooks to share resources across distributed clusters such as Apache Spark, Kubernetes and others.
enterprise gateway hacktoberfest jupyter jupyter-enterprise-gateway jupyter-kernels jupyter-notebook kernel kubernetes remote-kernels spark spark-on-kubernetes yarn
Last synced: 17 Jan 2025
https://github.com/yanagishima/yanagishima
Web UI for Trino, Hive and SparkSQL
elasticsearch hive spark trino
Last synced: 30 Oct 2024
https://github.com/frees-io/freestyle
A cohesive & pragmatic framework of FP centric Scala libraries
architectural-patterns cassandra free-monads freestyle functional-programming kafka monads redis rpc scala spark tagless-final
Last synced: 20 Jan 2025
https://github.com/xubo245/SparkLearning
Learning Apache spark,including code and data .Most part can run local.
Last synced: 31 Oct 2024
https://github.com/awslabs/data-on-eks
DoEKS is a tool to build, deploy and scale Data & ML Platforms on Amazon EKS
aws-eks eks jupyterhub kubeflow kubernetes ml mlflow ray spark terraform
Last synced: 08 Nov 2024
https://github.com/absaoss/spline
Data Lineage Tracking And Visualization Solution
bigdata hadoop lineage scala spark tracking visualization
Last synced: 19 Jan 2025
https://github.com/AbsaOSS/spline
Data Lineage Tracking And Visualization Solution
bigdata hadoop lineage scala spark tracking visualization
Last synced: 05 Nov 2024
https://github.com/igniterealtime/spark
Cross-platform real-time collaboration client optimized for business and organizations.
collaboration cross-platform jabber java openfire spark xmpp xmpp-client
Last synced: 17 Jan 2025
https://github.com/qubole/sparklens
Qubole Sparklens tool for performance tuning Apache Spark
cluster performance performance-analysis performance-metrics performance-tuning performance-visualization scala scheduler scheduling simulation spark spark-applications spark-job spark-ml spark-mllib spark-sql sparkjava
Last synced: 18 Jan 2025
https://github.com/mvillarrealb/docker-spark-cluster
A simple spark standalone cluster for your testing environment purposses
bigdata developer-tools docker-compose spark
Last synced: 22 Jan 2025
https://github.com/minio/sidekick
High Performance HTTP Sidecar Load Balancer
bigdata kubernetes load-balancer minio-servers proxy sidecar sidekick spark
Last synced: 01 Nov 2024
https://github.com/harsha2010/magellan
Geo Spatial Data Analytics on Spark
big-data geojson geometric-algorithms geospatial geospatial-analysis geospatial-analytics geospatial-processing magellan shapefile spark sparksql
Last synced: 13 Nov 2024
https://github.com/housepower/ClickHouse-Native-JDBC
ClickHouse Native Protocol JDBC implementation
analytics clickhouse clickhouse-client database jdbc spark tcp-protocol
Last synced: 12 Nov 2024
https://github.com/Stratio/sparta
Real Time Analytics and Data Pipelines based on Spark Streaming
analytics hdfs kafka lambda olap real-time scala spark spark-streaming sparksql sparta stratio stratio-sparta streaming streaming-data triggers workflow
Last synced: 16 Nov 2024
https://github.com/ankurchavda/streamify
A data engineering project with Kafka, Spark Streaming, dbt, Docker, Airflow, Terraform, GCP and much more!
airflow data-engineering dbt gcp kafka python spark
Last synced: 27 Nov 2024
https://github.com/running-elephant/moonbox
Moonbox is a DVtaaS (Data Virtualization as a Service) Platform
data-virtualization hive kudu moonbox spark virtual-database
Last synced: 18 Jan 2025
https://github.com/polyaxon/traceml
Engine for ML/Data tracking, visualization, explainability, drift detection, and dashboards for Polyaxon.
dask data-exploration data-profiling data-quality data-quality-checks data-science data-visualization dataframes dataops explainable-ai matplotlib mlops pandas pandas-summary plotly pytorch spark statistics tensorflow tracking
Last synced: 16 Jan 2025
https://github.com/ing-bank/popmon
Monitor the stability of a Pandas or Spark dataframe ⚙︎
covariate-shift data-analysis data-distributions data-profiling data-science dataset-shifts drift-detection hacktoberfest ing-bank ipython jupyter mlops monitoring pandas population-monitoring python spark statistical-process-control statistical-tests statistics
Last synced: 16 Jan 2025
https://github.com/capitalone/datacompy
Pandas, Polars, and Spark DataFrame comparison for humans and more!
compare dask data data-science dataframes fugue numpy pandas polars pyspark python spark
Last synced: 16 Jan 2025
https://github.com/uber/marmaray
Generic Data Ingestion & Dispersal Library for Hadoop
avro-schema data-lake hadoop ingest-data schema-format spark
Last synced: 28 Oct 2024
https://github.com/nightscape/spark-excel
A Spark plugin for reading and writing Excel files
data-frame etl excel scala spark
Last synced: 17 Jan 2025
https://github.com/kotlin/kotlin-spark-api
This projects gives Kotlin bindings and several extensions for Apache Spark. We are looking to have this as a part of Apache Spark 3.x
bigdata kotlin nullability scala spark
Last synced: 17 Jan 2025
https://github.com/Kotlin/kotlin-spark-api
This projects gives Kotlin bindings and several extensions for Apache Spark. We are looking to have this as a part of Apache Spark 3.x
bigdata kotlin nullability scala spark
Last synced: 18 Nov 2024
https://github.com/rjurney/agile_data_code_2
Code for Agile Data Science 2.0, O'Reilly 2017, Second Edition
agile-data agile-data-science airflow amazon-ec2 amazon-web-services analytics apache-kafka apache-spark data data-science data-syndrome kafka machine-learning machine-learning-algorithms predictive-analytics python python-3 python3 spark vagrant
Last synced: 20 Jan 2025
https://github.com/japila-books/spark-sql-internals
The Internals of Spark SQL
apache-spark book internals mkdocs-material spark spark-sql
Last synced: 18 Jan 2025
https://github.com/rjurney/Agile_Data_Code_2
Code for Agile Data Science 2.0, O'Reilly 2017, Second Edition
agile-data agile-data-science airflow amazon-ec2 amazon-web-services analytics apache-kafka apache-spark data data-science data-syndrome kafka machine-learning machine-learning-algorithms predictive-analytics python python-3 python3 spark vagrant
Last synced: 27 Nov 2024
https://github.com/azure/azuredatabricksbestpractices
Version 1 of Technical Best Practices of Azure Databricks based on real world Customer and Technical SME inputs
azure azuredatabricks deployment grafana performance performance-monitoring provisioning python scalability security spark
Last synced: 17 Jan 2025
https://github.com/Azure/AzureDatabricksBestPractices
Version 1 of Technical Best Practices of Azure Databricks based on real world Customer and Technical SME inputs
azure azuredatabricks deployment grafana performance performance-monitoring provisioning python scalability security spark
Last synced: 04 Dec 2024
https://github.com/tweag/sparkle
Haskell on Apache Spark.
analytics apache-spark haskell spark
Last synced: 19 Jan 2025