Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

Apache Spark

Apache Spark is an open source distributed general-purpose cluster-computing framework. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.

https://github.com/collabH/bigdata-growth

大数据知识仓库涉及到数据仓库建模、实时计算、大数据、数据中台、系统设计、Java、算法等。

bigdata bigdatalearning debezium flink hadoop hbase hdfs hive hudi kafka kudu mapreduce olap spark

Last synced: 31 Oct 2024

https://github.com/apache/carbondata

High performance data store solution

apache big-data carbondata data-format hadoop java scala spark

Last synced: 21 Jan 2025

https://github.com/jupyter-incubator/sparkmagic

Jupyter magics and kernels for working with remote Spark clusters

cluster jupyter jupyter-notebook kerberos kernel livy magic notebook pandas-dataframe pyspark spark sql-query

Last synced: 21 Jan 2025

https://github.com/moj-analytical-services/splink

Fast, accurate and scalable probabilistic data linkage with support for multiple SQL backends

data-matching data-science deduplicate-data deduplication duckdb em-algorithm entity-resolution fuzzy-matching record-linkage spark uk-gov-data-science

Last synced: 21 Jan 2025

https://github.com/DTStack/Taier

Taier is a big data development platform for submission, scheduling, operation and maintenance, and indicator information display

azkaban chunjun cronjob-scheduler dag data-schedule distributed-schedule-system flink hadoop hive job-scheduler scheduler spark task-schedule workflow-scheduling-system

Last synced: 30 Oct 2024

https://github.com/harisekhon/dockerfiles

50+ DockerHub public images for Docker & Kubernetes - DevOps, CI/CD, GitHub Actions, CircleCI, Jenkins, TeamCity, Alpine, CentOS, Debian, Fedora, Ubuntu, Hadoop, Kafka, ZooKeeper, HBase, Cassandra, Solr, SolrCloud, Presto, Apache Drill, Nifi, Spark, Consul, Riak

apache-drill cassandra consul devops docker dockerhub hacktoberfest hadoop hbase kafka kubernetes linux nagios-plugins presto rabbitmq rabbitmq-cluster solr solrcloud spark zookeeper

Last synced: 16 Jan 2025

https://github.com/HariSekhon/Dockerfiles

50+ DockerHub public images for Docker & Kubernetes - DevOps, CI/CD, GitHub Actions, CircleCI, Jenkins, TeamCity, Alpine, CentOS, Debian, Fedora, Ubuntu, Hadoop, Kafka, ZooKeeper, HBase, Cassandra, Solr, SolrCloud, Presto, Apache Drill, Nifi, Spark, Consul, Riak

apache-drill cassandra consul devops docker dockerhub hacktoberfest hadoop hbase kafka kubernetes linux nagios-plugins presto rabbitmq rabbitmq-cluster solr solrcloud spark zookeeper

Last synced: 04 Nov 2024

https://github.com/kwai/blaze

Blazing-fast query execution engine speaks Apache Spark language and has Arrow-DataFusion at its core.

big-data datafusion rust-lang spark

Last synced: 16 Jan 2025

https://github.com/dtstack/taier

Taier is a big data development platform for submission, scheduling, operation and maintenance, and indicator information display

azkaban chunjun cronjob-scheduler dag data-schedule distributed-schedule-system flink hadoop hive job-scheduler scheduler spark task-schedule workflow-scheduling-system

Last synced: 16 Jan 2025

https://github.com/databricks/learningsparkv2

This is the github repo for Learning Spark: Lightning-Fast Data Analytics [2nd Edition]

apache-spark delta-lake mlflow mllib spark spark-mllib spark-sql structured-streaming

Last synced: 18 Jan 2025

https://github.com/mahmoudparsian/pyspark-tutorial

PySpark-Tutorial provides basic algorithms using PySpark

big-data big-data-analytics data-algorithms pyspark spark spark-dataframes spark-rdd

Last synced: 19 Jan 2025

https://github.com/apachecn/spark-doc-zh

Apache Spark 官方文档中文版

big-data documentation java spark

Last synced: 18 Jan 2025

https://github.com/teradata/kylo

Kylo is a data lake management software platform and framework for enabling scalable enterprise-class data lakes on big data technologies such as Teradata, Apache Spark and/or Hadoop. Kylo is licensed under Apache 2.0. Contributed by Teradata Inc.

data-lake hadoop kylo nifi spark teradata

Last synced: 17 Jan 2025

https://github.com/Teradata/kylo

Kylo is a data lake management software platform and framework for enabling scalable enterprise-class data lakes on big data technologies such as Teradata, Apache Spark and/or Hadoop. Kylo is licensed under Apache 2.0. Contributed by Teradata Inc.

data-lake hadoop kylo nifi spark teradata

Last synced: 05 Nov 2024

https://github.com/datavane/datasophon

The next generation of cloud-native big data management expert , Aims to help users rapidly build stable, efficient, and scalable cloud-native platforms for big data.

cloudnative doris easy-to-use kubernetes spark yarn

Last synced: 16 Jan 2025

https://github.com/jacksu/utils4s

scala、spark使用过程中,各种测试用例以及相关资料整理

akka breeze json4s scala scala-demo scala-spark spark spark-streaming

Last synced: 19 Jan 2025

https://github.com/TIBCOSoftware/snappydata

Project SnappyData - memory optimized analytics database, based on Apache Spark™ and Apache Geode™. Stream, Transact, Analyze, Predict in one cluster

analytics memory-database scale snappydata spark stream transaction

Last synced: 18 Nov 2024

https://github.com/tibcosoftware/snappydata

Project SnappyData - memory optimized analytics database, based on Apache Spark™ and Apache Geode™. Stream, Transact, Analyze, Predict in one cluster

analytics memory-database scale snappydata spark stream transaction

Last synced: 17 Jan 2025

https://github.com/projectnessie/nessie

Nessie: Transactional Catalog for Data Lakes with Git-like semantics

aws-lambda data git iceberg java spark

Last synced: 15 Jan 2025

https://github.com/josonle/coding-now

学习记录的一些笔记,以及所看得一些电子书eBooks、视频资源和平常收纳的一些自己认为比较好的博客、网站、工具。涉及大数据几大组件、Python机器学习和数据分析、Linux、操作系统、算法、网络等

bigdata coding ebook-collection hadoop-hive java linux notes spark

Last synced: 19 Jan 2025

https://github.com/twosigma/flint

A Time Series Library for Apache Spark

spark timeseries

Last synced: 20 Jan 2025

https://github.com/bigdatagenomics/adam

ADAM is a genomics analysis platform with specialized file formats built using Apache Avro, Apache Spark, and Apache Parquet. Apache 2 licensed.

avro big-data bioinformatics genomics java parquet python r scala spark

Last synced: 21 Jan 2025

https://github.com/deanwampler/spark-scala-tutorial

A free tutorial for Apache Spark.

jupyter scala spark tutorial

Last synced: 19 Jan 2025

https://github.com/oeljeklaus-you/useractionanalyzeplatform

电商用户行为分析大数据平台

accumulator hadoop java kyro spark spark-sql sparkjava

Last synced: 20 Jan 2025

https://github.com/h2oai/sparkling-water

Sparkling Water provides H2O functionality inside Spark cluster

big-data h2o integration machine-learning pyspark pysparkling rsparkling scala spark

Last synced: 21 Jan 2025

https://github.com/redislabs/spark-redis

A connector for Spark that allows reading and writing to/from Redis cluster

dataframe java redis spark

Last synced: 16 Jan 2025

https://github.com/RedisLabs/spark-redis

A connector for Spark that allows reading and writing to/from Redis cluster

dataframe java redis spark

Last synced: 31 Oct 2024

https://github.com/apache/celeborn

Apache Celeborn is an elastic and high-performance service for shuffle and spilled data.

bigdata shuffle spark

Last synced: 17 Jan 2025

https://github.com/wzhe06/sparkctr

CTR prediction model based on spark(LR, GBDT, DNN)

computational-advertising ctr-prediction machine-learning scala spark spark-ml spark-mllib

Last synced: 15 Jan 2025

https://github.com/wzhe06/SparkCTR

CTR prediction model based on spark(LR, GBDT, DNN)

computational-advertising ctr-prediction machine-learning scala spark spark-ml spark-mllib

Last synced: 17 Dec 2024

https://github.com/fayson/cdhproject

hadoop各组件使用,持续更新

java scala spark

Last synced: 19 Jan 2025

https://github.com/apache/incubator-livy

Apache Livy is an open source REST interface for interacting with Apache Spark from anywhere.

apachelivy bigdata livy spark

Last synced: 21 Jan 2025

https://github.com/pingcap/tispark

TiSpark is built for running Apache Spark on top of TiDB/TiKV

bigdata spark tidb tikv

Last synced: 16 Jan 2025

https://github.com/typelevel/frameless

Expressive types for Spark.

fp functional-programming scala spark typelevel

Last synced: 16 Jan 2025

https://github.com/apache/datafusion-comet

Apache DataFusion Comet Spark Accelerator

arrow datafusion rust spark

Last synced: 16 Jan 2025

https://github.com/nvidia/spark-rapids

Spark RAPIDS plugin - accelerate Apache Spark with GPUs

big-data gpu rapids spark

Last synced: 16 Jan 2025

https://github.com/IBM/elasticsearch-spark-recommender

Use Jupyter Notebooks to demonstrate how to build a Recommender with Apache Spark & Elasticsearch

elasticsearch ibmcode jupyter python spark

Last synced: 12 Nov 2024

https://github.com/jadianes/spark-movie-lens

An on-line movie recommender using Spark, Python Flask, and the MovieLens dataset

big-data bigdata flask movie-recommendation movielens-dataset python spark

Last synced: 15 Jan 2025

https://github.com/NVIDIA/spark-rapids

Spark RAPIDS plugin - accelerate Apache Spark with GPUs

big-data gpu rapids spark

Last synced: 05 Nov 2024

https://github.com/WeBankFinTech/Scriptis

Scriptis is for interactive data analysis with script development(SQL, Pyspark, HiveQL), task submission(Spark, Hive), UDF, function, resource management and intelligent diagnosis.

errorcode hive hive-table hql hue ide linkis pyspark resouce-management scala spark sql udf zeppelin

Last synced: 23 Nov 2024

https://github.com/delta-io/delta-sharing

An open protocol for secure data sharing

big-data data-sharing delta-lake pandas spark

Last synced: 15 Jan 2025

https://github.com/lyhue1991/eat_pyspark_in_10_days

pyspark🍒🥭 is delicious,just eat it!😋😋

pyspark spark

Last synced: 17 Jan 2025

https://github.com/cdapio/cdap

An open source framework for building data analytic applications.

cdap dataset integration java java-8 mapreduce middleware platform python spark spark-streaming unified

Last synced: 21 Jan 2025

https://github.com/HariSekhon/DevOps-Python-tools

80+ DevOps & Data CLI Tools - AWS, GCP, GCF Python Cloud Functions, Log Anonymizer, Spark, Hadoop, HBase, Hive, Impala, Linux, Docker, Spark Data Converters & Validators (Avro/Parquet/JSON/CSV/INI/XML/YAML), Travis CI, AWS CloudFormation, Elasticsearch, Solr etc.

avro aws cloudformation devops docker dockerhub elasticsearch gcf gcp hadoop hbase hdfs json linux parquet pyspark python solr spark travis-ci

Last synced: 07 Nov 2024

https://github.com/mrpowers-io/spark-daria

Essential Spark extensions and helper methods ✨😲

dataframe spark

Last synced: 20 Jan 2025

https://github.com/miguno/kafka-storm-starter

[PROJECT IS NO LONGER MAINTAINED] Code examples that show to integrate Apache Kafka 0.8+ with Apache Storm 0.9+ and Apache Spark Streaming 1.1+, while using Apache Avro as the data serialization format.

apache-avro apache-kafka apache-spark apache-storm avro integration kafka scala spark storm

Last synced: 22 Jan 2025

https://github.com/a616567126/gpt-web-java

基于JDK8 AI 聊天机器人!微信公众号 Midjourney画图、卡密兑换、web 支持ChatGPT、Midjourney画图、sd画图,卡密兑换,易支付,公众号引流,邮件注册🔥

bard-api chatgpt google midjourney-api spark stable-diffusion

Last synced: 09 Nov 2024

https://github.com/mongodb/mongo-spark

The MongoDB Spark Connector

connector mongo-spark mongodb spark spark-packages

Last synced: 16 Jan 2025

https://github.com/lucacanali/sparkmeasure

This is the development repository for sparkMeasure, a tool and library designed for efficient analysis and troubleshooting of Apache Spark jobs. It focuses on easing the collection and examination of Spark metrics, making it a practical choice for both developers and data engineers.

apache-spark performance-metrics performance-troubleshooting python scala spark

Last synced: 15 Jan 2025

https://github.com/LucaCanali/sparkMeasure

This is the development repository for sparkMeasure, a tool and library designed for efficient analysis and troubleshooting of Apache Spark jobs. It focuses on easing the collection and examination of Spark metrics, making it a practical choice for both developers and data engineers.

apache-spark performance-metrics performance-troubleshooting python scala spark

Last synced: 25 Nov 2024

https://github.com/metabrainz/listenbrainz-server

Server for the ListenBrainz project, including the front-end (javascript/react) code that it serves and all of the data processing components that LB uses.

big-data database listenbrainz-server music python react spark typescript web

Last synced: 08 Nov 2024

https://github.com/deanwampler/justenoughscalaforspark

A tutorial on the most important features and idioms of Scala that you need to use Spark's Scala APIs.

jupyter scala spark tutorial

Last synced: 15 Jan 2025

https://github.com/deanwampler/JustEnoughScalaForSpark

A tutorial on the most important features and idioms of Scala that you need to use Spark's Scala APIs.

jupyter scala spark tutorial

Last synced: 25 Oct 2024

https://github.com/jupyter-server/enterprise_gateway

A lightweight, multi-tenant, scalable and secure gateway that enables Jupyter Notebooks to share resources across distributed clusters such as Apache Spark, Kubernetes and others.

enterprise gateway hacktoberfest jupyter jupyter-enterprise-gateway jupyter-kernels jupyter-notebook kernel kubernetes remote-kernels spark spark-on-kubernetes yarn

Last synced: 17 Jan 2025

https://github.com/yanagishima/yanagishima

Web UI for Trino, Hive and SparkSQL

elasticsearch hive spark trino

Last synced: 30 Oct 2024

https://github.com/xubo245/SparkLearning

Learning Apache spark,including code and data .Most part can run local.

learning ml spark sparkcore

Last synced: 31 Oct 2024

https://github.com/awslabs/data-on-eks

DoEKS is a tool to build, deploy and scale Data & ML Platforms on Amazon EKS

aws-eks eks jupyterhub kubeflow kubernetes ml mlflow ray spark terraform

Last synced: 08 Nov 2024

https://github.com/AbsaOSS/spline

Data Lineage Tracking And Visualization Solution

bigdata hadoop lineage scala spark tracking visualization

Last synced: 05 Nov 2024

https://github.com/absaoss/spline

Data Lineage Tracking And Visualization Solution

bigdata hadoop lineage scala spark tracking visualization

Last synced: 19 Jan 2025

https://github.com/igniterealtime/spark

Cross-platform real-time collaboration client optimized for business and organizations.

collaboration cross-platform jabber java openfire spark xmpp xmpp-client

Last synced: 17 Jan 2025

https://github.com/mvillarrealb/docker-spark-cluster

A simple spark standalone cluster for your testing environment purposses

bigdata developer-tools docker-compose spark

Last synced: 15 Jan 2025

https://github.com/minio/sidekick

High Performance HTTP Sidecar Load Balancer

bigdata kubernetes load-balancer minio-servers proxy sidecar sidekick spark

Last synced: 01 Nov 2024

https://github.com/housepower/ClickHouse-Native-JDBC

ClickHouse Native Protocol JDBC implementation

analytics clickhouse clickhouse-client database jdbc spark tcp-protocol

Last synced: 12 Nov 2024

https://github.com/ankurchavda/streamify

A data engineering project with Kafka, Spark Streaming, dbt, Docker, Airflow, Terraform, GCP and much more!

airflow data-engineering dbt gcp kafka python spark

Last synced: 27 Nov 2024

https://github.com/running-elephant/moonbox

Moonbox is a DVtaaS (Data Virtualization as a Service) Platform

data-virtualization hive kudu moonbox spark virtual-database

Last synced: 18 Jan 2025

https://github.com/raray-chuan/xichuan_note

xichuan的学习总结笔记,覆盖了java、spring、java其他常用框架,以及大数据相关组件等📚

bigdata elk flink hadoop hbase hive java juc jvm kafaka kafka redis spark spring springcloud zabbix zookeeper

Last synced: 19 Jan 2025

https://github.com/capitalone/datacompy

Pandas, Polars, and Spark DataFrame comparison for humans and more!

compare dask data data-science dataframes fugue numpy pandas polars pyspark python spark

Last synced: 16 Jan 2025

https://github.com/uber/marmaray

Generic Data Ingestion & Dispersal Library for Hadoop

avro-schema data-lake hadoop ingest-data schema-format spark

Last synced: 28 Oct 2024

https://github.com/nightscape/spark-excel

A Spark plugin for reading and writing Excel files

data-frame etl excel scala spark

Last synced: 17 Jan 2025

https://github.com/houshanren/big_data_architect_skills

一个大数据架构师应该掌握的技能

analytics bigdata hadoop skills spark xuan-xing

Last synced: 20 Jan 2025

https://github.com/spotify/featran

A Scala feature transformation library for data science and machine learning

algebird breeze data flink ml scala scalding scio spark tensorflow xgboost

Last synced: 17 Jan 2025

https://github.com/kotlin/kotlin-spark-api

This projects gives Kotlin bindings and several extensions for Apache Spark. We are looking to have this as a part of Apache Spark 3.x

bigdata kotlin nullability scala spark

Last synced: 17 Jan 2025

https://github.com/Kotlin/kotlin-spark-api

This projects gives Kotlin bindings and several extensions for Apache Spark. We are looking to have this as a part of Apache Spark 3.x

bigdata kotlin nullability scala spark

Last synced: 18 Nov 2024

https://github.com/azure/azuredatabricksbestpractices

Version 1 of Technical Best Practices of Azure Databricks based on real world Customer and Technical SME inputs

azure azuredatabricks deployment grafana performance performance-monitoring provisioning python scalability security spark

Last synced: 17 Jan 2025

https://github.com/Azure/AzureDatabricksBestPractices

Version 1 of Technical Best Practices of Azure Databricks based on real world Customer and Technical SME inputs

azure azuredatabricks deployment grafana performance performance-monitoring provisioning python scalability security spark

Last synced: 04 Dec 2024

https://github.com/tweag/sparkle

Haskell on Apache Spark.

analytics apache-spark haskell spark

Last synced: 19 Jan 2025