Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

Apache Spark

Apache Spark is an open source distributed general-purpose cluster-computing framework. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.

https://github.com/apache/spark

Apache Spark - A unified analytics engine for large-scale data processing

big-data java jdbc python r scala spark sql

Last synced: 20 Jan 2025

https://github.com/donnemartin/data-science-ipython-notebooks

Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.

aws big-data caffe data-science deep-learning hadoop kaggle keras machine-learning mapreduce matplotlib numpy pandas python scikit-learn scipy spark tensorflow theano

Last synced: 21 Jan 2025

https://github.com/getredash/redash

Make Your Company Data Driven. Connect to any data source, easily visualize, dashboard and share your data.

analytics athena bi bigquery business-intelligence dashboard databricks hacktoberfest javascript mysql postgresql python redash redshift spark spark-sql visualization

Last synced: 20 Jan 2025

https://github.com/yeasy/docker_practice

Learn and understand Docker&Container technologies, with real DevOps practice!

book cloud-computing container devops docker kubernetes linux mesos spark swarm

Last synced: 20 Jan 2025

https://github.com/GaiZhenbiao/ChuanhuChatGPT

GUI for ChatGPT API and many LLMs. Supports agents, file-based QA, GPT finetuning and query with web search. All with a neat UI.

chatbot chatglm chatgpt-api claude dalle3 ernie gemini gemma inspurai llama midjourney minimax moss ollama qwen spark stablelm

Last synced: 25 Oct 2024

https://github.com/gaizhenbiao/chuanhuchatgpt

GUI for ChatGPT API and many LLMs. Supports agents, file-based QA, GPT finetuning and query with web search. All with a neat UI.

chatbot chatglm chatgpt-api claude dalle3 ernie gemini gemma inspurai llama midjourney minimax moss ollama qwen spark stablelm

Last synced: 20 Jan 2025

https://github.com/zhisheng17/flink-learning

flink learning blog. http://www.54tianzhisheng.cn/ 含 Flink 入门、概念、原理、实战、性能调优、源码解析等内容。涉及 Flink Connector、Metrics、Library、DataStream API、Table API & SQL 等内容的学习案例,还有 Flink 落地应用的大型项目案例(PVUV、日志存储、百亿数据实时去重、监控告警)分享。欢迎大家支持我的专栏《大数据实时计算引擎 Flink 实战与性能优化》

clickhouse elasticsearch flink hbase influxdb kafka loki mysql opentsdb rabbitmq redis rocketmq spark stream-processing streaming

Last synced: 20 Jan 2025

https://github.com/FavioVazquez/ds-cheatsheets

List of Data Science Cheatsheets to rule the world

cheatsheet datascience jupyter programming python r spark

Last synced: 29 Oct 2024

https://github.com/faviovazquez/ds-cheatsheets

List of Data Science Cheatsheets to rule the world

cheatsheet datascience jupyter programming python r spark

Last synced: 29 Oct 2024

https://github.com/horovod/horovod

Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.

baidu deep-learning deeplearning keras machine-learning machinelearning mpi mxnet pytorch ray spark tensorflow uber

Last synced: 20 Jan 2025

https://github.com/aalansehaiyang/technology-talk

【大厂面试专栏】一份Java程序员需要的技术指南,这里有面试题、系统架构、职场锦囊、主流中间件等,让你成为更牛的自己!

dubbo es6 git hbase java kafka mycat spark spring springboot

Last synced: 21 Jan 2025

https://github.com/deeplearning4j/deeplearning4j

Suite of tools for deploying and training deep learning models using the JVM. Highlights include model import for keras, tensorflow, and onnx/pytorch, a modular and tiny c++ library for running math code and a java based math library on top of the core c++ library. Also includes samediff: a pytorch/tensorflow like library for running deep learn...

artificial-intelligence clojure deeplearning deeplearning4j dl4j gpu hadoop intellij java linear-algebra matrix-library neural-nets python scala spark

Last synced: 20 Jan 2025

https://github.com/apache/doris

Apache Doris is an easy-to-use, high performance and unified analytics database.

bigquery database dbt delta-lake elt etl hadoop hive hudi iceberg lakehouse olap query-engine real-time redshift snowflake spark sql

Last synced: 20 Jan 2025

https://github.com/apache/incubator-doris

Apache Doris is an easy-to-use, high performance and unified analytics database.

bigquery database dbt delta-lake elt etl hadoop hive hudi iceberg lakehouse olap query-engine real-time redshift snowflake spark sql

Last synced: 14 Dec 2024

https://github.com/wangzhiwubigdata/god-of-bigdata

专注大数据学习面试,大数据成神之路开启。Flink/Spark/Hadoop/Hbase/Hive...

azkaban bigdata flink flume hadoop hbase hdfs hive kafka spark zookeeper

Last synced: 21 Jan 2025

https://github.com/wangzhiwubigdata/God-Of-BigData

专注大数据学习面试,大数据成神之路开启。Flink/Spark/Hadoop/Hbase/Hive...

azkaban bigdata flink flume hadoop hbase hdfs hive kafka spark zookeeper

Last synced: 30 Oct 2024

https://github.com/delta-io/delta

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs

acid analytics big-data delta-lake spark

Last synced: 20 Jan 2025

https://github.com/alluxio/alluxio

Alluxio, data orchestration for analytics and machine learning in the cloud

alluxio data-analysis data-orchestration hadoop memory-speed presto spark tensorflow virtual-distributed-filesystem

Last synced: 20 Jan 2025

https://github.com/h2oai/h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.

automl big-data data-science deep-learning distributed ensemble-learning gbm gpu h2o h2o-automl hadoop java machine-learning naive-bayes opensource pca python r random-forest spark

Last synced: 20 Jan 2025

https://github.com/Alluxio/alluxio

Alluxio, data orchestration for analytics and machine learning in the cloud

alluxio data-analysis data-orchestration hadoop memory-speed presto spark tensorflow virtual-distributed-filesystem

Last synced: 29 Oct 2024

https://github.com/angel-ml/angel

A Flexible and Powerful Parameter Server for large-scale machine learning

high-dimensional machine-learning model online-learning parameter-server scala spark spark-streaming

Last synced: 15 Jan 2025

https://github.com/Angel-ML/angel

A Flexible and Powerful Parameter Server for large-scale machine learning

high-dimensional machine-learning model online-learning parameter-server scala spark spark-streaming

Last synced: 30 Oct 2024

https://github.com/apache/zeppelin

Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.

big-data database flink java javascript nosql scala spark zeppelin

Last synced: 20 Jan 2025

https://github.com/donnemartin/dev-setup

macOS development environment setup: Easy-to-understand instructions with automated setup scripts for developer tools like Vim, Sublime Text, Bash, iTerm, Python data analysis, Spark, Hadoop MapReduce, AWS, Heroku, JavaScript web development, Android development, common data stores, and dev-based OS X defaults.

android-development aws bash cli cloud elasticsearch git iterm2 linux mac macos mongodb mysql nodejs postgresql python redis spark sublime-text vim

Last synced: 16 Jan 2025

https://github.com/yahoo/TensorFlowOnSpark

TensorFlowOnSpark brings TensorFlow programs to Apache Spark clusters.

cluster featured machine-learning python scala spark tensorflow yahoo

Last synced: 28 Oct 2024

https://github.com/yahoo/tensorflowonspark

TensorFlowOnSpark brings TensorFlow programs to Apache Spark clusters.

cluster featured machine-learning python scala spark tensorflow yahoo

Last synced: 21 Jan 2025

https://github.com/tencentmusic/cube-studio

cube studio开源云原生一站式机器学习/深度学习/大模型AI平台,支持sso登录,多租户,大数据平台对接,notebook在线开发,拖拉拽任务流pipeline编排,多机多卡分布式训练,超参搜索,推理服务VGPU,边缘计算,serverless,标注平台,自动化标注,数据集管理,大模型微调,vllm大模型推理,llmops,私有知识库,AI模型应用商店,支持模型一键开发/推理/微调,支持国产cpu/gpu/npu芯片,支持RDMA,支持pytorch/tf/mxnet/deepspeed/paddle/colossalai/horovod/spark/ray/volcano分布式

ai aihub argo automl gpt inference kubeflow kubernetes llmops mlops notebook pipeline pytorch spark vgpu workflow

Last synced: 21 Jan 2025

https://github.com/RoaringBitmap/RoaringBitmap

A better compressed bitset in Java: used by Apache Spark, Netflix Atlas, Apache Pinot, Tablesaw, and many others

bitset druid java lucene roaring-bitmaps roaringbitmap spark

Last synced: 08 Nov 2024

https://github.com/roaringbitmap/roaringbitmap

A better compressed bitset in Java: used by Apache Spark, Netflix Atlas, Apache Pinot, Tablesaw, and many others

bitset druid java lucene roaring-bitmaps roaringbitmap spark

Last synced: 20 Jan 2025

https://github.com/lw-lin/CoolplaySpark

酷玩 Spark: Spark 源代码解析、Spark 类库等

apache-spark spark spark-streaming sparkcore structured-streaming

Last synced: 05 Nov 2024

https://github.com/lw-lin/coolplayspark

酷玩 Spark: Spark 源代码解析、Spark 类库等

apache-spark spark spark-streaming sparkcore structured-streaming

Last synced: 17 Jan 2025

https://github.com/liyupi/sql-generator

🔨 用 JSON 来生成结构化的 SQL 语句,基于 Vue3 + TypeScript + Vite + Ant Design + MonacoEditor 实现,项目简单(重逻辑轻页面)、适合练手~

ant-design bigdata hive javascript json monaco-editor mysql spark sql typescript vite vue vue3

Last synced: 16 Jan 2025

https://github.com/databricks/koalas

Koalas: pandas API on Apache Spark

big-data data-science dataframe mlflow pandas pydata spark

Last synced: 21 Jan 2025

https://github.com/awslabs/deequ

Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.

dataquality scala spark unit-testing

Last synced: 21 Jan 2025

https://github.com/apache/linkis

Apache Linkis builds a computation middleware layer to facilitate connection, governance and orchestration between the upper applications and the underlying data engines.

application-manager context-service engine hive hive-table impala jdbc jobserver linkis livy presto pyspark resource-manager rest-api scriptis spark sql storage thrift-server udf

Last synced: 21 Jan 2025

https://github.com/spark-notebook/spark-notebook

Interactive and Reactive Data Science using Scala and Spark.

apache-spark data-science notebook reactive scala spark

Last synced: 16 Jan 2025

https://github.com/andypetrella/spark-notebook

Interactive and Reactive Data Science using Scala and Spark.

apache-spark data-science notebook reactive scala spark

Last synced: 12 Oct 2024

https://github.com/webankfintech/dataspherestudio

DataSphereStudio is a one stop data application development& management portal, covering scenarios including data exchange, desensitization/cleansing, analysis/mining, quality measurement, visualization, and task scheduling.

airflow atlas azkaban dataworks davinci dolphinscheduler flink governance griffin hadoop hive hue kettle linkis spark supperset tableau visualis workflow zeppelin

Last synced: 21 Jan 2025

https://github.com/WeBankFinTech/DataSphereStudio

DataSphereStudio is a one stop data application development& management portal, covering scenarios including data exchange, desensitization/cleansing, analysis/mining, quality measurement, visualization, and task scheduling.

airflow atlas azkaban dataworks davinci dolphinscheduler flink governance griffin hadoop hive hue kettle linkis spark supperset tableau visualis workflow zeppelin

Last synced: 26 Oct 2024

https://github.com/spark-jobserver/spark-jobserver

REST job server for Apache Spark

rest-api scala spark spark-jobserver

Last synced: 21 Jan 2025

https://github.com/moran1607/bigdataguide

大数据学习,从零开始学习大数据,包含大数据学习各阶段学习视频、面试资料

bigdata flink flume hadoop hbase hive javase kafka scala spark zookeeper

Last synced: 15 Jan 2025

https://github.com/kubeflow/spark-operator

Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.

apache-spark google-cloud-dataproc kubernetes kubernetes-controller kubernetes-crd kubernetes-operator spark

Last synced: 21 Jan 2025

https://github.com/MoRan1607/BigDataGuide

大数据学习,从零开始学习大数据,包含大数据学习各阶段学习视频、面试资料

bigdata flink flume hadoop hbase hive javase kafka scala spark zookeeper

Last synced: 05 Nov 2024

https://github.com/douban/dpark

Python clone of Spark, a MapReduce alike framework in Python

bigdata dpark mapreduce python spark stream-processing

Last synced: 12 Oct 2024

https://github.com/vector4wang/spring-boot-quick

:herb: 基于springboot的快速学习示例,整合自己遇到的开源框架,如:rabbitmq(延迟队列)、Kafka、jpa、redies、oauth2、swagger、jsp、docker、k3s、k3d、k8s、mybatis加解密插件、异常处理、日志输出、多模块开发、多环境打包、缓存cache、爬虫、jwt、GraphQL、dubbo、zookeeper和Async等等:pushpin:

assembly druid dubbo elasticsearch hbase jwt k3d k3s k8s maven modules multi-data mybatis oauth2 rabbitmq spark spring-boot springboot sse swagger

Last synced: 17 Jan 2025

https://github.com/apache/paimon

Apache Paimon is a lake format that enables building a Realtime Lakehouse Architecture with Flink and Spark for both streaming and batch operations.

big-data data-ingestion flink paimon real-time-analytics spark streaming-datalake table-store

Last synced: 21 Jan 2025

https://github.com/apache/incubator-paimon

Apache Paimon is a lake format that enables building a Realtime Lakehouse Architecture with Flink and Spark for both streaming and batch operations.

big-data data-ingestion flink paimon real-time-analytics spark streaming-datalake table-store

Last synced: 18 Dec 2024

https://github.com/lakesoul-io/lakesoul

LakeSoul is an end-to-end, realtime and cloud native Lakehouse framework with fast data ingestion, concurrent update and incremental data analytics on cloud storages for both BI and AI applications.

arrow big-data datafusion datalake flink huggingface lakehouse lakesoul postgresql python pytorch rust spark sql streaming vectorized velox

Last synced: 15 Jan 2025

https://github.com/lakesoul-io/LakeSoul

LakeSoul is an end-to-end, realtime and cloud native Lakehouse framework with fast data ingestion, concurrent update and incremental data analytics on cloud storages for both BI and AI applications.

arrow big-data datafusion datalake flink huggingface lakehouse lakesoul postgresql python pytorch rust spark sql streaming vectorized velox

Last synced: 30 Oct 2024

https://github.com/salesforce/transmogrifai

TransmogrifAI (pronounced trăns-mŏgˈrə-fī) is an AutoML library for building modular, reusable, strongly typed machine learning workflows on Apache Spark with minimal hand-tuning

ai automated-machine-learning automl dsl einstein estimators feature-engineering features machine-learning ml pipelines salesforce scala spark sparkml structured-data transformations transformers transmogrification transmogrify

Last synced: 17 Jan 2025

https://github.com/salesforce/TransmogrifAI

TransmogrifAI (pronounced trăns-mŏgˈrə-fī) is an AutoML library for building modular, reusable, strongly typed machine learning workflows on Apache Spark with minimal hand-tuning

ai automated-machine-learning automl dsl einstein estimators feature-engineering features machine-learning ml pipelines salesforce scala spark sparkml structured-data transformations transformers transmogrification transmogrify

Last synced: 30 Oct 2024

https://github.com/zio/zio-quill

Compile-time Language Integrated Queries for Scala

cassandra database jdbc linq mysql postgres scala scalajs spark sparksql

Last synced: 21 Jan 2025

https://github.com/apache/kyuubi

Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses.

data-lake hacktoberfest hadoop hive jdbc kubernetes spark spark-sql sql thrift

Last synced: 21 Jan 2025

https://github.com/Qihoo360/Quicksql

A Flexible, Fast, Federated(3F) SQL Analysis Middleware for Multiple Data Sources

flink hive spark sql

Last synced: 30 Oct 2024

https://github.com/qihoo360/quicksql

A Flexible, Fast, Federated(3F) SQL Analysis Middleware for Multiple Data Sources

flink hive spark sql

Last synced: 17 Jan 2025

https://github.com/fugue-project/fugue

A unified interface for distributed computing. Fugue executes SQL, Python, Pandas, and Polars code on Spark, Dask and Ray without any rewrites.

dask data-practitioners distributed distributed-computing distributed-systems machine-learning pandas spark sql

Last synced: 21 Jan 2025

https://github.com/endymecy/spark-ml-source-analysis

spark ml 算法原理剖析以及具体的源码实现分析

machine-learning source-analysis spark

Last synced: 18 Jan 2025

https://github.com/datastax/spark-cassandra-connector

DataStax Connector for Apache Spark to Apache Cassandra

cassandra scala spark

Last synced: 21 Jan 2025

https://github.com/xzb-1248/spark

✨Spark is a web-based, cross-platform and full-featured Remote Administration Tool (RAT) written in Go that allows you control all your devices anywhere. Spark是一个Go编写的,网页UI、跨平台以及多功能的远程控制和监控工具,你可以随时随地监控和控制所有设备。

dashboard go golang rat remote-access-tool remote-admin-tool remote-administration-tool remote-control server-monitoring shell spark webshell

Last synced: 16 Jan 2025

https://github.com/ytsaurus/ytsaurus

YTsaurus is a scalable and fault-tolerant open-source big data platform.

big-data clickhouse distributed-database lakehouse olap-database spark sql ytsaurus

Last synced: 15 Jan 2025

https://github.com/szilard/benchm-ml

A minimal benchmark for scalability, speed and accuracy of commonly used open source implementations (R packages, Python scikit-learn, H2O, xgboost, Spark MLlib etc.) of the top machine learning algorithms for binary classification (random forests, gradient boosted trees, deep neural networks etc.).

data-science deep-learning gradient-boosting-machine h2o machine-learning python r random-forest spark xgboost

Last synced: 18 Jan 2025

https://github.com/XZB-1248/Spark

✨Spark is a web-based, cross-platform and full-featured Remote Administration Tool (RAT) written in Go that allows you control all your devices anywhere. Spark是一个Go编写的,网页UI、跨平台以及多功能的远程控制和监控工具,你可以随时随地监控和控制所有设备。

dashboard go golang rat remote-access-tool remote-admin-tool remote-administration-tool remote-control server-monitoring shell spark webshell

Last synced: 19 Nov 2024

https://github.com/gchq/Gaffer

A large-scale entity and relation database supporting aggregation of properties

accumulo aggregation big-data graph graph-database hadoop hbase parquet spark

Last synced: 13 Nov 2024

https://github.com/gchq/gaffer

A large-scale entity and relation database supporting aggregation of properties

accumulo aggregation big-data graph graph-database hadoop hbase parquet spark

Last synced: 21 Jan 2025

https://github.com/broadinstitute/gatk

Official code repository for GATK versions 4 and up

bioinformatics dna gatk genome genomics ngs science sequencing spark

Last synced: 22 Jan 2025

https://github.com/alexioannides/pyspark-example-project

Implementing best practices for PySpark ETL jobs and applications.

data-engineering data-science etl etl-job etl-pipeline pyspark python spark

Last synced: 18 Jan 2025

https://github.com/apachecn/.github

ApacheCN 开源组织:公告、介绍、成员、活动、交流方式

dl ml python pytorch solidity spark

Last synced: 16 Jan 2025

https://github.com/zhonghuasheng/tutorial

后端 (Java Golang)全栈知识架构体系总结

emsp go java keepalived mongodb mqtt mysql netty redis rocketmq spark spring springboot springcloud tomcat tutorial

Last synced: 16 Jan 2025

https://github.com/jadianes/spark-py-notebooks

Apache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks

big-data bigdata data-analysis data-science ipython ipython-notebook machine-learning mllib notebook pyspark python spark

Last synced: 18 Jan 2025

https://github.com/zhonghuasheng/Tutorial

后端 (Java Golang)全栈知识架构体系总结

emsp go java keepalived mongodb mqtt mysql netty redis rocketmq spark spring springboot springcloud tomcat tutorial

Last synced: 29 Oct 2024

https://github.com/o-gs/dji-firmware-tools

Tools for handling firmwares of DJI products, with focus on quadcopters.

ambarella dji elf firmware inspire mavic modding phantom reverse-engineering spark tools

Last synced: 16 Jan 2025

https://github.com/water8394/bigdata-interview

:dart: :star2:[大数据面试题]分享自己在网络上收集的大数据相关的面试题以及自己的答案总结.目前包含Hadoop/Hive/Spark/Flink/Hbase/Kafka/Zookeeper框架的面试题知识总结

bigdata flink hadoop hbase hdfs interview interview-questions kafka mapreduce spark yarn

Last synced: 18 Jan 2025

https://github.com/water8394/BigData-Interview

:dart: :star2:[大数据面试题]分享自己在网络上收集的大数据相关的面试题以及自己的答案总结.目前包含Hadoop/Hive/Spark/Flink/Hbase/Kafka/Zookeeper框架的面试题知识总结

bigdata flink hadoop hbase hdfs interview interview-questions kafka mapreduce spark yarn

Last synced: 30 Oct 2024

https://github.com/maxpumperla/elephas

Distributed Deep learning with Keras & Spark

deep-learning distributed-computing keras neural-networks spark

Last synced: 18 Jan 2025

https://github.com/collabh/bigdata-growth

大数据知识仓库涉及到数据仓库建模、实时计算、大数据、数据中台、系统设计、Java、算法等。

bigdata bigdatalearning debezium flink hadoop hbase hdfs hive hudi kafka kudu mapreduce olap spark

Last synced: 16 Jan 2025

https://github.com/combust/mleap

MLeap: Deploy ML Pipelines to Production

data-pipelines python scala scikit-learn spark tensorflow transformers

Last synced: 21 Jan 2025

https://github.com/japila-books/apache-spark-internals

The Internals of Apache Spark

apache-spark book internals spark

Last synced: 16 Jan 2025