Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

https://github.com/douban/dpark

Python clone of Spark, a MapReduce alike framework in Python

bigdata dpark mapreduce python spark stream-processing

Last synced: 26 Mar 2024

https://github.com/minzhang-1/PointHop-PointHop2_Spark

A fast and low memory requirement version of PointHop and PointHop++, which is built upon Apache Spark.

3d 3d-classification classification feature-extraction knn least-square-regression pca point-cloud pyspark python spark

Last synced: 26 Mar 2024

https://github.com/tencentmusic/cube-studio

cube studio开源云原生一站式机器学习/深度学习AI平台,支持sso登录,多租户/多项目组,大数据平台对接,notebook在线开发,拖拉拽任务流pipeline编排,多机多卡分布式训练,超参搜索,推理服务VGPU,边缘计算,serverless,标注平台,自动化标注,数据集管理,大模型微调,vllm大模型推理,llmops,私有知识库,AI模型应用商店,支持模型一键开发/推理/微调,支持国产cpu/gpu/npu芯片,支持RDMA,支持pytorch/tf/mxnet/deepspeed/paddle/colossalai/horovod/spark/ray/volcano分布式

ai aihub argo automl gpt inference kubeflow kubernetes llmops mlops notebook pipeline pytorch spark vgpu workflow

Last synced: 26 Mar 2024

https://github.com/deanwampler/JustEnoughScalaForSpark

A tutorial on the most important features and idioms of Scala that you need to use Spark's Scala APIs.

jupyter scala spark tutorial

Last synced: 26 Mar 2024

https://github.com/awslabs/deequ

Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.

dataquality scala spark unit-testing

Last synced: 23 Mar 2024

https://github.com/yahoo/TensorFlowOnSpark

TensorFlowOnSpark brings TensorFlow programs to Apache Spark clusters.

cluster featured machine-learning python scala spark tensorflow yahoo

Last synced: 23 Mar 2024

https://github.com/basin-etl/basin

Basin is a visual programming editor for building Spark and PySpark pipelines. Easily build, debug, and deploy complex ETL pipelines from your browser

emr etl hadoop informatica odi pipeline pyspark spark

Last synced: 23 Mar 2024

https://github.com/combust/mleap

MLeap: Deploy ML Pipelines to Production

data-pipelines python scala scikit-learn spark tensorflow transformers

Last synced: 23 Mar 2024

https://github.com/delta-io/delta-sharing

An open protocol for secure data sharing

big-data data-sharing delta-lake pandas spark

Last synced: 21 Mar 2024

https://github.com/getyourguide/TypedPyspark

Type-annotate your spark dataframes and validate them

pyspark python spark typing

Last synced: 19 Mar 2024

https://github.com/Ibotta/sk-dist

Distributed scikit-learn meta-estimators in PySpark

data-science machine-learning ml scikit-learn spark

Last synced: 18 Mar 2024

https://github.com/projectglow/glow

An open-source toolkit for large-scale genomic analysis

delta genomics gwas machine-learning population-genetics regression spark

Last synced: 18 Mar 2024

https://github.com/manuzhang/jupyterlab_spark

Spark Application UI extension for JupyterLab

jupyterlab jupyterlab-extension spark typescript

Last synced: 18 Mar 2024

https://github.com/itsjafer/jupyterlab-sparkmonitor

JupyterLab extension that enables monitoring launched Apache Spark jobs from within a notebook

apache-spark jupyter jupyter-lab jupyterlab jupyterlab-extension pyspark spark

Last synced: 18 Mar 2024

https://github.com/jupyter-server/enterprise_gateway

A lightweight, multi-tenant, scalable and secure gateway that enables Jupyter Notebooks to share resources across distributed clusters such as Apache Spark, Kubernetes and others.

enterprise gateway hacktoberfest jupyter jupyter-enterprise-gateway jupyter-kernels jupyter-notebook kernel kubernetes remote-kernels spark spark-on-kubernetes yarn

Last synced: 18 Mar 2024

https://github.com/vericast/spylon-kernel

Jupyter kernel for scala and spark

jupyter-kernels kernel metakernel scala spark team-platform

Last synced: 18 Mar 2024

https://github.com/paypal/PPExtensions

Set of iPython and Jupyter extensions to improve user experience

gimel hive ipython-magic jupyer jupyter-extension magics notebooks spark tableau teradata

Last synced: 18 Mar 2024

https://github.com/krishnan-r/sparkmonitor

Monitor Apache Spark from Jupyter Notebook

extension jupyter spark

Last synced: 18 Mar 2024

https://github.com/asavinov/prosto

Prosto is a data processing toolkit radically changing how data is processed by heavily relying on functions and operations with functions - an alternative to map-reduce and join-groupby

business-intelligence data-preparation data-preprocessing data-processing data-science data-wrangling feature-engineering map-reduce olap pandas python spark workflow

Last synced: 18 Mar 2024

https://github.com/garystafford/kafka-connect-msk-demo

For a series of posts on Amazon MSK, Amazon EKS, and Amazon EMR

aws kafka kafka-connect kubernetes spark spark-streaming

Last synced: 18 Mar 2024

https://github.com/Chabane/generator-mitosis

A micro-service infrastructure generator based on Yeoman/Chatbot, Kubernetes/Docker Swarm, Traefik, Ansible, Jenkins, Spark, Hadoop, Kafka, etc.

ansible chatbot docker elasticsearch golang jenkins kafka kibana kubernetes logstash machine-learning rust sonarqube spark swarm traefik vagrant yeoman-generator

Last synced: 16 Mar 2024

https://github.com/trK54Ylmz/kafka-spark-streaming-example

Simple examle for Spark Streaming over Kafka topic

java kafka spark stream-processing

Last synced: 16 Mar 2024