An open API service indexing awesome lists of open source software.

Apache Spark

Apache Spark is an open source distributed general-purpose cluster-computing framework. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.

https://github.com/uber/marmaray

Generic Data Ingestion & Dispersal Library for Hadoop

avro-schema data-lake hadoop ingest-data schema-format spark

Last synced: 23 Mar 2025

https://github.com/Kotlin/kotlin-spark-api

This projects gives Kotlin bindings and several extensions for Apache Spark. We are looking to have this as a part of Apache Spark 3.x

bigdata kotlin nullability scala spark

Last synced: 13 May 2025

https://github.com/houshanren/big_data_architect_skills

一个大数据架构师应该掌握的技能

analytics bigdata hadoop skills spark xuan-xing

Last synced: 05 Apr 2025

https://github.com/spotify/featran

A Scala feature transformation library for data science and machine learning

algebird breeze data flink ml scala scalding scio spark tensorflow xgboost

Last synced: 15 May 2025

https://github.com/kotlin/kotlin-spark-api

This projects gives Kotlin bindings and several extensions for Apache Spark. We are looking to have this as a part of Apache Spark 3.x

bigdata kotlin nullability scala spark

Last synced: 12 Apr 2025

https://github.com/azure/azuredatabricksbestpractices

Version 1 of Technical Best Practices of Azure Databricks based on real world Customer and Technical SME inputs

azure azuredatabricks deployment grafana performance performance-monitoring provisioning python scalability security spark

Last synced: 04 Apr 2025

https://github.com/Azure/AzureDatabricksBestPractices

Version 1 of Technical Best Practices of Azure Databricks based on real world Customer and Technical SME inputs

azure azuredatabricks deployment grafana performance performance-monitoring provisioning python scalability security spark

Last synced: 04 Dec 2024

https://github.com/tweag/sparkle

Haskell on Apache Spark.

analytics apache-spark haskell spark

Last synced: 16 May 2025

https://github.com/lucidworks/spark-solr

Tools for reading data from Solr as a Spark RDD and indexing objects from Spark into Solr using SolrJ.

solr spark

Last synced: 08 May 2025

https://github.com/mrpowers-io/spark-fast-tests

Apache Spark testing helpers (dependency free & works with Scalatest, uTest, and MUnit)

spark testing-framework

Last synced: 15 May 2025

https://github.com/cartershanklin/pyspark-cheatsheet

PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster

apache-spark big-data pyspark spark

Last synced: 14 Feb 2025

https://github.com/supercowpowers/zat

Zeek Analysis Tools (ZAT): Processing and analysis of Zeek network data with Pandas, scikit-learn, Kafka and Spark

bro data-analysis kafka networking pandas python scikit-learn security spark zeek zeek-analysis

Last synced: 09 Apr 2025

https://github.com/commoncrawl/cc-pyspark

Process Common Crawl data with Python and Spark

common-crawl commoncrawl pyspark spark sparksql warc-files wat-files wet

Last synced: 12 Jun 2025

https://github.com/datavane/datavines

Know your data better!Datavines is Next-gen Data Observability Platform, support metadata manage and data quality.

dataobservability dataprofile dataquality datascience doris metadata spark

Last synced: 09 Apr 2025

https://github.com/zsvoboda/ngods-stocks

New Generation Opensource Data Stack Demo

cube dagster datahub dbt iceberg metabase python spark spark-sql trino trinodb

Last synced: 05 Apr 2025

https://github.com/SuperCowPowers/zat

Zeek Analysis Tools (ZAT): Processing and analysis of Zeek network data with Pandas, scikit-learn, Kafka and Spark

bro data-analysis kafka networking pandas python scikit-learn security spark zeek zeek-analysis

Last synced: 27 Nov 2024

https://github.com/microsoft/hyperspace

An open source indexing subsystem that brings index-based query acceleration to Apache Spark™ and big data workloads.

acceleration analytics big-data databases indexing spark

Last synced: 17 Jan 2025

https://github.com/gacwr/openuba

A robust, and flexible open source User & Entity Behavior Analytics (UEBA) framework used for Security Analytics. Developed with luv by Data Scientists & Security Analysts from the Cyber Security Industry. [PRE-ALPHA]

analytics anomaly-detection cybersecurity datascience elasticsearch elk flask information-security machine-learning nodejs react security siem sklearn spark tensorflow threathunting uba ueba user-behaviour

Last synced: 04 Apr 2025

https://github.com/apache/uniffle

Uniffle is a high performance, general purpose Remote Shuffle Service.

mapreduce remote-shuffle-service rss shuffle spark tez

Last synced: 15 May 2025

https://github.com/USCDataScience/sparkler

Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.

big-data distributed-systems information-retrieval nutch search search-engine solr spark tika web-crawler

Last synced: 25 Mar 2025

https://github.com/zhaoyachao/zdh_web

大数据采集,抽取平台,zdh_web是zdh系列服务的可视化管理平台,包含数据采集,调度,权限,审批流,私域营销等模块

bigdata collection data data-collection datapipeline datax-web etl pipline scheduler spark sparketl

Last synced: 04 Apr 2025

https://github.com/apache/incubator-uniffle

Uniffle is a high performance, general purpose Remote Shuffle Service.

mapreduce remote-shuffle-service rss shuffle spark tez

Last synced: 10 Mar 2025

https://github.com/kevinliao159/mydatascienceportfolio

Applying Data Science and Machine Learning to Solve Real World Business Problems

api data-science data-visualization machine-learning neural-networks nlp recommendation-system spark

Last synced: 05 Apr 2025

https://github.com/googleclouddataproc/spark-bigquery-connector

BigQuery data source for Apache Spark: Read data from BigQuery into DataFrames, write DataFrames into BigQuery tables.

bigquery bigquery-storage-api google-bigquery google-cloud google-cloud-dataproc spark

Last synced: 14 May 2025

https://github.com/teeyog/IQL

An ad hoc query service based on the spark sql engine.(基于spark sql引擎的即席查询服务)

spark sparksql

Last synced: 27 Mar 2025

https://github.com/cubefs/compass

Compass is a task diagnosis platform for bigdata

airflow bigdata diagnose dolphinscheduler flink hadoop mapreduce scheduler spark sql

Last synced: 15 May 2025

https://github.com/GoogleCloudDataproc/spark-bigquery-connector

BigQuery data source for Apache Spark: Read data from BigQuery into DataFrames, write DataFrames into BigQuery tables.

bigquery bigquery-storage-api google-bigquery google-cloud google-cloud-dataproc spark

Last synced: 25 Jan 2025

https://github.com/XuefengHuang/RecommendationSystem

Book recommender system using collaborative filtering based on Spark

collaborative-filtering python-flask recommendation-system spark

Last synced: 26 Mar 2025

https://github.com/groupon/sparklint

A tool for monitoring and tuning Spark jobs for efficiency.

performance-analysis scala spark

Last synced: 05 Apr 2025

https://github.com/jorgebucaran/spark.fish

▁▂▄▆▇█▇▆▄▂▁

fish fish-plugin spark

Last synced: 09 Apr 2025

https://github.com/kanyun-inc/ytk-learn

Ytk-learn is a distributed machine learning library which implements most of popular machine learning algorithms(GBDT, GBRT, Mixture Logistic Regression, Gradient Boosting Soft Tree, Factorization Machines, Field-aware Factorization Machines, Logistic Regression, Softmax).

distributed factorization-machines gbdt gbm hadoop logistic-regression machine-learning spark

Last synced: 06 Apr 2025

https://github.com/datamechanics/delight

A Spark UI and Spark History Server alternative with CPU and Memory metrics! Delight is free, cross-platform, and open-source.

apache-spark cpu dashboard delight kubernetes memory monitoring netapp-public spark spark-history-server spark-ui

Last synced: 22 Jan 2025

https://github.com/twosigma/Cook

Fair job scheduler on Kubernetes and Mesos for batch workloads and Spark

cluster gke kubernetes mesos scheduler spark

Last synced: 14 Mar 2025

https://github.com/elasticluster/elasticluster

Create clusters of VMs on the cloud and configure them with Ansible.

ansible azure cloud cluster clustering ec2 gcp gridengine hadoop hpc python slurm spark

Last synced: 07 Apr 2025

https://github.com/oap-project/raydp

RayDP provides simple APIs for running Spark on Ray and integrating Spark with AI libraries.

ray spark

Last synced: 13 Apr 2025

https://github.com/miguno/wirbelsturm

[PROJECT IS NO LONGER MAINTAINED] Wirbelsturm is a Vagrant and Puppet based tool to perform 1-click local and remote deployments, with a focus on big data tech like Kafka.

apache-kafka apache-spark apache-storm kafka puppet spark storm vagrant

Last synced: 22 Jan 2025

https://github.com/alshdavid/crayon-router

Simple framework agnostic UI router for SPAs

react router spark svelte svelte-v3 vue

Last synced: 06 Apr 2025

https://github.com/lightbend/cloudflow

Cloudflow enables users to quickly develop, orchestrate, and operate distributed streaming applications on Kubernetes.

akka cloudflow flink kubernetes microservices-architectures spark streaming-applications streaming-data streaming-runtimes

Last synced: 12 Apr 2025

https://github.com/sderosiaux/every-single-day-i-tldr

A daily digest of the articles or videos I've found interesting, that I want to share with you.

akka architecture bigdata category-theory data-engineering ddd googlecloudplatform java javascript kafka kubernetes microservices reactjs scala spark technology watch

Last synced: 16 May 2025

https://github.com/kamu-data/kamu-cli

Next-generation decentralized data lakehouse and a multi-party stream processing network

blockchain data-as-code data-management data-science datafusion flink jupyter kamu open-data open-data-fabric spark sql

Last synced: 15 May 2025

https://github.com/neo4j/neo4j-spark-connector

Neo4j Connector for Apache Spark, which provides bi-directional read/write access to Neo4j from Spark, using the Spark DataSource APIs

bolt cypher hacktoberfest neo4j-connector neo4j-driver spark

Last synced: 15 May 2025

https://github.com/datawhalechina/juicy-bigdata

🎉🎉🐳 Datawhale大数据处理导论教程 | 大数据技术方向的开篇课程🎉🎉

bigdata hadoop hbase hdfs hive mapreduce spark

Last synced: 09 Apr 2025

https://github.com/microsoft/data-accelerator

Data Accelerator for Apache Spark simplifies onboarding to Streaming of Big Data. It offers a rich, easy to use experience to help with creation, editing and management of Spark jobs on Azure HDInsights or Databricks while enabling the full power of the Spark engine.

apache-spark azure big-data cosmosdb docker eventhub hdinsight iot iothub kafka kafka-streams nodejs react servicefabric spark spark-sql spark-streaming sparksql streaming streaming-data

Last synced: 15 May 2025

https://github.com/aws/sagemaker-spark

A Spark library for Amazon SageMaker.

amazon-sagemaker aws machine-learning python sagemaker scala spark

Last synced: 14 May 2025

https://github.com/melin/superior-sql-parser

基于 antlr4 的多种数据库SQL解析器,获取SQL中元数据,可用于数据平台产品中的多个场景:ddl语句提取元数据、sql 权限校验、表级血缘、sql语法校验等场景。支持spark、flink、gauss、starrocks、Oracle、MYSQL、Postgresql,sqlserver,、db2等

flink gauss lineage metadata mysql parser postgres spark sql starrocks

Last synced: 04 Apr 2025

https://github.com/spotify/big-data-rosetta-code

Code snippets for solving common big data problems in various platforms. Inspired by Rosetta Code

bigdata scala scalding scio spark

Last synced: 16 May 2025

https://github.com/azure/azure-event-hubs

☁️ Cloud-scale telemetry ingestion from any stream of data with Azure Event Hubs

amqp apache azure c dotnet event-hubs eventhub eventhubs go golang java messaging microsoft node node-js nodejs python spark stream streaming

Last synced: 14 May 2025

https://github.com/Ibotta/sk-dist

Distributed scikit-learn meta-estimators in PySpark

data-science machine-learning ml scikit-learn spark

Last synced: 25 Nov 2024

https://github.com/ibotta/sk-dist

Distributed scikit-learn meta-estimators in PySpark

data-science machine-learning ml scikit-learn spark

Last synced: 16 May 2025

https://github.com/hbase-rdd/hbase-rdd

Spark RDD to read, write and delete from HBase

hbase scala spark

Last synced: 06 Apr 2025

https://github.com/apache/incubator-graphar

An open source, standard data file format for graph data storage and retrieval.

big-data data-orchestration etl graph graph-analysis graph-storage pyspark spark

Last synced: 16 May 2025

https://github.com/xd-deng/spark-practice

Apache Spark (PySpark) Practice on Real Data

pyspark spark

Last synced: 07 Apr 2025

https://github.com/flyteorg/flytekit

Extensible Python SDK for developing Flyte tasks and workflows. Simple to get started and learn and highly extensible.

automation data data-science extensible flyte flyte-tasks hacktoberfest mlops pypi python sdk spark workflows

Last synced: 14 May 2025

https://github.com/projectglow/glow

An open-source toolkit for large-scale genomic analysis

delta genomics gwas machine-learning population-genetics regression spark

Last synced: 25 Nov 2024

https://github.com/XD-DENG/Spark-practice

Apache Spark (PySpark) Practice on Real Data

pyspark spark

Last synced: 06 Mar 2025

https://github.com/jelmerk/hnswlib

Java library for approximate nearest neighbors search using Hierarchical Navigable Small World graphs

algorithm java k-nearest-neighbors knn-search pyspark scala spark

Last synced: 15 Apr 2025

https://github.com/PiercingDan/spark-Jupyter-AWS

A guide on how to set up Jupyter with Pyspark painlessly on AWS EC2 clusters, with S3 I/O support

apache-spark apache-spark-cluster aws aws-ec2 aws-s3 ebs-volumes ec2 ec2-instance jupyter jupyter-notebook spark spark-clusters

Last synced: 27 Nov 2024

https://github.com/WeBankFinTech/Visualis

Visualis is a BI tool for data visualization. It provides financial-grade data visualization capabilities on the basis of data security and permissions, based on the open source project Davinci contributed by CreditEase.

appjoint datasource dataspherestudio davinci linkis scriptis spark superset tableau visualization

Last synced: 28 Mar 2025

https://github.com/piercingdan/spark-jupyter-aws

A guide on how to set up Jupyter with Pyspark painlessly on AWS EC2 clusters, with S3 I/O support

apache-spark apache-spark-cluster aws aws-ec2 aws-s3 ebs-volumes ec2 ec2-instance jupyter jupyter-notebook spark spark-clusters

Last synced: 12 May 2025

https://github.com/oap-project/gazelle_plugin

Native SQL Engine plugin for Spark SQL with vectorized SIMD optimizations.

arrow native-kernels native-sql-engine spark vectorized-simd-optimizations

Last synced: 19 Mar 2025

https://github.com/bytedance/CloudShuffleService

Cloud Shuffle Service(CSS) is a general purpose remote shuffle solution for compute engines, including Spark/Flink/MapReduce.

flink hadoop-mapreduce spark

Last synced: 04 Apr 2025

https://github.com/bytedance/cloudshuffleservice

Cloud Shuffle Service(CSS) is a general purpose remote shuffle solution for compute engines, including Spark/Flink/MapReduce.

flink hadoop-mapreduce spark

Last synced: 07 Apr 2025

https://github.com/oeljeklaus-you/javaorbigdata-interview

Java开发者或者大数据开发者面试知识点整理

bigdata hadoop interview java spark storm

Last synced: 08 May 2025

https://github.com/MLWhiz/data_science_blogs

A repository to keep track of all the code that I end up writing for my blog posts.

blogging chatbot data datascience gan graphs machine-learning mcmc python spark streamlit time-series xgboost

Last synced: 05 May 2025

https://github.com/mlwhiz/data_science_blogs

A repository to keep track of all the code that I end up writing for my blog posts.

blogging chatbot data datascience gan graphs machine-learning mcmc python spark streamlit time-series xgboost

Last synced: 06 Apr 2025

https://github.com/tencent/firestorm

Firestorm is a Remote Shuffle Service, and provides the capability for Apache Spark and Apache Hadoop MapReduce applications to store shuffle data on remote servers

mapreduce remoteshuffle shuffle spark

Last synced: 06 Apr 2025

https://github.com/paypal/gimel

Big Data Processing Framework - Unified Data API or SQL on Any Storage

aerospike big-data cassandra data-api elasticsearch gimel hbase jdbc kafka paypal pyspark python restapi scala spark spark-streaming streaming-sql teradata

Last synced: 16 May 2025

https://github.com/databrickslabs/dqx

Databricks framework to validate Data Quality of pySpark DataFrames

data-profiling data-quality data-quality-checks data-quality-monitoring databricks dlt spark spark-streaming

Last synced: 08 Apr 2025

https://github.com/saurfang/spark-knn

k-Nearest Neighbors algorithm on Spark

knn spark

Last synced: 06 Apr 2025

https://github.com/adidas/lakehouse-engine

The Lakehouse Engine is a configuration driven Spark framework, written in Python, serving as a scalable and distributed engine for several lakehouse algorithms, data flows and utilities for Data Products.

big-data configuration-driven data-engineering data-quality databricks delta-lake framework great-expectations lakehouse spark

Last synced: 12 Apr 2025

https://github.com/mellanox/sparkrdma

This is archive of SparkRDMA project. The new repository with RDMA shuffle acceleration for Apache Spark is here: https://github.com/Nvidia/sparkucx

apache-spark big-data bigdata disni hadoop infiniband java mellanox rdma roce scala shuffle spark

Last synced: 22 Jan 2025

https://github.com/mgalarnyk/installations_mac_ubuntu_windows

Installations for Data Science. Anaconda, RStudio, Spark, TensorFlow, AWS (Amazon Web Services).

anaconda aws-ec2 ec2-instance python rstudio spark

Last synced: 06 Apr 2025

https://github.com/mGalarnyk/Installations_Mac_Ubuntu_Windows

Installations for Data Science. Anaconda, RStudio, Spark, TensorFlow, AWS (Amazon Web Services).

anaconda aws-ec2 ec2-instance python rstudio spark

Last synced: 27 Nov 2024

https://github.com/absaoss/abris

Avro SerDe for Apache Spark structured APIs.

avro avro-schema kafka schema-registry spark

Last synced: 04 Apr 2025

https://github.com/ondra-m/ruby-spark

Ruby wrapper for Apache Spark

distributed rdd ruby ruby-spark spark

Last synced: 05 Apr 2025

https://github.com/huangfox/dpkb

大数据相关内容汇总,包括分布式存储引擎、分布式计算引擎、数仓建设等。关键词:Hadoop、HBase、ES、Kudu、Hive、Presto、Spark、Flink、Kylin、ClickHouse

flink hadoop hbase hive presto spark

Last synced: 27 Mar 2025

https://github.com/mkuthan/example-spark

Spark, Spark Streaming and Spark SQL unit testing strategies

spark spark-streaming testing

Last synced: 13 Apr 2025

https://github.com/iimeta/fastapi

智元 Fast API 是一站式API管理系统,将各类LLM API进行统一格式、统一规范、统一管理,使其在功能、性能和用户体验上达到极致。

api chatgpt ernie-bot fast fastapi glm gpt gpt-4 openai qwen realtime spark

Last synced: 26 Nov 2024

https://github.com/neoremind/kraps-rpc

A RPC framework leveraging Spark RPC module

rpc spark

Last synced: 09 Apr 2025