Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

Projects in Awesome Lists tagged with spark

A curated list of projects in awesome lists tagged with spark .

https://github.com/teeyog/IQL

An ad hoc query service based on the spark sql engine.(基于spark sql引擎的即席查询服务)

spark sparksql

Last synced: 31 Jul 2024

https://github.com/gacwr/openuba

A robust, and flexible open source User & Entity Behavior Analytics (UEBA) framework used for Security Analytics. Developed with luv by Data Scientists & Security Analysts from the Cyber Security Industry. [PRE-ALPHA]

analytics anomaly-detection cybersecurity datascience elasticsearch elk flask information-security machine-learning nodejs react security siem sklearn spark tensorflow threathunting uba ueba user-behaviour

Last synced: 26 Sep 2024

https://github.com/zsvoboda/ngods-stocks

New Generation Opensource Data Stack Demo

cube dagster datahub dbt iceberg metabase python spark spark-sql trino trinodb

Last synced: 03 Aug 2024

https://github.com/XuefengHuang/RecommendationSystem

Book recommender system using collaborative filtering based on Spark

collaborative-filtering python-flask recommendation-system spark

Last synced: 31 Jul 2024

https://github.com/apache/incubator-uniffle

Uniffle is a high performance, general purpose Remote Shuffle Service.

mapreduce remote-shuffle-service rss shuffle spark tez

Last synced: 01 Aug 2024

https://github.com/groupon/sparklint

A tool for monitoring and tuning Spark jobs for efficiency.

performance-analysis scala spark

Last synced: 26 Sep 2024

https://github.com/GoogleCloudDataproc/spark-bigquery-connector

BigQuery data source for Apache Spark: Read data from BigQuery into DataFrames, write DataFrames into BigQuery tables.

bigquery bigquery-storage-api google-bigquery google-cloud google-cloud-dataproc spark

Last synced: 30 Sep 2024

https://github.com/googleclouddataproc/spark-bigquery-connector

BigQuery data source for Apache Spark: Read data from BigQuery into DataFrames, write DataFrames into BigQuery tables.

bigquery bigquery-storage-api google-bigquery google-cloud google-cloud-dataproc spark

Last synced: 28 Sep 2024

https://github.com/kanyun-inc/ytk-learn

Ytk-learn is a distributed machine learning library which implements most of popular machine learning algorithms(GBDT, GBRT, Mixture Logistic Regression, Gradient Boosting Soft Tree, Factorization Machines, Field-aware Factorization Machines, Logistic Regression, Softmax).

distributed factorization-machines gbdt gbm hadoop logistic-regression machine-learning spark

Last synced: 04 Aug 2024

https://github.com/datamechanics/delight

A Spark UI and Spark History Server alternative with CPU and Memory metrics! Delight is free, cross-platform, and open-source.

apache-spark cpu dashboard delight kubernetes memory monitoring netapp-public spark spark-history-server spark-ui

Last synced: 28 Sep 2024

https://github.com/twosigma/Cook

Fair job scheduler on Kubernetes and Mesos for batch workloads and Spark

cluster gke kubernetes mesos scheduler spark

Last synced: 30 Jul 2024

https://github.com/elasticluster/elasticluster

Create clusters of VMs on the cloud and configure them with Ansible.

ansible azure cloud cluster clustering ec2 gcp gridengine hadoop hpc python slurm spark

Last synced: 01 Aug 2024

https://github.com/miguno/wirbelsturm

[PROJECT IS NO LONGER MAINTAINED] Wirbelsturm is a Vagrant and Puppet based tool to perform 1-click local and remote deployments, with a focus on big data tech like Kafka.

apache-kafka apache-spark apache-storm kafka puppet spark storm vagrant

Last synced: 28 Sep 2024

https://github.com/jorgebucaran/spark.fish

▁▂▄▆▇█▇▆▄▂▁

fish fish-plugin spark

Last synced: 01 Aug 2024

https://github.com/alshdavid/crayon-router

Simple framework agnostic UI router for SPAs

react router spark svelte svelte-v3 vue

Last synced: 21 Sep 2024

https://github.com/lightbend/cloudflow

Cloudflow enables users to quickly develop, orchestrate, and operate distributed streaming applications on Kubernetes.

akka cloudflow flink kubernetes microservices-architectures spark streaming-applications streaming-data streaming-runtimes

Last synced: 26 Sep 2024

https://github.com/cubefs/compass

Compass is a task diagnosis platform for bigdata

airflow bigdata diagnose dolphinscheduler flink hadoop mapreduce scheduler spark sql

Last synced: 01 Aug 2024

https://github.com/sderosiaux/every-single-day-i-tldr

A daily digest of the articles or videos I've found interesting, that I want to share with you.

akka architecture bigdata category-theory data-engineering ddd googlecloudplatform java javascript kafka kubernetes microservices reactjs scala spark technology watch

Last synced: 04 Sep 2024

https://github.com/neo4j/neo4j-spark-connector

Neo4j Connector for Apache Spark, which provides bi-directional read/write access to Neo4j from Spark, using the Spark DataSource APIs

bolt cypher hacktoberfest neo4j-connector neo4j-driver spark

Last synced: 29 Sep 2024

https://github.com/neo4j-contrib/neo4j-spark-connector

Neo4j Connector for Apache Spark, which provides bi-directional read/write access to Neo4j from Spark, using the Spark DataSource APIs

bolt cypher hacktoberfest neo4j-connector neo4j-driver spark

Last synced: 01 Aug 2024

https://github.com/microsoft/data-accelerator

Data Accelerator for Apache Spark simplifies onboarding to Streaming of Big Data. It offers a rich, easy to use experience to help with creation, editing and management of Spark jobs on Azure HDInsights or Databricks while enabling the full power of the Spark engine.

apache-spark azure big-data cosmosdb docker eventhub hdinsight iot iothub kafka kafka-streams nodejs react servicefabric spark spark-sql spark-streaming sparksql streaming streaming-data

Last synced: 28 Sep 2024

https://github.com/oap-project/raydp

RayDP provides simple APIs for running Spark on Ray and integrating Spark with AI libraries.

ray spark

Last synced: 03 Aug 2024

https://github.com/Ibotta/sk-dist

Distributed scikit-learn meta-estimators in PySpark

data-science machine-learning ml scikit-learn spark

Last synced: 06 Aug 2024

https://github.com/ibotta/sk-dist

Distributed scikit-learn meta-estimators in PySpark

data-science machine-learning ml scikit-learn spark

Last synced: 29 Sep 2024

https://github.com/azure/azure-event-hubs

☁️ Cloud-scale telemetry ingestion from any stream of data with Azure Event Hubs

amqp apache azure c dotnet event-hubs eventhub eventhubs go golang java messaging microsoft node node-js nodejs python spark stream streaming

Last synced: 29 Sep 2024

https://github.com/kamu-data/kamu-cli

New generation decentralized data lake and a streaming data pipeline

blockchain data-as-code data-management data-science datafusion flink jupyter kamu open-data open-data-fabric spark sql

Last synced: 30 Sep 2024

https://github.com/hbase-rdd/hbase-rdd

Spark RDD to read, write and delete from HBase

hbase scala spark

Last synced: 28 Sep 2024

https://github.com/xd-deng/spark-practice

Apache Spark (PySpark) Practice on Real Data

pyspark spark

Last synced: 01 Oct 2024

https://github.com/PiercingDan/spark-Jupyter-AWS

A guide on how to set up Jupyter with Pyspark painlessly on AWS EC2 clusters, with S3 I/O support

apache-spark apache-spark-cluster aws aws-ec2 aws-s3 ebs-volumes ec2 ec2-instance jupyter jupyter-notebook spark spark-clusters

Last synced: 07 Aug 2024

https://github.com/piercingdan/spark-jupyter-aws

A guide on how to set up Jupyter with Pyspark painlessly on AWS EC2 clusters, with S3 I/O support

apache-spark apache-spark-cluster aws aws-ec2 aws-s3 ebs-volumes ec2 ec2-instance jupyter jupyter-notebook spark spark-clusters

Last synced: 28 Sep 2024

https://github.com/WeBankFinTech/Visualis

Visualis is a BI tool for data visualization. It provides financial-grade data visualization capabilities on the basis of data security and permissions, based on the open source project Davinci contributed by CreditEase.

appjoint datasource dataspherestudio davinci linkis scriptis spark superset tableau visualization

Last synced: 31 Jul 2024

https://github.com/oap-project/gazelle_plugin

Native SQL Engine plugin for Spark SQL with vectorized SIMD optimizations.

arrow native-kernels native-sql-engine spark vectorized-simd-optimizations

Last synced: 31 Jul 2024

https://github.com/jelmerk/hnswlib

Java library for approximate nearest neighbors search using Hierarchical Navigable Small World graphs

algorithm java k-nearest-neighbors knn-search pyspark scala spark

Last synced: 28 Sep 2024

https://github.com/mlwhiz/data_science_blogs

A repository to keep track of all the code that I end up writing for my blog posts.

blogging chatbot data datascience gan graphs machine-learning mcmc python spark streamlit time-series xgboost

Last synced: 26 Sep 2024

https://github.com/MLWhiz/data_science_blogs

A repository to keep track of all the code that I end up writing for my blog posts.

blogging chatbot data datascience gan graphs machine-learning mcmc python spark streamlit time-series xgboost

Last synced: 02 Aug 2024

https://github.com/paypal/gimel

Big Data Processing Framework - Unified Data API or SQL on Any Storage

aerospike big-data cassandra data-api elasticsearch gimel hbase jdbc kafka paypal pyspark python restapi scala spark spark-streaming streaming-sql teradata

Last synced: 29 Sep 2024

https://github.com/bytedance/CloudShuffleService

Cloud Shuffle Service(CSS) is a general purpose remote shuffle solution for compute engines, including Spark/Flink/MapReduce.

flink hadoop-mapreduce spark

Last synced: 01 Aug 2024

https://github.com/mGalarnyk/Installations_Mac_Ubuntu_Windows

Installations for Data Science. Anaconda, RStudio, Spark, TensorFlow, AWS (Amazon Web Services).

anaconda aws-ec2 ec2-instance python rstudio spark

Last synced: 07 Aug 2024

https://github.com/ondra-m/ruby-spark

Ruby wrapper for Apache Spark

distributed rdd ruby ruby-spark spark

Last synced: 03 Aug 2024

https://github.com/melin/superior-sql-parser

基于 antlr4 的多种数据库SQL解析器,获取SQL中元数据,可用于数据平台产品中的多个场景:ddl语句提取元数据、sql 权限校验、表级血缘、sql语法校验等场景。支持spark、flink、gauss、starrocks、Oracle、MYSQL、Postgresql,sqlserver,、db2等

flink gauss lineage metadata mysql parser postgres spark sql starrocks

Last synced: 01 Aug 2024

https://github.com/neoremind/kraps-rpc

A RPC framework leveraging Spark RPC module

rpc spark

Last synced: 28 Sep 2024

https://github.com/huangfox/dpkb

大数据相关内容汇总,包括分布式存储引擎、分布式计算引擎、数仓建设等。关键词:Hadoop、HBase、ES、Kudu、Hive、Presto、Spark、Flink、Kylin、ClickHouse

flink hadoop hbase hive presto spark

Last synced: 31 Jul 2024

https://github.com/flyteorg/flytekit

Extensible Python SDK for developing Flyte tasks and workflows. Simple to get started and learn and highly extensible.

automation data data-science extensible flyte flyte-tasks hacktoberfest mlops pypi python sdk spark workflows

Last synced: 28 Sep 2024

https://github.com/WeBankFinTech/WeBank-all-Project

All the project addresses participated and established by WeBank are collected.汇集了微众银行参与和建立的所有项目地址。

ai bigdata blockchain could dpr fate finance frontend linkis spark

Last synced: 31 Jul 2024

https://github.com/JahstreetOrg/spark-on-kubernetes-helm

Spark on Kubernetes infrastructure Helm charts repo

helm history-server jupyter kubernetes livy spark

Last synced: 03 Aug 2024

https://github.com/databrickslabs/automl-toolkit

Toolkit for Apache Spark ML for Feature clean-up, feature Importance calculation suite, Information Gain selection, Distributed SMOTE, Model selection and training, Hyper parameter optimization and selection, Model interprability.

apache-spark feature-engineering machinelearning ml pyspark scala spark

Last synced: 28 Sep 2024

https://github.com/karakanb/vue-info-card

Simple and beautiful card component with an elegant spark line, for VueJS.

card card-component component info-card spark vue vue-components vuejs vuejs2

Last synced: 27 Sep 2024

https://github.com/syzer/js-spark

Realtime calculation distributed system. AKA distributed lodash

distributed distributed-computing multicore realtime spark

Last synced: 28 Sep 2024

https://github.com/polomarcus/spark-structured-streaming-examples

Spark Structured Streaming / Kafka / Cassandra / Elastic

cassandra kafka spark spark-sql structured-streaming

Last synced: 29 Sep 2024

https://github.com/swoop-inc/spark-alchemy

Collection of open-source Spark tools & frameworks that have made the data engineering and data science teams at Swoop highly productive

data-engineering data-science scala spark

Last synced: 28 Sep 2024

https://github.com/vericast/spylon-kernel

Jupyter kernel for scala and spark

jupyter-kernels kernel metakernel scala spark team-platform

Last synced: 01 Aug 2024

https://github.com/lynnlangit/learning-hadoop-and-spark

Companion to Learning Hadoop and Learning Spark courses on Linked In Learning

apache-spark dataproc emr hadoop learning-hadoop mapreduce spark wordcount

Last synced: 28 Sep 2024

https://github.com/apple/batch-processing-gateway

The gateway component to make Spark on K8s much easier for Spark users.

batch-processing k8s kubernetes spark

Last synced: 28 Sep 2024

https://github.com/mc2-project/opaque-sql

An encrypted data analytics platform

analytics enclave machine-learning privacy security spark spark-sql

Last synced: 31 Jul 2024

https://github.com/ClickHouse/spark-clickhouse-connector

Spark ClickHouse Connector build on DataSourceV2 API

arrow clickhouse datasourcev2 grpc http spark

Last synced: 02 Aug 2024

https://github.com/benfradet/spark-kafka-writer

Write your Spark data to Kafka seamlessly

kafka spark

Last synced: 28 Sep 2024

https://github.com/capeprivacy/cape-python

Privacy transformations on Spark and Pandas dataframes backed by a simple policy language.

collaboration data-science hacktoberfest machine-learning pandas policy privacy python spark

Last synced: 03 Aug 2024

https://github.com/leobenkel/zparkio

Boiler plate framework to use Spark and ZIO together.

boiler-plate functional-programming helpers scala spark template zio

Last synced: 28 Sep 2024

https://github.com/leobenkel/Zparkio

Boiler plate framework to use Spark and ZIO together.

boiler-plate functional-programming helpers scala spark template zio

Last synced: 02 Aug 2024

https://github.com/dsaidgovsg/airflow-pipeline

An Airflow docker image preconfigured to work well with Spark and Hadoop/EMR

airflow docker hadoop spark

Last synced: 31 Jul 2024

https://github.com/krishnan-r/sparkmonitor

Monitor Apache Spark from Jupyter Notebook

extension jupyter spark

Last synced: 28 Sep 2024

https://github.com/aliyun/aliyun-emapreduce-datasources

Extended datasource support for Spark/Hadoop on Aliyun E-MapReduce.

aliyun datasources e-mapreduce hadoop kafka spark

Last synced: 26 Sep 2024

https://github.com/yaooqinn/spark-authorizer

A Spark SQL extension which provides SQL Standard Authorization for Apache Spark | This repo is contributed to Apache Kyuubi | 项目已迁移至 Apache Kyuubi

acl hive ranger ranger-hive-plugin spark

Last synced: 01 Oct 2024

https://github.com/unnati-xyz/scalable-data-science-platform

Content for architecting a data science platform for products using Luigi, Spark & Flask.

data-engineer data-pipeline data-science luigi machine-learning rest-api spark

Last synced: 07 Aug 2024

https://github.com/radanalyticsio/spark-operator

Operator for managing the Spark clusters on Kubernetes and OpenShift.

apache-spark kubernetes kubernetes-operator openshift spark

Last synced: 28 Sep 2024

https://github.com/henridf/apache-spark-node

Node.js bindings for Apache Spark DataFrame APIs

data-frame node spark

Last synced: 01 Aug 2024

https://github.com/helgeho/ArchiveSpark

An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developed at Internet Archive.

archivespark internet-archive spark spark-framework warc web-archiving webarchive

Last synced: 01 Aug 2024

https://github.com/sansa-stack/sansa-stack

Big Data RDF Processing and Analytics Stack built on Apache Spark and Apache Jena http://sansa-stack.github.io/SANSA-Stack/

apache-jena apache-spark distributed-computing flink rdf semantic-web spark

Last synced: 28 Sep 2024

https://github.com/absaoss/cobrix

A COBOL parser and Mainframe/EBCDIC data source for Apache Spark

cobol cobol-parser copybook ebcdic etl mainframe scalable spark

Last synced: 28 Sep 2024

https://github.com/eto-ai/rikai

Parquet-based ML data format optimized for working with unstructured data

deep-learning machine-learning pytorch spark tensorflow

Last synced: 02 Aug 2024

https://github.com/archivesunleashed/aut

The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.

analysis apache-spark big-data big-data-analytics dataframe digital-humanities hadoop network-graphing pyspark python3 scala spark text-extraction webarchives

Last synced: 28 Sep 2024

https://github.com/easysql/easy_sql

A library developed to ease the data ETL development process.

clickhouse etl postgres postgresql python spark sql

Last synced: 02 Aug 2024

https://github.com/clustering4ever/clustering4ever

C4E, a JVM friendly library written in Scala for both local and distributed (Spark) Clustering.

ai artificial-intelligence big-data bigdata clustering clustering-algorithm clustering-evaluation scala scalability spark

Last synced: 30 Sep 2024

https://github.com/Clustering4Ever/Clustering4Ever

C4E, a JVM friendly library written in Scala for both local and distributed (Spark) Clustering.

ai artificial-intelligence big-data bigdata clustering clustering-algorithm clustering-evaluation scala scalability spark

Last synced: 04 Aug 2024

https://github.com/memverge/splash

Splash, a flexible Spark shuffle manager that supports user-defined storage backends for shuffle data storage and exchange

apache-spark bigdata disaggregation elasticity java scala shuffle spark storage

Last synced: 28 Sep 2024

https://github.com/apache/spark-website

Apache Spark Website

big-data java jdbc python r scala spark sql

Last synced: 30 Sep 2024