Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
Projects in Awesome Lists tagged with spark
A curated list of projects in awesome lists tagged with spark .
https://github.com/teeyog/IQL
An ad hoc query service based on the spark sql engine.(基于spark sql引擎的即席查询服务)
Last synced: 31 Jul 2024
https://github.com/gacwr/openuba
A robust, and flexible open source User & Entity Behavior Analytics (UEBA) framework used for Security Analytics. Developed with luv by Data Scientists & Security Analysts from the Cyber Security Industry. [PRE-ALPHA]
analytics anomaly-detection cybersecurity datascience elasticsearch elk flask information-security machine-learning nodejs react security siem sklearn spark tensorflow threathunting uba ueba user-behaviour
Last synced: 26 Sep 2024
https://github.com/XuefengHuang/RecommendationSystem
Book recommender system using collaborative filtering based on Spark
collaborative-filtering python-flask recommendation-system spark
Last synced: 31 Jul 2024
https://github.com/apache/incubator-uniffle
Uniffle is a high performance, general purpose Remote Shuffle Service.
mapreduce remote-shuffle-service rss shuffle spark tez
Last synced: 01 Aug 2024
https://github.com/groupon/sparklint
A tool for monitoring and tuning Spark jobs for efficiency.
performance-analysis scala spark
Last synced: 26 Sep 2024
https://github.com/GoogleCloudDataproc/spark-bigquery-connector
BigQuery data source for Apache Spark: Read data from BigQuery into DataFrames, write DataFrames into BigQuery tables.
bigquery bigquery-storage-api google-bigquery google-cloud google-cloud-dataproc spark
Last synced: 30 Sep 2024
https://github.com/googleclouddataproc/spark-bigquery-connector
BigQuery data source for Apache Spark: Read data from BigQuery into DataFrames, write DataFrames into BigQuery tables.
bigquery bigquery-storage-api google-bigquery google-cloud google-cloud-dataproc spark
Last synced: 28 Sep 2024
https://github.com/kanyun-inc/ytk-learn
Ytk-learn is a distributed machine learning library which implements most of popular machine learning algorithms(GBDT, GBRT, Mixture Logistic Regression, Gradient Boosting Soft Tree, Factorization Machines, Field-aware Factorization Machines, Logistic Regression, Softmax).
distributed factorization-machines gbdt gbm hadoop logistic-regression machine-learning spark
Last synced: 04 Aug 2024
https://github.com/kevinschaich/pyspark-cheatsheet
🐍 Quick reference guide to common patterns & functions in PySpark.
cheat cheatsheet cheatsheets data data-science docs documentation guide guides pyspark pyspark-tutorial quickstart reference references spark spark-sql
Last synced: 31 Jul 2024
https://github.com/datamechanics/delight
A Spark UI and Spark History Server alternative with CPU and Memory metrics! Delight is free, cross-platform, and open-source.
apache-spark cpu dashboard delight kubernetes memory monitoring netapp-public spark spark-history-server spark-ui
Last synced: 28 Sep 2024
https://github.com/twosigma/Cook
Fair job scheduler on Kubernetes and Mesos for batch workloads and Spark
cluster gke kubernetes mesos scheduler spark
Last synced: 30 Jul 2024
https://github.com/elasticluster/elasticluster
Create clusters of VMs on the cloud and configure them with Ansible.
ansible azure cloud cluster clustering ec2 gcp gridengine hadoop hpc python slurm spark
Last synced: 01 Aug 2024
https://github.com/miguno/wirbelsturm
[PROJECT IS NO LONGER MAINTAINED] Wirbelsturm is a Vagrant and Puppet based tool to perform 1-click local and remote deployments, with a focus on big data tech like Kafka.
apache-kafka apache-spark apache-storm kafka puppet spark storm vagrant
Last synced: 28 Sep 2024
https://github.com/tirthajyoti/spark-with-python
Fundamentals of Spark with Python (using PySpark), code examples
analytics apache apache-spark big-data database dataframe distributed-computing hadoop hdfs machine-learning map-reduce mlib parallel-computing pyspark python spark sql
Last synced: 28 Sep 2024
https://github.com/lightbend/cloudflow
Cloudflow enables users to quickly develop, orchestrate, and operate distributed streaming applications on Kubernetes.
akka cloudflow flink kubernetes microservices-architectures spark streaming-applications streaming-data streaming-runtimes
Last synced: 26 Sep 2024
https://github.com/sderosiaux/every-single-day-i-tldr
A daily digest of the articles or videos I've found interesting, that I want to share with you.
akka architecture bigdata category-theory data-engineering ddd googlecloudplatform java javascript kafka kubernetes microservices reactjs scala spark technology watch
Last synced: 04 Sep 2024
https://github.com/neo4j/neo4j-spark-connector
Neo4j Connector for Apache Spark, which provides bi-directional read/write access to Neo4j from Spark, using the Spark DataSource APIs
bolt cypher hacktoberfest neo4j-connector neo4j-driver spark
Last synced: 29 Sep 2024
https://github.com/neo4j-contrib/neo4j-spark-connector
Neo4j Connector for Apache Spark, which provides bi-directional read/write access to Neo4j from Spark, using the Spark DataSource APIs
bolt cypher hacktoberfest neo4j-connector neo4j-driver spark
Last synced: 01 Aug 2024
https://github.com/microsoft/data-accelerator
Data Accelerator for Apache Spark simplifies onboarding to Streaming of Big Data. It offers a rich, easy to use experience to help with creation, editing and management of Spark jobs on Azure HDInsights or Databricks while enabling the full power of the Spark engine.
apache-spark azure big-data cosmosdb docker eventhub hdinsight iot iothub kafka kafka-streams nodejs react servicefabric spark spark-sql spark-streaming sparksql streaming streaming-data
Last synced: 28 Sep 2024
https://github.com/oap-project/raydp
RayDP provides simple APIs for running Spark on Ray and integrating Spark with AI libraries.
Last synced: 03 Aug 2024
https://github.com/Ibotta/sk-dist
Distributed scikit-learn meta-estimators in PySpark
data-science machine-learning ml scikit-learn spark
Last synced: 06 Aug 2024
https://github.com/ibotta/sk-dist
Distributed scikit-learn meta-estimators in PySpark
data-science machine-learning ml scikit-learn spark
Last synced: 29 Sep 2024
https://github.com/zero-one-group/geni
A Clojure dataframe library that runs on Spark
big-data clojure clojure-library clojure-repl data-engineering data-science dataframe distributed-computing high-performance-computing machine-learning parallel-computing spark
Last synced: 31 Jul 2024
https://github.com/kamu-data/kamu-cli
New generation decentralized data lake and a streaming data pipeline
blockchain data-as-code data-management data-science datafusion flink jupyter kamu open-data open-data-fabric spark sql
Last synced: 30 Sep 2024
https://github.com/hbase-rdd/hbase-rdd
Spark RDD to read, write and delete from HBase
Last synced: 28 Sep 2024
https://github.com/Hydrospheredata/hydro-serving
MLOps Platform
machine-learning models pipelines realtime scikit-learn scoring serverless serving spark tensorflow
Last synced: 31 Jul 2024
https://github.com/DTStack/dt-sql-parser
SQL Parsers for BigData, built with antlr4.
antlr4 autocompletion bigdata flink hive impala mysql parser postgresql spark sql sql-validation trino
Last synced: 01 Aug 2024
https://github.com/dtstack/dt-sql-parser
SQL Parsers for BigData, built with antlr4.
antlr4 autocompletion bigdata flink hive impala mysql parser postgresql spark sql sql-validation trino
Last synced: 28 Sep 2024
https://github.com/xd-deng/spark-practice
Apache Spark (PySpark) Practice on Real Data
Last synced: 01 Oct 2024
https://github.com/PiercingDan/spark-Jupyter-AWS
A guide on how to set up Jupyter with Pyspark painlessly on AWS EC2 clusters, with S3 I/O support
apache-spark apache-spark-cluster aws aws-ec2 aws-s3 ebs-volumes ec2 ec2-instance jupyter jupyter-notebook spark spark-clusters
Last synced: 07 Aug 2024
https://github.com/piercingdan/spark-jupyter-aws
A guide on how to set up Jupyter with Pyspark painlessly on AWS EC2 clusters, with S3 I/O support
apache-spark apache-spark-cluster aws aws-ec2 aws-s3 ebs-volumes ec2 ec2-instance jupyter jupyter-notebook spark spark-clusters
Last synced: 28 Sep 2024
https://github.com/WeBankFinTech/Visualis
Visualis is a BI tool for data visualization. It provides financial-grade data visualization capabilities on the basis of data security and permissions, based on the open source project Davinci contributed by CreditEase.
appjoint datasource dataspherestudio davinci linkis scriptis spark superset tableau visualization
Last synced: 31 Jul 2024
https://github.com/jaceklaskowski/spark-workshop
Apache Spark™ and Scala Workshops
apache-spark spark spark-mllib spark-sql spark-structured-streaming spark-workshops workshop
Last synced: 28 Sep 2024
https://github.com/oap-project/gazelle_plugin
Native SQL Engine plugin for Spark SQL with vectorized SIMD optimizations.
arrow native-kernels native-sql-engine spark vectorized-simd-optimizations
Last synced: 31 Jul 2024
https://github.com/jelmerk/hnswlib
Java library for approximate nearest neighbors search using Hierarchical Navigable Small World graphs
algorithm java k-nearest-neighbors knn-search pyspark scala spark
Last synced: 28 Sep 2024
https://github.com/mlwhiz/data_science_blogs
A repository to keep track of all the code that I end up writing for my blog posts.
blogging chatbot data datascience gan graphs machine-learning mcmc python spark streamlit time-series xgboost
Last synced: 26 Sep 2024
https://github.com/MLWhiz/data_science_blogs
A repository to keep track of all the code that I end up writing for my blog posts.
blogging chatbot data datascience gan graphs machine-learning mcmc python spark streamlit time-series xgboost
Last synced: 02 Aug 2024
https://github.com/FirelyTeam/spark
Firely and Incendi's open source FHIR server
c-sharp docker dstu2 fhir fhir-api fhir-server fhir-spec fhir-specification r4 spark spark-fhir-server stu3
Last synced: 31 Jul 2024
https://github.com/paypal/gimel
Big Data Processing Framework - Unified Data API or SQL on Any Storage
aerospike big-data cassandra data-api elasticsearch gimel hbase jdbc kafka paypal pyspark python restapi scala spark spark-streaming streaming-sql teradata
Last synced: 29 Sep 2024
https://github.com/locationtech/rasterframes
Geospatial Raster support for Spark DataFrames
earth-observation geotrellis image-processing machine-learning scala spark spark-ml sparksql
Last synced: 31 Jul 2024
https://github.com/bytedance/CloudShuffleService
Cloud Shuffle Service(CSS) is a general purpose remote shuffle solution for compute engines, including Spark/Flink/MapReduce.
Last synced: 01 Aug 2024
https://github.com/mGalarnyk/Installations_Mac_Ubuntu_Windows
Installations for Data Science. Anaconda, RStudio, Spark, TensorFlow, AWS (Amazon Web Services).
anaconda aws-ec2 ec2-instance python rstudio spark
Last synced: 07 Aug 2024
https://github.com/azure/azure-event-hubs-spark
Enabling Continuous Data Processing with Apache Spark and Azure Event Hubs
apache apache-spark azure bigdata connector continuous databricks event-hubs eventhubs ingestion kafka microsoft real-time scala spark spark-streaming stream streaming structured-streaming
Last synced: 28 Sep 2024
https://github.com/ondra-m/ruby-spark
Ruby wrapper for Apache Spark
distributed rdd ruby ruby-spark spark
Last synced: 03 Aug 2024
https://github.com/neoremind/kraps-rpc
A RPC framework leveraging Spark RPC module
Last synced: 28 Sep 2024
https://github.com/dylan-profiler/visions
Type System for Data Analysis in Python
data-analysis data-science hacktoberfest numpy pandas python spark type-inference type-system
Last synced: 28 Sep 2024
https://github.com/azure/azure-cosmosdb-spark
Apache Spark Connector for Azure Cosmos DB
apache-spark azure-cosmos-db azure-databricks changefeed connector cosmos-db databricks databricks-notebooks jupyter-notebook lambda-architecture pyspark spark
Last synced: 28 Sep 2024
https://github.com/flyteorg/flytekit
Extensible Python SDK for developing Flyte tasks and workflows. Simple to get started and learn and highly extensible.
automation data data-science extensible flyte flyte-tasks hacktoberfest mlops pypi python sdk spark workflows
Last synced: 28 Sep 2024
https://github.com/JahstreetOrg/spark-on-kubernetes-helm
Spark on Kubernetes infrastructure Helm charts repo
helm history-server jupyter kubernetes livy spark
Last synced: 03 Aug 2024
https://github.com/Azure/azure-cosmosdb-spark
Apache Spark Connector for Azure Cosmos DB
apache-spark azure-cosmos-db azure-databricks changefeed connector cosmos-db databricks databricks-notebooks jupyter-notebook lambda-architecture pyspark spark
Last synced: 03 Aug 2024
https://github.com/zio/zio-protoquill
Quill for Scala 3
cassandra jdbc language-integrated-query linq postgresql scala spark sparksql sql
Last synced: 26 Sep 2024
https://github.com/databrickslabs/automl-toolkit
Toolkit for Apache Spark ML for Feature clean-up, feature Importance calculation suite, Information Gain selection, Distributed SMOTE, Model selection and training, Hyper parameter optimization and selection, Model interprability.
apache-spark feature-engineering machinelearning ml pyspark scala spark
Last synced: 28 Sep 2024
https://github.com/karakanb/vue-info-card
Simple and beautiful card component with an elegant spark line, for VueJS.
card card-component component info-card spark vue vue-components vuejs vuejs2
Last synced: 27 Sep 2024
https://github.com/syzer/js-spark
Realtime calculation distributed system. AKA distributed lodash
distributed distributed-computing multicore realtime spark
Last synced: 28 Sep 2024
https://github.com/polomarcus/spark-structured-streaming-examples
Spark Structured Streaming / Kafka / Cassandra / Elastic
cassandra kafka spark spark-sql structured-streaming
Last synced: 29 Sep 2024
https://github.com/swoop-inc/spark-alchemy
Collection of open-source Spark tools & frameworks that have made the data engineering and data science teams at Swoop highly productive
data-engineering data-science scala spark
Last synced: 28 Sep 2024
https://github.com/vericast/spylon-kernel
Jupyter kernel for scala and spark
jupyter-kernels kernel metakernel scala spark team-platform
Last synced: 01 Aug 2024
https://github.com/nareshk1290/udacity-data-engineering
Udacity Data Engineering Nano Degree (DEND)
airflow aws cassandra etl postgresql redshift s3 spark star-schema udacity-dend
Last synced: 29 Sep 2024
https://github.com/apache/incubator-wayang
Apache Wayang(incubating) is the first cross-platform data processing system.
apache big-data cross-platform data-management-platform data-processing distributed-system hadoop java jdbc middleware open-source performance scala spark
Last synced: 29 Sep 2024
https://github.com/lynnlangit/learning-hadoop-and-spark
Companion to Learning Hadoop and Learning Spark courses on Linked In Learning
apache-spark dataproc emr hadoop learning-hadoop mapreduce spark wordcount
Last synced: 28 Sep 2024
https://github.com/locationtech-labs/geopyspark
GeoTrellis for PySpark
big-data geospatial geotrellis python spark tile-server
Last synced: 07 Aug 2024
https://github.com/apple/batch-processing-gateway
The gateway component to make Spark on K8s much easier for Spark users.
batch-processing k8s kubernetes spark
Last synced: 28 Sep 2024
https://github.com/setl-framework/setl
A simple Spark-powered ETL framework that just works 🍺
big-data data-analysis data-engineering data-science data-transformation dataset etl etl-pipeline framework machine-learning modularization pipeline scala setl spark
Last synced: 28 Sep 2024
https://github.com/SETL-Framework/setl
A simple Spark-powered ETL framework that just works 🍺
big-data data-analysis data-engineering data-science data-transformation dataset etl etl-pipeline framework machine-learning modularization pipeline scala setl spark
Last synced: 01 Aug 2024
https://github.com/whylabs/whylogs-java
Profile and monitor your ML data pipeline end-to-end
ai-pipelines aiops apache-spark approximate-statistics calculate-statistics data-quality dataset java mlops spark statistical-properties statistics whylogs
Last synced: 28 Sep 2024
https://github.com/mc2-project/opaque-sql
An encrypted data analytics platform
analytics enclave machine-learning privacy security spark spark-sql
Last synced: 31 Jul 2024
https://github.com/ClickHouse/spark-clickhouse-connector
Spark ClickHouse Connector build on DataSourceV2 API
arrow clickhouse datasourcev2 grpc http spark
Last synced: 02 Aug 2024
https://github.com/benfradet/spark-kafka-writer
Write your Spark data to Kafka seamlessly
Last synced: 28 Sep 2024
https://github.com/capeprivacy/cape-python
Privacy transformations on Spark and Pandas dataframes backed by a simple policy language.
collaboration data-science hacktoberfest machine-learning pandas policy privacy python spark
Last synced: 03 Aug 2024
https://github.com/leobenkel/zparkio
Boiler plate framework to use Spark and ZIO together.
boiler-plate functional-programming helpers scala spark template zio
Last synced: 28 Sep 2024
https://github.com/leobenkel/Zparkio
Boiler plate framework to use Spark and ZIO together.
boiler-plate functional-programming helpers scala spark template zio
Last synced: 02 Aug 2024
https://github.com/dsaidgovsg/airflow-pipeline
An Airflow docker image preconfigured to work well with Spark and Hadoop/EMR
Last synced: 31 Jul 2024
https://github.com/krishnan-r/sparkmonitor
Monitor Apache Spark from Jupyter Notebook
Last synced: 28 Sep 2024
https://github.com/aliyun/aliyun-emapreduce-datasources
Extended datasource support for Spark/Hadoop on Aliyun E-MapReduce.
aliyun datasources e-mapreduce hadoop kafka spark
Last synced: 26 Sep 2024
https://github.com/yaooqinn/spark-authorizer
A Spark SQL extension which provides SQL Standard Authorization for Apache Spark | This repo is contributed to Apache Kyuubi | 项目已迁移至 Apache Kyuubi
acl hive ranger ranger-hive-plugin spark
Last synced: 01 Oct 2024
https://github.com/unnati-xyz/scalable-data-science-platform
Content for architecting a data science platform for products using Luigi, Spark & Flask.
data-engineer data-pipeline data-science luigi machine-learning rest-api spark
Last synced: 07 Aug 2024
https://github.com/baghelamit/iot-traffic-monitor
cassandra java kafka spark spring-boot
Last synced: 29 Sep 2024
https://github.com/radanalyticsio/spark-operator
Operator for managing the Spark clusters on Kubernetes and OpenShift.
apache-spark kubernetes kubernetes-operator openshift spark
Last synced: 28 Sep 2024
https://github.com/qubole/spark-on-lambda
Apache Spark on AWS Lambda
apache-spark aws aws-cloud aws-lambda big-data lambda serverless spark
Last synced: 28 Sep 2024
https://github.com/henridf/apache-spark-node
Node.js bindings for Apache Spark DataFrame APIs
Last synced: 01 Aug 2024
https://github.com/helgeho/ArchiveSpark
An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developed at Internet Archive.
archivespark internet-archive spark spark-framework warc web-archiving webarchive
Last synced: 01 Aug 2024
https://github.com/sansa-stack/sansa-stack
Big Data RDF Processing and Analytics Stack built on Apache Spark and Apache Jena http://sansa-stack.github.io/SANSA-Stack/
apache-jena apache-spark distributed-computing flink rdf semantic-web spark
Last synced: 28 Sep 2024
https://github.com/absaoss/cobrix
A COBOL parser and Mainframe/EBCDIC data source for Apache Spark
cobol cobol-parser copybook ebcdic etl mainframe scalable spark
Last synced: 28 Sep 2024
https://github.com/eto-ai/rikai
Parquet-based ML data format optimized for working with unstructured data
deep-learning machine-learning pytorch spark tensorflow
Last synced: 02 Aug 2024
https://github.com/archivesunleashed/aut
The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.
analysis apache-spark big-data big-data-analytics dataframe digital-humanities hadoop network-graphing pyspark python3 scala spark text-extraction webarchives
Last synced: 28 Sep 2024
https://github.com/easysql/easy_sql
A library developed to ease the data ETL development process.
clickhouse etl postgres postgresql python spark sql
Last synced: 02 Aug 2024
https://github.com/clustering4ever/clustering4ever
C4E, a JVM friendly library written in Scala for both local and distributed (Spark) Clustering.
ai artificial-intelligence big-data bigdata clustering clustering-algorithm clustering-evaluation scala scalability spark
Last synced: 30 Sep 2024
https://github.com/Clustering4Ever/Clustering4Ever
C4E, a JVM friendly library written in Scala for both local and distributed (Spark) Clustering.
ai artificial-intelligence big-data bigdata clustering clustering-algorithm clustering-evaluation scala scalability spark
Last synced: 04 Aug 2024
https://github.com/memverge/splash
Splash, a flexible Spark shuffle manager that supports user-defined storage backends for shuffle data storage and exchange
apache-spark bigdata disaggregation elasticity java scala shuffle spark storage
Last synced: 28 Sep 2024
https://github.com/Qihoo360/XLearning-XDML
extremely distributed machine learning
ai distributed hadoop hazelcast kudu machine-learning parameter-server spark
Last synced: 31 Jul 2024