Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

https://github.com/intel-analytics/BigDL-2.x

BigDL: Distributed TensorFlow, Keras and PyTorch on Apache Spark/Flink & Ray

analytics-zoo apache-spark bigdl deep-neural-network distributed-deep-learning keras-tensorflow python pytorch scala

Last synced: 26 Jun 2024

https://github.com/databrickslabs/automl-toolkit

Toolkit for Apache Spark ML for Feature clean-up, feature Importance calculation suite, Information Gain selection, Distributed SMOTE, Model selection and training, Hyper parameter optimization and selection, Model interprability.

apache-spark feature-engineering machinelearning ml pyspark scala spark

Last synced: 24 Jun 2024

https://github.com/Chabane/bigdata-playground

A complete example of a big data application using : Kubernetes (kops/aws), Apache Spark SQL/Streaming/MLib, Apache Flink, Scala, Python, Apache Kafka, Apache Hbase, Apache Parquet, Apache Avro, Apache Storm, Twitter Api, MongoDB, NodeJS, Angular, GraphQL

angular apache-flink apache-spark avro big-data docker graphql hadoop hbase kafka kops machine-learning mongodb nodejs parquet python scala spark-sql spark-streaming twitter-api

Last synced: 17 Jun 2024

https://github.com/japila-books/apache-spark-internals

The Internals of Apache Spark

apache-spark book internals spark

Last synced: 13 Jun 2024

https://github.com/jhole89/aws-glue-sbt-quickstart

Example of how to set SBT up for local development of AWS Glue Scripts

apache-spark aws-glue quickstart sbt

Last synced: 10 Jun 2024

https://github.com/RBC-DSAI-IITM/DCEIL

A fast, scalable and distributed community detection algorithm based on CEIL scoring function.

apache-hadoop apache-spark community-detection

Last synced: 09 Jun 2024

https://github.com/thangdnsf/BigCLAM-ApacheSpark

Overlapping community detection in Large-Scale Networks using BigCLAM model build on Apache Spark

apache-spark bigclam bigclam-model community-detection graph-mining graphx large-scale latex machine-learning scala scale-networks spark

Last synced: 09 Jun 2024

https://github.com/hortonworks-spark/spark-atlas-connector

A Spark Atlas connector to track data lineage in Apache Atlas

apache-atlas apache-spark

Last synced: 07 Jun 2024

https://github.com/jaceklaskowski/spark-kubernetes-book

The Internals of Spark on Kubernetes

apache-spark book internals kubernetes spark

Last synced: 07 Jun 2024

https://github.com/Nosto/spartann

Hyper performant kNN using Annoy for Apache Spark.

ann annoy apache-spark k-nearest-neighbors k-nearest-neighbours knn ml spark

Last synced: 07 Jun 2024

https://github.com/abhirockzz/cosmosdb-synapse-workshop

Near Real Time Analytics with Azure Synapse Link for Azure Cosmos DB

apache-spark azure-cosmos-db azure-synapse-analytics mongodb pyspark python

Last synced: 04 Jun 2024

https://github.com/intel-analytics/analytics-zoo

Distributed Tensorflow, Keras and PyTorch on Apache Spark/Flink & Ray

analytics-zoo apache-spark bigdl deep-neural-network distributed-deep-learning keras-tensorflow python pytorch scala

Last synced: 01 Jun 2024

https://github.com/lw-lin/CoolplaySpark

酷玩 Spark: Spark 源代码解析、Spark 类库等

apache-spark spark spark-streaming sparkcore structured-streaming

Last synced: 26 May 2024

https://github.com/radanalyticsio/spark-operator

Operator for managing the Spark clusters on Kubernetes and OpenShift.

apache-spark kubernetes kubernetes-operator openshift spark

Last synced: 22 May 2024

https://github.com/Hydrospheredata/mist

Serverless proxy for Spark cluster

apache-spark api big-data serverless

Last synced: 16 May 2024

https://github.com/harryprince/awesome-sparklyr

An awesome sparklyr related package collection

apache-spark awesome big-data dbi machine-learning r r-stats spark-sql sparklyr

Last synced: 14 May 2024

https://github.com/SANSA-Stack/SANSA-Stack

Big Data RDF Processing and Analytics Stack built on Apache Spark and Apache Jena http://sansa-stack.github.io/SANSA-Stack/

apache-jena apache-spark distributed-computing flink rdf semantic-web spark

Last synced: 13 May 2024

https://github.com/datamechanics/delight

A Spark UI and Spark History Server alternative with CPU and Memory metrics! Delight is free, cross-platform, and open-source.

apache-spark cpu dashboard delight kubernetes memory monitoring netapp-public spark spark-history-server spark-ui

Last synced: 11 May 2024

https://github.com/archivesunleashed/twut

An open-source toolkit for analyzing line-oriented JSON Twitter archives with Apache Spark.

apache-spark spark spark-packages tweets twitter-data twitter-json

Last synced: 07 May 2024

https://github.com/archivesunleashed/aut

The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.

analysis apache-spark big-data big-data-analytics dataframe digital-humanities hadoop network-graphing pyspark python3 scala spark text-extraction webarchives

Last synced: 07 May 2024

https://github.com/awesome-spark/awesome-spark

A curated list of awesome Apache Spark packages and resources.

apache-spark awesome pyspark sparkr

Last synced: 05 May 2024

https://github.com/mlflow/mlflow

Open source platform for the machine learning lifecycle

ai apache-spark machine-learning ml mlflow model-management

Last synced: 05 May 2024

https://github.com/OryxProject/oryx

Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning

apache-kafka apache-spark cloudera java kafka lambda-architecture machine-learning oryx

Last synced: 02 May 2024

https://github.com/spark-notebook/spark-notebook

Interactive and Reactive Data Science using Scala and Spark.

apache-spark data-science notebook reactive scala spark

Last synced: 30 Apr 2024

https://github.com/zero323/pyspark-stubs

Apache (Py)Spark type annotations (stub files).

apache-spark mypy pep484 pyspark python python-3 stub-files type-annotations

Last synced: 28 Apr 2024

https://github.com/svenkreiss/pysparkling

A pure Python implementation of Apache Spark's RDD and DStream interfaces.

apache-spark data-processing data-science python

Last synced: 28 Apr 2024

https://github.com/liquidSVM/liquidSVM

Support vector machines (SVMs) and related kernel-based learning algorithms are a well-known class of machine learning algorithms, for non-parametric classification and regression. liquidSVM is an implementation of SVMs whose key features are: fully integrated hyper-parameter selection, extreme speed on both small and large data sets, full flexibility for experts, and inclusion of a variety of different learning scenarios: multi-class classification, ROC, and Neyman-Pearson learning, and least-squares, quantile, and expectile regression.

apache-spark c-plus-plus classification expectile-regression machine-learning matlab ml octave python quantile-regression r r-package regression rstats svm

Last synced: 28 Apr 2024

https://github.com/lifeomic/sparkflow

Easy to use library to bring Tensorflow on Apache Spark

apache-spark dataframe deep-learning lifeomic pipeline spark-ml tensorflow

Last synced: 27 Apr 2024

https://github.com/opencypher/morpheus

Morpheus brings the leading graph query language, Cypher, onto the leading distributed processing platform, Spark.

apache-spark apache2 big-data cypher graph scala

Last synced: 27 Apr 2024

https://github.com/rstudio/sparkxgb

R interface for XGBoost on Spark

apache-spark machine-learning r rstats spark xgboost

Last synced: 25 Apr 2024

https://github.com/ognis1205/spark-tda

SparkTDA is a package for Apache Spark providing Topological Data Analysis Functionalities.

apache-spark machine-learning ml mllib spark tda topological-data-analysis

Last synced: 19 Apr 2024

https://github.com/PiercingDan/spark-Jupyter-AWS

A guide on how to set up Jupyter with Pyspark painlessly on AWS EC2 clusters, with S3 I/O support

apache-spark apache-spark-cluster aws aws-ec2 aws-s3 ebs-volumes ec2 ec2-instance jupyter jupyter-notebook spark spark-clusters

Last synced: 17 Apr 2024

https://github.com/awesome-spark/spark-gotchas

Spark Gotchas. A subjective compilation of the Apache Spark tips and tricks

apache-spark book guide pyspark

Last synced: 11 Apr 2024

https://github.com/nchammas/flintrock

A command-line tool for launching Apache Spark clusters.

apache-spark apache-spark-cluster ec2 orchestration spark-ec2

Last synced: 11 Apr 2024

https://github.com/cerndb/dist-keras

Distributed Deep Learning, with a focus on distributed training, using Keras and Apache Spark.

apache-spark data-parallelism data-science deep-learning distributed-optimizers hadoop keras machine-learning optimization-algorithms tensorflow

Last synced: 11 Apr 2024

https://github.com/databricks/spark-sklearn

(Deprecated) Scikit-learn integration package for Apache Spark

apache-spark grid-search machine-learning parameter-tuning scikit-learn

Last synced: 11 Apr 2024

https://github.com/mrpowers/quinn

pyspark methods to enhance developer productivity 📣 👯 🎉

apache-spark pyspark

Last synced: 11 Apr 2024

https://github.com/rstudio/sparktf

R interface to Spark TensorFlow Connector

apache-spark keras r rstats sparklyr sparklyr-extension tensorflow

Last synced: 02 Apr 2024

https://github.com/BitwiseInc/Hydrograph

A visual ETL development and debugging tool for big data

apache-spark big-data cascading etl etl-framework

Last synced: 01 Apr 2024

https://github.com/LucaCanali/sparkMeasure

This is the development repository for sparkMeasure, a tool and library designed for efficient analysis and troubleshooting of Apache Spark jobs. It focuses on easing the collection and examination of Spark metrics, making it a practical choice for both developers and data engineers.

apache-spark performance-metrics performance-troubleshooting python scala spark

Last synced: 31 Mar 2024

https://github.com/apache-spark-on-k8s/spark

Apache Spark enhanced with native Kubernetes scheduler back-end: NOTE this repository is being ARCHIVED as all new development for the kubernetes scheduler back-end is now on https://github.com/apache/spark/

apache-spark kubernetes kubernetes-cluster

Last synced: 28 Mar 2024

https://github.com/IBMStreams/streamsx.kafka

Repository for integration with Apache Kafka

apache-spark ibm-streams kafka messaging stream-processing toolkit

Last synced: 26 Mar 2024

https://github.com/openscoring/openscoring

REST web service for the true real-time scoring (<1 ms) of Scikit-Learn, R and Apache Spark models

apache-spark api lightgbm pmml r real-time scikit-learn xgboost

Last synced: 23 Mar 2024

https://github.com/itsjafer/jupyterlab-sparkmonitor

JupyterLab extension that enables monitoring launched Apache Spark jobs from within a notebook

apache-spark jupyter jupyter-lab jupyterlab jupyterlab-extension pyspark spark

Last synced: 18 Mar 2024