Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/feathr-ai/feathr
Feathr – A scalable, unified data and AI engineering platform for enterprise
apache-spark artificial-intelligence azure data-engineering data-quality data-science feature-engineering feature-governance feature-management feature-marketplace feature-metadata feature-platform feature-store machine-learning mlops
Last synced: 29 Jun 2024
https://github.com/intel-analytics/BigDL-2.x
BigDL: Distributed TensorFlow, Keras and PyTorch on Apache Spark/Flink & Ray
analytics-zoo apache-spark bigdl deep-neural-network distributed-deep-learning keras-tensorflow python pytorch scala
Last synced: 26 Jun 2024
https://github.com/mahmoudparsian/data-algorithms-book
MapReduce, Spark, Java, and Scala for Data Algorithms Book
apache-hadoop apache-spark data-algorithms design-patterns distributed-algorithms distributed-computing hadoop-mapreduce java machine-learning mappers mapreduce partitioning pyspark python reducers scala
Last synced: 26 Jun 2024
https://github.com/streamnative/pulsar-spark
Spark Connector to read and write with Pulsar
apache-pulsar apache-spark batch-processing data-processing data-science flink spark spark-sql stream-processing structured-streaming
Last synced: 26 Jun 2024
https://github.com/databrickslabs/automl-toolkit
Toolkit for Apache Spark ML for Feature clean-up, feature Importance calculation suite, Information Gain selection, Distributed SMOTE, Model selection and training, Hyper parameter optimization and selection, Model interprability.
apache-spark feature-engineering machinelearning ml pyspark scala spark
Last synced: 24 Jun 2024
https://github.com/1duo/awesome-ai-infrastructures
Infrastructures™ for Machine Learning Training/Inference in Production.
apache-arrow apache-mesos apache-spark artificial-intelligence awesome-list deep-learning deep-learning-framework federated-learning knowledge-distillation kubernetes machine-learning machine-learning-systems model-compression pruning quantization
Last synced: 20 Jun 2024
https://microsoft.github.io/SynapseML/
Simple and Distributed Machine Learning
ai apache-spark azure big-data cognitive-services data-science databricks deep-learning http lightgbm machine-learning microsoft ml model-deployment onnx opencv pyspark scala spark synapse
Last synced: 19 Jun 2024
https://github.com/Chabane/bigdata-playground
A complete example of a big data application using : Kubernetes (kops/aws), Apache Spark SQL/Streaming/MLib, Apache Flink, Scala, Python, Apache Kafka, Apache Hbase, Apache Parquet, Apache Avro, Apache Storm, Twitter Api, MongoDB, NodeJS, Angular, GraphQL
angular apache-flink apache-spark avro big-data docker graphql hadoop hbase kafka kops machine-learning mongodb nodejs parquet python scala spark-sql spark-streaming twitter-api
Last synced: 17 Jun 2024
https://github.com/japila-books/apache-spark-internals
The Internals of Apache Spark
apache-spark book internals spark
Last synced: 13 Jun 2024
https://github.com/san089/goodreads_etl_pipeline
An end-to-end GoodReads Data Pipeline for Building Data Lake, Data Warehouse and Analytics Platform.
airflow airflow-dag apache-airflow apache-spark data-engineering data-engineering-pipeline data-lake data-migration emr-cluster etl-framework etl-job etl-pipeline goodreads-data-pipeline livy python redshift s3 scheduler spark warehouse
Last synced: 13 Jun 2024
https://github.com/jhole89/aws-glue-sbt-quickstart
Example of how to set SBT up for local development of AWS Glue Scripts
apache-spark aws-glue quickstart sbt
Last synced: 10 Jun 2024
https://github.com/RBC-DSAI-IITM/DCEIL
A fast, scalable and distributed community detection algorithm based on CEIL scoring function.
apache-hadoop apache-spark community-detection
Last synced: 09 Jun 2024
https://github.com/thangdnsf/BigCLAM-ApacheSpark
Overlapping community detection in Large-Scale Networks using BigCLAM model build on Apache Spark
apache-spark bigclam bigclam-model community-detection graph-mining graphx large-scale latex machine-learning scala scale-networks spark
Last synced: 09 Jun 2024
https://github.com/hortonworks-spark/spark-atlas-connector
A Spark Atlas connector to track data lineage in Apache Atlas
Last synced: 07 Jun 2024
https://github.com/japila-books/spark-sql-internals
The Internals of Spark SQL
apache-spark book internals mkdocs-material spark spark-sql
Last synced: 07 Jun 2024
https://github.com/lw-lin/streaming-readings
Streaming System 相关的论文读物
apache-spark dataflow drizzle flink heron millwheel s4 spark-streaming spe storm stream-processing stream-processing-engine streaming streaming-engine
Last synced: 07 Jun 2024
https://github.com/jaceklaskowski/spark-kubernetes-book
The Internals of Spark on Kubernetes
apache-spark book internals kubernetes spark
Last synced: 07 Jun 2024
https://github.com/jaceklaskowski/spark-workshop
Apache Spark™ and Scala Workshops
apache-spark spark spark-mllib spark-sql spark-structured-streaming spark-workshops workshop
Last synced: 07 Jun 2024
https://github.com/Nosto/spartann
Hyper performant kNN using Annoy for Apache Spark.
ann annoy apache-spark k-nearest-neighbors k-nearest-neighbours knn ml spark
Last synced: 07 Jun 2024
https://github.com/abhirockzz/cosmosdb-synapse-workshop
Near Real Time Analytics with Azure Synapse Link for Azure Cosmos DB
apache-spark azure-cosmos-db azure-synapse-analytics mongodb pyspark python
Last synced: 04 Jun 2024
https://github.com/big-data-europe/docker-spark
Apache Spark docker image
apache-spark docker k8s-spark kubernetes spark-kubernetes
Last synced: 01 Jun 2024
https://github.com/intel-analytics/analytics-zoo
Distributed Tensorflow, Keras and PyTorch on Apache Spark/Flink & Ray
analytics-zoo apache-spark bigdl deep-neural-network distributed-deep-learning keras-tensorflow python pytorch scala
Last synced: 01 Jun 2024
https://github.com/MingChen0919/learning-apache-spark
Notes on Apache Spark (pyspark)
apache-spark machine-learning pyspark-tutorial
Last synced: 31 May 2024
https://github.com/lw-lin/CoolplaySpark
酷玩 Spark: Spark 源代码解析、Spark 类库等
apache-spark spark spark-streaming sparkcore structured-streaming
Last synced: 26 May 2024
https://github.com/radanalyticsio/spark-operator
Operator for managing the Spark clusters on Kubernetes and OpenShift.
apache-spark kubernetes kubernetes-operator openshift spark
Last synced: 22 May 2024
https://github.com/Hydrospheredata/mist
Serverless proxy for Spark cluster
apache-spark api big-data serverless
Last synced: 16 May 2024
https://github.com/mikeroyal/Apache-Spark-Guide
Apache Spark Guide
apache-spark awesome awesome-automations awesome-list big-data data-engineering data-engineering-pipeline data-science machine-learning pyspark spark spark-streaming
Last synced: 14 May 2024
https://github.com/harryprince/awesome-sparklyr
An awesome sparklyr related package collection
apache-spark awesome big-data dbi machine-learning r r-stats spark-sql sparklyr
Last synced: 14 May 2024
https://github.com/treeverse/lakeFS
lakeFS - Data version control for your data lake | Git for data
apache-spark apache-sparksql aws-s3 azure-blob-storage azure-storage data-engineering data-lake data-quality data-version-control data-versioning datalake datalakes git-for-data go golang google-cloud-storage hadoop-filesystem lakefs object-storage
Last synced: 13 May 2024
https://github.com/SANSA-Stack/SANSA-Stack
Big Data RDF Processing and Analytics Stack built on Apache Spark and Apache Jena http://sansa-stack.github.io/SANSA-Stack/
apache-jena apache-spark distributed-computing flink rdf semantic-web spark
Last synced: 13 May 2024
https://github.com/datamechanics/delight
A Spark UI and Spark History Server alternative with CPU and Memory metrics! Delight is free, cross-platform, and open-source.
apache-spark cpu dashboard delight kubernetes memory monitoring netapp-public spark spark-history-server spark-ui
Last synced: 11 May 2024
https://github.com/infoslack/awesome-kafka
A list about Apache Kafka
apache-kafka apache-spark data-pipeline data-processing infrastructure kafka kafka-streams stream-processing streaming-data
Last synced: 07 May 2024
https://github.com/archivesunleashed/twut
An open-source toolkit for analyzing line-oriented JSON Twitter archives with Apache Spark.
apache-spark spark spark-packages tweets twitter-data twitter-json
Last synced: 07 May 2024
https://github.com/archivesunleashed/aut
The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.
analysis apache-spark big-data big-data-analytics dataframe digital-humanities hadoop network-graphing pyspark python3 scala spark text-extraction webarchives
Last synced: 07 May 2024
https://github.com/Microsoft/Mobius
C# and F# language binding and extensions to Apache Spark
apache-spark bigdata csharp dataframe dataset dstream eventhubs fsharp kafka-streaming mapreduce mobius near-real-time rdd spark spark-streaming streaming
Last synced: 05 May 2024
https://github.com/awesome-spark/awesome-spark
A curated list of awesome Apache Spark packages and resources.
apache-spark awesome pyspark sparkr
Last synced: 05 May 2024
https://github.com/mlflow/mlflow
Open source platform for the machine learning lifecycle
ai apache-spark machine-learning ml mlflow model-management
Last synced: 05 May 2024
https://github.com/OryxProject/oryx
Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning
apache-kafka apache-spark cloudera java kafka lambda-architecture machine-learning oryx
Last synced: 02 May 2024
https://github.com/spark-notebook/spark-notebook
Interactive and Reactive Data Science using Scala and Spark.
apache-spark data-science notebook reactive scala spark
Last synced: 30 Apr 2024
https://github.com/zero323/pyspark-stubs
Apache (Py)Spark type annotations (stub files).
apache-spark mypy pep484 pyspark python python-3 stub-files type-annotations
Last synced: 28 Apr 2024
https://github.com/svenkreiss/pysparkling
A pure Python implementation of Apache Spark's RDD and DStream interfaces.
apache-spark data-processing data-science python
Last synced: 28 Apr 2024
https://github.com/liquidSVM/liquidSVM
Support vector machines (SVMs) and related kernel-based learning algorithms are a well-known class of machine learning algorithms, for non-parametric classification and regression. liquidSVM is an implementation of SVMs whose key features are: fully integrated hyper-parameter selection, extreme speed on both small and large data sets, full flexibility for experts, and inclusion of a variety of different learning scenarios: multi-class classification, ROC, and Neyman-Pearson learning, and least-squares, quantile, and expectile regression.
apache-spark c-plus-plus classification expectile-regression machine-learning matlab ml octave python quantile-regression r r-package regression rstats svm
Last synced: 28 Apr 2024
https://github.com/lensacom/sparkit-learn
PySpark + Scikit-learn = Sparkit-learn
apache-spark distributed-computing machine-learning python scikit-learn
Last synced: 28 Apr 2024
https://github.com/lifeomic/sparkflow
Easy to use library to bring Tensorflow on Apache Spark
apache-spark dataframe deep-learning lifeomic pipeline spark-ml tensorflow
Last synced: 27 Apr 2024
https://github.com/opencypher/morpheus
Morpheus brings the leading graph query language, Cypher, onto the leading distributed processing platform, Spark.
apache-spark apache2 big-data cypher graph scala
Last synced: 27 Apr 2024
https://github.com/rstudio/sparkxgb
R interface for XGBoost on Spark
apache-spark machine-learning r rstats spark xgboost
Last synced: 25 Apr 2024
https://github.com/dmmiller612/sparktorch
Train and run Pytorch models on Apache Spark.
apache-spark deep-learning distributed-computing inference pipelines pytorch sparktorch
Last synced: 19 Apr 2024
https://github.com/Azure/azure-cosmosdb-spark
Apache Spark Connector for Azure Cosmos DB
apache-spark azure-cosmos-db azure-databricks changefeed connector cosmos-db databricks databricks-notebooks jupyter-notebook lambda-architecture pyspark spark
Last synced: 19 Apr 2024
https://github.com/ognis1205/spark-tda
SparkTDA is a package for Apache Spark providing Topological Data Analysis Functionalities.
apache-spark machine-learning ml mllib spark tda topological-data-analysis
Last synced: 19 Apr 2024
https://github.com/PiercingDan/spark-Jupyter-AWS
A guide on how to set up Jupyter with Pyspark painlessly on AWS EC2 clusters, with S3 I/O support
apache-spark apache-spark-cluster aws aws-ec2 aws-s3 ebs-volumes ec2 ec2-instance jupyter jupyter-notebook spark spark-clusters
Last synced: 17 Apr 2024
https://github.com/rjurney/Agile_Data_Code_2
Code for Agile Data Science 2.0, O'Reilly 2017, Second Edition
agile-data agile-data-science airflow amazon-ec2 amazon-web-services analytics apache-kafka apache-spark data data-science data-syndrome kafka machine-learning machine-learning-algorithms predictive-analytics python python-3 python3 spark vagrant
Last synced: 17 Apr 2024
https://github.com/microsoft/SynapseML
Simple and Distributed Machine Learning
ai apache-spark azure big-data cognitive-services data-science databricks deep-learning http lightgbm machine-learning microsoft ml model-deployment onnx opencv pyspark scala spark synapse
Last synced: 17 Apr 2024
https://github.com/sumitarora/awesome-spark
Apache Spark Awesome List
apache-spark spark spark-fundamentals spark-resources
Last synced: 11 Apr 2024
https://github.com/streamnative/awesome-pulsar
A curated list of Pulsar tools, integrations and resources.
apache-bookkeeper apache-flink apache-kafka apache-pulsar apache-spark apache-storm elastic-beats grafana-dashboard messaging prometheus pub-sub spark spark-sql spark-structured-streaming
Last synced: 11 Apr 2024
https://github.com/awesome-spark/spark-gotchas
Spark Gotchas. A subjective compilation of the Apache Spark tips and tricks
apache-spark book guide pyspark
Last synced: 11 Apr 2024
https://github.com/nchammas/flintrock
A command-line tool for launching Apache Spark clusters.
apache-spark apache-spark-cluster ec2 orchestration spark-ec2
Last synced: 11 Apr 2024
https://github.com/cerndb/dist-keras
Distributed Deep Learning, with a focus on distributed training, using Keras and Apache Spark.
apache-spark data-parallelism data-science deep-learning distributed-optimizers hadoop keras machine-learning optimization-algorithms tensorflow
Last synced: 11 Apr 2024
https://github.com/databricks/spark-sklearn
(Deprecated) Scikit-learn integration package for Apache Spark
apache-spark grid-search machine-learning parameter-tuning scikit-learn
Last synced: 11 Apr 2024
https://github.com/mrpowers/quinn
pyspark methods to enhance developer productivity 📣 👯 🎉
Last synced: 11 Apr 2024
https://github.com/sparklyr/sparklyr
R interface for Apache Spark
apache-spark distributed dplyr ide livy machine-learning r remote-clusters rstats spark sparklyr
Last synced: 11 Apr 2024
https://github.com/dotnet/spark
.NET for Apache® Spark™ makes Apache Spark™ easily accessible to .NET developers.
analytics apache-spark azure bigdata csharp databricks dotnet dotnet-core dotnet-standard emr fsharp hdinsight machine-learning microsoft spark spark-sql spark-streaming streaming tpcds tpch
Last synced: 11 Apr 2024
https://github.com/rstudio/sparktf
R interface to Spark TensorFlow Connector
apache-spark keras r rstats sparklyr sparklyr-extension tensorflow
Last synced: 02 Apr 2024
https://github.com/BitwiseInc/Hydrograph
A visual ETL development and debugging tool for big data
apache-spark big-data cascading etl etl-framework
Last synced: 01 Apr 2024
https://github.com/LucaCanali/sparkMeasure
This is the development repository for sparkMeasure, a tool and library designed for efficient analysis and troubleshooting of Apache Spark jobs. It focuses on easing the collection and examination of Spark metrics, making it a practical choice for both developers and data engineers.
apache-spark performance-metrics performance-troubleshooting python scala spark
Last synced: 31 Mar 2024
https://github.com/apache-spark-on-k8s/spark
Apache Spark enhanced with native Kubernetes scheduler back-end: NOTE this repository is being ARCHIVED as all new development for the kubernetes scheduler back-end is now on https://github.com/apache/spark/
apache-spark kubernetes kubernetes-cluster
Last synced: 28 Mar 2024
https://github.com/IBMStreams/streamsx.kafka
Repository for integration with Apache Kafka
apache-spark ibm-streams kafka messaging stream-processing toolkit
Last synced: 26 Mar 2024
https://github.com/gtkcyber/griffon-vm
Griffon Data Science Virtual Machine
apache-drill apache-spark big-data data-science database elasticsearch hadoop jupyter-notebook mysql node-js python r ruby scala virtual-machine
Last synced: 26 Mar 2024
https://github.com/openscoring/openscoring
REST web service for the true real-time scoring (<1 ms) of Scikit-Learn, R and Apache Spark models
apache-spark api lightgbm pmml r real-time scikit-learn xgboost
Last synced: 23 Mar 2024
https://github.com/harryprince/geospark
bring sf to spark in production
apache-spark gis large-scale-spatial-analysis r spark-sql sparklyr-extension spatial-analysis spatial-queries
Last synced: 20 Mar 2024
https://github.com/itsjafer/jupyterlab-sparkmonitor
JupyterLab extension that enables monitoring launched Apache Spark jobs from within a notebook
apache-spark jupyter jupyter-lab jupyterlab jupyterlab-extension pyspark spark
Last synced: 18 Mar 2024
https://github.com/aloneguid/parquet-dotnet
Fully managed Apache Parquet implementation
apache-parquet apache-spark dotnet dotnet-core dotnet-standard ios linux windows xamarin xbox
Last synced: 15 Mar 2024
https://github.com/Azure/mmlspark
Simple and Distributed Machine Learning
ai apache-spark azure big-data cognitive-services data-science databricks deep-learning http lightgbm machine-learning microsoft ml model-deployment onnx opencv pyspark scala spark synapse
Last synced: 13 Mar 2024