
An open API service indexing awesome lists of open source software.

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.

deep-learning machine-learning parquet parquet-files pyarrow pyspark pytorch sysml tensorflow

Last synced: 29 Jun 2024

Tumult Core is a collection of composable components for implementing algorithms to perform differentially private computations.

differential-privacy privacy pyspark

Last synced: 28 Jun 2024

Twitter Spark Streaming using PySpark

apache pyspark spark-streaming tweepy-api twitter

Last synced: 27 Jun 2024

80+ DevOps & Data CLI Tools - AWS, GCP, GCF Python Cloud Functions, Log Anonymizer, Spark, Hadoop, HBase, Hive, Impala, Linux, Docker, Spark Data Converters & Validators (Avro/Parquet/JSON/CSV/INI/XML/YAML), Travis CI, AWS CloudFormation, Elasticsearch, Solr etc.

avro aws cloudformation devops docker dockerhub elasticsearch gcf gcp hadoop hbase hdfs json linux parquet pyspark python solr spark travis-ci

Last synced: 27 Jun 2024

Toolkit for Apache Spark ML for Feature clean-up, feature Importance calculation suite, Information Gain selection, Distributed SMOTE, Model selection and training, Hyper parameter optimization and selection, Model interprability.

apache-spark feature-engineering machinelearning ml pyspark scala spark

Last synced: 24 Jun 2024

Hierarchical record linkage at scale

machine-learning pyspark python record-linkage

Last synced: 22 Jun 2024

This is an ETL application on AWS with general open sales and customer data that you can find here:, it's a zipped file with some .csvs inside that we will apply transformations.

airflow argocd athena aws catalog data data-engineer database emr emr-cluster etl glue kubernetes pipeline postgres pyspark rds spark

Last synced: 16 Jun 2024

PySpark test helper methods with beautiful error messages

pyspark testing

Last synced: 16 Jun 2024

Delta Lake helper methods in PySpark

deltalake pyspark

Last synced: 16 Jun 2024

Near Real Time Analytics with Azure Synapse Link for Azure Cosmos DB

apache-spark azure-cosmos-db azure-synapse-analytics mongodb pyspark python

Last synced: 04 Jun 2024

Kuwala is the no-code data platform for BI analysts and engineers enabling you to build powerful analytics workflows. We are set out to bring state-of-the-art data engineering tools you love, such as Airbyte, dbt, or Great Expectations together in one intuitive interface built with React Flow. In addition we provide third-party data into data science models and products with a focus on geospatial data. Currently, the following data connectors are available worldwide: a) High-resolution demographics data b) Point of Interests from Open Street Map c) Google Popular Times

admin-boundaries data data-integration data-science dbt elt google-trends jupyter kuwala no-code open-data open-source population postgres pyspark python react react-flow scraping spatial-analysis

Last synced: 02 Jun 2024

t-Digest data structure in Python. Useful for percentiles and quantiles, including distributed enviroments like PySpark

distributed-computing estimate mapreduce percentile pyspark python quantile

Last synced: 31 May 2024

This is a repo documenting the best practices in PySpark.

best-practices pyspark

Last synced: 31 May 2024

Local Development of AWS Glue with Docker and Visual Studio Code

aws docker glue pyspark visual-studio-code vscode-extension

Last synced: 27 May 2024

Apache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks

big-data bigdata data-analysis data-science ipython ipython-notebook machine-learning mllib notebook pyspark python spark

Last synced: 26 May 2024

Apache Linkis builds a computation middleware layer to facilitate connection, governance and orchestration between the upper applications and the underlying data engines.

application-manager context-service engine hive hive-table impala jdbc jobserver linkis livy presto pyspark resource-manager rest-api scriptis spark sql storage thrift-server udf

Last synced: 16 May 2024

Sparkling Water provides H2O functionality inside Spark cluster

big-data h2o integration machine-learning pyspark pysparkling rsparkling scala spark

Last synced: 13 May 2024

The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.

analysis apache-spark big-data big-data-analytics dataframe digital-humanities hadoop network-graphing pyspark python3 scala spark text-extraction webarchives

Last synced: 07 May 2024

A curated list of awesome Apache Spark packages and resources.

apache-spark awesome pyspark sparkr

Last synced: 05 May 2024

Scriptis is for interactive data analysis with script development(SQL, Pyspark, HiveQL), task submission(Spark, Hive), UDF, function, resource management and intelligent diagnosis.

errorcode hive hive-table hql hue ide linkis pyspark resouce-management scala spark sql udf zeppelin

Last synced: 02 May 2024

Apache (Py)Spark type annotations (stub files).

apache-spark mypy pep484 pyspark python python-3 stub-files type-annotations

Last synced: 28 Apr 2024

Open Source Contributor Index

analytics azure-functions open-source pyspark python

Last synced: 26 Apr 2024

Gathers Python deployment, infrastructure and practices.

airflow docker docker-compose kafka pyflink pyspark python tensorflow

Last synced: 15 Apr 2024

A comprehensive Spark guide collated from multiple sources that can be referred to learn more about Spark or as an interview refresher.

big-data pyspark spark

Last synced: 15 Apr 2024

BERT, AWS RDS, AWS Forecast, EMR Spark Cluster, Hive, Serverless, Google Assistant + Raspberry Pi, Infrared, Google Cloud Platform Natural Language, Anomaly detection, Tensorflow, Mathematics

anomaly-detection aws-emr-clusters aws-rds bert bert-model emr-cluster googleassistant googlespeech hiveql keras-tensorflow mathe mathematica pyspark raspberry-pi-3 sql-server tensorflow wolfram-mathematica

Last synced: 13 Apr 2024

Spark Gotchas. A subjective compilation of the Apache Spark tips and tricks

apache-spark book guide pyspark

Last synced: 11 Apr 2024

Helpers & syntactic sugar for PySpark.

pyspark python spark

Last synced: 11 Apr 2024

pyspark methods to enhance developer productivity 📣 👯 🎉

apache-spark pyspark

Last synced: 11 Apr 2024

Jupyter magics and kernels for working with remote Spark clusters

cluster jupyter jupyter-notebook kerberos kernel livy magic notebook pandas-dataframe pyspark spark sql-query

Last synced: 11 Apr 2024

Pandas and Spark DataFrame comparison for humans and more!

compare dask data data-science dataframes fugue numpy pandas polars pyspark python spark

Last synced: 31 Mar 2024

A fast and low memory requirement version of PointHop and PointHop++, which is built upon Apache Spark.

3d 3d-classification classification feature-extraction knn least-square-regression pca point-cloud pyspark python spark

Last synced: 26 Mar 2024

Basin is a visual programming editor for building Spark and PySpark pipelines. Easily build, debug, and deploy complex ETL pipelines from your browser

emr etl hadoop informatica odi pipeline pyspark spark

Last synced: 23 Mar 2024

Type-annotate your spark dataframes and validate them

pyspark python spark typing

Last synced: 19 Mar 2024

JupyterLab extension that enables monitoring launched Apache Spark jobs from within a notebook

apache-spark jupyter jupyter-lab jupyterlab jupyterlab-extension pyspark spark

Last synced: 18 Mar 2024

Hadoop-Hive-Spark cluster + Jupyter on Docker

docker hadoop hive jupyter jupyter-notebook pyspark spark

Last synced: 14 Mar 2024