Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/uber/petastorm
Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
deep-learning machine-learning parquet parquet-files pyarrow pyspark pytorch sysml tensorflow
Last synced: 29 Jun 2024
https://github.com/quintoandar/butterfree
A tool for building feature stores.
data-engineering data-science etl etl-framework feature-store package pyspark python
Last synced: 29 Jun 2024
https://gitlab.com/tumult-labs/core
Tumult Core is a collection of composable components for implementing algorithms to perform differentially private computations.
differential-privacy privacy pyspark
Last synced: 28 Jun 2024
https://github.com/prakhar21/spark-streaming
Twitter Spark Streaming using PySpark
apache pyspark spark-streaming tweepy-api twitter
Last synced: 27 Jun 2024
https://github.com/HariSekhon/DevOps-Python-tools
80+ DevOps & Data CLI Tools - AWS, GCP, GCF Python Cloud Functions, Log Anonymizer, Spark, Hadoop, HBase, Hive, Impala, Linux, Docker, Spark Data Converters & Validators (Avro/Parquet/JSON/CSV/INI/XML/YAML), Travis CI, AWS CloudFormation, Elasticsearch, Solr etc.
avro aws cloudformation devops docker dockerhub elasticsearch gcf gcp hadoop hbase hdfs json linux parquet pyspark python solr spark travis-ci
Last synced: 27 Jun 2024
https://github.com/mahmoudparsian/data-algorithms-book
MapReduce, Spark, Java, and Scala for Data Algorithms Book
apache-hadoop apache-spark data-algorithms design-patterns distributed-algorithms distributed-computing hadoop-mapreduce java machine-learning mappers mapreduce partitioning pyspark python reducers scala
Last synced: 26 Jun 2024
https://github.com/databrickslabs/automl-toolkit
Toolkit for Apache Spark ML for Feature clean-up, feature Importance calculation suite, Information Gain selection, Distributed SMOTE, Model selection and training, Hyper parameter optimization and selection, Model interprability.
apache-spark feature-engineering machinelearning ml pyspark scala spark
Last synced: 24 Jun 2024
https://github.com/ipums/hlink
Hierarchical record linkage at scale
machine-learning pyspark python record-linkage
Last synced: 22 Jun 2024
https://microsoft.github.io/SynapseML/
Simple and Distributed Machine Learning
ai apache-spark azure big-data cognitive-services data-science databricks deep-learning http lightgbm machine-learning microsoft ml model-deployment onnx opencv pyspark scala spark synapse
Last synced: 19 Jun 2024
https://github.com/camposvinicius/aws-etl
This is an ETL application on AWS with general open sales and customer data that you can find here: https://github.com/camposvinicius/data/blob/main/AdventureWorks.zip, it's a zipped file with some .csvs inside that we will apply transformations.
airflow argocd athena aws catalog data data-engineer database emr emr-cluster etl glue kubernetes pipeline postgres pyspark rds spark
Last synced: 16 Jun 2024
https://github.com/MrPowers/chispa
PySpark test helper methods with beautiful error messages
Last synced: 16 Jun 2024
https://github.com/ibis-project/ibis
the portable Python dataframe library
bigquery clickhouse dask database datafusion duckdb impala mssql mysql pandas polars postgresql pyarrow pyspark python snowflake sql sqlalchemy sqlite trino
Last synced: 05 Jun 2024
https://github.com/abhirockzz/cosmosdb-synapse-workshop
Near Real Time Analytics with Azure Synapse Link for Azure Cosmos DB
apache-spark azure-cosmos-db azure-synapse-analytics mongodb pyspark python
Last synced: 04 Jun 2024
https://github.com/kuwala-io/kuwala
Kuwala is the no-code data platform for BI analysts and engineers enabling you to build powerful analytics workflows. We are set out to bring state-of-the-art data engineering tools you love, such as Airbyte, dbt, or Great Expectations together in one intuitive interface built with React Flow. In addition we provide third-party data into data science models and products with a focus on geospatial data. Currently, the following data connectors are available worldwide: a) High-resolution demographics data b) Point of Interests from Open Street Map c) Google Popular Times
admin-boundaries data data-integration data-science dbt elt google-trends jupyter kuwala no-code open-data open-source population postgres pyspark python react react-flow scraping spatial-analysis
Last synced: 02 Jun 2024
https://github.com/CamDavidsonPilon/tdigest
t-Digest data structure in Python. Useful for percentiles and quantiles, including distributed enviroments like PySpark
distributed-computing estimate mapreduce percentile pyspark python quantile
Last synced: 31 May 2024
https://github.com/ericxiao251/spark-syntax
This is a repo documenting the best practices in PySpark.
Last synced: 31 May 2024
https://github.com/cevoaustralia/glue-vscode
Local Development of AWS Glue with Docker and Visual Studio Code
aws docker glue pyspark visual-studio-code vscode-extension
Last synced: 27 May 2024
https://github.com/alanchn31/Movalytics-Data-Warehouse
Data pipeline performing ETL to AWS Redshift using Spark, orchestrated with Apache Airflow
airflow analytics aws-redshift aws-s3 data-engineer-nanodegree data-engineering data-engineering-pipeline data-modelling data-warehouse-cloud docker movie-database movie-recommendation movie-reviews pyspark python3 redshift spark sql udacity
Last synced: 27 May 2024
https://github.com/jadianes/spark-py-notebooks
Apache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks
big-data bigdata data-analysis data-science ipython ipython-notebook machine-learning mllib notebook pyspark python spark
Last synced: 26 May 2024
https://github.com/apache/linkis
Apache Linkis builds a computation middleware layer to facilitate connection, governance and orchestration between the upper applications and the underlying data engines.
application-manager context-service engine hive hive-table impala jdbc jobserver linkis livy presto pyspark resource-manager rest-api scriptis spark sql storage thrift-server udf
Last synced: 16 May 2024
https://github.com/mikeroyal/Apache-Spark-Guide
Apache Spark Guide
apache-spark awesome awesome-automations awesome-list big-data data-engineering data-engineering-pipeline data-science machine-learning pyspark spark spark-streaming
Last synced: 14 May 2024
https://github.com/h2oai/sparkling-water
Sparkling Water provides H2O functionality inside Spark cluster
big-data h2o integration machine-learning pyspark pysparkling rsparkling scala spark
Last synced: 13 May 2024
https://github.com/logicalclocks/hopsworks
Hopsworks - Data-Intensive AI platform with a Feature Store
aws azure data-science feature-engineering feature-management feature-store gcp governance hopsworks kserve machine-learning ml mlops model-serving pyspark python serverless
Last synced: 11 May 2024
https://github.com/archivesunleashed/aut
The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.
analysis apache-spark big-data big-data-analytics dataframe digital-humanities hadoop network-graphing pyspark python3 scala spark text-extraction webarchives
Last synced: 07 May 2024
https://github.com/awesome-spark/awesome-spark
A curated list of awesome Apache Spark packages and resources.
apache-spark awesome pyspark sparkr
Last synced: 05 May 2024
https://github.com/WeBankFinTech/Scriptis
Scriptis is for interactive data analysis with script development(SQL, Pyspark, HiveQL), task submission(Spark, Hive), UDF, function, resource management and intelligent diagnosis.
errorcode hive hive-table hql hue ide linkis pyspark resouce-management scala spark sql udf zeppelin
Last synced: 02 May 2024
https://github.com/JohnSnowLabs/spark-nlp
State of the Art Natural Language Processing
albert bert entity-extraction language-detection language-model lemmatizer llm machine-translation named-entity-recognition natural-language-processing nlp part-of-speech-tagger pyspark question-answering sentiment-analysis spark spell-checker tensorflow text-classification transformers
Last synced: 30 Apr 2024
https://github.com/zero323/pyspark-stubs
Apache (Py)Spark type annotations (stub files).
apache-spark mypy pep484 pyspark python python-3 stub-files type-annotations
Last synced: 28 Apr 2024
https://github.com/hi-primus/optimus
:truck: Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark
big-data-cleaning bigdata cudf dask dask-cudf data-analysis data-cleaner data-cleaning data-cleansing data-exploration data-extraction data-preparation data-profiling data-science data-transformation data-wrangling machine-learning pyspark spark
Last synced: 28 Apr 2024
https://github.com/epam/OSCI
Open Source Contributor Index
analytics azure-functions open-source pyspark python
Last synced: 26 Apr 2024
https://github.com/sberbank-ai-lab/RePlay
RecSys Library
machine-learning pyspark pytorch recommender-systems recsys
Last synced: 22 Apr 2024
https://github.com/Azure/azure-cosmosdb-spark
Apache Spark Connector for Azure Cosmos DB
apache-spark azure-cosmos-db azure-databricks changefeed connector cosmos-db databricks databricks-notebooks jupyter-notebook lambda-architecture pyspark spark
Last synced: 19 Apr 2024
https://github.com/microsoft/SynapseML
Simple and Distributed Machine Learning
ai apache-spark azure big-data cognitive-services data-science databricks deep-learning http lightgbm machine-learning microsoft ml model-deployment onnx opencv pyspark scala spark synapse
Last synced: 17 Apr 2024
https://github.com/huseinzol05/Gather-Deployment
Gathers Python deployment, infrastructure and practices.
airflow docker docker-compose kafka pyflink pyspark python tensorflow
Last synced: 15 Apr 2024
https://github.com/ankurchavda/SparkLearning
A comprehensive Spark guide collated from multiple sources that can be referred to learn more about Spark or as an interview refresher.
Last synced: 15 Apr 2024
https://github.com/RubensZimbres/Repo-2019
BERT, AWS RDS, AWS Forecast, EMR Spark Cluster, Hive, Serverless, Google Assistant + Raspberry Pi, Infrared, Google Cloud Platform Natural Language, Anomaly detection, Tensorflow, Mathematics
anomaly-detection aws-emr-clusters aws-rds bert bert-model emr-cluster googleassistant googlespeech hiveql keras-tensorflow mathe mathematica pyspark raspberry-pi-3 sql-server tensorflow wolfram-mathematica
Last synced: 13 Apr 2024
https://github.com/awesome-spark/spark-gotchas
Spark Gotchas. A subjective compilation of the Apache Spark tips and tricks
apache-spark book guide pyspark
Last synced: 11 Apr 2024
https://github.com/mrpowers/quinn
pyspark methods to enhance developer productivity 📣 👯 🎉
Last synced: 11 Apr 2024
https://github.com/jupyter-incubator/sparkmagic
Jupyter magics and kernels for working with remote Spark clusters
cluster jupyter jupyter-notebook kerberos kernel livy magic notebook pandas-dataframe pyspark spark sql-query
Last synced: 11 Apr 2024
https://github.com/kevinschaich/pyspark-cheatsheet
🐍 Quick reference guide to common patterns & functions in PySpark.
cheat cheatsheet cheatsheets data data-science docs documentation guide guides pyspark pyspark-tutorial quickstart reference references spark spark-sql
Last synced: 10 Apr 2024
https://github.com/capitalone/datacompy
Pandas and Spark DataFrame comparison for humans and more!
compare dask data data-science dataframes fugue numpy pandas polars pyspark python spark
Last synced: 31 Mar 2024
https://github.com/minzhang-1/PointHop-PointHop2_Spark
A fast and low memory requirement version of PointHop and PointHop++, which is built upon Apache Spark.
3d 3d-classification classification feature-extraction knn least-square-regression pca point-cloud pyspark python spark
Last synced: 26 Mar 2024
https://github.com/basin-etl/basin
Basin is a visual programming editor for building Spark and PySpark pipelines. Easily build, debug, and deploy complex ETL pipelines from your browser
emr etl hadoop informatica odi pipeline pyspark spark
Last synced: 23 Mar 2024
https://github.com/getyourguide/TypedPyspark
Type-annotate your spark dataframes and validate them
Last synced: 19 Mar 2024
https://github.com/itsjafer/jupyterlab-sparkmonitor
JupyterLab extension that enables monitoring launched Apache Spark jobs from within a notebook
apache-spark jupyter jupyter-lab jupyterlab jupyterlab-extension pyspark spark
Last synced: 18 Mar 2024
https://github.com/myamafuj/hadoop-hive-spark-docker
Hadoop-Hive-Spark cluster + Jupyter on Docker
docker hadoop hive jupyter jupyter-notebook pyspark spark
Last synced: 14 Mar 2024
https://github.com/Azure/mmlspark
Simple and Distributed Machine Learning
ai apache-spark azure big-data cognitive-services data-science databricks deep-learning http lightgbm machine-learning microsoft ml model-deployment onnx opencv pyspark scala spark synapse
Last synced: 13 Mar 2024
https://github.com/aipredict/ai-deployment
关注AI模型上线、模型部署
deploy keras lightgbm mxnet onnx pmml pyspark pytorch scikit-learn spark-ml tensorflow xgboost
Last synced: 13 Mar 2024