Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/viniciusmsousa/pyspark-ds-toolbox
A Pyspark companion for data science tasks.
https://github.com/viniciusmsousa/pyspark-ds-toolbox
data-science spark
Last synced: 4 days ago
JSON representation
A Pyspark companion for data science tasks.
- Host: GitHub
- URL: https://github.com/viniciusmsousa/pyspark-ds-toolbox
- Owner: viniciusmsousa
- License: gpl-3.0
- Created: 2021-12-03T14:19:28.000Z (about 3 years ago)
- Default Branch: main
- Last Pushed: 2023-02-17T16:55:53.000Z (almost 2 years ago)
- Last Synced: 2025-01-31T02:36:19.655Z (14 days ago)
- Topics: data-science, spark
- Language: Python
- Homepage: https://viniciusmsousa.github.io/pyspark-ds-toolbox/index.html
- Size: 4.32 MB
- Stars: 2
- Watchers: 1
- Forks: 1
- Open Issues: 6
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE.md
Awesome Lists containing this project
README
# Pyspark DS Toolbox
[![Lifecycle:
experimental](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://lifecycle.r-lib.org/articles/stages.html#experimental)
[![PyPI Latest Release](https://img.shields.io/pypi/v/pyspark-ds-toolbox.svg)](https://pypi.org/project/pyspark-ds-toolbox/)
[![CodeFactor](https://www.codefactor.io/repository/github/viniciusmsousa/pyspark-ds-toolbox/badge)](https://www.codefactor.io/repository/github/viniciusmsousa/pyspark-ds-toolbox)
[![Maintainability](https://api.codeclimate.com/v1/badges/9a85a662305167c5aba1/maintainability)](https://codeclimate.com/github/viniciusmsousa/pyspark-ds-toolbox/maintainability)
[![Codecov test coverage](https://codecov.io/gh/viniciusmsousa/pyspark-ds-toolbox/branch/main/graph/badge.svg)](https://codecov.io/gh/viniciusmsousa/pyspark-ds-toolbox?branch=main)
[![Package Tests](https://github.com/viniciusmsousa/pyspark-ds-toolbox/actions/workflows/package-tests.yml/badge.svg)](https://github.com/viniciusmsousa/pyspark-ds-toolbox/actions)
[![Downloads](https://pepy.tech/badge/pyspark-ds-toolbox)](https://pepy.tech/project/pyspark-ds-toolbox)The objective of the package is to provide a set of tools that helps the daily work of data science with spark. The documentation can be found [here](https://viniciusmsousa.github.io/pyspark-ds-toolbox/index.html) and notebooks with usage examples [here](https://github.com/viniciusmsousa/pyspark-ds-toolbox/tree/main/examples).
Feel free to contribute :)
## Installation
Directly from PyPi:
```
pip install pyspark-ds-toolbox
```or from github, note that installing from github will install the latest development version:
```
pip install git+https://github.com/viniciusmsousa/pyspark-ds-toolbox.git
```## Organization
The package organized in a structure based on the nature of the task, such as data wrangling, model/prediction evaluation, and so on.
```
pyspark_ds_toolbox # Main Package
├─ causal_inference # Sub-package dedicated to Causal Inferece
│ ├─ diff_in_diff.py
│ └─ ps_matching.py
├─ ml # Sub-package dedicated to ML
│ ├─ data_prep # Sub-package to ML data preparation tools
│ │ ├─ class_weights.py
│ │ └─ features_vector.py
│ ├─ classification # Sub-package decidated to classification tasks
│ │ ├─ eval.py
│ │ └─ baseline_classifiers.py
│ ├─ feature_importance # Sub-package with feature importance tools
│ │ ├─ native_spark.py
│ │ └─ shap_values.py
│ └─ feature_selection # Sub-package with feature selection tools
│ └─ information_value.py
├─ wrangling # Sub-package decidated to data wrangling tasks
│ ├─ reshape.py
│ └─ data_quality.py
└─ stats # Sub-package dedicated to basic statistic functionalities
└─ association.py
```