https://github.com/joshweiner/ml-impute
A package for synthetic data generation for imputation using single and multiple imputation methods.
https://github.com/joshweiner/ml-impute
imputation imputation-methods jax machine-learning multiple-imputation numpy pandas parallelization singular-value-decomposition synthetic-data synthetic-dataset-generation
Last synced: 3 months ago
JSON representation
A package for synthetic data generation for imputation using single and multiple imputation methods.
- Host: GitHub
- URL: https://github.com/joshweiner/ml-impute
- Owner: JoshWeiner
- License: mit
- Created: 2022-07-11T13:50:17.000Z (about 3 years ago)
- Default Branch: main
- Last Pushed: 2023-02-22T01:19:20.000Z (over 2 years ago)
- Last Synced: 2025-04-08T04:04:24.432Z (6 months ago)
- Topics: imputation, imputation-methods, jax, machine-learning, multiple-imputation, numpy, pandas, parallelization, singular-value-decomposition, synthetic-data, synthetic-dataset-generation
- Language: Python
- Homepage: https://test.pypi.org/project/ml-impute
- Size: 58.6 KB
- Stars: 4
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# ML-Impute
### A python package for synthetic data generation using single and multiple imputation.
Ml-Impute is a library for generating synthetic data for null-value imputation, notably with the ability to handle mixed datatypes. This package is based off of the research of [Audigier, Husson, and Josse](https://arxiv.org/pdf/1301.4797.pdf) and their method of iterative factor analysis for singular data imputation.
The goal of this package is to:
**(a)** provide an open source package for use of this method in Python for the first time, and;
**(b)** to provide an efficient parallelization of the algorithm when extending it to both single and multiple imputation.> Note: I am currently a university student and may not have the time to continue to release updates and changes as fast as some other packages might. In the spirit of open-source code, please feel free to add pull requests or open a new issue if you have bug fixes or improvements. Thank you for your understanding and for your contributions.
## Table of Contents
- [Table of Contents](#table-of-contents)
- [Installation](#installation)
- [Usage](#usage)
- [Example](#example)
- [License](#license)
## Installation
ML-Impute is currently available on PyPi.
**Unix/Mac OS/Windows**
```
pip install ml-impute
```
## Usage
Currently, ML-Impute can handle both single and multiple imputation.To follow a demonstration of both methods, proceed to the Example Section.
The following subsections provide an overview into each method along with their usage information.
To use the package post-installation via pip, instantiate the following object as follows:
```
from mpute import generatorgen = generator.Generator()
```> #### **Generator.generate**(self, dataframe, encode_cols, exclude_cols, max_iter, tol, explained_var, method, n_versions, noise)
| Parameter | Description |
| :--- | :--- |
| dataframe | (__*required*__) Pandas dataframe object |
| encode_cols | (*optional*, default=[]) Categorical columns to be encoded.
By default, ml-impute will encode all columns with *object* or *category* dtypes. However, many datasets contain numerical categorical data (ex/ Likert scales, classification types, etc.) that should be encoded. |
| exclude_cols | (*optional*, default=[]) Categorical columns to be excluded from encoding and/or imputation.
On occastion, datasets will contain unique non-ordinal data (such as unique IDs) that, if encoded, will lead to large increases in memory usage and runtime. These columns should be excluded. |
| max_iter | (*optional*, default=1000) The maximum number of iterations of imputation before exit. |
| tol | (*optional*, default=1e-4) Tolerance bound for convergence.
If Frobenius norm relative error is < tol before max_iter is reached, exit.|
| explained_var | (*optional*, default=0.95) Percentage of the total variance kept when reconstructing the dataframe after performing Singular Value Decomposition. |
| method | (*optional*, default="single") Specification for use of single or multiple imputation method.
**Possible values**: ["single", "multiple"] |
| n_versions | (*optional*, default=20) If performing multiple imputation, the number of generated dataframes.
If performing singular imputation, n_versions=1|
| noise | (*optional*, default="gaussian") If performing multiple impuation, specify the type of noise added to each generated dataset to create variation. Gaussian noise is centered around 0 with a standard deviation of 0.1.
If performing singular imputation, noise=None |
| engine | (*optional*, default="default") For either singular or multiple imputation, choose the engine through which the SVD is calculated.
**Possible values**: ["default", "dask"]
*"default"* utilizes the JAX numpy library for efficient SVD calculation and multiprocessing, and is recommended for speed.
*"dask"* creates a dask distributed scheduler which is used to compute the SVD. Given that this is an iterative method, this is recommended only when working with very large datasets. || Method | Return Value |
| :--- | :--- |
| "single" | **imputed_df**: a copy of the dataframe argument with synthetic data imputed for all null values |
| "multiple" | **df_dict**: a dictionary containing each of the n_versions of generated datasets with variable synthetic data.
keys: [0, n_versions)
values: [dataframes]|
### **Single Imputation**
Single imputation works with the following line:
```
imputed_df = gen.generate(dataframe)
```
### **Multiple Imputation**
Multiple imputation is as simple as the following:
```
imputed_dfs = gen.generate(dataframe method="multiple")
```
## Example
For the following example, we will use the titanic example-dataset available in [sklearn.datasets openml](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_openml.html#sklearn.datasets.fetch_openml).
Build the titanic dataset and create a Generator object as follows:
```
import pandas as pd
from mpute import generator
from sklearn import datasetstitanic, target = datasets.fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)
titanic['survived'] = targetgen = generator.Generator()
```
### **Single Imputation**```
imputed_df = gen.generate(titanic, exclude_cols=['name', 'cabin', 'ticket'])
```
> **Note**: 'name', 'cabin', and 'ticket' are excluded as they mainly contain unique identifiers, therefore unnecessary for imputation and if encoded, would result in a significant increase in memory usage.
> It is possible to replace the cabin column with two columns such as 'deck' and 'position', as these may be a determinant of survival. However, this preprocessing would have to occur beforehand
### **Multiple Imputation**
Multiple imputation is as simple as the following:
```
imputed_dfs = gen.generate(titanic method="multiple")
```That's all there is to it. Happy using!
## License
ML-Impute is published under the MIT License. Please see the LICENSE file for more information.