{"id":25431932,"url":"https://github.com/joshweiner/ml-impute","last_synced_at":"2025-07-18T17:33:33.836Z","repository":{"id":56089373,"uuid":"512775705","full_name":"JoshWeiner/ml-impute","owner":"JoshWeiner","description":"A package for synthetic data generation for imputation using single and multiple imputation methods.","archived":false,"fork":false,"pushed_at":"2023-02-22T01:19:20.000Z","size":60,"stargazers_count":4,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-04-08T04:04:24.432Z","etag":null,"topics":["imputation","imputation-methods","jax","machine-learning","multiple-imputation","numpy","pandas","parallelization","singular-value-decomposition","synthetic-data","synthetic-dataset-generation"],"latest_commit_sha":null,"homepage":"https://test.pypi.org/project/ml-impute","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/JoshWeiner.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-07-11T13:50:17.000Z","updated_at":"2024-06-24T09:31:02.000Z","dependencies_parsed_at":"2025-02-17T04:40:53.549Z","dependency_job_id":null,"html_url":"https://github.com/JoshWeiner/ml-impute","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/JoshWeiner/ml-impute","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JoshWeiner%2Fml-impute","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JoshWeiner%2Fml-impute/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JoshWeiner%2Fml-impute/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JoshWeiner%2Fml-impute/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/JoshWeiner","download_url":"https://codeload.github.com/JoshWeiner/ml-impute/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JoshWeiner%2Fml-impute/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":265801965,"owners_count":23830506,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["imputation","imputation-methods","jax","machine-learning","multiple-imputation","numpy","pandas","parallelization","singular-value-decomposition","synthetic-data","synthetic-dataset-generation"],"created_at":"2025-02-17T04:30:30.202Z","updated_at":"2025-07-18T17:33:33.810Z","avatar_url":"https://github.com/JoshWeiner.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# ML-Impute\n\n### A python package for synthetic data generation using single and multiple imputation.\n\n\u003cdiv align=\"center\" style=\"display: flex; justify-content: center;\"\u003e\n\n\u003ca href=\"https://pypi.python.org/pypi/\"\u003e\n\u003cimg src =\"https://img.shields.io/badge/python-3.x-blue.svg?style=for-the-badge\" alt=\"Python version\" /\u003e\u003c/a\u003e\n\n\u003c!-- Build status --\u003e\n\u003ca href=\"https://pypi.org/project/ml-impute\"\u003e\n\u003cimg src =\"https://img.shields.io/pypi/v/ml-impute?style=for-the-badge\" alt=\"PyPi version\"/\u003e\u003c/a\u003e\n\n\u003c!-- Test coverage --\u003e\n\u003c!--\n\u003ca href=\"https://coveralls.io/\"\u003e\n\u003cimg src =\"https://img.shields.io/codecov/c/gh/JoshWeiner/ml-impute.svg?style=for-the-badge\" alt=\"Coverage Status\"/\u003e\u003c/a\u003e\n--\u003e\n\n\u003ca href=\"https://opensource.org/licenses/MIT\"\u003e\n\u003cimg src =\"https://img.shields.io/:license-mit-ff69b4.svg?style=for-the-badge\" alt=\"license\" /\u003e\u003c/a\u003e\n\n\u003c/div\u003e\n\nMl-Impute is a library for generating synthetic data for null-value imputation, notably with the ability to handle mixed datatypes. This package is based off of the research of [Audigier, Husson, and Josse](https://arxiv.org/pdf/1301.4797.pdf) and their method of iterative factor analysis for singular data imputation. \u003cbr\u003e\nThe goal of this package is to: \u003cbr\u003e\n**(a)** provide an open source package for use of this method in Python for the first time, and; \u003cbr\u003e\n**(b)** to provide an efficient parallelization of the algorithm when extending it to both single and multiple imputation.\n\n\u003e Note: I am currently a university student and may not have the time to continue to release updates and changes as fast as some other packages might. In the spirit of open-source code, please feel free to add pull requests or open a new issue if you have bug fixes or improvements. Thank you for your understanding and for your contributions.\n\u003chr\u003e\n\n## Table of Contents\n- [Table of Contents](#table-of-contents)\n- [Installation](#installation)\n- [Usage](#usage)\n- [Example](#example)\n- [License](#license)\n\n\u003chr\u003e\n\n## Installation\n\nML-Impute is currently available on PyPi.\n\n**Unix/Mac OS/Windows**\n```\npip install ml-impute\n```\n\u003chr\u003e\n\n## Usage\nCurrently, ML-Impute can handle both single and multiple imputation.\n\nTo follow a demonstration of both methods, proceed to the \u003ca href=\"#Example\"\u003eExample\u003c/a\u003e Section. \n\nThe following subsections provide an overview into each method along with their usage information.\n\nTo use the package post-installation via pip, instantiate the following object as follows:\n```\nfrom mpute import generator\n\ngen = generator.Generator()\n```\n\n\u003e #### **Generator.generate**(self, dataframe, encode_cols, exclude_cols, max_iter, tol, explained_var, method, n_versions, noise)\n| Parameter | Description |\n| :--- | :--- |\n| dataframe | (__*required*__) Pandas dataframe object |\n| encode_cols | (*optional*, default=[]) Categorical columns to be encoded. \u003cbr\u003e By default, ml-impute will encode all columns with *object* or *category* dtypes. However, many datasets contain numerical categorical data (ex/ Likert scales, classification types, etc.) that should be encoded. |\n| exclude_cols | (*optional*, default=[]) Categorical columns to be excluded from encoding and/or imputation. \u003cbr\u003e On occastion, datasets will contain unique non-ordinal data (such as unique IDs) that, if encoded, will lead to large increases in memory usage and runtime. These columns should be excluded. |\n| max_iter | (*optional*, default=1000) The maximum number of iterations of imputation before exit. |\n| tol | (*optional*, default=1e-4) Tolerance bound for convergence. \u003cbr\u003eIf Frobenius norm relative error is \u003c tol before max_iter is reached, exit.|\n| explained_var | (*optional*, default=0.95) Percentage of the total variance kept when reconstructing the dataframe after performing Singular Value Decomposition. |\n| method | (*optional*, default=\"single\") Specification for use of single or multiple imputation method. \u003cbr\u003e **Possible values**: [\"single\", \"multiple\"] |\n| n_versions | (*optional*, default=20)  If performing multiple imputation, the number of generated dataframes. \u003cbr\u003e If performing singular imputation, n_versions=1|\n| noise | (*optional*, default=\"gaussian\") If performing multiple impuation, specify the type of noise added to each generated dataset to create variation. Gaussian noise is centered around 0 with a standard deviation of 0.1. \u003cbr\u003e If performing singular imputation, noise=None |\n| engine | (*optional*, default=\"default\") For either singular or multiple imputation, choose the engine through which the SVD is calculated. \u003cbr\u003e **Possible values**: [\"default\", \"dask\"]\u003cbr\u003e*\"default\"* utilizes the JAX numpy library for efficient SVD calculation and multiprocessing, and is recommended for speed. \u003cbr\u003e *\"dask\"* creates a dask distributed scheduler which is used to compute the SVD. Given that this is an iterative method, this is recommended only when working with very large datasets. |\n\n| Method | Return Value |\n| :--- | :--- |\n| \"single\" | **imputed_df**: a copy of the dataframe argument with synthetic data imputed for all null values |\n| \"multiple\" | **df_dict**: a dictionary containing each of the n_versions of generated datasets with variable synthetic data. \u003cbr\u003e keys: [0, n_versions) \u003cbr\u003e values: [dataframes]|\n\n\u003chr\u003e\n\n### **Single Imputation**\nSingle imputation works with the following line:\n```\nimputed_df = gen.generate(dataframe)\n```\n### **Multiple Imputation**\nMultiple imputation is as simple as the following:\n```\nimputed_dfs = gen.generate(dataframe method=\"multiple\")\n```\n\n\u003chr\u003e\n\n## Example\n\nFor the following example, we will use the titanic example-dataset available in [sklearn.datasets openml](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_openml.html#sklearn.datasets.fetch_openml).\n\nBuild the titanic dataset and create a Generator object as follows:\n```\nimport pandas as pd\nfrom mpute import generator\nfrom sklearn import datasets\n\ntitanic, target = datasets.fetch_openml(\"titanic\", version=1, as_frame=True, return_X_y=True)\ntitanic['survived'] = target\n\ngen = generator.Generator()\n```\n### **Single Imputation**\n\n```\nimputed_df = gen.generate(titanic, exclude_cols=['name', 'cabin', 'ticket'])\n```\n\u003e **Note**: 'name', 'cabin', and 'ticket' are excluded as they mainly contain unique identifiers, therefore unnecessary for imputation and if encoded, would result in a significant increase in memory usage. \u003cbr\u003e\n\u003e It is possible to replace the cabin column with two columns such as 'deck' and 'position', as these may be a determinant of survival. However, this preprocessing would have to occur beforehand \n\u003chr\u003e\n\n### **Multiple Imputation**\nMultiple imputation is as simple as the following:\n```\nimputed_dfs = gen.generate(titanic method=\"multiple\")\n```\n\nThat's all there is to it. Happy using!\n\u003chr\u003e\n\n## License\nML-Impute is published under the MIT License. Please see the LICENSE file for more information.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjoshweiner%2Fml-impute","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjoshweiner%2Fml-impute","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjoshweiner%2Fml-impute/lists"}