{"id":13583367,"url":"https://github.com/awslabs/datawig","last_synced_at":"2025-04-06T18:32:21.941Z","repository":{"id":32930419,"uuid":"143926972","full_name":"awslabs/datawig","owner":"awslabs","description":"Imputation of missing values in tables.","archived":false,"fork":false,"pushed_at":"2024-06-17T23:16:42.000Z","size":6825,"stargazers_count":486,"open_issues_count":25,"forks_count":70,"subscribers_count":22,"default_branch":"master","last_synced_at":"2025-03-13T02:37:49.831Z","etag":null,"topics":["imputation","missing-value-handling"],"latest_commit_sha":null,"homepage":null,"language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/awslabs.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-08-07T21:07:59.000Z","updated_at":"2025-03-06T01:40:49.000Z","dependencies_parsed_at":"2024-06-19T02:29:56.310Z","dependency_job_id":"f4c7f279-e915-41a5-ba34-51d423c8a5dc","html_url":"https://github.com/awslabs/datawig","commit_stats":{"total_commits":131,"total_committers":14,"mean_commits":9.357142857142858,"dds":0.8091603053435115,"last_synced_commit":"90697c90c3422ea48bb7723d1a4f064501987dc9"},"previous_names":[],"tags_count":12,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/awslabs%2Fdatawig","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/awslabs%2Fdatawig/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/awslabs%2Fdatawig/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/awslabs%2Fdatawig/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/awslabs","download_url":"https://codeload.github.com/awslabs/datawig/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247531215,"owners_count":20953913,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["imputation","missing-value-handling"],"created_at":"2024-08-01T15:03:25.794Z","updated_at":"2025-04-06T18:32:16.926Z","avatar_url":"https://github.com/awslabs.png","language":"JavaScript","funding_links":[],"categories":["JavaScript","Others"],"sub_categories":[],"readme":"DataWig - Imputation for Tables\n================================\n\n[![PyPI version](https://badge.fury.io/py/datawig.svg)](https://badge.fury.io/py/datawig.svg)\n[![GitHub license](https://img.shields.io/github/license/awslabs/datawig.svg)](https://github.com/awslabs/datawig/blob/master/LICENSE)\n[![GitHub issues](https://img.shields.io/github/issues/awslabs/datawig.svg)](https://github.com/awslabs/datawig/issues)\n[![Build Status](https://travis-ci.org/awslabs/datawig.svg?branch=master)](https://travis-ci.org/awslabs/datawig)\n\nDataWig learns Machine Learning models to impute missing values in tables.\n\nSee our user-guide and extended documentation [here](https://datawig.readthedocs.io/en/latest).\n\n## Installation\n\n### CPU\n```bash\npip3 install datawig\n```\n\n### GPU\nIf you want to run DataWig on a GPU you need to make sure your version of Apache MXNet Incubating contains the GPU bindings.\nDepending on your version of CUDA, you can do this by running the following:\n\n```bash\nwget https://raw.githubusercontent.com/awslabs/datawig/master/requirements/requirements.gpu-cu${CUDA_VERSION}.txt\npip install datawig --no-deps -r requirements.gpu-cu${CUDA_VERSION}.txt\nrm requirements.gpu-cu${CUDA_VERSION}.txt\n```\nwhere `${CUDA_VERSION}` can be `75` (7.5), `80` (8.0), `90` (9.0), or `91` (9.1).\n\n## Running DataWig\nThe DataWig API expects your data as a [pandas DataFrame](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html). Here is an example of how the dataframe might look:\n\n|Product Type | Description           | Size | Color |\n|-------------|-----------------------|------|-------|\n|   Shoe      | Ideal for Running     | 12UK | Black |\n| SDCards     | Best SDCard ever ...  | 8GB  | Blue  |\n| Dress       | This **yellow** dress | M    | **?** |\n\n### Quickstart Example\n\nFor most use cases, the `SimpleImputer` class is the best starting point. For convenience there is the function [SimpleImputer.complete](https://datawig.readthedocs.io/en/latest/source/API.html#datawig.simple_imputer.SimpleImputer.complete) that takes a DataFrame and fits an imputation model for each column with missing values, with all other columns as inputs:\n\n```python\nimport datawig, numpy\n\n# generate some data with simple nonlinear dependency\ndf = datawig.utils.generate_df_numeric() \n# mask 10% of the values\ndf_with_missing = df.mask(numpy.random.rand(*df.shape) \u003e .9)\n\n# impute missing values\ndf_with_missing_imputed = datawig.SimpleImputer.complete(df_with_missing)\n\n```\n\nYou can also impute values in specific columns only (called `output_column` below) using values in other columns (called `input_columns` below). DataWig currently supports imputation of categorical columns and numeric columns.\n\n### Imputation of categorical columns\n\n```python\nimport datawig\n\ndf = datawig.utils.generate_df_string( num_samples=200, \n                                       data_column_name='sentences', \n                                       label_column_name='label')\n\ndf_train, df_test = datawig.utils.random_split(df)\n\n#Initialize a SimpleImputer model\nimputer = datawig.SimpleImputer(\n    input_columns=['sentences'], # column(s) containing information about the column we want to impute\n    output_column='label', # the column we'd like to impute values for\n    output_path = 'imputer_model' # stores model data and metrics\n    )\n\n#Fit an imputer model on the train data\nimputer.fit(train_df=df_train)\n\n#Impute missing values and return original dataframe with predictions\nimputed = imputer.predict(df_test)\n```\n\n### Imputation of numerical columns\n\n```python\nimport datawig\n\ndf = datawig.utils.generate_df_numeric( num_samples=200, \n                                        data_column_name='x', \n                                        label_column_name='y')         \ndf_train, df_test = datawig.utils.random_split(df)\n\n#Initialize a SimpleImputer model\nimputer = datawig.SimpleImputer(\n    input_columns=['x'], # column(s) containing information about the column we want to impute\n    output_column='y', # the column we'd like to impute values for\n    output_path = 'imputer_model' # stores model data and metrics\n    )\n\n#Fit an imputer model on the train data\nimputer.fit(train_df=df_train, num_epochs=50)\n\n#Impute missing values and return original dataframe with predictions\nimputed = imputer.predict(df_test)\n             \n```\n\nIn order to have more control over the types of models and preprocessings, the `Imputer` class allows directly specifying all relevant model features and parameters. \n\nFor details on usage, refer to the provided [examples](./examples).\n\n### Acknowledgments\nThanks to [David Greenberg](https://github.com/dgreenberg) for the package name.\n\n### Building documentation\n\n```bash\ngit clone git@github.com:awslabs/datawig.git\ncd datawig/docs\nmake html\nopen _build/html/index.html\n```\n\n\n### Executing Tests\n\nClone the repository from git and set up virtualenv in the root dir of the package:\n\n```\npython3 -m venv venv\n```\n\nInstall the package from local sources:\n\n```\n./venv/bin/pip install -e .\n```\n\nRun tests:\n\n```\n./venv/bin/pip install -r requirements/requirements.dev.txt\n./venv/bin/python -m pytest\n```\n\n\n### Updating PyPi distribution\n\nBefore updating, increment the version in setup.py.\n\n```\ngit clone git@github.com:awslabs/datawig.git\ncd datawig\n# build local distribution for current version\npython setup.py sdist\n# upload to PyPi\ntwine upload --skip-existing dist/*\n```\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fawslabs%2Fdatawig","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fawslabs%2Fdatawig","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fawslabs%2Fdatawig/lists"}