{"id":13416289,"url":"https://github.com/rhiever/datacleaner","last_synced_at":"2025-05-15T21:03:04.611Z","repository":{"id":48746071,"uuid":"52671170","full_name":"rhiever/datacleaner","owner":"rhiever","description":"A Python tool that automatically cleans data sets and readies them for analysis.","archived":false,"fork":false,"pushed_at":"2019-05-22T13:53:35.000Z","size":633,"stargazers_count":1067,"open_issues_count":12,"forks_count":204,"subscribers_count":58,"default_branch":"master","last_synced_at":"2025-05-14T16:14:53.177Z","etag":null,"topics":["automation","data-science","machine-learning","python"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/rhiever.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2016-02-27T14:45:22.000Z","updated_at":"2025-05-12T01:34:35.000Z","dependencies_parsed_at":"2022-09-23T21:22:14.087Z","dependency_job_id":null,"html_url":"https://github.com/rhiever/datacleaner","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rhiever%2Fdatacleaner","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rhiever%2Fdatacleaner/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rhiever%2Fdatacleaner/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rhiever%2Fdatacleaner/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/rhiever","download_url":"https://codeload.github.com/rhiever/datacleaner/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254422754,"owners_count":22068678,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["automation","data-science","machine-learning","python"],"created_at":"2024-07-30T21:00:56.509Z","updated_at":"2025-05-15T21:03:04.562Z","avatar_url":"https://github.com/rhiever.png","language":"Python","funding_links":[],"categories":["Python","🐍 Python","Feature Extraction"],"sub_categories":["Useful Python Tools for Data Analysis","General Feature Extraction"],"readme":"[![Build Status](https://travis-ci.org/rhiever/datacleaner.svg?branch=master)](https://travis-ci.org/rhiever/datacleaner)\n[![Code Health](https://landscape.io/github/rhiever/datacleaner/master/landscape.svg?style=flat)](https://landscape.io/github/rhiever/datacleaner/master)\n[![Coverage Status](https://coveralls.io/repos/github/rhiever/datacleaner/badge.svg?branch=master)](https://coveralls.io/github/rhiever/datacleaner?branch=master)\n![Python 2.7](https://img.shields.io/badge/python-2.7-blue.svg)\n![Python 3.5](https://img.shields.io/badge/python-3.5-blue.svg)\n![License](https://img.shields.io/badge/license-MIT%20License-blue.svg)\n[![PyPI version](https://badge.fury.io/py/datacleaner.svg)](https://badge.fury.io/py/datacleaner)\n\n\n# datacleaner\n\n[![Join the chat at https://gitter.im/rhiever/datacleaner](https://badges.gitter.im/rhiever/datacleaner.svg)](https://gitter.im/rhiever/datacleaner?utm_source=badge\u0026utm_medium=badge\u0026utm_campaign=pr-badge\u0026utm_content=badge)\n\nA Python tool that automatically cleans data sets and readies them for analysis.\n\n## datacleaner is not magic\n\ndatacleaner works with data in [pandas DataFrames](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html).\n\ndatacleaner is not magic, and it won't take an unorganized blob of text and automagically parse it out for you.\n\nWhat datacleaner *will* do is save you a ton of time encoding and cleaning your data once it's already in a format that pandas DataFrames can handle.\n\nCurrently, datacleaner does the following:\n\n* Optionally drops any row with a missing value\n\n* Replaces missing values with the mode (for categorical variables) or median (for continuous variables) on a column-by-column basis\n\n* Encodes non-numerical variables (e.g., categorical variables with strings) with numerical equivalents\n\nWe plan to add more cleaning features as the project grows.\n\n## License\n\nPlease see the [repository license](https://github.com/rhiever/datacleaner/blob/master/LICENSE) for the licensing and usage information for datacleaner.\n\nGenerally, we have licensed datacleaner to make it as widely usable as possible.\n\n## Installation\n\ndatacleaner is built to use pandas DataFrames and some scikit-learn modules for data preprocessing. As such, we recommend installing the [Anaconda Python distribution](https://www.continuum.io/downloads) prior to installing datacleaner.\n\nOnce the prerequisites are installed, datacleaner can be installed with a simple `pip` command:\n\n```\npip install datacleaner\n```\n\n## Usage\n\n### datacleaner on the command line\n\ndatacleaner can be used on the command line. Use `--help` to see its usage instructions.\n\n```\nusage: datacleaner [-h] [-cv CROSS_VAL_FILENAME] [-o OUTPUT_FILENAME]\n                   [-cvo CV_OUTPUT_FILENAME] [-is INPUT_SEPARATOR]\n                   [-os OUTPUT_SEPARATOR] [--drop-nans]\n                   [--ignore-update-check] [--version]\n                   INPUT_FILENAME\n\nA Python tool that automatically cleans data sets and readies them for analysis\n\npositional arguments:\n  INPUT_FILENAME        File name of the data file to clean\n\noptional arguments:\n  -h, --help            show this help message and exit\n  -cv CROSS_VAL_FILENAME\n                        File name for the validation data set if performing\n                        cross-validation\n  -o OUTPUT_FILENAME    Data file to output the cleaned data set to\n  -cvo CV_OUTPUT_FILENAME\n                        Data file to output the cleaned cross-validation data\n                        set to\n  -is INPUT_SEPARATOR   Column separator for the input file(s) (default: \\t)\n  -os OUTPUT_SEPARATOR  Column separator for the output file(s) (default: \\t)\n  --drop-nans           Drop all rows that have a NaN in any column (default: False)\n  --ignore-update-check\n                        Do not check for the latest version of datacleaner\n                        (default: False)\n  --version             show program's version number and exit\n```\n\nAn example command-line call to datacleaner may look like:\n\n```\ndatacleaner my_data.csv -o my_clean.data.csv -is , -os ,\n```\n\nwhich will read the data from `my_data.csv` (assuming columns are separated by commas), clean the data set, then output the resulting data set to `my_clean.data.csv`.\n\n### datacleaner in scripts\n\ndatacleaner can also be used as part of a script. There are two primary functions implemented in datacleaner: `autoclean` and `autoclean_cv`.\n\n```\nautoclean(input_dataframe, drop_nans=False, copy=False, ignore_update_check=False)\n    Performs a series of automated data cleaning transformations on the provided data set\n    \n    Parameters\n    ----------\n    input_dataframe: pandas.DataFrame\n        Data set to clean\n    drop_nans: bool\n        Drop all rows that have a NaN in any column (default: False)\n    copy: bool\n        Make a copy of the data set (default: False) \n    encoder: category_encoders transformer\n        The a valid category_encoders transformer which is passed an inferred cols list. Default (None: LabelEncoder)\n    encoder_kwargs: category_encoders\n        The a valid sklearn transformer to encode categorical features. Default (None)\n    ignore_update_check: bool\n        Do not check for the latest version of datacleaner\n\n    Returns\n    ----------\n    output_dataframe: pandas.DataFrame\n        Cleaned data set\n```\n\n```\nautoclean_cv(training_dataframe, testing_dataframe, drop_nans=False, copy=False, ignore_update_check=False)\n    Performs a series of automated data cleaning transformations on the provided training and testing data sets\n    \n    Unlike `autoclean()`, this function takes cross-validation into account by learning the data transformations\n    from only the training set, then applying those transformations to both the training and testing set.\n    By doing so, this function will prevent information leak from the training set into the testing set.\n    \n    Parameters\n    ----------\n    training_dataframe: pandas.DataFrame\n        Training data set\n    testing_dataframe: pandas.DataFrame\n        Testing data set\n    drop_nans: bool\n        Drop all rows that have a NaN in any column (default: False)\n    copy: bool\n        Make a copy of the data set (default: False)  \n    encoder: category_encoders transformer\n        The a valid category_encoders transformer which is passed an inferred cols list. Default (None: LabelEncoder)\n    encoder_kwargs: category_encoders\n        The a valid sklearn transformer to encode categorical features. Default (None)\n    ignore_update_check: bool\n        Do not check for the latest version of datacleaner\n\n    Returns\n    ----------\n    output_training_dataframe: pandas.DataFrame\n        Cleaned training data set\n    output_testing_dataframe: pandas.DataFrame\n        Cleaned testing data set\n```\n\nBelow is an example of datacleaner performing basic cleaning on a data set.\n\n```python\nfrom datacleaner import autoclean\nimport pandas as pd\n\nmy_data = pd.read_csv('my_data.csv', sep=',')\nmy_clean_data = autoclean(my_data)\nmy_data.to_csv('my_clean_data.csv', sep=',', index=False)\n```\n\nNote that because datacleaner works directly on [pandas DataFrames](http://pandas.pydata.org/pandas-docs/stable/10min.html), all [DataFrame operations](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html) are still available to the resulting data sets.\n\n## Contributing to datacleaner\n\nWe welcome you to [check the existing issues](https://github.com/rhiever/datacleaner/issues/) for bugs or enhancements to work on. If you have an idea for an extension to datacleaner, please [file a new issue](https://github.com/rhiever/datacleaner/issues/new) so we can discuss it.\n\n## Citing datacleaner\n\nIf you use datacleaner as part of your workflow in a scientific publication, please consider citing the datacleaner repository with the following DOI:\n\n[![DOI](https://zenodo.org/badge/20747/rhiever/datacleaner.svg)](https://zenodo.org/badge/latestdoi/20747/rhiever/datacleaner)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frhiever%2Fdatacleaner","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frhiever%2Fdatacleaner","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frhiever%2Fdatacleaner/lists"}