{"id":34077518,"url":"https://github.com/datapreprocessing/datacleaning","last_synced_at":"2025-12-14T10:06:14.472Z","repository":{"id":52406297,"uuid":"362174500","full_name":"DataPreprocessing/DataCleaning","owner":"DataPreprocessing","description":"Data Cleaning is a python package for data preprocessing. This cleans the CSV file and returns the cleaned data frame.  It does the work of imputation, removing duplicates, replacing special characters, and many more. ","archived":false,"fork":false,"pushed_at":"2021-04-29T22:36:14.000Z","size":120,"stargazers_count":8,"open_issues_count":0,"forks_count":3,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-09-30T11:02:50.376Z","etag":null,"topics":["data","data-cleaning","data-cleansing","data-preprocessing","data-wrangling","imputation","python","threshold"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/DataPreprocessing.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-04-27T16:10:38.000Z","updated_at":"2025-04-28T06:25:58.000Z","dependencies_parsed_at":"2022-08-26T02:42:55.361Z","dependency_job_id":null,"html_url":"https://github.com/DataPreprocessing/DataCleaning","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/DataPreprocessing/DataCleaning","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DataPreprocessing%2FDataCleaning","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DataPreprocessing%2FDataCleaning/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DataPreprocessing%2FDataCleaning/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DataPreprocessing%2FDataCleaning/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/DataPreprocessing","download_url":"https://codeload.github.com/DataPreprocessing/DataCleaning/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DataPreprocessing%2FDataCleaning/sbom","scorecard":{"id":37842,"data":{"date":"2025-08-11","repo":{"name":"github.com/DataPreprocessing/DataCleaning","commit":"d890e8adf3d94d30b62246738d6c9abaa28e865c"},"scorecard":{"version":"v5.2.1-40-gf6ed084d","commit":"f6ed084d17c9236477efd66e5b258b9d4cc7b389"},"score":2.1,"checks":[{"name":"Binary-Artifacts","score":10,"reason":"no binaries found in the repo","details":null,"documentation":{"short":"Determines if the project has generated executable (binary) artifacts in the source repository.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#binary-artifacts"}},{"name":"Maintained","score":0,"reason":"0 commit(s) and 0 issue activity found in the last 90 days -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project is \"actively maintained\".","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#maintained"}},{"name":"Code-Review","score":0,"reason":"Found 0/22 approved changesets -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project requires human code review before pull requests (aka merge requests) are merged.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#code-review"}},{"name":"Packaging","score":-1,"reason":"packaging workflow not detected","details":["Warn: no GitHub/GitLab publishing workflow detected."],"documentation":{"short":"Determines if the project is published as a package that others can easily download, install, easily update, and uninstall.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#packaging"}},{"name":"Token-Permissions","score":-1,"reason":"No tokens found","details":null,"documentation":{"short":"Determines if the project's workflows follow the principle of least privilege.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#token-permissions"}},{"name":"Dangerous-Workflow","score":-1,"reason":"no workflows found","details":null,"documentation":{"short":"Determines if the project's GitHub Action workflows avoid dangerous patterns.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#dangerous-workflow"}},{"name":"Pinned-Dependencies","score":-1,"reason":"no dependencies found","details":null,"documentation":{"short":"Determines if the project has declared and pinned the dependencies of its build process.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#pinned-dependencies"}},{"name":"CII-Best-Practices","score":0,"reason":"no effort to earn an OpenSSF best practices badge detected","details":null,"documentation":{"short":"Determines if the project has an OpenSSF (formerly CII) Best Practices Badge.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#cii-best-practices"}},{"name":"Security-Policy","score":0,"reason":"security policy file not detected","details":["Warn: no security policy file detected","Warn: no security file to analyze","Warn: no security file to analyze","Warn: no security file to analyze"],"documentation":{"short":"Determines if the project has published a security policy.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#security-policy"}},{"name":"Fuzzing","score":0,"reason":"project is not fuzzed","details":["Warn: no fuzzer integrations found"],"documentation":{"short":"Determines if the project uses fuzzing.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#fuzzing"}},{"name":"Signed-Releases","score":-1,"reason":"no releases found","details":null,"documentation":{"short":"Determines if the project cryptographically signs release artifacts.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#signed-releases"}},{"name":"License","score":10,"reason":"license file detected","details":["Info: project has a license file: LICENSE:0","Info: FSF or OSI recognized license: MIT License: LICENSE:0"],"documentation":{"short":"Determines if the project has defined a license.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#license"}},{"name":"Branch-Protection","score":0,"reason":"branch protection not enabled on development/release branches","details":["Warn: branch protection not enabled for branch 'main'"],"documentation":{"short":"Determines if the default and release branches are protected with GitHub's branch protection settings.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#branch-protection"}},{"name":"Vulnerabilities","score":3,"reason":"7 existing vulnerabilities detected","details":["Warn: Project is vulnerable to: PYSEC-2022-288 / GHSA-6hrg-qmvc-2xh8","Warn: Project is vulnerable to: GHSA-6p56-wp2h-9hxr","Warn: Project is vulnerable to: GHSA-fpfv-jqm9-f5jm","Warn: Project is vulnerable to: PYSEC-2024-110 / GHSA-jw8x-6495-233v","Warn: Project is vulnerable to: GHSA-jxfp-4rvq-9h9m","Warn: Project is vulnerable to: PYSEC-2023-102","Warn: Project is vulnerable to: PYSEC-2023-114"],"documentation":{"short":"Determines if the project has open, known unfixed vulnerabilities.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#vulnerabilities"}},{"name":"SAST","score":0,"reason":"SAST tool is not run on all commits -- score normalized to 0","details":["Warn: 0 commits out of 13 are checked with a SAST tool"],"documentation":{"short":"Determines if the project uses static code analysis.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#sast"}}]},"last_synced_at":"2025-08-14T20:43:25.621Z","repository_id":52406297,"created_at":"2025-08-14T20:43:25.621Z","updated_at":"2025-08-14T20:43:25.621Z"},"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":27725966,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-12-14T02:00:11.348Z","response_time":56,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data","data-cleaning","data-cleansing","data-preprocessing","data-wrangling","imputation","python","threshold"],"created_at":"2025-12-14T10:06:13.267Z","updated_at":"2025-12-14T10:06:14.466Z","avatar_url":"https://github.com/DataPreprocessing.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\n\u003cimg src=\"https://github.com/DataPreprocessing/DataCleaning/blob/mopidevimu/img/datacleaning.png\" width=\"100%\"/\u003e\n\n![PyPI - Python Version](https://img.shields.io/pypi/pyversions/data-cleaning)\n![PyPI - License](https://img.shields.io/pypi/l/data-cleaning)\n![PyPI](https://img.shields.io/pypi/v/data-cleaning)\n![GitHub repo size](https://img.shields.io/github/repo-size/DataPreprocessing/DataCleaning)\n\n\u003ch1 align=\"center\"\u003eDATA CLEANING\u003c/h1\u003e\n## Description\n\u003cp\u003eIn any Machine Learning process, Data Preprocessing is the primary step wherein the raw/unclean data are transformed \ninto cleaned data, So that in the later stage, machine learning algorithms can be applied. This python paackage make the \ndata preprocessing very easy in just 2 lines of code. All you have to do is just input a raw data(CSV file), this library\nwill clean your data and return you the cleaned dataframe on which further you can apply feature engineering, \nfeature selection and modeling.\n\n- What this does?\n    * Cleans special character\n    * Removes duplicates\n    * Fixes abnormality in column names \n    * Imputes the data (categorical \u0026 numerical)\n    \n\u003c/p\u003e\n\n## Data Cleaning\n\n\u003cp\u003eData-cleaning is a python package for data preprocessing. This cleans the CSV file and returns the \u003cb\u003ecleaned data frame\u003c/b\u003e. \nIt does the work of imputation, removing duplicates, replacing special characters, and many more.\u003c/p\u003e\n\n## How to use:\n\nStep 1:\n  Install the libaray\n\n````python\npip install data-cleaning\n````\nStep 2:\n\n  Import the library, and specify the path of the csv file. \n````python\nfrom datacleaning import DataCleaning\n\ndp = DataCleaning(file_upload='filename.csv')\ncleaned_df = dp.start_cleaning()\n````\n\nThere are some optional parameters that you can specify as listed below,\n\n## Usage:\n\n````python\nfrom datacleaning import DataCleaning\n\nDataCleaning(file_upload='filename.csv', separator=\",\", row_threshold=None, col_threshold=None,\n         special_character=None, action=None, ignore_columns=None, imputation_type=\"RDF\")\n````\n\n## Parameters\n\n------\n\n| Parameter | Default Value | Limit | Example |\n| ------ | ------ | ------ | ------ |\n| file_upload | ***none*** | Provide a CSV file. | filename.csv |\n| separator | ***,*** | Separator used in csv file | ****;****\n| row_threshold | ***none*** | 0 to 100 | 80 | \n| col_threshold | ***none*** | 0 to 100 | 80 | \n| special_character | Check the list below |Sspecify the character \u003cbr\u003e that is not listed in default_list (see below) | [ '$' , '?' ] | \n| action | ***none*** | ***add*** or ***remove*** | add | \n| ignore_columns | ***none*** | Provide list of column names \u003cbr\u003e to ignoring the special characters operation. | [ 'column1', 'column2' ] | \n| imputation_type | ***RDF*** | Select your preferred imputation \u003cbr\u003e ***RDF***, ***KNN***, ***mean***, ***median***, ***most_frequent***, ***constant*** . | KNN | \n\n\n## Examples of using parameters\n\n### - Appending extra special characters to the existing default_list\n\nThe DEFAULT SPECIAL CHARACTERS included in the package are shown below,\n\n````python\ndefault_list = [\"!\", '\"', \"#\", \"%\", \"\u0026\", \"'\", \"(\", \")\",\n                  \"*\", \"+\", \",\", \"-\", \".\", \"/\", \":\", \";\", \"\u003c\",\n                  \"=\", \"\u003e\", \"?\", \"@\", \"[\", \"\\\\\", \"]\", \"^\", \"_\",\n                  \"`\", \"{\", \"|\", \"}\", \"~\", \"–\", \"//\", \"%*\", \":/\", \".;\", \"Ø\", \"§\",'$',\"£\"]\n````\nHow to remove a special character, say for example if you want to remove \"?\" and \"%\".\n\n\u003ci\u003eNote:- Do not forget to give \u003cb\u003e action = 'remove' \u003c/b\u003e\u003c/i\u003e\n\n````python\nfrom datacleaning import DataCleaning\n\ndp = DataCleaning(file_upload='filename.csv', special_character =['?', '%'], action='remove')\ncleaned_df = dp.start_cleaning()\n````\nHow to add a special character, say for example if you want to add \"é\" that is not in the default_list given above.\n\n\u003ci\u003eNote:- Do not forget to give \u003cb\u003e action = 'add' \u003c/b\u003e\u003c/i\u003e\n\n````python\nfrom datacleaning import DataCleaning\n\ndp = DataCleaning(file_upload='filename.csv', special_character =['é'], action='add')\ncleaned_df = dp.start_cleaning()\n````\n\n### - Ignoring a particular columns and adding a special character\nSay for example, column named \"timestamp\" and \"date\" needs to be removed and a special character needs to be added 'é'\n\n````python\nfrom datacleaning import DataCleaning\n\ndp = DataCleaning(file_upload='filename.csv', special_character =['é'],\n              action='add', ignore_columns=['timestamp', 'date'])\ncleaned_df = dp.start_cleaning()\n````\n\n### - Changing threshold to remove null rows/columns above this given threshold value\n\n````python\nfrom datacleaning import DataCleaning\n\ndp = DataCleaning(file_upload='filename.csv', row_threshold=50, col_threshold=90)\ncleaned_df = dp.start_cleaning()\n````    \n\n### - Imputation methods available\n\n  - RDF (RandomForest) -\u003e (DEFAULT)\n  - KNN (k-nearest neighbors)\n  - mean\n  - median\n  - most_frequent\n  - constant\n  \n````python\n# Example for KNN imputation.\nfrom datacleaning import DataCleaning\n\ndp = DataCleaning(file_upload='filename.csv', imputation_type='KNN')\ncleaned_df = dp.start_cleaning()\n````\n\n\u003ch2 align=\"center\"\u003e \u003e\u003e THANK YOU \u003c\u003c \u003c/h2\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdatapreprocessing%2Fdatacleaning","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdatapreprocessing%2Fdatacleaning","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdatapreprocessing%2Fdatacleaning/lists"}