{"id":18734666,"url":"https://github.com/elysian01/data-purifier","last_synced_at":"2025-10-04T06:08:58.166Z","repository":{"id":43183950,"uuid":"347637048","full_name":"Elysian01/Data-Purifier","owner":"Elysian01","description":"A Python library for Automated Exploratory Data Analysis, Automated Data Cleaning, and Automated Data Preprocessing For Machine Learning and Natural Language Processing Applications in Python.","archived":false,"fork":false,"pushed_at":"2022-05-06T21:19:03.000Z","size":7880,"stargazers_count":45,"open_issues_count":2,"forks_count":6,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-10-02T10:47:50.944Z","etag":null,"topics":["data-analysis","data-cleaning","data-cleaning-pipeline","data-preprocessing","data-science","data-visualization","datapurifier","eda","exploratory-data-analysis","jupyter","python-lib","python-library","python3"],"latest_commit_sha":null,"homepage":"https://pypi.org/project/data-purifier/","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Elysian01.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-03-14T12:56:50.000Z","updated_at":"2025-09-15T16:48:41.000Z","dependencies_parsed_at":"2022-08-31T21:01:45.443Z","dependency_job_id":null,"html_url":"https://github.com/Elysian01/Data-Purifier","commit_stats":null,"previous_names":[],"tags_count":5,"template":false,"template_full_name":null,"purl":"pkg:github/Elysian01/Data-Purifier","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Elysian01%2FData-Purifier","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Elysian01%2FData-Purifier/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Elysian01%2FData-Purifier/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Elysian01%2FData-Purifier/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Elysian01","download_url":"https://codeload.github.com/Elysian01/Data-Purifier/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Elysian01%2FData-Purifier/sbom","scorecard":{"id":45358,"data":{"date":"2025-08-11","repo":{"name":"github.com/Elysian01/Data-Purifier","commit":"d5f10cb5ac5eea4b45c6e4f9887c1963d4ffc1a5"},"scorecard":{"version":"v5.2.1-40-gf6ed084d","commit":"f6ed084d17c9236477efd66e5b258b9d4cc7b389"},"score":1.7,"checks":[{"name":"Packaging","score":-1,"reason":"packaging workflow not detected","details":["Warn: no GitHub/GitLab publishing workflow detected."],"documentation":{"short":"Determines if the project is published as a package that others can easily download, install, easily update, and uninstall.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#packaging"}},{"name":"Code-Review","score":0,"reason":"Found 0/30 approved changesets -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project requires human code review before pull requests (aka merge requests) are merged.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#code-review"}},{"name":"SAST","score":0,"reason":"no SAST tool detected","details":["Warn: no pull requests merged into dev branch"],"documentation":{"short":"Determines if the project uses static code analysis.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#sast"}},{"name":"Maintained","score":0,"reason":"0 commit(s) and 0 issue activity found in the last 90 days -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project is \"actively maintained\".","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#maintained"}},{"name":"Token-Permissions","score":-1,"reason":"No tokens found","details":null,"documentation":{"short":"Determines if the project's workflows follow the principle of least privilege.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#token-permissions"}},{"name":"Dangerous-Workflow","score":-1,"reason":"no workflows found","details":null,"documentation":{"short":"Determines if the project's GitHub Action workflows avoid dangerous patterns.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#dangerous-workflow"}},{"name":"Binary-Artifacts","score":10,"reason":"no binaries found in the repo","details":null,"documentation":{"short":"Determines if the project has generated executable (binary) artifacts in the source repository.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#binary-artifacts"}},{"name":"CII-Best-Practices","score":0,"reason":"no effort to earn an OpenSSF best practices badge detected","details":null,"documentation":{"short":"Determines if the project has an OpenSSF (formerly CII) Best Practices Badge.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#cii-best-practices"}},{"name":"Security-Policy","score":0,"reason":"security policy file not detected","details":["Warn: no security policy file detected","Warn: no security file to analyze","Warn: no security file to analyze","Warn: no security file to analyze"],"documentation":{"short":"Determines if the project has published a security policy.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#security-policy"}},{"name":"Fuzzing","score":0,"reason":"project is not fuzzed","details":["Warn: no fuzzer integrations found"],"documentation":{"short":"Determines if the project uses fuzzing.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#fuzzing"}},{"name":"Pinned-Dependencies","score":-1,"reason":"no dependencies found","details":null,"documentation":{"short":"Determines if the project has declared and pinned the dependencies of its build process.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#pinned-dependencies"}},{"name":"License","score":10,"reason":"license file detected","details":["Info: project has a license file: LICENSE:0","Info: FSF or OSI recognized license: MIT License: LICENSE:0"],"documentation":{"short":"Determines if the project has defined a license.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#license"}},{"name":"Signed-Releases","score":-1,"reason":"no releases found","details":null,"documentation":{"short":"Determines if the project cryptographically signs release artifacts.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#signed-releases"}},{"name":"Branch-Protection","score":0,"reason":"branch protection not enabled on development/release branches","details":["Warn: branch protection not enabled for branch 'master'"],"documentation":{"short":"Determines if the default and release branches are protected with GitHub's branch protection settings.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#branch-protection"}},{"name":"Vulnerabilities","score":0,"reason":"18 existing vulnerabilities detected","details":["Warn: Project is vulnerable to: GHSA-cpwx-vrp4-4pq7","Warn: Project is vulnerable to: PYSEC-2021-66 / GHSA-g3rq-g295-4j3m","Warn: Project is vulnerable to: GHSA-h5c8-rqwp-cp95","Warn: Project is vulnerable to: GHSA-h75v-3vvj-5mfj","Warn: Project is vulnerable to: GHSA-q2x7-8rv6-6q7h","Warn: Project is vulnerable to: PYSEC-2022-288 / GHSA-6hrg-qmvc-2xh8","Warn: Project is vulnerable to: PYSEC-2021-856 / GHSA-5545-2q6w-2gh6","Warn: Project is vulnerable to: GHSA-6p56-wp2h-9hxr","Warn: Project is vulnerable to: PYSEC-2019-108 / GHSA-9fq2-x9r6-wfmf","Warn: Project is vulnerable to: PYSEC-2021-857 / GHSA-f7c7-j99h-c22f","Warn: Project is vulnerable to: GHSA-fpfv-jqm9-f5jm","Warn: Project is vulnerable to: GHSA-9hjg-9r4m-mvj7","Warn: Project is vulnerable to: GHSA-9wx4-h78v-vm56","Warn: Project is vulnerable to: PYSEC-2023-74 / GHSA-j8r2-6x86-q33q","Warn: Project is vulnerable to: PYSEC-2024-110 / GHSA-jw8x-6495-233v","Warn: Project is vulnerable to: GHSA-jxfp-4rvq-9h9m","Warn: Project is vulnerable to: PYSEC-2023-102","Warn: Project is vulnerable to: PYSEC-2023-114"],"documentation":{"short":"Determines if the project has open, known unfixed vulnerabilities.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#vulnerabilities"}}]},"last_synced_at":"2025-08-14T22:36:29.026Z","repository_id":43183950,"created_at":"2025-08-14T22:36:29.026Z","updated_at":"2025-08-14T22:36:29.026Z"},"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":278272747,"owners_count":25959634,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-04T02:00:05.491Z","response_time":63,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-analysis","data-cleaning","data-cleaning-pipeline","data-preprocessing","data-science","data-visualization","datapurifier","eda","exploratory-data-analysis","jupyter","python-lib","python-library","python3"],"created_at":"2024-11-07T15:14:28.992Z","updated_at":"2025-10-04T06:08:58.149Z","avatar_url":"https://github.com/Elysian01.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Data-Purifier\n\nA Python library for Automated Exploratory Data Analysis, Automated Data Cleaning and Automated Data Preprocessing For Machine Learning and Natural Language Processing Applications in Python.\n\n[![PyPI version](https://badge.fury.io/py/data-purifier.svg)](https://badge.fury.io/py/data-purifier)\n[![License](https://img.shields.io/pypi/l/ansicolortags.svg)](https://img.shields.io/pypi/l/ansicolortags.svg) \n[![Python Version](https://img.shields.io/pypi/pyversions/data-purifier)](https://pypi.org/project/data-purifier/)\n[![PyPi Downloads](https://static.pepy.tech/personalized-badge/data-purifier?period=total\u0026units=international_system\u0026left_color=black\u0026right_color=orange\u0026left_text=Downloads)](https://pepy.tech/project/data-purifier)\n\n\nDemo Output of Auto EDA\n\u003cbr\u003e\u003cbr\u003e\n\u003cimg src = \"./static/demo.gif\" width=\"600px\" height = \"300px\"\u003e\n\n\nTable of Contents\n- [Data-Purifier](#data-purifier)\n  - [Installation](#installation)\n  - [Get Started](#get-started)\n    - [Tutorial](#tutorial)\n    - [Automated EDA for NLP](#automated-eda-for-nlp)\n    - [Automated Data Preprocessing for NLP](#automated-data-preprocessing-for-nlp)\n    - [Automated EDA for Machine Learning](#automated-eda-for-machine-learning)\n    - [Automated Report Generation for Machine Learning](#automated-report-generation-for-machine-learning)\n  - [Example](#example)\n\n\n## Installation\n\n**Prerequsites**\n\n- [Anaconda](https://docs.anaconda.com/anaconda/install/)\n\nTo use Data-purifier, it's recommended to create a new environment, and install the required dependencies:\n\nTo install from PyPi:\n\n```sh\nconda create -n \u003cyour_env_name\u003e python=3.6 anaconda\nconda activate \u003cyour_env_name\u003e # ON WINDOWS: `source activate \u003cyour_env_name\u003e`\n\npip install data-purifier\npython -m spacy download en_core_web_sm\n```\n\nTo install from source:\n\n```sh\ncd \u003cData-Purifier_Destination\u003e\ngit clone https://github.com/Elysian01/Data-Purifier.git\n# or download and unzip https://github.com/Elysian01/Data-Purifier/archive/master.zip\n\nconda create -n \u003cyour_env_name\u003e python=3.6 anaconda\nconda activate \u003cyour_env_name\u003e # ON WINDOWS: `source activate \u003cyour_env_name\u003e`\ncd Data-Purifier\n\npip install -r requirements.txt\npython -m spacy download en_core_web_sm\n```\n\n## Get Started\n\nLoad the module\n```python\nimport datapurifier as dp\nfrom datapurifier import Mleda, Nlpeda, Nlpurifier, NLAutoPurifier\n\nprint(dp.__version__)\n```\n\nGet the list of the example dataset  \n```python\nprint(dp.get_dataset_names()) # to get all dataset names\nprint(dp.get_text_dataset_names()) # to get all text dataset names\n```\n\nLoad an example dataset, pass one of the dataset names from the example list as an argument.\n```python\ndf = dp.load_dataset(\"womens_clothing_e-commerce_reviews\")\n```\n\n### [Tutorial](https://youtu.be/gDI6m1foHD8)\n\n[![Data-Purifier Tutorial](https://img.youtube.com/vi/gDI6m1foHD8/0.jpg)](https://www.youtube.com/watch?v=gDI6m1foHD8)\n\n[Automated NLP Pre-Processing using Data-Purifier Library Blog](https://medium.com/@abhig0209/automated-nlp-pre-processing-using-data-purifier-library-183678fabc8e)\n\n\n### Automated EDA for NLP\n\n**Basic NLP**\n\n* It will check for null rows and drop them (if any) and then will perform following analysis row by row and will return dataframe containing those analysis:\n   1. Word Count \n   2. Character Count\n   3. Average Word Length\n   4. Stop Word Count\n   5. Uppercase Word Count\n\nLater you can also observe distribution of above mentioned analysis just by selecting the column from the dropdown list, and our system will automatically plot it.\n\n* It can also perform `sentiment analysis` on dataframe row by row, giving the polarity of each sentence (or row), later you can also view the `distribution of polarity`.\n\n**Word Analysis**\n\n* Can find count of `specific word` mentioned by the user in the textbox.\n* Plots `wordcloud plot`\n* Perform `Unigram, Bigram, and Trigram` analysis, returning the dataframe of each and also showing its respective distribution plot.\n\n**Code Implementation**\n\n\nFor Automated EDA and Automated Data Cleaning of NL dataset, load the dataset and pass the dataframe along with the targeted column containing textual data.\n\n```python\nnlp_df = pd.read_csv(\"./datasets/twitter16m.csv\", header=None, encoding='latin-1')\nnlp_df.columns = [\"tweets\",\"sentiment\"]\n```\n\n**Basic Analysis**\n\nFor Basic EDA, pass the argument `basic` as argument in constructor\n```python\neda = Nlpeda(nlp_df, \"tweets\", analyse=\"basic\")\neda.df\n```\n**Word Analysis**\n\nFor Word based EDA, pass the argument `word` as argument in constructor\n```python\neda = Nlpeda(nlp_df, \"tweets\", analyse=\"word\")\neda.unigram_df # for seeing unigram datfarame\n```\n\n\n### **Automated Data Preprocessing for NLP**\n\n* In automated data preprocessing, it goes through the following pipeline, and return the cleaned data-frame\n    1. Drop Null Rows\n    2. Convert everything to lowercase \n    3. Removes digits/numbers\n    4. Removes html tags\n    5. Convert accented chars to normal letters\n    6. Removes special and punctuation characters\n    7. Removes stop words\n    8. Removes multiple spaces\n\n**Code Implementation**\n\nPass in the dataframe with the name of the column which you have to clean\n```python\ncleaned_df = NLAutoPurifier(df, target = \"tweets\")\n```\n   \n### **Widget Based Automated Data Preprocessing for NLP**\n\n* Here you can choose the preprocessing method from the GUI\n\n* It provides following cleaning techniques, where you have to just tick the checkbox and our system will automatically perform the operation for you.\n\n| Features                                   | Features                              | Features                         |\n| ------------------------------------------ | ------------------------------------- | -------------------------------- |\n| Drop Null Rows                             | Lower all Words                       | Contraction to Expansion         |\n| Removal of emojis                          | Removal of emoticons                  | Conversion of emoticons to words |\n| Count Urls                                 | Get Word Count                        | Count Mails                      |\n| Conversion of emojis to words              | Remove Numbers and Alphanumeric words | Remove Stop Words                |\n| Remove Special Characters and Punctuations | Remove Mails                          | Remove Html Tags                 |\n| Remove Urls                                | Remove Multiple Spaces                | Remove Accented Characters       |\n\n\n* You can convert word to its base form by selecting either `stemming` or `lemmatization` option.\n\n* Remove Top Common Word: By giving range of word, you can `remove top common word`\n  \n* Remove Top Rare Word: By giving range of word, you can `remove top rare word`\n\nAfter you are done, selecting your cleaning methods or techniques, click on `Start Purifying` button to let the magic begins. Upon its completion you can access the cleaned dataframe by `\u003cobj\u003e.df`\n\n**Code Implementation**\n\n```python\npure = Nlpurifier(nlp_df, \"tweets\")\n```\n\nView the processed and purified dataframe\n\n```python\npure.df\n```\n\n\n### Automated EDA for Machine Learning\n\n* It gives shape, number of categorical and numerical features, description of the dataset, and also the information about the number of null values and their respective percentage. \n\n* For understanding the distribution of datasets and getting useful insights, there are many interactive plots generated where the user can select his desired column and the system will automatically plot it. Plot includes\n   1. Count plot\n   2. Correlation plot\n   3. Joint plot\n   4. Pair plot\n   5. Pie plot \n\n**Code Implementation**\n\nLoad the dataset and let the magic of automated EDA begin\n\n```python\ndf = pd.read_csv(\"./datasets/iris.csv\")\nae = Mleda(df)\nae\n```\n\n### Automated Report Generation for Machine Learning\n\nReport contains sample of data, shape, number of numerical and categorical features, data uniqueness information, description of data, and null information.\n\n```python\ndf = pd.read_csv(\"./datasets/iris.csv\")\nreport = MlReport(df)\n```\n\n\n## Example\n[Colab Notebook](https://colab.research.google.com/drive/1J932G1uzqxUHCMwk2gtbuMQohYZsze8U?usp=sharing)\n\nOfficial Documentation: https://cutt.ly/CbFT5Dw\n\nPython Package: https://pypi.org/project/data-purifier/\n\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Felysian01%2Fdata-purifier","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Felysian01%2Fdata-purifier","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Felysian01%2Fdata-purifier/lists"}