{"id":15157828,"url":"https://github.com/cego669/dirtycategoriesencoding","last_synced_at":"2026-02-11T16:02:29.409Z","repository":{"id":251266516,"uuid":"836895616","full_name":"cego669/DirtyCategoriesEncoding","owner":"cego669","description":"Repository containing two classes (StringAgglomerativeEncoder and StringDistanceEncoder) useful for grouping or visualizing the distance between dirty categorical variables. They are compatible with the scikit-learn API.","archived":false,"fork":false,"pushed_at":"2024-08-01T23:44:52.000Z","size":1132,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-04-07T14:47:14.979Z","etag":null,"topics":["category","clustering","dimensionality-reduction","dirty","hierarchical-clustering","machine-learning","scikit-learn","singular-value-decomposition","svd"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/cego669.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-08-01T19:30:51.000Z","updated_at":"2024-08-01T23:44:55.000Z","dependencies_parsed_at":"2024-08-01T22:02:00.637Z","dependency_job_id":"744c8b31-b384-490f-9246-f1cb5f29de9f","html_url":"https://github.com/cego669/DirtyCategoriesEncoding","commit_stats":{"total_commits":8,"total_committers":1,"mean_commits":8.0,"dds":0.0,"last_synced_commit":"c1252092f92cf6db17a47f8bde989372a4809833"},"previous_names":["cego669/dirtycategoriesencoding"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cego669%2FDirtyCategoriesEncoding","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cego669%2FDirtyCategoriesEncoding/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cego669%2FDirtyCategoriesEncoding/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cego669%2FDirtyCategoriesEncoding/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/cego669","download_url":"https://codeload.github.com/cego669/DirtyCategoriesEncoding/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247675631,"owners_count":20977376,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["category","clustering","dimensionality-reduction","dirty","hierarchical-clustering","machine-learning","scikit-learn","singular-value-decomposition","svd"],"created_at":"2024-09-26T20:03:59.340Z","updated_at":"2026-02-11T16:02:29.379Z","avatar_url":"https://github.com/cego669.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003c!-- ABOUT THE PROJECT --\u003e\n## About The Project\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"dirty_categories_encoding.png\" alt=\"dirtycategoriesencoding\" title=\"Dirty Categories Visualization and Clustering\" width=\"1000\"/\u003e\n\u003c/p\u003e\n\nInspired by the methodology exposed in the article [\"Similarity encoding for learning with dirty categorical variables\"](https://link.springer.com/article/10.1007/s10994-018-5724-2), I wrote two Python classes **compatible with scikit-learn** capable of dealing with \"dirty categories\", which are categories with typos or a complex, implicit hierarchy.\n\n\"Dirty categories\" are a huge challenge in the data cleaning and modeling stages and, in the latter context, can be extremely harmful in terms of the cardinality of the categories when using methods such as One Hot Encoding. In that regard:\n\n- The `StringAgglomerativeEncoder` class **clusters similar \"dirty categories\"** and, thus, **can serve to speed up and automate the data cleaning process**. To work, the class vectorizes unique categories using the n-gram technique and calculates the distance between each vector using the Dice metric. With the distance matrix, the Hierarchical Clustering method is applied.\n\n- The `StringDistanceEncoder` class, instead of calculating the distance matrix, uses the n-gram vectors representing each category to **extract components by the Singular Value Decomposition (SVD) method**, which is commonly employed as a dimensionality reduction method in the context of machine learning. If two components are extracted in total, it is possible to project the \"dirty categories\" on a plot and thus **visualize the distance between them**!\n\n\u003cp align=\"right\"\u003e(\u003ca href=\"#readme-top\"\u003eback to top\u003c/a\u003e)\u003c/p\u003e\n\n\u003c!-- GETTING STARTED --\u003e\n## Getting Started\n\nYou can start making use of the classes by downloading the `.py` files (`StringAgglomerativeEncoder.py` and `StringDistanceEncoder.py`) and then moving them to your working directory. Then just import the classes as follows:\n\n```python \nfrom StringAgglomerativeEncoder import StringAgglomerativeEncoder\n```\n\nor...\n\n```python \nfrom StringDistanceEncoder import StringDistanceEncoder\n```\n\n\u003c!-- USAGE EXAMPLES --\u003e\n## Usage\n\nYou can find examples of how to properly use the classes in this repository by accessing the example notebooks.\n\n- The `prediction_example.ipynb` notebook exemplifies the use of the `StringDistanceEncoder` class for prediction problems and compares the performance of this method with what would be obtained using the `OneHotEncoder` class.\n- The notebook in `visualization_and_clustering_example.ipynb` exemplifies the use of the `StringAgglomerativeEncoder` class for clustering categories with typos. Then, the clusters are visualized in a two-dimensional space through the use of the `StringDistanceEncoder` class.\n\n\u003cp align=\"right\"\u003e(\u003ca href=\"#readme-top\"\u003eback to top\u003c/a\u003e)\u003c/p\u003e\n\n\u003c!-- CONTRIBUTING --\u003e\n## Contributing\n\nContributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are **greatly appreciated**.\n\nIf you have a suggestion that would make this project better, please fork the repo and create a pull request. You can also simply open an issue with the tag \"enhancement\".\nDon't forget to give the project a star! **Thanks!**\n\n\u003cp align=\"right\"\u003e(\u003ca href=\"#readme-top\"\u003eback to top\u003c/a\u003e)\u003c/p\u003e\n\n\u003c!-- LICENSE --\u003e\n## License\n\nDistributed under the MIT License. See `LICENSE.txt` for more information.\n\n\u003cp align=\"right\"\u003e(\u003ca href=\"#readme-top\"\u003eback to top\u003c/a\u003e)\u003c/p\u003e\n\n\u003c!-- CONTACT --\u003e\n## Contact\n\nCarlos Eduardo Gonçalves de Oliveira - [linkedin](https://www.linkedin.com/in/cego669/) - carlosedgonc@gmail.com\n\nProject Link: [https://github.com/cego669/DirtyCategoriesEncoding](https://github.com/cego669/DirtyCategoriesEncoding)\n\n\u003cp align=\"right\"\u003e(\u003ca href=\"#readme-top\"\u003eback to top\u003c/a\u003e)\u003c/p\u003e\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcego669%2Fdirtycategoriesencoding","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcego669%2Fdirtycategoriesencoding","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcego669%2Fdirtycategoriesencoding/lists"}