{"id":13642622,"url":"https://github.com/jfilter/clean-text","last_synced_at":"2026-01-28T14:06:38.904Z","repository":{"id":41512222,"uuid":"160742633","full_name":"jfilter/clean-text","owner":"jfilter","description":"🧹 Python package for text cleaning","archived":false,"fork":false,"pushed_at":"2023-05-09T09:40:35.000Z","size":161,"stargazers_count":976,"open_issues_count":19,"forks_count":79,"subscribers_count":13,"default_branch":"main","last_synced_at":"2025-05-12T06:03:54.557Z","etag":null,"topics":["natural-language-processing","nlp","python","python-package","scraping","text-cleaning","text-normalization","text-preprocessing","user-generated-content"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jfilter.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2018-12-06T22:54:15.000Z","updated_at":"2025-05-10T13:52:03.000Z","dependencies_parsed_at":"2024-01-21T03:59:04.921Z","dependency_job_id":"6a0a4463-96fb-4dc8-bd1a-9aa355486b4d","html_url":"https://github.com/jfilter/clean-text","commit_stats":null,"previous_names":[],"tags_count":8,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jfilter%2Fclean-text","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jfilter%2Fclean-text/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jfilter%2Fclean-text/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jfilter%2Fclean-text/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jfilter","download_url":"https://codeload.github.com/jfilter/clean-text/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254292004,"owners_count":22046426,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["natural-language-processing","nlp","python","python-package","scraping","text-cleaning","text-normalization","text-preprocessing","user-generated-content"],"created_at":"2024-08-02T01:01:34.039Z","updated_at":"2026-01-28T14:06:38.898Z","avatar_url":"https://github.com/jfilter.png","language":"Python","funding_links":[],"categories":["Python","Vorverarbeitungstools"],"sub_categories":["Textnormalisierung"],"readme":"# `clean-text` [![Build Status](https://img.shields.io/github/actions/workflow/status/jfilter/clean-text/test.yml)](https://github.com/jfilter/clean-text/actions/workflows/test.yml) [![PyPI](https://img.shields.io/pypi/v/clean-text.svg)](https://pypi.org/project/clean-text/) [![PyPI - Python Version](https://img.shields.io/pypi/pyversions/clean-text.svg)](https://pypi.org/project/clean-text/) [![PyPI - Downloads](https://img.shields.io/pypi/dm/clean-text)](https://pypistats.org/packages/clean-text)\n\nUser-generated content on the Web and in social media is often dirty. Preprocess your scraped data with `clean-text` to create a normalized text representation. For instance, turn this corrupted input:\n\n```txt\nA bunch of \\\\u2018new\\\\u2019 references, including [Moana](https://en.wikipedia.org/wiki/Moana_%282016_film%29).\n\n\n»Yóù àré     rïght \u0026lt;3!«\n```\n\ninto this clean output:\n\n```txt\nA bunch of 'new' references, including [moana](\u003cURL\u003e).\n\n\"you are right \u003c3!\"\n```\n\n`clean-text` uses [ftfy](https://github.com/LuminosoInsight/python-ftfy), [unidecode](https://github.com/takluyver/Unidecode) and numerous hand-crafted rules, i.e., RegEx.\n\n## Installation\n\nTo install the GPL-licensed package [unidecode](https://github.com/takluyver/Unidecode) alongside:\n\n```bash\npip install clean-text[gpl]\n```\n\nYou may want to abstain from GPL:\n\n```bash\npip install clean-text\n```\n\nNB: This package is named `clean-text` and not `cleantext`.\n\nIf [unidecode](https://github.com/takluyver/Unidecode) is not available, `clean-text` will resort to Python's [unicodedata.normalize](https://docs.python.org/3.7/library/unicodedata.html#unicodedata.normalize) for [transliteration](https://en.wikipedia.org/wiki/Transliteration).\nTransliteration to closest ASCII symbols involes manually mappings, i.e., `ê` to `e`.\n`unidecode`'s mapping is superiour but unicodedata's are sufficent.\nHowever, you may want to disable this feature altogether depending on your data and use case.\n\nTo make it clear: There are **inconsistencies** between processing text with or without `unidecode`.\n\n## Usage\n\n```python\nfrom cleantext import clean\n\nclean(\"some input\",\n    fix_unicode=True,               # fix various unicode errors\n    to_ascii=True,                  # transliterate to closest ASCII representation\n    lower=True,                     # lowercase text\n    no_line_breaks=False,           # fully strip line breaks as opposed to only normalizing them\n    no_code=False,                  # replace all code snippets with a special token\n    no_urls=False,                  # replace all URLs with a special token\n    no_emails=False,                # replace all email addresses with a special token\n    no_phone_numbers=False,         # replace all phone numbers with a special token\n    no_ip_addresses=False,          # replace all IP addresses with a special token\n    no_file_paths=False,            # replace all file paths with a special token\n    no_numbers=False,               # replace all numbers with a special token\n    no_digits=False,                # replace all digits with a special token\n    no_currency_symbols=False,      # replace all currency symbols with a special token\n    no_punct=False,                 # remove punctuations\n    replace_with_punct=\"\",          # instead of removing punctuations you may replace them\n    exceptions=None,                # list of regex patterns to preserve verbatim\n    replace_with_code=\"\u003cCODE\u003e\",\n    replace_with_url=\"\u003cURL\u003e\",\n    replace_with_email=\"\u003cEMAIL\u003e\",\n    replace_with_phone_number=\"\u003cPHONE\u003e\",\n    replace_with_ip_address=\"\u003cIP\u003e\",\n    replace_with_file_path=\"\u003cFILE_PATH\u003e\",\n    replace_with_number=\"\u003cNUMBER\u003e\",\n    replace_with_digit=\"0\",\n    replace_with_currency_symbol=\"\u003cCUR\u003e\",\n    lang=\"en\"                       # set to 'de' for German special handling\n)\n```\n\nCarefully choose the arguments that fit your task. The default parameters are listed above.\n\n### Preserving patterns with exceptions\n\nUse `exceptions` to protect specific text patterns from being modified during cleaning.\nEach entry is a regex pattern string; all matches are preserved **verbatim** (not lowered, not\ntransliterated — exactly as they appeared in the input).\n\n```python\nfrom cleantext import clean\n\n# Preserve a literal compound word while removing other punctuation\nclean(\"drive-thru and text---cleaning\", no_punct=True, exceptions=[\"drive-thru\"])\n# =\u003e 'drive-thru and textcleaning'\n\n# Preserve all hyphenated compound words using a regex\nclean(\"drive-thru and pick-up\", no_punct=True, exceptions=[r\"\\w+-\\w+\"])\n# =\u003e 'drive-thru and pick-up'\n\n# Multiple exception patterns\nclean(\"drive-thru costs $5\", no_punct=True, no_currency_symbols=True,\n      exceptions=[r\"\\w+-\\w+\", r\"\\$\\d+\"])\n# =\u003e 'drive-thru costs $5'\n```\n\nYou may also only use specific functions for cleaning. For this, take a look at the [source code](https://github.com/jfilter/clean-text/blob/main/cleantext/clean.py).\n\n### Cleaning multiple texts in parallel\n\nUse `clean_texts()` to clean a list of strings. Set `n_jobs` to enable parallel processing via Python's built-in `multiprocessing`:\n\n```python\nfrom cleantext import clean_texts\n\n# Sequential (default) — no multiprocessing overhead\nclean_texts([\"text one\", \"text two\", \"text three\"])\n\n# Use all available CPU cores\nclean_texts([\"text one\", \"text two\", \"text three\"], n_jobs=-1)\n\n# Use a specific number of workers\nclean_texts([\"text one\", \"text two\", \"text three\"], n_jobs=4)\n\n# All clean() keyword arguments are supported\nclean_texts(texts, n_jobs=-1, no_urls=True, lang=\"de\", lower=False)\n```\n\n`n_jobs` semantics:\n- `1` or `None` — sequential processing (default, zero overhead)\n- `-1` — use all available CPU cores\n- `-2` — use all cores except one, etc.\n- Any positive integer — use exactly that many workers\n- `0` — raises `ValueError`\n\n### Supported languages\n\nSo far, only English and German are fully supported.\nIt should work for the majority of western languages.\nIf you need some special handling for your language, feel free to contribute. 🙃\n\n### Using `clean-text` with `scikit-learn`\n\nThere is also **scikit-learn** compatible API to use in your pipelines.\nAll of the parameters above work here as well.\n\n```bash\npip install clean-text[gpl,sklearn]\npip install clean-text[sklearn]\n```\n\n```python\nfrom cleantext.sklearn import CleanTransformer\n\ncleaner = CleanTransformer(no_punct=False, lower=False)\n\ncleaner.transform(['Happily clean your text!', 'Another Input'])\n```\n\n## Development\n\n[Use poetry.](https://python-poetry.org/)\n\nSee [RELEASING.md](RELEASING.md) for how to publish a new version.\n\n## Contributing\n\nIf you have a **question**, found a **bug** or want to propose a new **feature**, have a look at the [issues page](https://github.com/jfilter/clean-text/issues).\n\n**Pull requests** are especially welcomed when they fix bugs or improve the code quality.\n\nIf you don't like the output of `clean-text`, consider adding a [test](https://github.com/jfilter/clean-text/tree/main/tests) with your specific input and desired output.\n\n## Related Work\n\n### Generic text cleaning packages\n\n-   https://github.com/pudo/normality\n-   https://github.com/davidmogar/cucco\n-   https://github.com/lyeoni/prenlp\n-   https://github.com/s/preprocessor\n-   https://github.com/artefactory/NLPretext\n-   https://github.com/cbaziotis/ekphrasis\n\n### Full-blown NLP libraries with some text cleaning\n\n-   https://github.com/chartbeat-labs/textacy\n-   https://github.com/jbesomi/texthero\n\n### Remove or replace strings\n\n-   https://github.com/vi3k6i5/flashtext\n-   https://github.com/ddelange/retrie\n\n### Detect dates\n\n-   https://github.com/scrapinghub/dateparser\n\n### Clean massive Common Crawl data\n\n-   https://github.com/facebookresearch/cc_net\n\n## Acknowledgements\n\nBuilt upon the work by [Burton DeWilde](https://github.com/bdewilde) for [Textacy](https://github.com/chartbeat-labs/textacy).\n\n## License\n\nApache\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjfilter%2Fclean-text","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjfilter%2Fclean-text","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjfilter%2Fclean-text/lists"}