https://github.com/ternaus/ternaus-cleantext
Cleans text as in the CLIP model
https://github.com/ternaus/ternaus-cleantext
python text-cleaning
Last synced: 6 months ago
JSON representation
Cleans text as in the CLIP model
- Host: GitHub
- URL: https://github.com/ternaus/ternaus-cleantext
- Owner: ternaus
- License: mit
- Created: 2022-12-08T19:19:28.000Z (almost 3 years ago)
- Default Branch: main
- Last Pushed: 2022-12-08T21:03:22.000Z (almost 3 years ago)
- Last Synced: 2025-03-26T15:43:04.619Z (7 months ago)
- Topics: python, text-cleaning
- Language: Python
- Homepage:
- Size: 4.88 KB
- Stars: 2
- Watchers: 2
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
[](https://sourcery.ai)
# Cleantextclip
Library to prepare text for machine learning and NLP tasks. Originated from CLIP model preparation, but a few more
rules were added.## Installation
```bash
pip install -U ternaus_cleantext
```Cleans text similar, but stricter than in the CLIP model:
1. Escapes HTML characters
2. Removes html tags
3. Removes URLs
4. Removes extra white spaces
5. Text to lower case```python
from ternaus_cleantext.ternaus_cleantext import clean_text
print(clean_text("This is a test https://ternaus.com bold"))
```
returns
`this is a test bold`