Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/aflah02/cleansetext
This is a simple library to help you clean your textual data
https://github.com/aflah02/cleansetext
cleaning-data nlp preprocessing pypi text
Last synced: about 5 hours ago
JSON representation
This is a simple library to help you clean your textual data
- Host: GitHub
- URL: https://github.com/aflah02/cleansetext
- Owner: aflah02
- License: mit
- Created: 2022-12-27T02:19:18.000Z (almost 2 years ago)
- Default Branch: master
- Last Pushed: 2023-01-02T14:18:47.000Z (almost 2 years ago)
- Last Synced: 2024-11-13T15:23:18.855Z (7 days ago)
- Topics: cleaning-data, nlp, preprocessing, pypi, text
- Language: Python
- Homepage: https://pypi.org/project/cleansetext/
- Size: 129 KB
- Stars: 6
- Watchers: 2
- Forks: 0
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# CleanseText
![](https://github.com/aflah02/cleansetext/actions/workflows/python-publish.yml/badge.svg)
![](https://github.com/aflah02/cleansetext/actions/workflows/python-package.yml/badge.svg)This is a simple library to help you clean your textual data.
## Why do I need this?
Honestly there are several packages out there which do similar things, but they've never really worked well for my use cases or don't have all the features I need. So I decided to make my own.
The API design is made to be readable, and I don't hesitate to create functions even for trivial tasks as they make reaching the goal easier.
## How to Install?
`pip install cleansetext`
## Sample usage
```
from cleansetext.pipeline import Pipeline
from cleansetext.steps import *
from nltk.tokenize import TweetTokenizer
tk = TweetTokenizer()# Create a pipeline with a list of preprocessing steps
pipeline = Pipeline([
RemoveEmojis(),
RemoveAllPunctuations(),
RemoveTokensWithOnlyPunctuations(),
ReplaceURLsandHTMLTags(),
ReplaceUsernames(),
RemoveWhiteSpaceOrChunksOfWhiteSpace()
], track_diffs=True)# Process text
text = "@Mary I hate you and everything about you ...... 🎉🎉 google.com"
text = tk.tokenize(text)print(text)
# Output: ['@Mary', 'I', 'hate', 'you', 'and', 'everything', 'about', 'you', '...', '🎉', '🎉', 'google.com']print(pipeline.process(text))
# Output:
# ['', 'I', 'hate', 'you', 'and', 'everything', 'about', 'you', '']pipeline.explain(show_diffs=True)
# Output:
# Step 1: Remove emojis from text | Language: en
# Diff: ['@Mary', 'I', 'hate', 'you', 'and', 'everything', 'about', 'you', '...', '🎉', '🎉', 'google.com'] -> ['@Mary', 'I', 'hate', 'you', 'and', 'everything', 'about', 'you', '...', 'google.com']
# Step 2: Remove all punctuations from a list of words | Punctuations: !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
# Diff: ['@Mary', 'I', 'hate', 'you', 'and', 'everything', 'about', 'you', '...', 'google.com'] -> ['@Mary', 'I', 'hate', 'you', 'and', 'everything', 'about', 'you', '...', 'google.com']
# Step 3: Remove tokens with only punctuations from a list of words | Punctuations: !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
# Diff: ['@Mary', 'I', 'hate', 'you', 'and', 'everything', 'about', 'you', '...', 'google.com'] -> ['@Mary', 'I', 'hate', 'you', 'and', 'everything', 'about', 'you', 'google.com']
# Step 4: Remove URLs and HTML tags from a sentence | Replace with:
# Diff: ['@Mary', 'I', 'hate', 'you', 'and', 'everything', 'about', 'you', 'google.com'] -> ['@Mary', 'I', 'hate', 'you', 'and', 'everything', 'about', 'you', '']
# Step 5: Remove usernames from a sentence | Replace with:
# Diff: ['@Mary', 'I', 'hate', 'you', 'and', 'everything', 'about', 'you', ''] -> ['', 'I', 'hate', 'you', 'and', 'everything', 'about', 'you', '']
# Step 6: Remove whitespace from a sentence or chunks of whitespace
# Diff: ['', 'I', 'hate', 'you', 'and', 'everything', 'about', 'you', ''] -> ['', 'I', 'hate', 'you', 'and', 'everything', 'about', 'you', '']```