Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/aflah02/cleansetext

This is a simple library to help you clean your textual data
https://github.com/aflah02/cleansetext

cleaning-data nlp preprocessing pypi text

Last synced: about 5 hours ago
JSON representation

This is a simple library to help you clean your textual data

Host: GitHub
URL: https://github.com/aflah02/cleansetext
Owner: aflah02
License: mit
Created: 2022-12-27T02:19:18.000Z (almost 2 years ago)
Default Branch: master
Last Pushed: 2023-01-02T14:18:47.000Z (almost 2 years ago)
Last Synced: 2024-11-13T15:23:18.855Z (7 days ago)
Topics: cleaning-data, nlp, preprocessing, pypi, text
Language: Python
Homepage: https://pypi.org/project/cleansetext/
Size: 129 KB
Stars: 6
Watchers: 2
Forks: 0
Open Issues: 2
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # CleanseText

![](https://github.com/aflah02/cleansetext/actions/workflows/python-publish.yml/badge.svg)

![](https://github.com/aflah02/cleansetext/actions/workflows/python-package.yml/badge.svg)

This is a simple library to help you clean your textual data.

## Why do I need this?

Honestly there are several packages out there which do similar things, but they've never really worked well for my use cases or don't have all the features I need. So I decided to make my own.

The API design is made to be readable, and I don't hesitate to create functions even for trivial tasks as they make reaching the goal easier.

## How to Install?

`pip install cleansetext`

## Sample usage

```

from cleansetext.pipeline import Pipeline

from cleansetext.steps import *

from nltk.tokenize import TweetTokenizer

tk = TweetTokenizer()

# Create a pipeline with a list of preprocessing steps

pipeline = Pipeline([

    RemoveEmojis(),

    RemoveAllPunctuations(),

    RemoveTokensWithOnlyPunctuations(),

    ReplaceURLsandHTMLTags(),

    ReplaceUsernames(),

    RemoveWhiteSpaceOrChunksOfWhiteSpace()

], track_diffs=True)

# Process text

text = "@Mary I hate you    and everything about you ...... 🎉🎉 google.com"

text = tk.tokenize(text)

print(text)

# Output: ['@Mary', 'I', 'hate', 'you', 'and', 'everything', 'about', 'you', '...', '🎉', '🎉', 'google.com']

print(pipeline.process(text))

# Output:

# ['', 'I', 'hate', 'you', 'and', 'everything', 'about', 'you', '']

pipeline.explain(show_diffs=True)

# Output:

# Step 1: Remove emojis from text | Language: en

# Diff: ['@Mary', 'I', 'hate', 'you', 'and', 'everything', 'about', 'you', '...', '🎉', '🎉', 'google.com'] -> ['@Mary', 'I', 'hate', 'you', 'and', 'everything', 'about', 'you', '...', 'google.com']

# Step 2: Remove all punctuations from a list of words | Punctuations: !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~

# Diff: ['@Mary', 'I', 'hate', 'you', 'and', 'everything', 'about', 'you', '...', 'google.com'] -> ['@Mary', 'I', 'hate', 'you', 'and', 'everything', 'about', 'you', '...', 'google.com']

# Step 3: Remove tokens with only punctuations from a list of words | Punctuations: !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~

# Diff: ['@Mary', 'I', 'hate', 'you', 'and', 'everything', 'about', 'you', '...', 'google.com'] -> ['@Mary', 'I', 'hate', 'you', 'and', 'everything', 'about', 'you', 'google.com']

# Step 4: Remove URLs and HTML tags from a sentence | Replace with: 

# Diff: ['@Mary', 'I', 'hate', 'you', 'and', 'everything', 'about', 'you', 'google.com'] -> ['@Mary', 'I', 'hate', 'you', 'and', 'everything', 'about', 'you', '']

# Step 5: Remove usernames from a sentence | Replace with: 

# Diff: ['@Mary', 'I', 'hate', 'you', 'and', 'everything', 'about', 'you', ''] -> ['', 'I', 'hate', 'you', 'and', 'everything', 'about', 'you', '']

# Step 6: Remove whitespace from a sentence or chunks of whitespace

# Diff: ['', 'I', 'hate', 'you', 'and', 'everything', 'about', 'you', ''] -> ['', 'I', 'hate', 'you', 'and', 'everything', 'about', 'you', '']

```