https://github.com/rainergo/nlp-cleantext
Part of a larger NLP Machine Learning project. Python scripts to clean and preprocess raw, unprocessed and messy text that mostly use regular expressions (Python re package, Python 3.11).
https://github.com/rainergo/nlp-cleantext
nlp re regular-expressions text-processing
Last synced: 3 months ago
JSON representation
Part of a larger NLP Machine Learning project. Python scripts to clean and preprocess raw, unprocessed and messy text that mostly use regular expressions (Python re package, Python 3.11).
- Host: GitHub
- URL: https://github.com/rainergo/nlp-cleantext
- Owner: rainergo
- Created: 2023-09-11T08:36:45.000Z (about 2 years ago)
- Default Branch: master
- Last Pushed: 2023-10-29T19:30:31.000Z (almost 2 years ago)
- Last Synced: 2025-01-28T22:41:27.341Z (8 months ago)
- Topics: nlp, re, regular-expressions, text-processing
- Language: Python
- Homepage:
- Size: 13.7 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# NLP: Clean raw and messy text with regular expressions
## Background
This code is part of a larger NLP Machine Learning project. To train/finetune a Machine Learning model or to make predictions with it,
the input text first needs to be cleaned before it can be tokenized and used.## General Info
The code can be used to clean and preprocess raw, unprocessed and messy text and mostly uses regular
expressions (Python re package, Python 3.11) to do that.
In "**main.py**", there is a messy sample text to be cleaned. The methods in the class CleanText in "**funcs/clean.py**" transform
those parts of the text that are found by the compiled re objects/patterns defined in "**funcs/re_patterns.py**".
These patterns and functions can be adjusted to specific needs.
In addition to the Python re package, some other Python string functions (such as "maketrans", etc) are used.## Setup
The **main.py** script contains sample text in the variable "*messy_text*".1. Go to **main.py** and paste the text you want to be cleaned into the variable "*messy_text*". Then run it. The cleaned text will be printed.
1. Adjust the class methods in "**funcs/clean.py**" and the regular expressions in "**funcs/re_patterns.py**" according to your needs.