https://github.com/hardchor/ai-text-cleaner
https://github.com/hardchor/ai-text-cleaner
Last synced: about 1 year ago
JSON representation
- Host: GitHub
- URL: https://github.com/hardchor/ai-text-cleaner
- Owner: hardchor
- Created: 2025-05-05T08:36:39.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-05-05T09:12:23.000Z (about 1 year ago)
- Last Synced: 2025-05-05T09:50:55.491Z (about 1 year ago)
- Language: Python
- Size: 33.2 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# AI Text Cleaner
A simple UI, CLI and Python script to normalize text, often useful as a preprocessing step for AI models or text analysis tasks. It cleans up common typographic inconsistencies.

## Features
The script performs the following normalization steps:
1. **Replace Dashes:** Converts em-dash (—, U+2014), en-dash (–, U+2013), non-breaking hyphen (-, U+2011), and mathematical minus (−, U+2212) to standard hyphens (-).
2. **Normalize Quotes:** Converts curly double quotes (“ ”, U+201C/U+201D) and angle double quotes (« » U+00AB/U+00BB) to straight double quotes ("). Converts curly single quotes (‘ ’, U+2018/U+2019) and angle single quotes (‹ › U+2039/U+203A) to straight single quotes (').
3. **Strip Special Spaces:** Removes non-breaking spaces (U+00A0), narrow non-breaking spaces (U+202F), and zero-width spaces (U+200B).
4. **Replace Ellipsis:** Converts single-character ellipsis (…, U+2026) to three periods (...).
5. **Replace Ligatures:** Replaces fi (fi, U+FB01) → 'fi', fl (fl, U+FB02) → 'fl', ff (ff, U+FB00) → 'ff', ffi (ffi, U+FB03) → 'ffi', ffl (ffl, U+FB04) → 'ffl'.
6. **Replace Soft Hyphens:** Removes soft hyphens (U+00AD).
7. **Replace Bullets:** Converts bullets (•, U+2022) to hyphens (-).
8. **(Optional)** Collapses multiple whitespace characters into a single space (currently commented out in the code).
## Requirements
- Python 3
No external libraries are required.
## Usage
### Web UI
Use Docker Compose to build and run services:
```bash
docker-compose up --build
```
Then, access the app at `http://localhost:8501`.
### Command Line Interface
The script can read text from a file specified as a command-line argument or from standard input. The normalized text is always written to standard output.
**1. From a file:**
```bash
python main.py input.txt > output.txt
```
Replace `input.txt` with the path to your text file. The normalized output will be saved to `output.txt`.
**2. From standard input (e.g., piping):**
```bash
cat input.txt | python main.py > output.txt
```
Or type directly into the terminal (press Ctrl+D to signal end-of-input):
```bash
python main.py
^D
```
## Building an Executable
You can generate a standalone executable using PyInstaller via the `uvx` wrapper. This places the built binary in your local bin directory:
```bash
uvx pyinstaller normalize.py \
--onefile \
--name normalize \
--distpath ~/.local/bin
```
After building, ensure `~/.local/bin` is in your `PATH` to run `normalize` directly from the command line.
## License
This project does not currently have a license. Consider adding one if distributing.