https://github.com/pszemraj/rehuman
Unicode-safe text cleaning & typographic normalization for Rust
https://github.com/pszemraj/rehuman
no-emoji-in-code rust-crate text-normalization text-processing unicode
Last synced: 28 days ago
JSON representation
Unicode-safe text cleaning & typographic normalization for Rust
- Host: GitHub
- URL: https://github.com/pszemraj/rehuman
- Owner: pszemraj
- License: mit
- Created: 2025-10-26T03:03:39.000Z (4 months ago)
- Default Branch: main
- Last Pushed: 2025-10-27T12:30:17.000Z (4 months ago)
- Last Synced: 2026-01-13T14:58:08.417Z (30 days ago)
- Topics: no-emoji-in-code, rust-crate, text-normalization, text-processing, unicode
- Language: Rust
- Homepage: https://crates.io/crates/rehuman
- Size: 78.1 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# rehuman
Unicode-safe text cleaning & normalization for Rust.
Strip invisible characters, normalize typography, and enforce consistent formatting-ideal for text sourced from web scraping, user input, or [LLMs](https://archive.fo/PrRYl).
> This crate is a Rust rewrite and expansion of [humanize-ai-lib](https://github.com/Nordth/humanize-ai-lib) by [Nordth](https://github.com/Nordth).
## Why rehuman?
Untrusted text often contains:
- Zero-width spaces and control characters that break parsers
- Mixed quote styles that defeat string matching
- Non-breaking spaces that masquerade as regular spaces
- Inconsistent Unicode normalization that produces duplicate keys
**rehuman fixes this** in a single pass with predictable, measurable output.
## Installation
**Library crate**: add `rehuman` to your project with `cargo add rehuman` or edit `Cargo.toml`:
```toml
[dependencies]
rehuman = "0.1.0" # replace with the latest published version
```
**CLI binaries**: install the published release (installs both `rehuman` and `ishuman`):
```bash
cargo install rehuman
```
Click to Expand: Build from Source
For the latest version(s), clone this repo and run `cargo install --path .`:
```bash
git clone https://github.com/pszemraj/rehuman.git
cd rehuman
cargo install --path .
```
Binaries will be installed to `~/.cargo/bin` by default.[^1]
[^1]: You may need to add `~/.cargo/bin` to your `PATH` if it is not already there; add `export PATH="$HOME/.cargo/bin:$PATH"` to your shell profile (`.bashrc`, `.zshrc`, etc.).
## Quick Start
> [!WARNING]
> This is an **early release** focused on correctness. Performance optimizations are in progress. Use `--stream` or `StreamCleaner` to stream large files.
### Library
```rust
use rehuman::{clean, humanize};
let cleaned = clean("Hello\u{200B}there"); // -> "Hello there"
let humanized = humanize("“Quote”—and…more"); // -> "\"Quote\"-and...more"
```
> [!IMPORTANT]
> By default `rehuman::clean` removes emoji to guarantee ASCII-only output[^2].
[^2]: This is a deliberate design choice given the propensity of today's LLMs to spam emoji in their outputs.
```rust
use rehuman::clean;
// Default behavior removes emoji
let cleaned = clean("Thanks 👍"); // -> "Thanks "
```
To keep emoji, construct a cleaner with `CleaningOptions::builder().keyboard_only(false)` (or pass `--keep-emoji` on the CLI).
### CLI
`rehuman` reads the input and emits cleaned text to STDOUT-your source file stays untouched unless you pass `--inplace`:
```bash
# Stream-clean to STDOUT and capture stats
rehuman notes.txt --stream --stats > notes.cleaned.txt
# Overwrite the original file in place
rehuman notes.txt --inplace
```
> [!TIP]
> Both CLI tools act as filters, so you can drop them into pipelines
```bash
cat notes.txt | rehuman --stream | tee notes.cleaned.txt
curl https://example.com/raw.txt | rehuman --stream --stats-json >/tmp/clean.txt
```
Use `ishuman` when you only need detection:
```bash
# Exit status 0 when clean, 1 when changes would be made (no stdout by default)
ishuman notes.txt
# Add --stats or --json to explain what would change
ishuman notes.txt --stats
```
Run `rehuman --help` or `ishuman --help` for the full list of flags (_emoji policy, line endings, configs, streaming, etc._).
## Documentation
More details are available in the [`docs/`](docs/) folder:
- [API Reference](docs/api.md) - all functions, options, and statistics
- [CLI Guide](docs/cli.md) - usage of `rehuman` and `ishuman`
- [Examples](docs/examples.md) - recipes for common workloads
- [Development Notes](docs/development.md) - roadmap & implementation details
## Detailed Features
- **Invisible character removal**: ZWSP, BOM, bidi isolates, control characters
- **Space normalization**: NBSP, figure space, ideographic space → ASCII space
- **Typography fixes**: curly quotes → ASCII, em/en dash → hyphen, ellipsis → three dots
- **Unicode normalization**: NFC/NFD/NFKC/NFKD (`unorm` feature, enabled by default)
- **Whitespace controls**: optional collapsing, trimming, and line-ending normalization
- **Keyboard-only enforcement**: ASCII output with configurable emoji policy
- **Detailed stats**: every cleaning run reports what changed
- **CLI tooling**: `rehuman` (cleaner) and `ishuman` (detector) with streaming & in-place modes
## License
MIT