https://github.com/open-technology-foundation/strip_tags
A simple utility to strip HTML tags from files or standard input.
https://github.com/open-technology-foundation/strip_tags
bash bash-scripting
Last synced: about 2 months ago
JSON representation
A simple utility to strip HTML tags from files or standard input.
- Host: GitHub
- URL: https://github.com/open-technology-foundation/strip_tags
- Owner: Open-Technology-Foundation
- License: gpl-3.0
- Created: 2025-03-22T07:02:36.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-03-22T07:03:12.000Z (over 1 year ago)
- Last Synced: 2025-03-22T08:18:44.563Z (over 1 year ago)
- Topics: bash, bash-scripting
- Language: Python
- Homepage: https://yatti.id/
- Size: 14.6 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# strip_tags
Strip HTML tags from files or stdin while preserving text content.
Available in three versions with near-identical CLI interfaces:
- **`strip_tags`** - Python + BeautifulSoup (robust, handles edge cases)
- **`strip_tags.bash`** - Pure Bash + sed (fast, portable, no dependencies)
- **`strip_tags-c`** - Single C binary (fastest, zero dependencies beyond libc, fixes `>`-in-attribute bug)
## Quick Start
```bash
# Strip all HTML tags
echo "
Hello world
" | strip_tags
# Output: Hello world
# Preserve specific tags
echo "
Hello world
" | strip_tags -a b
# Output: Hello world
# Process a file
strip_tags index.html > clean.txt
# Pipe from curl
curl -s https://example.com | strip_tags -a h1,p
```
## Features
- Remove HTML tags while preserving text content
- Selectively preserve tags with `-a/--allow`
- Automatic whitespace normalization (collapse multiple blank lines)
- Process files or piped stdin
- Bash tab completion for options and common HTML tags
- Full Unicode support
## Installation
### Requirements
- **Python version**: Python 3.10+ and BeautifulSoup4
- **Bash version**: Bash 5.2+ and GNU sed (no other dependencies)
- **C version**: any C11-capable compiler and libc; no runtime dependencies. Build with `make compile`.
### User Install
```bash
git clone https://github.com/Open-Technology-Foundation/strip_tags.git && cd strip_tags && make install
```
### System Install
```bash
git clone https://github.com/Open-Technology-Foundation/strip_tags.git && cd strip_tags && sudo make install PREFIX=/usr/local
```
Optional: pre-build Python venv with `make install-venv`
### Development (Symlink Only)
For development, create symlinks without copying files:
```bash
make link
# Or for system-wide: sudo make link BINDIR=/usr/local/bin
```
### Update
Pull latest changes and refresh symlinks:
```bash
make update
```
### Installation (C)
Build and install the C binary as `strip_tags-c`:
```bash
make compile # produces ./strip_tags-c
sudo make install-c # installs to /usr/local/bin/strip_tags-c
```
Uninstall the C binary only:
```bash
sudo make uninstall-c
```
`make install` now builds and installs the C variant alongside the Bash and Python variants. The C binary has no runtime dependencies beyond libc and is portable to any POSIX system (it does not require GNU sed, unlike the Bash variant).
### Uninstall
```bash
make uninstall
# Or: sudo make uninstall PREFIX=/usr/local
```
### Tab Completion
Add to `~/.bashrc`:
```bash
source ~/.local/share/yatti/strip_tags/.bash_completion
# Or for system install: source /usr/local/share/yatti/strip_tags/.bash_completion
```
## Usage
```
strip_tags [OPTIONS] [FILE]
Options:
-a, --allow TAGS Comma-separated list of tags to preserve
--no-squeeze Disable collapsing of repeated blank lines
-v, --version Show version and exit
-h, --help Show this help and exit
```
## Examples
### Basic Tag Stripping
```bash
# From stdin
echo "
Text
" | strip_tags
# From file
strip_tags document.html
# Save output
strip_tags document.html > clean.txt
```
### Preserve Specific Tags
```bash
# Keep bold tags
strip_tags -a b < input.html
# Keep multiple tags (comma-separated)
strip_tags --allow "a,p,h1,h2,h3" page.html
# Spaces allowed around commas
strip_tags -a "p, div, span" page.html
# Namespaced tags (SVG, etc.)
strip_tags -a "svg:rect,svg:circle" drawing.svg
```
### Pipeline Usage
```bash
# Fetch and clean a webpage
curl -s https://example.com | strip_tags -a p,h1
# Extract text from HTML email
cat email.html | strip_tags | less
# Clean multiple files
for f in *.html; do strip_tags "$f" > "${f%.html}.txt"; done
```
### Whitespace Control
```bash
# Default: collapse 3+ blank lines to 2
strip_tags document.html
# Preserve all whitespace
strip_tags --no-squeeze document.html
```
## Performance
Tested on 33KB real-world HTML (averaged over 5 runs):
| Scenario | Python | Bash | C | Speedup (vs Python) |
|----------|--------|------|---|---------------------|
| Simple tags | 57 ms | 10 ms | 2-4 ms | **15-25x** |
| With `--allow` | 58 ms | 13 ms | 2-4 ms | **15-25x** |
| 33KB HTML | 68 ms | 18 ms | 2-4 ms | **15-25x** |
| 33KB + allow | 66 ms | 59 ms | 2-4 ms | **15-25x** |
Bash is 4-5x faster than Python; the C binary is another order of magnitude faster and closes the `--allow` gap entirely (single-pass DFA, no separate "allow" code path).
## Accuracy
| Feature | Python | Bash | C | Notes |
|---------|--------|------|---|-------|
| Basic HTML | 100% | 100% | 100% | Identical output |
| Nested tags | 100% | 100% | 100% | All handle correctly |
| Multi-line tags | Yes | Yes | Yes | Tags spanning lines |
| Self-closing | Yes | Yes | Yes | `
`, `
` |
| Namespaced tags | Yes | Yes | Yes | `svg:rect`, `xlink:href` |
| Script blocks | Preserves content | **Removes entirely** | **Removes entirely** | C matches Bash |
| Style blocks | Preserves content | **Removes entirely** | **Removes entirely** | C matches Bash |
| HTML comments | Preserves | **Removes** | **Removes** (or `--keep-comments`) | C adds opt-in flag |
| DOCTYPE | Preserves | **Removes** | **Removes** (or `--keep-doctype`) | C adds opt-in flag |
| `>` in attributes | Handles | **Breaks** | **Handles** | C fixes Bash limitation |
| CDATA sections | Preserves wrappers | Mishandled | Emits body verbatim | |
| Malformed HTML | Robust recovery | Best-effort | Best-effort | Python most forgiving |
| HTML entities | Decodes some | Preserves as-is | Preserves as-is | |
| Portable beyond Linux | Yes | No (needs GNU sed) | Yes | C/POSIX-only |
## When to Use Which
### Use Python (`strip_tags`) when:
- Processing malformed or complex HTML
- You need script/style content preserved (not removed)
- Accuracy is more important than speed
- HTML contains `>` inside attribute values
### Use Bash (`strip_tags.bash`) when:
- Speed is priority (4-5x faster)
- You want script/style blocks fully removed
- Running on minimal systems without Python
- Processing clean, well-formed HTML
- In containers or constrained environments
### Use C (`strip_tags-c`) when:
- You want the lowest possible latency (~2-4 ms per invocation)
- Target system has no Python and no GNU sed (e.g., BusyBox, BSD, minimal containers)
- HTML may contain `>` inside attribute values (Bash version mis-tokenizes these)
- You want CDATA bodies preserved and DOCTYPE/comments toggleable via `--keep-doctype` / `--keep-comments`
## Testing
Run the full test suite (111 tests):
```bash
source .venv/bin/activate
pytest tests/ -v
```
Run specific test modules:
```bash
# Python tests only (65 tests)
pytest tests/test_python_strip_tags.py -v
# Bash tests only (46 tests)
pytest tests/test_bash_strip_tags.py -v
```
Run performance comparison:
```bash
python tests/performance_matrix.py
```
## License
GPL-3.0