An open API service indexing awesome lists of open source software.

https://github.com/open-technology-foundation/strip_tags

A simple utility to strip HTML tags from files or standard input.
https://github.com/open-technology-foundation/strip_tags

bash bash-scripting

Last synced: about 2 months ago
JSON representation

A simple utility to strip HTML tags from files or standard input.

Awesome Lists containing this project

README

          

# strip_tags

Strip HTML tags from files or stdin while preserving text content.

Available in three versions with near-identical CLI interfaces:
- **`strip_tags`** - Python + BeautifulSoup (robust, handles edge cases)
- **`strip_tags.bash`** - Pure Bash + sed (fast, portable, no dependencies)
- **`strip_tags-c`** - Single C binary (fastest, zero dependencies beyond libc, fixes `>`-in-attribute bug)

## Quick Start

```bash
# Strip all HTML tags
echo "

Hello world

" | strip_tags
# Output: Hello world

# Preserve specific tags
echo "

Hello world

" | strip_tags -a b
# Output: Hello world

# Process a file
strip_tags index.html > clean.txt

# Pipe from curl
curl -s https://example.com | strip_tags -a h1,p
```

## Features

- Remove HTML tags while preserving text content
- Selectively preserve tags with `-a/--allow`
- Automatic whitespace normalization (collapse multiple blank lines)
- Process files or piped stdin
- Bash tab completion for options and common HTML tags
- Full Unicode support

## Installation

### Requirements

- **Python version**: Python 3.10+ and BeautifulSoup4
- **Bash version**: Bash 5.2+ and GNU sed (no other dependencies)
- **C version**: any C11-capable compiler and libc; no runtime dependencies. Build with `make compile`.

### User Install

```bash
git clone https://github.com/Open-Technology-Foundation/strip_tags.git && cd strip_tags && make install
```

### System Install

```bash
git clone https://github.com/Open-Technology-Foundation/strip_tags.git && cd strip_tags && sudo make install PREFIX=/usr/local
```

Optional: pre-build Python venv with `make install-venv`

### Development (Symlink Only)

For development, create symlinks without copying files:

```bash
make link
# Or for system-wide: sudo make link BINDIR=/usr/local/bin
```

### Update

Pull latest changes and refresh symlinks:

```bash
make update
```

### Installation (C)

Build and install the C binary as `strip_tags-c`:

```bash
make compile # produces ./strip_tags-c
sudo make install-c # installs to /usr/local/bin/strip_tags-c
```

Uninstall the C binary only:

```bash
sudo make uninstall-c
```

`make install` now builds and installs the C variant alongside the Bash and Python variants. The C binary has no runtime dependencies beyond libc and is portable to any POSIX system (it does not require GNU sed, unlike the Bash variant).

### Uninstall

```bash
make uninstall
# Or: sudo make uninstall PREFIX=/usr/local
```

### Tab Completion

Add to `~/.bashrc`:

```bash
source ~/.local/share/yatti/strip_tags/.bash_completion
# Or for system install: source /usr/local/share/yatti/strip_tags/.bash_completion
```

## Usage

```
strip_tags [OPTIONS] [FILE]

Options:
-a, --allow TAGS Comma-separated list of tags to preserve
--no-squeeze Disable collapsing of repeated blank lines
-v, --version Show version and exit
-h, --help Show this help and exit
```

## Examples

### Basic Tag Stripping

```bash
# From stdin
echo "

Text

" | strip_tags

# From file
strip_tags document.html

# Save output
strip_tags document.html > clean.txt
```

### Preserve Specific Tags

```bash
# Keep bold tags
strip_tags -a b < input.html

# Keep multiple tags (comma-separated)
strip_tags --allow "a,p,h1,h2,h3" page.html

# Spaces allowed around commas
strip_tags -a "p, div, span" page.html

# Namespaced tags (SVG, etc.)
strip_tags -a "svg:rect,svg:circle" drawing.svg
```

### Pipeline Usage

```bash
# Fetch and clean a webpage
curl -s https://example.com | strip_tags -a p,h1

# Extract text from HTML email
cat email.html | strip_tags | less

# Clean multiple files
for f in *.html; do strip_tags "$f" > "${f%.html}.txt"; done
```

### Whitespace Control

```bash
# Default: collapse 3+ blank lines to 2
strip_tags document.html

# Preserve all whitespace
strip_tags --no-squeeze document.html
```

## Performance

Tested on 33KB real-world HTML (averaged over 5 runs):

| Scenario | Python | Bash | C | Speedup (vs Python) |
|----------|--------|------|---|---------------------|
| Simple tags | 57 ms | 10 ms | 2-4 ms | **15-25x** |
| With `--allow` | 58 ms | 13 ms | 2-4 ms | **15-25x** |
| 33KB HTML | 68 ms | 18 ms | 2-4 ms | **15-25x** |
| 33KB + allow | 66 ms | 59 ms | 2-4 ms | **15-25x** |

Bash is 4-5x faster than Python; the C binary is another order of magnitude faster and closes the `--allow` gap entirely (single-pass DFA, no separate "allow" code path).

## Accuracy

| Feature | Python | Bash | C | Notes |
|---------|--------|------|---|-------|
| Basic HTML | 100% | 100% | 100% | Identical output |
| Nested tags | 100% | 100% | 100% | All handle correctly |
| Multi-line tags | Yes | Yes | Yes | Tags spanning lines |
| Self-closing | Yes | Yes | Yes | `
`, `


` |
| Namespaced tags | Yes | Yes | Yes | `svg:rect`, `xlink:href` |
| Script blocks | Preserves content | **Removes entirely** | **Removes entirely** | C matches Bash |
| Style blocks | Preserves content | **Removes entirely** | **Removes entirely** | C matches Bash |
| HTML comments | Preserves | **Removes** | **Removes** (or `--keep-comments`) | C adds opt-in flag |
| DOCTYPE | Preserves | **Removes** | **Removes** (or `--keep-doctype`) | C adds opt-in flag |
| `>` in attributes | Handles | **Breaks** | **Handles** | C fixes Bash limitation |
| CDATA sections | Preserves wrappers | Mishandled | Emits body verbatim | |
| Malformed HTML | Robust recovery | Best-effort | Best-effort | Python most forgiving |
| HTML entities | Decodes some | Preserves as-is | Preserves as-is | |
| Portable beyond Linux | Yes | No (needs GNU sed) | Yes | C/POSIX-only |

## When to Use Which

### Use Python (`strip_tags`) when:

- Processing malformed or complex HTML
- You need script/style content preserved (not removed)
- Accuracy is more important than speed
- HTML contains `>` inside attribute values

### Use Bash (`strip_tags.bash`) when:

- Speed is priority (4-5x faster)
- You want script/style blocks fully removed
- Running on minimal systems without Python
- Processing clean, well-formed HTML
- In containers or constrained environments

### Use C (`strip_tags-c`) when:

- You want the lowest possible latency (~2-4 ms per invocation)
- Target system has no Python and no GNU sed (e.g., BusyBox, BSD, minimal containers)
- HTML may contain `>` inside attribute values (Bash version mis-tokenizes these)
- You want CDATA bodies preserved and DOCTYPE/comments toggleable via `--keep-doctype` / `--keep-comments`

## Testing

Run the full test suite (111 tests):

```bash
source .venv/bin/activate
pytest tests/ -v
```

Run specific test modules:

```bash
# Python tests only (65 tests)
pytest tests/test_python_strip_tags.py -v

# Bash tests only (46 tests)
pytest tests/test_bash_strip_tags.py -v
```

Run performance comparison:

```bash
python tests/performance_matrix.py
```

## License

GPL-3.0