{"id":26581585,"url":"https://github.com/open-technology-foundation/strip_tags","last_synced_at":"2026-05-16T21:12:32.608Z","repository":{"id":283791329,"uuid":"952929356","full_name":"Open-Technology-Foundation/strip_tags","owner":"Open-Technology-Foundation","description":"A simple utility to strip HTML tags from files or standard input.","archived":false,"fork":false,"pushed_at":"2025-03-22T07:03:12.000Z","size":15,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-22T08:18:44.563Z","etag":null,"topics":["bash","bash-scripting"],"latest_commit_sha":null,"homepage":"https://yatti.id/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Open-Technology-Foundation.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-03-22T07:02:36.000Z","updated_at":"2025-03-22T07:04:01.000Z","dependencies_parsed_at":"2025-03-22T08:28:52.323Z","dependency_job_id":null,"html_url":"https://github.com/Open-Technology-Foundation/strip_tags","commit_stats":null,"previous_names":["open-technology-foundation/strip_tags"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/Open-Technology-Foundation/strip_tags","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Open-Technology-Foundation%2Fstrip_tags","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Open-Technology-Foundation%2Fstrip_tags/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Open-Technology-Foundation%2Fstrip_tags/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Open-Technology-Foundation%2Fstrip_tags/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Open-Technology-Foundation","download_url":"https://codeload.github.com/Open-Technology-Foundation/strip_tags/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Open-Technology-Foundation%2Fstrip_tags/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":263715328,"owners_count":23500242,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bash","bash-scripting"],"created_at":"2025-03-23T07:21:42.558Z","updated_at":"2026-05-16T21:12:32.603Z","avatar_url":"https://github.com/Open-Technology-Foundation.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# strip_tags\n\nStrip HTML tags from files or stdin while preserving text content.\n\nAvailable in three versions with near-identical CLI interfaces:\n- **`strip_tags`** - Python + BeautifulSoup (robust, handles edge cases)\n- **`strip_tags.bash`** - Pure Bash + sed (fast, portable, no dependencies)\n- **`strip_tags-c`** - Single C binary (fastest, zero dependencies beyond libc, fixes `\u003e`-in-attribute bug)\n\n## Quick Start\n\n```bash\n# Strip all HTML tags\necho \"\u003cp\u003eHello \u003cb\u003eworld\u003c/b\u003e\u003c/p\u003e\" | strip_tags\n# Output: Hello world\n\n# Preserve specific tags\necho \"\u003cp\u003eHello \u003cb\u003eworld\u003c/b\u003e\u003c/p\u003e\" | strip_tags -a b\n# Output: Hello \u003cb\u003eworld\u003c/b\u003e\n\n# Process a file\nstrip_tags index.html \u003e clean.txt\n\n# Pipe from curl\ncurl -s https://example.com | strip_tags -a h1,p\n```\n\n## Features\n\n- Remove HTML tags while preserving text content\n- Selectively preserve tags with `-a/--allow`\n- Automatic whitespace normalization (collapse multiple blank lines)\n- Process files or piped stdin\n- Bash tab completion for options and common HTML tags\n- Full Unicode support\n\n## Installation\n\n### Requirements\n\n- **Python version**: Python 3.10+ and BeautifulSoup4\n- **Bash version**: Bash 5.2+ and GNU sed (no other dependencies)\n- **C version**: any C11-capable compiler and libc; no runtime dependencies. Build with `make compile`.\n\n### User Install\n\n```bash\ngit clone https://github.com/Open-Technology-Foundation/strip_tags.git \u0026\u0026 cd strip_tags \u0026\u0026 make install\n```\n\n### System Install\n\n```bash\ngit clone https://github.com/Open-Technology-Foundation/strip_tags.git \u0026\u0026 cd strip_tags \u0026\u0026 sudo make install PREFIX=/usr/local\n```\n\nOptional: pre-build Python venv with `make install-venv`\n\n### Development (Symlink Only)\n\nFor development, create symlinks without copying files:\n\n```bash\nmake link\n# Or for system-wide: sudo make link BINDIR=/usr/local/bin\n```\n\n### Update\n\nPull latest changes and refresh symlinks:\n\n```bash\nmake update\n```\n\n### Installation (C)\n\nBuild and install the C binary as `strip_tags-c`:\n\n```bash\nmake compile          # produces ./strip_tags-c\nsudo make install-c   # installs to /usr/local/bin/strip_tags-c\n```\n\nUninstall the C binary only:\n\n```bash\nsudo make uninstall-c\n```\n\n`make install` now builds and installs the C variant alongside the Bash and Python variants. The C binary has no runtime dependencies beyond libc and is portable to any POSIX system (it does not require GNU sed, unlike the Bash variant).\n\n### Uninstall\n\n```bash\nmake uninstall\n# Or: sudo make uninstall PREFIX=/usr/local\n```\n\n### Tab Completion\n\nAdd to `~/.bashrc`:\n\n```bash\nsource ~/.local/share/yatti/strip_tags/.bash_completion\n# Or for system install: source /usr/local/share/yatti/strip_tags/.bash_completion\n```\n\n## Usage\n\n```\nstrip_tags [OPTIONS] [FILE]\n\nOptions:\n  -a, --allow TAGS   Comma-separated list of tags to preserve\n  --no-squeeze       Disable collapsing of repeated blank lines\n  -v, --version      Show version and exit\n  -h, --help         Show this help and exit\n```\n\n## Examples\n\n### Basic Tag Stripping\n\n```bash\n# From stdin\necho \"\u003cdiv\u003e\u003cp\u003eText\u003c/p\u003e\u003c/div\u003e\" | strip_tags\n\n# From file\nstrip_tags document.html\n\n# Save output\nstrip_tags document.html \u003e clean.txt\n```\n\n### Preserve Specific Tags\n\n```bash\n# Keep bold tags\nstrip_tags -a b \u003c input.html\n\n# Keep multiple tags (comma-separated)\nstrip_tags --allow \"a,p,h1,h2,h3\" page.html\n\n# Spaces allowed around commas\nstrip_tags -a \"p, div, span\" page.html\n\n# Namespaced tags (SVG, etc.)\nstrip_tags -a \"svg:rect,svg:circle\" drawing.svg\n```\n\n### Pipeline Usage\n\n```bash\n# Fetch and clean a webpage\ncurl -s https://example.com | strip_tags -a p,h1\n\n# Extract text from HTML email\ncat email.html | strip_tags | less\n\n# Clean multiple files\nfor f in *.html; do strip_tags \"$f\" \u003e \"${f%.html}.txt\"; done\n```\n\n### Whitespace Control\n\n```bash\n# Default: collapse 3+ blank lines to 2\nstrip_tags document.html\n\n# Preserve all whitespace\nstrip_tags --no-squeeze document.html\n```\n\n## Performance\n\nTested on 33KB real-world HTML (averaged over 5 runs):\n\n| Scenario | Python | Bash | C | Speedup (vs Python) |\n|----------|--------|------|---|---------------------|\n| Simple tags | 57 ms | 10 ms | 2-4 ms | **15-25x** |\n| With `--allow` | 58 ms | 13 ms | 2-4 ms | **15-25x** |\n| 33KB HTML | 68 ms | 18 ms | 2-4 ms | **15-25x** |\n| 33KB + allow | 66 ms | 59 ms | 2-4 ms | **15-25x** |\n\nBash is 4-5x faster than Python; the C binary is another order of magnitude faster and closes the `--allow` gap entirely (single-pass DFA, no separate \"allow\" code path).\n\n## Accuracy\n\n| Feature | Python | Bash | C | Notes |\n|---------|--------|------|---|-------|\n| Basic HTML | 100% | 100% | 100% | Identical output |\n| Nested tags | 100% | 100% | 100% | All handle correctly |\n| Multi-line tags | Yes | Yes | Yes | Tags spanning lines |\n| Self-closing | Yes | Yes | Yes | `\u003cbr/\u003e`, `\u003chr/\u003e` |\n| Namespaced tags | Yes | Yes | Yes | `svg:rect`, `xlink:href` |\n| Script blocks | Preserves content | **Removes entirely** | **Removes entirely** | C matches Bash |\n| Style blocks | Preserves content | **Removes entirely** | **Removes entirely** | C matches Bash |\n| HTML comments | Preserves | **Removes** | **Removes** (or `--keep-comments`) | C adds opt-in flag |\n| DOCTYPE | Preserves | **Removes** | **Removes** (or `--keep-doctype`) | C adds opt-in flag |\n| `\u003e` in attributes | Handles | **Breaks** | **Handles** | C fixes Bash limitation |\n| CDATA sections | Preserves wrappers | Mishandled | Emits body verbatim | |\n| Malformed HTML | Robust recovery | Best-effort | Best-effort | Python most forgiving |\n| HTML entities | Decodes some | Preserves as-is | Preserves as-is | |\n| Portable beyond Linux | Yes | No (needs GNU sed) | Yes | C/POSIX-only |\n\n## When to Use Which\n\n### Use Python (`strip_tags`) when:\n\n- Processing malformed or complex HTML\n- You need script/style content preserved (not removed)\n- Accuracy is more important than speed\n- HTML contains `\u003e` inside attribute values\n\n### Use Bash (`strip_tags.bash`) when:\n\n- Speed is priority (4-5x faster)\n- You want script/style blocks fully removed\n- Running on minimal systems without Python\n- Processing clean, well-formed HTML\n- In containers or constrained environments\n\n### Use C (`strip_tags-c`) when:\n\n- You want the lowest possible latency (~2-4 ms per invocation)\n- Target system has no Python and no GNU sed (e.g., BusyBox, BSD, minimal containers)\n- HTML may contain `\u003e` inside attribute values (Bash version mis-tokenizes these)\n- You want CDATA bodies preserved and DOCTYPE/comments toggleable via `--keep-doctype` / `--keep-comments`\n\n## Testing\n\nRun the full test suite (111 tests):\n\n```bash\nsource .venv/bin/activate\npytest tests/ -v\n```\n\nRun specific test modules:\n\n```bash\n# Python tests only (65 tests)\npytest tests/test_python_strip_tags.py -v\n\n# Bash tests only (46 tests)\npytest tests/test_bash_strip_tags.py -v\n```\n\nRun performance comparison:\n\n```bash\npython tests/performance_matrix.py\n```\n\n## License\n\nGPL-3.0\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fopen-technology-foundation%2Fstrip_tags","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fopen-technology-foundation%2Fstrip_tags","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fopen-technology-foundation%2Fstrip_tags/lists"}