Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/simonw/strip-tags
CLI tool for stripping tags from HTML
https://github.com/simonw/strip-tags
Last synced: about 2 months ago
JSON representation
CLI tool for stripping tags from HTML
- Host: GitHub
- URL: https://github.com/simonw/strip-tags
- Owner: simonw
- License: apache-2.0
- Created: 2023-05-18T15:44:34.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2023-12-19T14:26:47.000Z (9 months ago)
- Last Synced: 2024-07-10T08:43:37.377Z (3 months ago)
- Language: Python
- Homepage:
- Size: 39.1 KB
- Stars: 183
- Watchers: 4
- Forks: 4
- Open Issues: 7
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# strip-tags
[![PyPI](https://img.shields.io/pypi/v/strip-tags.svg)](https://pypi.org/project/strip-tags/)
[![Changelog](https://img.shields.io/github/v/release/simonw/strip-tags?include_prereleases&label=changelog)](https://github.com/simonw/strip-tags/releases)
[![Tests](https://github.com/simonw/strip-tags/workflows/Test/badge.svg)](https://github.com/simonw/strip-tags/actions?query=workflow%3ATest)
[![License](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](https://github.com/simonw/strip-tags/blob/master/LICENSE)Strip tags from HTML, optionally from areas identified by CSS selectors
See [llm, ttok and strip-tags—CLI tools for working with ChatGPT and other LLMs](https://simonwillison.net/2023/May/18/cli-tools-for-llms/) for more on this project.
## Installation
Install this tool using `pip`:
```bash
pip install strip-tags
```
## UsagePipe content into this tool to strip tags from it:
```bash
cat input.html | strip-tags > output.txt
````
Or pass a filename:
```bash
strip-tags -i input.html > output.txt
```
To run against just specific areas identified by CSS selectors:
```bash
strip-tags '.content' -i input.html > output.txt
```
This can be called with multiple selectors:
```bash
cat input.html | strip-tags '.content' '.sidebar' > output.txt
```
To return just the first element on the page that matches one of the selectors, use `--first`:
```bash
cat input.html | strip-tags .content --first > output.txt
```
To remove content contained by specific selectors - e.g. the `` section of a page, use `-r` or `--remove`:
```bash
cat input.html | strip-tags -r nav > output.txt
```
To minify whitespace - reducing multiple space and tab characters to a single space, and multiple newlines and spaces to a maximum of two newlines - add `-m` or `--minify`:
```bash
cat input.html | strip-tags -m > output.txt
```
You can also run this command using `python -m` like this:
```bash
python -m strip_tags --help
```
### Keeping the markup for specified tagsWhen passing content to a language model, it can sometimes be useful to leave in a subset of HTML tags - `
This is the heading
` for example - to provide extra hints to the model.The `-t/--keep-tag` option can be passed multiple times to specify tags that should be kept.
This example looks at the `` section of https://datasette.io/ and keeps the tags around the list items and `
` elements:
```
curl -s https://datasette.io/ | strip-tags header -t h1 -t li
```
```html
Datasette
Find stories in data
```
All attributes will be removed from the tags, except for the `id=` and `class=` attribute since those may provide further useful hints to the language model.
The `href` attribute on links, the `alt` attribute on images and the `name` and `value` attributes on `meta` tags are kept as well.
You can also specify a bundle of tags. For example, `strip-tags -t hs` will keep the tag markup for all levels of headings.
The following bundles can be used:
- `-t hs`: `
`, ``, ``, ``, ``, ``
- `-t metadata`: ``, ``
- `-t structure`: ``, ``, ``, ``, ``, ``, ``
- `-t tables`: ``, ``, ``, ``, ``, ``, ``, ``, ``, ``
- `-t lists`: ``, ``, `- `, `
`, `- `, `
- `
## As a Python library
You can use `strip-tags` from Python code too. The function signature looks like this:
```python
def strip_tags(
input: str,
selectors: Optional[Iterable[str]]=None,
*,
removes: Optional[Iterable[str]]=None,
minify: bool=False,
first: bool=False,
keep_tags: Optional[Iterable[str]]=None,
all_attrs: bool=False
) -> str:
```
Here's an example:
```python
from strip_tags import strip_tags
html = """
This has tags
And whitespace too
Ignore this bit.
"""
stripped = strip_tags(html, ["div"], minify=True, keep_tags=["h1"])
print(stripped)
```
Output:
```
This has tags
And whitespace too
```
## strip-tags --help
```
Usage: strip-tags [OPTIONS] [SELECTORS]...
Strip tags from HTML, optionally from areas identified by CSS selectors
Example usage:
cat input.html | strip-tags > output.txt
To run against just specific areas identified by CSS selectors:
cat input.html | strip-tags .entry .footer > output.txt
Options:
--version Show the version and exit.
-r, --remove TEXT Remove content in these selectors
-i, --input FILENAME Input file
-m, --minify Minify whitespace
-t, --keep-tag TEXT Keep these
--all-attrs Include all attributes on kept tags
--first First element matching the selectors
--help Show this message and exit.
```
## Development
To contribute to this tool, first checkout the code. Then create a new virtual environment:
```bash
cd strip-tags
python -m venv venv
source venv/bin/activate
```
Now install the dependencies and test dependencies:
```bash
pip install -e '.[test]'
```
To run the tests:
```bash
pytest
```
`, ``, ``, ``
- `-t metadata`: ``, ``
- `-t structure`: ``, ``, ``, ``, ``, ``, ``
- `-t tables`: ``, ``, ``, ``, ``, ``, ``, ``, ``, ``
- `-t lists`: ``, ``, `- `, `
`, `- `, `
- `
## As a Python library
You can use `strip-tags` from Python code too. The function signature looks like this:
```python
def strip_tags(
input: str,
selectors: Optional[Iterable[str]]=None,
*,
removes: Optional[Iterable[str]]=None,
minify: bool=False,
first: bool=False,
keep_tags: Optional[Iterable[str]]=None,
all_attrs: bool=False
) -> str:
```
Here's an example:
```python
from strip_tags import strip_tags
html = """
This has tags
And whitespace too
Ignore this bit.
"""
stripped = strip_tags(html, ["div"], minify=True, keep_tags=["h1"])
print(stripped)
```
Output:
```
This has tags
And whitespace too
```
## strip-tags --help
```
Usage: strip-tags [OPTIONS] [SELECTORS]...
Strip tags from HTML, optionally from areas identified by CSS selectors
Example usage:
cat input.html | strip-tags > output.txt
To run against just specific areas identified by CSS selectors:
cat input.html | strip-tags .entry .footer > output.txt
Options:
--version Show the version and exit.
-r, --remove TEXT Remove content in these selectors
-i, --input FILENAME Input file
-m, --minify Minify whitespace
-t, --keep-tag TEXT Keep these
--all-attrs Include all attributes on kept tags
--first First element matching the selectors
--help Show this message and exit.
```
## Development
To contribute to this tool, first checkout the code. Then create a new virtual environment:
```bash
cd strip-tags
python -m venv venv
source venv/bin/activate
```
Now install the dependencies and test dependencies:
```bash
pip install -e '.[test]'
```
To run the tests:
```bash
pytest
```
`, ``
- `-t metadata`: ``, ``
- `-t structure`: ``, ``, ``, ``, ``, ``, ``
- `-t tables`: ``, ``, ``, ``, ``, ``, ``, ``, ``, ``
- `-t lists`: ``, ``, `- `, `
`, `- `, `
- `
## As a Python library
You can use `strip-tags` from Python code too. The function signature looks like this:
```python
def strip_tags(
input: str,
selectors: Optional[Iterable[str]]=None,
*,
removes: Optional[Iterable[str]]=None,
minify: bool=False,
first: bool=False,
keep_tags: Optional[Iterable[str]]=None,
all_attrs: bool=False
) -> str:
```
Here's an example:
```python
from strip_tags import strip_tags
html = """
This has tags
And whitespace too
Ignore this bit.
"""
stripped = strip_tags(html, ["div"], minify=True, keep_tags=["h1"])
print(stripped)
```
Output:
```
This has tags
And whitespace too
```
## strip-tags --help
```
Usage: strip-tags [OPTIONS] [SELECTORS]...
Strip tags from HTML, optionally from areas identified by CSS selectors
Example usage:
cat input.html | strip-tags > output.txt
To run against just specific areas identified by CSS selectors:
cat input.html | strip-tags .entry .footer > output.txt
Options:
--version Show the version and exit.
-r, --remove TEXT Remove content in these selectors
-i, --input FILENAME Input file
-m, --minify Minify whitespace
-t, --keep-tag TEXT Keep these
--all-attrs Include all attributes on kept tags
--first First element matching the selectors
--help Show this message and exit.
```
## Development
To contribute to this tool, first checkout the code. Then create a new virtual environment:
```bash
cd strip-tags
python -m venv venv
source venv/bin/activate
```
Now install the dependencies and test dependencies:
```bash
pip install -e '.[test]'
```
To run the tests:
```bash
pytest
```
- `-t metadata`: ``, ``
- `-t structure`: ``, ``, ``, ``, ``, ``, ``
- `-t tables`: ``, ``, ``, ``, ``, ``, ``, ``, ``, ``
- `-t lists`: `
- `, `
- `, `
- `, `
- `, `
- `
## As a Python library
You can use `strip-tags` from Python code too. The function signature looks like this:
```python
def strip_tags(
input: str,
selectors: Optional[Iterable[str]]=None,
*,
removes: Optional[Iterable[str]]=None,
minify: bool=False,
first: bool=False,
keep_tags: Optional[Iterable[str]]=None,
all_attrs: bool=False
) -> str:
```Here's an example:
```python
from strip_tags import strip_tagshtml = """
This has tags
And whitespace too
Ignore this bit.
"""
stripped = strip_tags(html, ["div"], minify=True, keep_tags=["h1"])
print(stripped)
```
Output:
```This has tags
And whitespace too
```## strip-tags --help
```
Usage: strip-tags [OPTIONS] [SELECTORS]...Strip tags from HTML, optionally from areas identified by CSS selectors
Example usage:
cat input.html | strip-tags > output.txt
To run against just specific areas identified by CSS selectors:
cat input.html | strip-tags .entry .footer > output.txt
Options:
--version Show the version and exit.
-r, --remove TEXT Remove content in these selectors
-i, --input FILENAME Input file
-m, --minify Minify whitespace
-t, --keep-tag TEXT Keep these
--all-attrs Include all attributes on kept tags
--first First element matching the selectors
--help Show this message and exit.```
## Development
To contribute to this tool, first checkout the code. Then create a new virtual environment:
```bash
cd strip-tags
python -m venv venv
source venv/bin/activate
```
Now install the dependencies and test dependencies:
```bash
pip install -e '.[test]'
```
To run the tests:
```bash
pytest
```
- `, `