Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/stevebauman/hypertext
A PHP HTML to pure text transformer.
https://github.com/stevebauman/hypertext
Last synced: 7 days ago
JSON representation
A PHP HTML to pure text transformer.
- Host: GitHub
- URL: https://github.com/stevebauman/hypertext
- Owner: stevebauman
- License: mit
- Created: 2023-10-21T18:30:54.000Z (over 1 year ago)
- Default Branch: master
- Last Pushed: 2024-10-12T16:28:40.000Z (4 months ago)
- Last Synced: 2025-01-05T06:06:41.257Z (21 days ago)
- Language: PHP
- Homepage:
- Size: 2.65 MB
- Stars: 153
- Watchers: 3
- Forks: 5
- Open Issues: 0
-
Metadata Files:
- Readme: readme.md
- License: LICENSE
Awesome Lists containing this project
README
Hypertext
A PHP HTML to pure text transformer that beautifully handles various and malformed HTML.---
Hypertext is excellent at pulling text content out of any HTML based document and automatically:
- Removes CSS
- Removes scripts
- Removes headers
- Removes non-HTML based content
- Preserves spacing
- Preserves links (optional)
- Preserves new lines (optional)It is directed at using the output in LLM related tasks, such as prompts and embeddings.
## Installation
```bash
composer require stevebauman/hypertext
```## Usage
```php
use Stevebauman\Hypertext\Transformer;$transformer = new Transformer();
// (Optional) Filter out specific elements by their XPath.
$transformer->filter("//*[@id='some-element']");// (Optional) Retain new line characters.
$transformer->keepNewLines();// (Optional) Retain anchor tags and their href attribute.
$transformer->keepLinks();$text = $transformer->toText($html);
```## Example
> For larger examples, please view the [tests/Fixtures](https://github.com/stevebauman/hypertext/tree/master/tests/Fixtures) directory.
**Input**:
```html
My Blog
Welcome to My Blog
This is a paragraph of text on my webpage.
Click here to view my posts.```
**Output (Pure Text)**:
```php
echo (new Transformer)->toText($html);
``````text
Welcome to My Blog This is a paragraph of text on my webpage. Click here to view my posts.
```**Output (Keep New Lines)**:
```php
echo (new Transformer)->keepNewLines()->toText($html);
``````text
Welcome to My Blog
This is a paragraph of text on my webpage.
Click here to view my posts.
```**Output (Keep Links)**:
```php
echo (new Transformer)->keepLinks()->toText($html);
``````text
Welcome to My Blog This is a paragraph of text on my webpage. Click Here to view my posts.
```**Output (Keep Both)**:
```php
echo (new Transformer)
->keepLinks()
->keepNewLines()
->toText($html);
``````text
Welcome to My Blog
This is a paragraph of text on my webpage.
Click Here to view my posts.
```