Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/stevebauman/hypertext

A PHP HTML to pure text transformer.
https://github.com/stevebauman/hypertext

Last synced: 7 days ago
JSON representation

A PHP HTML to pure text transformer.

Awesome Lists containing this project

README

        

Hypertext


A PHP HTML to pure text transformer that beautifully handles various and malformed HTML.






---

Hypertext is excellent at pulling text content out of any HTML based document and automatically:

- Removes CSS
- Removes scripts
- Removes headers
- Removes non-HTML based content
- Preserves spacing
- Preserves links (optional)
- Preserves new lines (optional)

It is directed at using the output in LLM related tasks, such as prompts and embeddings.

## Installation

```bash
composer require stevebauman/hypertext
```

## Usage

```php
use Stevebauman\Hypertext\Transformer;

$transformer = new Transformer();

// (Optional) Filter out specific elements by their XPath.
$transformer->filter("//*[@id='some-element']");

// (Optional) Retain new line characters.
$transformer->keepNewLines();

// (Optional) Retain anchor tags and their href attribute.
$transformer->keepLinks();

$text = $transformer->toText($html);
```

## Example

> For larger examples, please view the [tests/Fixtures](https://github.com/stevebauman/hypertext/tree/master/tests/Fixtures) directory.

**Input**:

```html



My Blog

Welcome to My Blog


This is a paragraph of text on my webpage.


Click here to view my posts.

```

**Output (Pure Text)**:

```php
echo (new Transformer)->toText($html);
```

```text
Welcome to My Blog This is a paragraph of text on my webpage. Click here to view my posts.
```

**Output (Keep New Lines)**:

```php
echo (new Transformer)->keepNewLines()->toText($html);
```

```text
Welcome to My Blog
This is a paragraph of text on my webpage.
Click here to view my posts.
```

**Output (Keep Links)**:

```php
echo (new Transformer)->keepLinks()->toText($html);
```

```text
Welcome to My Blog This is a paragraph of text on my webpage. Click Here to view my posts.
```

**Output (Keep Both)**:

```php
echo (new Transformer)
->keepLinks()
->keepNewLines()
->toText($html);
```

```text
Welcome to My Blog
This is a paragraph of text on my webpage.
Click Here to view my posts.
```