Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/bent10/stophtml

Extracts plain text from an HTML string. It's useful for Natural Language Processing (NLP) tasks.
https://github.com/bent10/stophtml

html nlp plaintext strip text token tokenize

Last synced: 2 months ago
JSON representation

Extracts plain text from an HTML string. It's useful for Natural Language Processing (NLP) tasks.

Awesome Lists containing this project

README

        

# stophtml

A utility for Node.js (`0.32 kB`) and the browser (`0.43 kB`) that extracts plain text from an HTML string while ignoring HTML tags. It's useful for Natural Language Processing (NLP) tasks that require only the textual content of HTML documents.

## Install

```bash
npm install stophtml
```

Or yarn:

```bash
yarn add stophtml
```

Alternatively, you can also include this module directly in your HTML file from CDN:

```yml
UMD: https://cdn.jsdelivr.net/npm/stophtml/dist/index.umd.js
ESM: https://cdn.jsdelivr.net/npm/stophtml/+esm
CJS: https://cdn.jsdelivr.net/npm/stophtml/dist/index.cjs
```

## Usage

```js
import stophtml from 'stophtml'

const input = '

This is bold and italic.

'
const segments = stophtml(input)

console.log(segments)
```

## API

### `stophtml(input: string): string[]`

Tokenizes an HTML string, extracting plain text while ignoring HTML tags.

- `input`: The input HTML string to tokenize.

Returns an array of plain text segments extracted from the HTML string.

## Related

- [boox](https://github.com/bent10/boox) – Performing full-text search across multiple documents by combining [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) score with [inverted index](https://en.wikipedia.org/wiki/Inverted_index) weight.
- [stopmarkdown](https://github.com/bent10/stopmarkdown) – Extracts plain text from an Markdown string.
- [nomark](https://github.com/bent10/nomark) – Transforms hypertext strings (e.g., HTML, Markdown) into plain text for natural language processing (NLP) normalization.
- [stopword](https://github.com/fergiemcdowall/stopword) – Allows you to strip stopwords from an input text (supports a ton of languages).

## Benchmark

```bash
✓ test/index.bench.ts (2) 1305ms
name hz min max mean p75 p99 p995 p999 rme samples
· stophtml 136,571.33 0.0064 0.3648 0.0073 0.0069 0.0241 0.0263 0.1222 ±0.70% 68286 fastest
· htmlparser2 68,310.52 0.0131 2.0111 0.0146 0.0138 0.0348 0.0458 0.0769 ±0.96% 34156

BENCH Summary

stophtml - test/index.bench.ts >
2.00x faster than htmlparser2
```

See benchmark code

```js
import { bench } from 'vitest'
import { Parser } from 'htmlparser2'
import stophtml from 'stophtml'

const html = getHtml()

bench('stophtml', () => {
stophtml(html)
})

bench('htmlparser2', () => {
htmlparser2Parser(html)
})

function htmlparser2Parser(text: string) {
const res: string[] = []

const parser = new Parser({
ontext(data) {
res.push(data)
}
})

parser.write(text)
parser.end()

return res.join(' ')
}

function getHtml() {
return `



HTML Template

Welcome to my HTML Template


This is a paragraph within the HTML template.



  • List item 1

  • List item 2

  • List item 3


Example Image
Visit our website

`
}
```

## Contributing

We 💛  issues.

When committing, please conform to [the semantic-release commit standards](https://www.conventionalcommits.org/). Please install `commitizen` and the adapter globally, if you have not already.

```bash
npm i -g commitizen cz-conventional-changelog
```

Now you can use `git cz` or just `cz` instead of `git commit` when committing. You can also use `git-cz`, which is an alias for `cz`.

```bash
git add . && git cz
```

## License

![GitHub](https://img.shields.io/github/license/bent10/stophtml)

A project by [Stilearning](https://stilearning.com) © 2024.