Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/bent10/stophtml

Extracts plain text from an HTML string. It's useful for Natural Language Processing (NLP) tasks.
https://github.com/bent10/stophtml

html nlp plaintext strip text token tokenize

Last synced: 2 months ago
JSON representation

Extracts plain text from an HTML string. It's useful for Natural Language Processing (NLP) tasks.

Host: GitHub
URL: https://github.com/bent10/stophtml
Owner: bent10
License: mit
Created: 2024-03-08T03:19:32.000Z (10 months ago)
Default Branch: main
Last Pushed: 2024-09-22T06:03:39.000Z (3 months ago)
Last Synced: 2024-09-28T14:06:23.201Z (3 months ago)
Topics: html, nlp, plaintext, strip, text, token, tokenize
Language: TypeScript
Homepage: https://www.npmjs.com/package/stophtml
Size: 135 KB
Stars: 2
Watchers: 2
Forks: 0
Open Issues: 2
Metadata Files:
- Readme: readme.md
- Changelog: changelog.md
- License: license

Awesome Lists containing this project

README

        # stophtml

A utility for Node.js (`0.32 kB`) and the browser (`0.43 kB`) that extracts plain text from an HTML string while ignoring HTML tags. It's useful for Natural Language Processing (NLP) tasks that require only the textual content of HTML documents.

## Install

```bash

npm install stophtml

```

Or yarn:

```bash

yarn add stophtml

```

Alternatively, you can also include this module directly in your HTML file from CDN:

```yml

UMD: https://cdn.jsdelivr.net/npm/stophtml/dist/index.umd.js

ESM: https://cdn.jsdelivr.net/npm/stophtml/+esm

CJS: https://cdn.jsdelivr.net/npm/stophtml/dist/index.cjs

```

## Usage

```js

import stophtml from 'stophtml'

const input = '
This is bold and italic.'

const segments = stophtml(input)

console.log(segments)

```

## API

### `stophtml(input: string): string[]`

Tokenizes an HTML string, extracting plain text while ignoring HTML tags.

- `input`: The input HTML string to tokenize.

Returns an array of plain text segments extracted from the HTML string.

## Related

- [boox](https://github.com/bent10/boox) – Performing full-text search across multiple documents by combining [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) score with [inverted index](https://en.wikipedia.org/wiki/Inverted_index) weight.

- [stopmarkdown](https://github.com/bent10/stopmarkdown) – Extracts plain text from an Markdown string.

- [nomark](https://github.com/bent10/nomark) – Transforms hypertext strings (e.g., HTML, Markdown) into plain text for natural language processing (NLP) normalization.

- [stopword](https://github.com/fergiemcdowall/stopword) – Allows you to strip stopwords from an input text (supports a ton of languages).

## Benchmark

```bash

✓ test/index.bench.ts (2) 1305ms

     name                 hz     min     max    mean     p75     p99    p995    p999     rme  samples

   · stophtml     136,571.33  0.0064  0.3648  0.0073  0.0069  0.0241  0.0263  0.1222  ±0.70%    68286   fastest

   · htmlparser2   68,310.52  0.0131  2.0111  0.0146  0.0138  0.0348  0.0458  0.0769  ±0.96%    34156

 BENCH  Summary

  stophtml - test/index.bench.ts >

    2.00x faster than htmlparser2

```

See benchmark code

```js

import { bench } from 'vitest'

import { Parser } from 'htmlparser2'

import stophtml from 'stophtml'

const html = getHtml()

bench('stophtml', () => {

  stophtml(html)

})

bench('htmlparser2', () => {

  htmlparser2Parser(html)

})

function htmlparser2Parser(text: string) {

  const res: string[] = []

  const parser = new Parser({

    ontext(data) {

      res.push(data)

    }

  })

  parser.write(text)

  parser.end()

  return res.join(' ')

}

function getHtml() {

  return `

    

    

    HTML Template

    
Welcome to my HTML Template

    This is a paragraph within the HTML template.

    

        List item 1

        List item 2

        List item 3

    

    

    Visit our website

`

}

```

## Contributing

We 💛  issues.

When committing, please conform to [the semantic-release commit standards](https://www.conventionalcommits.org/). Please install `commitizen` and the adapter globally, if you have not already.

```bash

npm i -g commitizen cz-conventional-changelog

```

Now you can use `git cz` or just `cz` instead of `git commit` when committing. You can also use `git-cz`, which is an alias for `cz`.

```bash

git add . && git cz

```

## License

![GitHub](https://img.shields.io/github/license/bent10/stophtml)

A project by [Stilearning](https://stilearning.com) © 2024.