Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/bent10/stophtml
Extracts plain text from an HTML string. It's useful for Natural Language Processing (NLP) tasks.
https://github.com/bent10/stophtml
html nlp plaintext strip text token tokenize
Last synced: 2 months ago
JSON representation
Extracts plain text from an HTML string. It's useful for Natural Language Processing (NLP) tasks.
- Host: GitHub
- URL: https://github.com/bent10/stophtml
- Owner: bent10
- License: mit
- Created: 2024-03-08T03:19:32.000Z (10 months ago)
- Default Branch: main
- Last Pushed: 2024-09-22T06:03:39.000Z (3 months ago)
- Last Synced: 2024-09-28T14:06:23.201Z (3 months ago)
- Topics: html, nlp, plaintext, strip, text, token, tokenize
- Language: TypeScript
- Homepage: https://www.npmjs.com/package/stophtml
- Size: 135 KB
- Stars: 2
- Watchers: 2
- Forks: 0
- Open Issues: 2
-
Metadata Files:
- Readme: readme.md
- Changelog: changelog.md
- License: license
Awesome Lists containing this project
README
# stophtml
A utility for Node.js (`0.32 kB`) and the browser (`0.43 kB`) that extracts plain text from an HTML string while ignoring HTML tags. It's useful for Natural Language Processing (NLP) tasks that require only the textual content of HTML documents.
## Install
```bash
npm install stophtml
```Or yarn:
```bash
yarn add stophtml
```Alternatively, you can also include this module directly in your HTML file from CDN:
```yml
UMD: https://cdn.jsdelivr.net/npm/stophtml/dist/index.umd.js
ESM: https://cdn.jsdelivr.net/npm/stophtml/+esm
CJS: https://cdn.jsdelivr.net/npm/stophtml/dist/index.cjs
```## Usage
```js
import stophtml from 'stophtml'const input = '
This is bold and italic.
'
const segments = stophtml(input)console.log(segments)
```## API
### `stophtml(input: string): string[]`
Tokenizes an HTML string, extracting plain text while ignoring HTML tags.
- `input`: The input HTML string to tokenize.
Returns an array of plain text segments extracted from the HTML string.
## Related
- [boox](https://github.com/bent10/boox) – Performing full-text search across multiple documents by combining [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) score with [inverted index](https://en.wikipedia.org/wiki/Inverted_index) weight.
- [stopmarkdown](https://github.com/bent10/stopmarkdown) – Extracts plain text from an Markdown string.
- [nomark](https://github.com/bent10/nomark) – Transforms hypertext strings (e.g., HTML, Markdown) into plain text for natural language processing (NLP) normalization.
- [stopword](https://github.com/fergiemcdowall/stopword) – Allows you to strip stopwords from an input text (supports a ton of languages).## Benchmark
```bash
✓ test/index.bench.ts (2) 1305ms
name hz min max mean p75 p99 p995 p999 rme samples
· stophtml 136,571.33 0.0064 0.3648 0.0073 0.0069 0.0241 0.0263 0.1222 ±0.70% 68286 fastest
· htmlparser2 68,310.52 0.0131 2.0111 0.0146 0.0138 0.0348 0.0458 0.0769 ±0.96% 34156BENCH Summary
stophtml - test/index.bench.ts >
2.00x faster than htmlparser2
```See benchmark code
```js
import { bench } from 'vitest'
import { Parser } from 'htmlparser2'
import stophtml from 'stophtml'const html = getHtml()
bench('stophtml', () => {
stophtml(html)
})bench('htmlparser2', () => {
htmlparser2Parser(html)
})function htmlparser2Parser(text: string) {
const res: string[] = []const parser = new Parser({
ontext(data) {
res.push(data)
}
})parser.write(text)
parser.end()return res.join(' ')
}function getHtml() {
return `
HTML Template
Welcome to my HTML Template
This is a paragraph within the HTML template.
- List item 1
- List item 2
- List item 3
Visit our website
`
}
```
## Contributing
We 💛 issues.
When committing, please conform to [the semantic-release commit standards](https://www.conventionalcommits.org/). Please install `commitizen` and the adapter globally, if you have not already.
```bash
npm i -g commitizen cz-conventional-changelog
```
Now you can use `git cz` or just `cz` instead of `git commit` when committing. You can also use `git-cz`, which is an alias for `cz`.
```bash
git add . && git cz
```
## License
![GitHub](https://img.shields.io/github/license/bent10/stophtml)
A project by [Stilearning](https://stilearning.com) © 2024.