{"id":18048520,"url":"https://github.com/bent10/stophtml","last_synced_at":"2025-04-10T09:51:52.012Z","repository":{"id":226528487,"uuid":"768952982","full_name":"bent10/stophtml","owner":"bent10","description":"Extracts plain text from an HTML string. It's useful for Natural Language Processing (NLP) tasks.","archived":false,"fork":false,"pushed_at":"2025-01-05T07:21:12.000Z","size":91,"stargazers_count":2,"open_issues_count":2,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-03-27T18:50:24.344Z","etag":null,"topics":["html","nlp","plaintext","strip","text","token","tokenize"],"latest_commit_sha":null,"homepage":"https://www.npmjs.com/package/stophtml","language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/bent10.png","metadata":{"files":{"readme":"readme.md","changelog":"changelog.md","contributing":null,"funding":null,"license":"license","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-03-08T03:19:32.000Z","updated_at":"2024-10-30T13:20:24.000Z","dependencies_parsed_at":null,"dependency_job_id":"4ab6c00a-b467-4998-a496-d13753b124cb","html_url":"https://github.com/bent10/stophtml","commit_stats":null,"previous_names":["bent10/stophtml"],"tags_count":4,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bent10%2Fstophtml","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bent10%2Fstophtml/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bent10%2Fstophtml/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bent10%2Fstophtml/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/bent10","download_url":"https://codeload.github.com/bent10/stophtml/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248197973,"owners_count":21063623,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["html","nlp","plaintext","strip","text","token","tokenize"],"created_at":"2024-10-30T20:13:15.345Z","updated_at":"2025-04-10T09:51:51.985Z","avatar_url":"https://github.com/bent10.png","language":"TypeScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"# stophtml\n\nA utility for Node.js (`0.32 kB`) and the browser (`0.43 kB`) that extracts plain text from an HTML string while ignoring HTML tags. It's useful for Natural Language Processing (NLP) tasks that require only the textual content of HTML documents.\n\n## Install\n\n```bash\nnpm install stophtml\n```\n\nOr yarn:\n\n```bash\nyarn add stophtml\n```\n\nAlternatively, you can also include this module directly in your HTML file from CDN:\n\n```yml\nUMD: https://cdn.jsdelivr.net/npm/stophtml/dist/index.umd.js\nESM: https://cdn.jsdelivr.net/npm/stophtml/+esm\nCJS: https://cdn.jsdelivr.net/npm/stophtml/dist/index.cjs\n```\n\n## Usage\n\n```js\nimport stophtml from 'stophtml'\n\nconst input = '\u003cp\u003eThis is \u003cb\u003ebold\u003c/b\u003e and \u003ci\u003eitalic\u003c/i\u003e.\u003c/p\u003e'\nconst segments = stophtml(input)\n\nconsole.log(segments)\n```\n\n## API\n\n### `stophtml(input: string): string[]`\n\nTokenizes an HTML string, extracting plain text while ignoring HTML tags.\n\n- `input`: The input HTML string to tokenize.\n\nReturns an array of plain text segments extracted from the HTML string.\n\n## Related\n\n- [boox](https://github.com/bent10/boox) – Performing full-text search across multiple documents by combining [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) score with [inverted index](https://en.wikipedia.org/wiki/Inverted_index) weight.\n- [stopmarkdown](https://github.com/bent10/stopmarkdown) – Extracts plain text from an Markdown string.\n- [nomark](https://github.com/bent10/nomark) – Transforms hypertext strings (e.g., HTML, Markdown) into plain text for natural language processing (NLP) normalization.\n- [stopword](https://github.com/fergiemcdowall/stopword) – Allows you to strip stopwords from an input text (supports a ton of languages).\n\n## Benchmark\n\n```bash\n✓ test/index.bench.ts (2) 1305ms\n     name                 hz     min     max    mean     p75     p99    p995    p999     rme  samples\n   · stophtml     136,571.33  0.0064  0.3648  0.0073  0.0069  0.0241  0.0263  0.1222  ±0.70%    68286   fastest\n   · htmlparser2   68,310.52  0.0131  2.0111  0.0146  0.0138  0.0348  0.0458  0.0769  ±0.96%    34156\n\n\n BENCH  Summary\n\n  stophtml - test/index.bench.ts \u003e\n    2.00x faster than htmlparser2\n```\n\n\u003cdetails\u003e\n\u003csummary\u003eSee benchmark code\u003c/summary\u003e\n\n```js\nimport { bench } from 'vitest'\nimport { Parser } from 'htmlparser2'\nimport stophtml from 'stophtml'\n\nconst html = getHtml()\n\nbench('stophtml', () =\u003e {\n  stophtml(html)\n})\n\nbench('htmlparser2', () =\u003e {\n  htmlparser2Parser(html)\n})\n\nfunction htmlparser2Parser(text: string) {\n  const res: string[] = []\n\n  const parser = new Parser({\n    ontext(data) {\n      res.push(data)\n    }\n  })\n\n  parser.write(text)\n  parser.end()\n\n  return res.join(' ')\n}\n\nfunction getHtml() {\n  return `\u003c!DOCTYPE html\u003e\n\u003chtml lang=\"en\"\u003e\n\u003chead\u003e\n    \u003cmeta charset=\"UTF-8\"\u003e\n    \u003cmeta name=\"viewport\" content=\"width=device-width, initial-scale=1.0\"\u003e\n    \u003ctitle\u003eHTML Template\u003c/title\u003e\n\u003c/head\u003e\n\u003cbody\u003e\n    \u003ch1\u003eWelcome to my HTML Template\u003c/h1\u003e\n    \u003cp\u003eThis is a paragraph within the HTML template.\u003c/p\u003e\n    \u003cul\u003e\n        \u003cli\u003eList item 1\u003c/li\u003e\n        \u003cli\u003eList item 2\u003c/li\u003e\n        \u003cli\u003eList item 3\u003c/li\u003e\n    \u003c/ul\u003e\n    \u003cimg src=\"https://example.com/image.jpg\" alt=\"Example Image\"\u003e\n    \u003ca href=\"https://example.com\"\u003eVisit our website\u003c/a\u003e\n\u003c/body\u003e\n\u003c/html\u003e\n`\n}\n```\n\n\u003c/details\u003e\n\n## Contributing\n\nWe 💛\u0026nbsp; issues.\n\nWhen committing, please conform to [the semantic-release commit standards](https://www.conventionalcommits.org/). Please install `commitizen` and the adapter globally, if you have not already.\n\n```bash\nnpm i -g commitizen cz-conventional-changelog\n```\n\nNow you can use `git cz` or just `cz` instead of `git commit` when committing. You can also use `git-cz`, which is an alias for `cz`.\n\n```bash\ngit add . \u0026\u0026 git cz\n```\n\n## License\n\n![GitHub](https://img.shields.io/github/license/bent10/stophtml)\n\nA project by [Stilearning](https://stilearning.com) \u0026copy; 2024.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbent10%2Fstophtml","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbent10%2Fstophtml","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbent10%2Fstophtml/lists"}