Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/bent10/nomark
Transform hypertext strings (e.g., HTML, Markdown) into plain text for natural language processing (NLP) normalization
https://github.com/bent10/nomark
html markdown nlp normalize normalizer plaintext text token tokenize transform transformer
Last synced: about 2 months ago
JSON representation
Transform hypertext strings (e.g., HTML, Markdown) into plain text for natural language processing (NLP) normalization
- Host: GitHub
- URL: https://github.com/bent10/nomark
- Owner: bent10
- License: mit
- Created: 2024-03-08T06:47:57.000Z (10 months ago)
- Default Branch: main
- Last Pushed: 2024-09-01T12:49:33.000Z (4 months ago)
- Last Synced: 2024-09-15T14:49:00.936Z (3 months ago)
- Topics: html, markdown, nlp, normalize, normalizer, plaintext, text, token, tokenize, transform, transformer
- Language: TypeScript
- Homepage: https://www.npmjs.com/package/nomark
- Size: 118 KB
- Stars: 2
- Watchers: 3
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: readme.md
- Changelog: changelog.md
- License: license
Awesome Lists containing this project
README
# nomark
A utility to transform hypertext strings (e.g., HTML, Markdown) into plain text, which is useful for natural language processing (NLP) normalization.
## Install
```bash
npm install nomark
```Or yarn:
```bash
yarn add nomark
```Alternatively, you can also include this module directly in your HTML file from CDN:
```yml
UMD: https://cdn.jsdelivr.net/npm/nomark/dist/index.umd.js
ESM: https://cdn.jsdelivr.net/npm/nomark/+esm
CJS: https://cdn.jsdelivr.net/npm/nomark/dist/index.cjs
```## Usage
````js
import nomark from 'nomark'const hypertext =
'# Café du Monde\n\nThis is some **bold**, _italic_, and ~~strikethrough~~ text.\n\n## Headers\n\n### This is an H3 header\n\n#### This is an H4 header\n\n##### This is an H5 header\n\n###### This is an H6 header\n\n## Lists\n\n### Unordered List\n\n- Item 1\n- Item 2\n - Subitem A\n - Subitem B\n - Sub-subitem 1\n - Sub-subitem 2\n\n### Ordered List\n\n1. First item\n2. Second item\n 1. Nested item\n 2. Another nested item\n\n## Links and Images\n\n[Example](https://example.com)\n\n![Example Logo](https://example.com/favicon.ico)\n\n## Blockquotes\n\n> This is a blockquote.\n>\n> - John Doe\n\n## Code Blocks\n\n```javascript\nfunction greet(name) {\n console.log(`Hello, ${name}!`)\n}\n\ngreet(\'World\')\n```\n\n## Tables\n\n| Name | Age | Gender |\n| ---- | --- | ------ |\n| John | 30 | Male |\n| Jane | 25 | Female |\n\n## Task Lists\n\n- [x] Task 1\n- [ ] Task 2\n- [x] Task 3\n\n## Emoji\n\n:smiley: :rocket: :book:\n\n## Strikethrough\n\n~~This text is strikethrough.~~\n\n## HTML tags\n\nThis is a red text.\n\nThis is a paragraph.
\n\nThis is a blockquote in HTML.\n\n
- \n
- HTML List Item 1 \n
- HTML List Item 2 \n
const plaintext = nomark(hypertext, {
stripMarkdown: true,
stripHtml: true
})
console.log(plaintext)
````
See the results:
```text
Café du Monde.
This is some bold, italic, and strikethrough text.
Headers.
This is an H3 header.
This is an H4 header.
This is an H5 header.
This is an H6 header.
Lists.
Unordered List.
Item 1.
Item 2.
Subitem A.
Subitem B.
Sub-subitem 1.
Sub-subitem 2.
Ordered List.
First item.
Second item.
Nested item.
Another nested item.
Links and Images.
Example.
Example Logo.
Blockquotes.
This is a blockquote.
John Doe.
Code Blocks.
function greet(name) {
console.log(`Hello, ${name}!`)
}
greet('World')
Tables.
Name, Age, Gender.
John, 30, Male.
Jane, 25, Female.
Task Lists.
Task 1.
Task 2.
Task 3.
Emoji.
:smiley: :rocket: :book:
Strikethrough.
This text is strikethrough.
HTML tags.
This is a red text.
This is a paragraph.
This is a blockquote in HTML.
HTML List Item 1
HTML List Item 2
GitHub Flavored Markdown (GFM) Features.
Code Blocks with Language Highlighting.
interface Person {
name: string
age: number
}
const person: Person = {
name: 'John Doe',
age: 30
}
Task Lists in Tables.
Task, Status.
Task 1, [x].
Task 2, [ ].
Task 3, [x].
Mentioning Users.
Hey @username, could you take a look at this?
URLs Automatically Linked.
https://example.com/foo/bar.
Strikethrough in Tables.
Item, Price.
Apple, $2.
Banana, $1.
Orange, $3.
Emoji in Headers.
:sparkles: Features :sparkles:
```
## API
### `nomark(input: string, options?: NomarkOptions): string`
This function transforms hypertext strings into plain text by applying [Unicode normalization](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/normalize?retiredLocale=id#form), stripping HTML tags, and removing Markdown syntax.
- `input`: The hypertext strings to transform.
- `options` (optional): Options for transforming the input.
- `form` (optional): The [Unicode normalization](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/normalize?retiredLocale=id#form) form to apply. Defaults to `'NFC'`.
- `stripHtml` (optional): Indicates whether to strip HTML tags from the text. Defaults to `false`.
- `stripMarkdown` (optional): Indicates whether to strip Markdown syntax from the text. Defaults to `false`.
## Related
- [boox](https://github.com/bent10/boox) – Performing full-text search across multiple documents by combining [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) score with [inverted index](https://en.wikipedia.org/wiki/Inverted_index) weight.
- [stophtml](https://github.com/bent10/stophtml) – Extracts plain text from an HTML string.
- [stopmarkdown](https://github.com/bent10/stopmarkdown) – Extracts plain text from an Markdown strings.
- [stopword](https://github.com/fergiemcdowall/stopword) – Allows you to strip stopwords from an input text (supports a ton of languages).
## Contributing
We 💛 issues.
When committing, please conform to [the semantic-release commit standards](https://www.conventionalcommits.org/). Please install `commitizen` and the adapter globally, if you have not already.
```bash
npm i -g commitizen cz-conventional-changelog
```
Now you can use `git cz` or just `cz` instead of `git commit` when committing. You can also use `git-cz`, which is an alias for `cz`.
```bash
git add . && git cz
```
## License
![GitHub](https://img.shields.io/github/license/bent10/nomark)
A project by [Stilearning](https://stilearning.com) © 2024.