Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/bent10/stopmarkdown
Extracts plain text from Markdown strings. It's useful for Natural Language Processing (NLP) tasks.
https://github.com/bent10/stopmarkdown
markdown nlp plaintext strip text token tokenize
Last synced: 2 months ago
JSON representation
Extracts plain text from Markdown strings. It's useful for Natural Language Processing (NLP) tasks.
- Host: GitHub
- URL: https://github.com/bent10/stopmarkdown
- Owner: bent10
- License: mit
- Created: 2024-03-08T03:45:36.000Z (10 months ago)
- Default Branch: main
- Last Pushed: 2024-10-22T09:13:03.000Z (2 months ago)
- Last Synced: 2024-10-23T13:51:26.787Z (2 months ago)
- Topics: markdown, nlp, plaintext, strip, text, token, tokenize
- Language: TypeScript
- Homepage: https://www.npmjs.com/package/stopmarkdown
- Size: 206 KB
- Stars: 1
- Watchers: 2
- Forks: 0
- Open Issues: 2
-
Metadata Files:
- Readme: readme.md
- Changelog: changelog.md
- License: license
Awesome Lists containing this project
README
# stopmarkdown
A utility for Node.js and the browser that extracts plain text from Markdown strings. It's useful for Natural Language Processing (NLP) tasks that require only the textual content of Markdown documents.
## Install
```bash
npm install stopmarkdown
```Or yarn:
```bash
yarn add stopmarkdown
```Alternatively, you can also include this module directly in your HTML file from CDN:
```yml
UMD: https://cdn.jsdelivr.net/npm/stopmarkdown/dist/index.umd.js
ESM: https://cdn.jsdelivr.net/npm/stopmarkdown/+esm
CJS: https://cdn.jsdelivr.net/npm/stopmarkdown/dist/index.cjs
```## Usage
```js
import stopmarkdown from 'stopmarkdown'const markdownContent = `
# Heading 1This is a paragraph with some *italic* and **bold** text.
- Item 1
- Item 2## Heading 2
> Blockquote
\`\`\`js
console.log('Code block');
\`\`\`
`const segments = stopmarkdown(markdownContent)
console.log(segments)
```## API
### `stopmarkdown(input: string): string[]`
Returns an array of text segments extracted from the Markdown string.
- `input`: The Markdown string to tokenize.
## Related
- [boox](https://github.com/bent10/boox) – Performing full-text search across multiple documents by combining [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) score with [inverted index](https://en.wikipedia.org/wiki/Inverted_index) weight.
- [stophtml](https://github.com/bent10/stophtml) – Extracts plain text from an HTML string.
- [nomark](https://github.com/bent10/nomark) – Transforms hypertext strings (e.g., HTML, Markdown) into plain text for natural language processing (NLP) normalization.
- [stopword](https://github.com/fergiemcdowall/stopword) – Allows you to strip stopwords from an input text (supports a ton of languages).## Contributing
We 💛 issues.
When committing, please conform to [the semantic-release commit standards](https://www.conventionalcommits.org/). Please install `commitizen` and the adapter globally, if you have not already.
```bash
npm i -g commitizen cz-conventional-changelog
```Now you can use `git cz` or just `cz` instead of `git commit` when committing. You can also use `git-cz`, which is an alias for `cz`.
```bash
git add . && git cz
```## License
![GitHub](https://img.shields.io/github/license/bent10/stopmarkdown)
A project by [Stilearning](https://stilearning.com) © 2024.