https://github.com/bent10/stopmarkdown

Extracts plain text from Markdown strings. It's useful for Natural Language Processing (NLP) tasks.
https://github.com/bent10/stopmarkdown

markdown nlp plaintext strip text token tokenize

Last synced: 3 months ago
JSON representation

Extracts plain text from Markdown strings. It's useful for Natural Language Processing (NLP) tasks.

Host: GitHub
URL: https://github.com/bent10/stopmarkdown
Owner: bent10
License: mit
Created: 2024-03-08T03:45:36.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2025-02-17T02:33:27.000Z (5 months ago)
Last Synced: 2025-03-24T08:47:28.505Z (4 months ago)
Topics: markdown, nlp, plaintext, strip, text, token, tokenize
Language: TypeScript
Homepage: https://www.npmjs.com/package/stopmarkdown
Size: 233 KB
Stars: 1
Watchers: 2
Forks: 0
Open Issues: 2
Metadata Files:
- Readme: readme.md
- Changelog: changelog.md
- License: license

Awesome Lists containing this project

README

        # stopmarkdown

A utility for Node.js and the browser that extracts plain text from Markdown strings. It's useful for Natural Language Processing (NLP) tasks that require only the textual content of Markdown documents.

## Install

```bash

npm install stopmarkdown

```

Or yarn:

```bash

yarn add stopmarkdown

```

Alternatively, you can also include this module directly in your HTML file from CDN:

```yml

UMD: https://cdn.jsdelivr.net/npm/stopmarkdown/dist/index.umd.js

ESM: https://cdn.jsdelivr.net/npm/stopmarkdown/+esm

CJS: https://cdn.jsdelivr.net/npm/stopmarkdown/dist/index.cjs

```

## Usage

```js

import stopmarkdown from 'stopmarkdown'

const markdownContent = `

# Heading 1

This is a paragraph with some *italic* and **bold** text.

- Item 1

- Item 2

## Heading 2

> Blockquote

\`\`\`js

console.log('Code block');

\`\`\`

`

const segments = stopmarkdown(markdownContent)

console.log(segments)

```

## API

### `stopmarkdown(input: string): string[]`

Returns an array of text segments extracted from the Markdown string.

- `input`: The Markdown string to tokenize.

## Related

- [boox](https://github.com/bent10/boox) – Performing full-text search across multiple documents by combining [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) score with [inverted index](https://en.wikipedia.org/wiki/Inverted_index) weight.

- [stophtml](https://github.com/bent10/stophtml) – Extracts plain text from an HTML string.

- [nomark](https://github.com/bent10/nomark) – Transforms hypertext strings (e.g., HTML, Markdown) into plain text for natural language processing (NLP) normalization.

- [stopword](https://github.com/fergiemcdowall/stopword) – Allows you to strip stopwords from an input text (supports a ton of languages).

## Contributing

We 💛  issues.

When committing, please conform to [the semantic-release commit standards](https://www.conventionalcommits.org/). Please install `commitizen` and the adapter globally, if you have not already.

```bash

npm i -g commitizen cz-conventional-changelog

```

Now you can use `git cz` or just `cz` instead of `git commit` when committing. You can also use `git-cz`, which is an alias for `cz`.

```bash

git add . && git cz

```

## License

![GitHub](https://img.shields.io/github/license/bent10/stopmarkdown)

A project by [Stilearning](https://stilearning.com) © 2024.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/bent10/stopmarkdown

Awesome Lists containing this project

README