https://github.com/echogarden-project/icu-segmentation-wasm

WebAssembly port of the ICU library's character, word, line-break, and sentence segmentation methods.
https://github.com/echogarden-project/icu-segmentation-wasm

character-segmentation sentence-boundary-detection sentence-segmentation word-boundary word-segmentation

Last synced: 4 months ago
JSON representation

WebAssembly port of the ICU library's character, word, line-break, and sentence segmentation methods.

Host: GitHub
URL: https://github.com/echogarden-project/icu-segmentation-wasm
Owner: echogarden-project
License: mit
Created: 2024-11-27T11:34:56.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2025-05-12T08:48:11.000Z (7 months ago)
Last Synced: 2025-05-12T08:54:30.837Z (7 months ago)
Topics: character-segmentation, sentence-boundary-detection, sentence-segmentation, word-boundary, word-segmentation
Language: C
Homepage:
Size: 27.1 MB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE.md

Awesome Lists containing this project

README

          # ICU Segmentation Library (WebAssembly port)

Provides natural language text segmentation based on the [ICU (International Components for

    Unicode) C++ library](https://icu.unicode.org/), ported to WebAssembly.

* **Character segmentation** finds the boundaries between grapheme clusters, which may be longer than a single Unicode codepoint, taking into account various linguistic properties

* **Word segmentation** finds boundaries between words, based on rulesets. It also supports more challenging languages like Chinese, Japanese, Thai and Khmer, which require specialized lexicons to determine the boundaries

* **Line break boundary detection** finds potential locations where a line break can be added, for the purpose of word-wrapping

* **Sentence segmentation** finds sentence boundaries based on rulesets and language-specific lexicons

* Supports all recent JavaScript runtimes (Node.js and browsers)

## Installation

```

npm install @echogarden/icu-segmentation-wasm

```

**Note**: package size is about 32 MB uncompressed (13.5 MB gzipped), due to the size of the ICU data bundled into the WebAssembly binary.

## Usage

### Split operations

Return an array of strings representing the parts.

```ts

import * as ICUSegmantation from '@echogarden/icu-segmentation-wasm'

await ICUSegmantation.initialize()

const str = 'Hello World! How are you doing today?'

console.log(ICUSegmantation.splitToCharacters(str))

// Outputs: [

//   'H','e','l','l','o',' ','W','o','r','l','d','!',' ','H','o','w',' ','a','r','e',' ',

//   'y','o', 'u',' ','d','o','i','n','g',' ','t','o','d','a','y','?'

// ]

console.log(ICUSegmantation.splitToWords(str))

// Outputs: [

//	'Hello', ' ', 'World', '!', ' ', 'How', ' ', 'are', ' ',

//  'you', ' ', 'doing', ' ', 'today', '?'

// ]

console.log(ICUSegmantation.splitToSentences(str, 'en'))

// Outputs: [

//   'Hello World! ',

//   'How are you doing today?'

// ]

```

### Iterator operations

Provides an iterator for sequence of boundary indexes.

They return a JavaScript iterator that can be consumed by a `for..of` loop:

```ts

createCharacterBreakIterator(text, lanaguge?)

createWordBreakIterator(text, lanaguge?)

createLineBreakIterator(text, lanaguge?)

createSentenceBreakIterator(text, lanaguge?)

```

Example usage of an iterator:

```ts

const text = 'Привет, мир! Как у тебя дела сегодня?'

for (const boundaryIndex of createSentenceBreakIterator(text, 'ru')) {

	console.log(boundaryIndex)

}

// Outputs:

// 0

// 13

// 37

```

## Building the WebAssembly module

See [this guide](docs/Building.md), which goes through the process of how ICU is built and linked to the WebAssembly wrapper used in this package.

## License

MIT

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/echogarden-project/icu-segmentation-wasm

Awesome Lists containing this project

README