https://github.com/eddieantonio/unicode-default-word-boundary

Split words with Unicode's default word boundary specification
https://github.com/eddieantonio/unicode-default-word-boundary

internationalization split text unicode word-boundary word-break

Last synced: about 1 month ago
JSON representation

Split words with Unicode's default word boundary specification

Host: GitHub
URL: https://github.com/eddieantonio/unicode-default-word-boundary
Owner: eddieantonio
License: mit
Created: 2019-05-03T15:26:46.000Z (over 6 years ago)
Default Branch: master
Last Pushed: 2024-09-12T10:11:02.000Z (about 1 year ago)
Last Synced: 2025-10-05T04:21:02.194Z (about 2 months ago)
Topics: internationalization, split, text, unicode, word-boundary, word-break
Language: TypeScript
Homepage:
Size: 661 KB
Stars: 13
Watchers: 2
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          Unicode Default Word Boundary

=============================

[![Build status](https://github.com/eddieantonio/unicode-default-word-boundary/actions/workflows/node.js.yml/badge.svg)](https://github.com/eddieantonio/unicode-default-word-boundary/actions/workflows/node.js.yml)

[![npm](https://img.shields.io/npm/v/unicode-default-word-boundary.svg)](https://www.npmjs.com/package/unicode-default-word-boundary)

Implements the [Unicode UAX #29 §4.1 default word boundary

specification][defaultwb], for finding **word breaks** in **multilingual

text**.

Use this to split words in text! Using UAX #29 is a lot smarter than the

`\b` word boundary in JavaScript's regular expressions! Note that

character classes like `\b`, `\w`, `\d` [only work on ASCII

characters][mdnregexp].

Usage

-----

Import the module and use the `split()` function:

```js

const split = require('unicode-default-word-boundary').split;

console.log(split(`The quick (“brown”) fox can’t jump 32.3 feet, right?`));

```

Output:

    [ 'The', 'quick', '(', '“', 'brown', '”', ')', 'fox', 'can’t', 'jump', '32.3', 'feet', ',', 'right', '?' ]

But that's not all! Try it with non-English text, like Russian:

```javascript

split(`В чащах юга жил бы цитрус? Да, но фальшивый экземпляр!`)

```

    [ 'В', 'чащах', 'юга', 'жил', 'бы', 'цитрус', '?', 'Да', ',', 'но', 'фальшивый', 'экземпляр', '!' ]

...Hebrew:

```javascript

split(`איך בלש תפס גמד רוצח עז קטנה?`);

```

    [ 'איך', 'בלש', 'תפס', 'גמד', 'רוצח', 'עז', 'קטנה', '?' ]

...[nêhiyawêwin][]:

```javascript

split(`ᑕᐻ ᒥᔪ ᑭᓯᑲᐤ ᐊᓄᐦᐨ᙮`);

```

    [ 'ᑕᐻ', 'ᒥᔪ ᑭᓯᑲᐤ', 'ᐊᓄᐦᐨ', '᙮' ]

...and many more!

More advanced use cases will want to use the `findSpans()` or the

`findBoundaries()` function.

What doesn't work

-----------------

Languages that do not have obvious word breaks, such as Chinese,

Japanese, Thai, Lao, and Khmer. You'll need to use statistical or

dictionary-based approaches to split words in these languages.

API Documentation

-----------------

The following functions make up the primary API:

### `split(text: string): string[]`

`split()` splits the text at word boundaries, returning an array of all

"words" from the text that contain characters other than whitespace.

See above for examples.

### `findSpans(text: string): Iterable`

`findSpans()` is a generator that yields successive _basic spans_ from

the text. A basic span is a chunk of text that is guaranteed to

start at a word boundary and end at the next word boundary. In other

words, basic spans are _indivisible_ in that there are no word

boundaries contained within a basic span.

A basic span has the following properties:

```typescript

interface BasicSpan {

    /** Where the span starts, relative to the input text. */

    start: number;

    /** At what index does the **next** span begin. */

    end: number;

    /** How many characters are in this span. */

    length: number;

    /** The text contained within this span. */

    text: string;

}

```

Note that unlike, `split()`, `findSpans()` **does** yield spans that

contain whitespace.

#### Example

`Array.from(findSpans("Hello, world🌎!"))`

Will yield spans with the following properties:

```javascript

[ { start: 0, end: 5, length: 5, text: 'Hello' },

  { start: 5, end: 6, length: 1, text: ',' },

  { start: 6, end: 7, length: 1, text: ' ' },

  { start: 7, end: 12, length: 5, text: 'world' },

  { start: 12, end: 14, length: 2, text: '🌎' },

  { start: 14, end: 15, length: 1, text: '!' } ]

```

**N.B.**: `findSpans()` may _not_ yield plain JavaScript objects, as

shown above. The objects that `findSpans()` yield will adhere to the

`BasicSpan` interface, however what `findSpans()` actually yields may

differ from simple objects.

### `findBoundaries(text: string): Generator`

`findBoundaries()` is like `findSpans()` except it yields the _index_ of

each successive word boundary. Anecdotally, using this function directly

may be faster than generating spans objects with `findSpans()`.

Contributing and Maintaining

----------------------------

When maintaining this package, you might notice something strange.

`index.ts` depends on `./src/gen/WordBreakProperty.ts`, but this file

does not exist! It is a **generated** file, created by reading Unicode

property data files, [downloaded from Unicode's website][unicodefiles].

These data files have been compressed and committed to this repository

in `libexec/`:

    libexec/

    libexec/

    ├── WordBreakProperty-15.1.0.txt.gz

    ├── compile-word-break.js

    └── emoji-data-15.1.0.txt.gz

**Note that `compile-word-break.js` actually creates

`./src/gen/WordBreakProperty.ts`!**

### How to generate `./src/gen/WordBreakProperty.ts`

When you have _just_ cloned the repository, this file will be generated

when you run `npm install`:

    npm install

If you want to regenerate it afterwards, you can run the build script:

    npm run build

### Benchmarking

To run the benchmarks, you can run the following:

    npm run benchmarks

If you want to compare the current implementation with a new

implementation, what I do is create a new working tree called `opt/`:

    git worktree add -b «NEW-BRANCH-NAME» opt

Then, I make changes in the working tree inside `opt/`, **compile

and run the tests**, then, in the main working tree, I run the

benchmarks:

    cd opt/

    npm install

    vim       # do whatever you need to do here

    npm test  # this also compiles the TypeScript

    cd ..

    npm run benchmarks

License

-------

TypeScript implementation © 2019 National Research Council Canada,

© 2024 Eddie Antonio Santos. MIT Licensed.

The algorithm comes from [UAX #29: Unicode Text Segmentation, an

integral part of the Unicode Standard, version 15.1][uax29].

[defaultwb]: https://unicode.org/reports/tr29/#Default_Word_Boundaries

[mdnregexp]: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions/Character_Classes#Types

[nêhiyawêwin]: https://en.wikipedia.org/wiki/Plains_Cree

[uax29]: https://unicode.org/reports/tr29/

[unicodefiles]: https://unicode.org/reports/tr41/tr41-24.html

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/eddieantonio/unicode-default-word-boundary

Awesome Lists containing this project

README