Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/eddieantonio/unicode-default-word-boundary
Split words with Unicode's default word boundary specification
https://github.com/eddieantonio/unicode-default-word-boundary
internationalization split text unicode word-boundary word-break
Last synced: 20 days ago
JSON representation
Split words with Unicode's default word boundary specification
- Host: GitHub
- URL: https://github.com/eddieantonio/unicode-default-word-boundary
- Owner: eddieantonio
- License: mit
- Created: 2019-05-03T15:26:46.000Z (over 5 years ago)
- Default Branch: master
- Last Pushed: 2024-09-12T10:11:02.000Z (2 months ago)
- Last Synced: 2024-10-13T15:49:32.826Z (about 1 month ago)
- Topics: internationalization, split, text, unicode, word-boundary, word-break
- Language: TypeScript
- Homepage:
- Size: 661 KB
- Stars: 12
- Watchers: 3
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
Unicode Default Word Boundary
=============================[![Build status](https://github.com/eddieantonio/unicode-default-word-boundary/actions/workflows/node.js.yml/badge.svg)](https://github.com/eddieantonio/unicode-default-word-boundary/actions/workflows/node.js.yml)
[![npm](https://img.shields.io/npm/v/unicode-default-word-boundary.svg)](https://www.npmjs.com/package/unicode-default-word-boundary)Implements the [Unicode UAX #29 §4.1 default word boundary
specification][defaultwb], for finding **word breaks** in **multilingual
text**.Use this to split words in text! Using UAX #29 is a lot smarter than the
`\b` word boundary in JavaScript's regular expressions! Note that
character classes like `\b`, `\w`, `\d` [only work on ASCII
characters][mdnregexp].Usage
-----Import the module and use the `split()` function:
```js
const split = require('unicode-default-word-boundary').split;console.log(split(`The quick (“brown”) fox can’t jump 32.3 feet, right?`));
```Output:
[ 'The', 'quick', '(', '“', 'brown', '”', ')', 'fox', 'can’t', 'jump', '32.3', 'feet', ',', 'right', '?' ]
But that's not all! Try it with non-English text, like Russian:
```javascript
split(`В чащах юга жил бы цитрус? Да, но фальшивый экземпляр!`)
```[ 'В', 'чащах', 'юга', 'жил', 'бы', 'цитрус', '?', 'Да', ',', 'но', 'фальшивый', 'экземпляр', '!' ]
...Hebrew:
```javascript
split(`איך בלש תפס גמד רוצח עז קטנה?`);
```[ 'איך', 'בלש', 'תפס', 'גמד', 'רוצח', 'עז', 'קטנה', '?' ]
...[nêhiyawêwin][]:
```javascript
split(`ᑕᐻ ᒥᔪ ᑭᓯᑲᐤ ᐊᓄᐦᐨ᙮`);
```[ 'ᑕᐻ', 'ᒥᔪ ᑭᓯᑲᐤ', 'ᐊᓄᐦᐨ', '᙮' ]
...and many more!
More advanced use cases will want to use the `findSpans()` or the
`findBoundaries()` function.What doesn't work
-----------------Languages that do not have obvious word breaks, such as Chinese,
Japanese, Thai, Lao, and Khmer. You'll need to use statistical or
dictionary-based approaches to split words in these languages.API Documentation
-----------------The following functions make up the primary API:
### `split(text: string): string[]`
`split()` splits the text at word boundaries, returning an array of all
"words" from the text that contain characters other than whitespace.See above for examples.
### `findSpans(text: string): Iterable`
`findSpans()` is a generator that yields successive _basic spans_ from
the text. A basic span is a chunk of text that is guaranteed to
start at a word boundary and end at the next word boundary. In other
words, basic spans are _indivisible_ in that there are no word
boundaries contained within a basic span.A basic span has the following properties:
```typescript
interface BasicSpan {
/** Where the span starts, relative to the input text. */
start: number;
/** At what index does the **next** span begin. */
end: number;
/** How many characters are in this span. */
length: number;
/** The text contained within this span. */
text: string;
}
```Note that unlike, `split()`, `findSpans()` **does** yield spans that
contain whitespace.#### Example
`Array.from(findSpans("Hello, world🌎!"))`
Will yield spans with the following properties:
```javascript
[ { start: 0, end: 5, length: 5, text: 'Hello' },
{ start: 5, end: 6, length: 1, text: ',' },
{ start: 6, end: 7, length: 1, text: ' ' },
{ start: 7, end: 12, length: 5, text: 'world' },
{ start: 12, end: 14, length: 2, text: '🌎' },
{ start: 14, end: 15, length: 1, text: '!' } ]
```**N.B.**: `findSpans()` may _not_ yield plain JavaScript objects, as
shown above. The objects that `findSpans()` yield will adhere to the
`BasicSpan` interface, however what `findSpans()` actually yields may
differ from simple objects.### `findBoundaries(text: string): Generator`
`findBoundaries()` is like `findSpans()` except it yields the _index_ of
each successive word boundary. Anecdotally, using this function directly
may be faster than generating spans objects with `findSpans()`.Contributing and Maintaining
----------------------------When maintaining this package, you might notice something strange.
`index.ts` depends on `./src/gen/WordBreakProperty.ts`, but this file
does not exist! It is a **generated** file, created by reading Unicode
property data files, [downloaded from Unicode's website][unicodefiles].
These data files have been compressed and committed to this repository
in `libexec/`:libexec/
libexec/
├── WordBreakProperty-15.1.0.txt.gz
├── compile-word-break.js
└── emoji-data-15.1.0.txt.gz**Note that `compile-word-break.js` actually creates
`./src/gen/WordBreakProperty.ts`!**### How to generate `./src/gen/WordBreakProperty.ts`
When you have _just_ cloned the repository, this file will be generated
when you run `npm install`:npm install
If you want to regenerate it afterwards, you can run the build script:
npm run build
### Benchmarking
To run the benchmarks, you can run the following:
npm run benchmarks
If you want to compare the current implementation with a new
implementation, what I do is create a new working tree called `opt/`:git worktree add -b «NEW-BRANCH-NAME» opt
Then, I make changes in the working tree inside `opt/`, **compile
and run the tests**, then, in the main working tree, I run the
benchmarks:cd opt/
npm install
vim # do whatever you need to do here
npm test # this also compiles the TypeScript
cd ..
npm run benchmarksLicense
-------TypeScript implementation © 2019 National Research Council Canada,
© 2024 Eddie Antonio Santos. MIT Licensed.The algorithm comes from [UAX #29: Unicode Text Segmentation, an
integral part of the Unicode Standard, version 15.1][uax29].[defaultwb]: https://unicode.org/reports/tr29/#Default_Word_Boundaries
[mdnregexp]: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions/Character_Classes#Types
[nêhiyawêwin]: https://en.wikipedia.org/wiki/Plains_Cree
[uax29]: https://unicode.org/reports/tr29/
[unicodefiles]: https://unicode.org/reports/tr41/tr41-24.html