An open API service indexing awesome lists of open source software.

https://github.com/ufcpp/graphemesplitter

A C# implementation of the Unicode grapheme cluster breaking algorithm
https://github.com/ufcpp/graphemesplitter

Last synced: 6 months ago
JSON representation

A C# implementation of the Unicode grapheme cluster breaking algorithm

Awesome Lists containing this project

README

          

# GraphemeSplitter

A C# implementation of the Unicode grapheme cluster breaking algorithm.

## **Notes**

- This library uses Unicode 10.0 version of grepheme boundary algorithm.
- In .NET 5.0, [`StringInfo.GetTextElementEnumerator `](https://docs.microsoft.com/en-us/dotnet/api/system.globalization.stringinfo.gettextelementenumerator) can enumerate graphemes correctly with Unicode 13.0 algorithm.

## NuGet package

https://www.nuget.org/packages/GraphemeSplitter/

```powershell
Install-Package GraphemeSplitter
```

## Sample

```cs
using GraphemeSplitter;
using static System.Console;
using static System.String;

public partial class Program
{
static string Split(string s) => Join(", ", s.GetGraphemes());

static void Main()
{
WriteLine(Split("πŸ‘¨β€πŸ‘¨β€πŸ‘§β€πŸ‘¦πŸ‘©β€πŸ‘©β€πŸ‘§β€πŸ‘¦πŸ‘¨β€πŸ‘¨β€πŸ‘§β€πŸ‘¦")); // πŸ‘¨β€πŸ‘¨β€πŸ‘§β€πŸ‘¦, πŸ‘©β€πŸ‘©β€πŸ‘§β€πŸ‘¦, πŸ‘¨β€πŸ‘¨β€πŸ‘§β€πŸ‘¦
}
}
```

[Web Sample](tree/master/RazorPageSample):

![Razor Page Sample](doc/RazorPageSample.png)

## Implementation

This library basically implements http://unicode.org/reports/tr29/.

Expample:

type | text | split result
--- | --- | ---
diacritical marks | à̠́̑bΜ‚ΜƒΜ’Μ£cΜƒΜ„Μ£Μ€dΜ…Μ†Μ₯Μ¦ | "à̠́̑", "bΜ‚ΜƒΜ’Μ£", "cΜƒΜ„Μ£Μ€", "dΜ…Μ†Μ₯Μ¦"
variation selector | 葛葛󠄀葛󠄁 | "θ‘›", "θ‘›σ „€", "葛󠄁"
asian syllable | ᄋᅑᆫ녕ᄒᅑ세요 | "ᄋᅑᆫ", "α„‚α…§α†Ό", "α„’α…‘", "세", "α„‹α…­"
family emoji | πŸ‘¨β€πŸ‘¨β€πŸ‘§β€πŸ‘¦πŸ‘©β€πŸ‘©β€πŸ‘§β€πŸ‘¦πŸ‘¨β€πŸ‘¨β€πŸ‘§β€πŸ‘¦ | "πŸ‘¨β€πŸ‘¨β€πŸ‘§β€πŸ‘¦", "πŸ‘©β€πŸ‘©β€πŸ‘§β€πŸ‘¦", "πŸ‘¨β€πŸ‘¨β€πŸ‘§β€πŸ‘¦"
emoji skin tone | πŸ‘©πŸ»πŸ‘±πŸΌπŸ‘§πŸ½πŸ‘¦πŸΎ | "πŸ‘©πŸ»", "πŸ‘±πŸΌ", "πŸ‘§πŸ½", "πŸ‘¦πŸΎ"

but slacks out the GB10, GB12, and GB13 rules for simplification.

original:

- GB10 … (E_Base | EBG) Extend* Γ— E_Modifier
- GB12 … sot (RI RI)* RI Γ— RI
- GB13 … [^RI] (RI RI)* RI Γ— RI

implemented:

- GB10 … (E_Base | EBG) Γ— Extend
- GB10 … (E_Base | EBG | Extend) Γ— E_Modifier
- GB12/GB13 … RI Γ— RI

Difference is:

sequence | original | implemented
--- | --- | ---
aΜ€πŸ»β€ (U+61, U+300, U+1F3FB) | Γ— Γ· | Γ— Γ—
πŸ‡―πŸ‡΅πŸ‡ΊπŸ‡Έ (U+1F1EF, U+1F1F5, U+1F1FA, U+1F1F8) | Γ— Γ· Γ— | Γ— Γ— Γ—

(where Γ· and Γ— means boundary and no bounadry respectively.)

## Acknowledgements

This library is influenced by
- https://github.com/devongovett/grapheme-breaker
- https://github.com/orling/grapheme-splitter
- https://github.com/unicode-rs/unicode-segmentation