https://github.com/ufcpp/graphemesplitter
A C# implementation of the Unicode grapheme cluster breaking algorithm
https://github.com/ufcpp/graphemesplitter
Last synced: 6 months ago
JSON representation
A C# implementation of the Unicode grapheme cluster breaking algorithm
- Host: GitHub
- URL: https://github.com/ufcpp/graphemesplitter
- Owner: ufcpp
- License: mit
- Created: 2017-10-27T16:55:22.000Z (about 8 years ago)
- Default Branch: master
- Last Pushed: 2020-11-10T03:31:46.000Z (about 5 years ago)
- Last Synced: 2024-11-09T06:56:20.460Z (about 1 year ago)
- Language: C#
- Homepage:
- Size: 179 KB
- Stars: 48
- Watchers: 5
- Forks: 7
- Open Issues: 4
-
Metadata Files:
- Readme: readme.md
- License: LICENSE
Awesome Lists containing this project
README
# GraphemeSplitter
A C# implementation of the Unicode grapheme cluster breaking algorithm.
## **Notes**
- This library uses Unicode 10.0 version of grepheme boundary algorithm.
- In .NET 5.0, [`StringInfo.GetTextElementEnumerator `](https://docs.microsoft.com/en-us/dotnet/api/system.globalization.stringinfo.gettextelementenumerator) can enumerate graphemes correctly with Unicode 13.0 algorithm.
## NuGet package
https://www.nuget.org/packages/GraphemeSplitter/
```powershell
Install-Package GraphemeSplitter
```
## Sample
```cs
using GraphemeSplitter;
using static System.Console;
using static System.String;
public partial class Program
{
static string Split(string s) => Join(", ", s.GetGraphemes());
static void Main()
{
WriteLine(Split("π¨βπ¨βπ§βπ¦π©βπ©βπ§βπ¦π¨βπ¨βπ§βπ¦")); // π¨βπ¨βπ§βπ¦, π©βπ©βπ§βπ¦, π¨βπ¨βπ§βπ¦
}
}
```
[Web Sample](tree/master/RazorPageSample):

## Implementation
This library basically implements http://unicode.org/reports/tr29/.
Expample:
type | text | split result
--- | --- | ---
diacritical marks | aΜΜΜ Μ‘bΜΜΜ’Μ£cΜΜΜ£Μ€dΜ
ΜΜ₯Μ¦ | "aΜΜΜ Μ‘", "bΜΜΜ’Μ£", "cΜΜΜ£Μ€", "dΜ
ΜΜ₯Μ¦"
variation selector | θθσ θσ | "θ", "θσ ", "θσ "
asian syllable | αα
‘α«αα
§αΌαα
‘αα
¦αα
| "αα
‘α«", "αα
§αΌ", "αα
‘", "αα
¦", "αα
"
family emoji | π¨βπ¨βπ§βπ¦π©βπ©βπ§βπ¦π¨βπ¨βπ§βπ¦ | "π¨βπ¨βπ§βπ¦", "π©βπ©βπ§βπ¦", "π¨βπ¨βπ§βπ¦"
emoji skin tone | π©π»π±πΌπ§π½π¦πΎ | "π©π»", "π±πΌ", "π§π½", "π¦πΎ"
but slacks out the GB10, GB12, and GB13 rules for simplification.
original:
- GB10 β¦ (E_Base | EBG) Extend* Γ E_Modifier
- GB12 β¦ sot (RI RI)* RI Γ RI
- GB13 β¦ [^RI] (RI RI)* RI Γ RI
implemented:
- GB10 β¦ (E_Base | EBG) Γ Extend
- GB10 β¦ (E_Base | EBG | Extend) Γ E_Modifier
- GB12/GB13 β¦ RI Γ RI
Difference is:
sequence | original | implemented
--- | --- | ---
aΜπ»β (U+61, U+300, U+1F3FB) | Γ Γ· | Γ Γ
π―π΅πΊπΈ (U+1F1EF, U+1F1F5, U+1F1FA, U+1F1F8) | Γ Γ· Γ | Γ Γ Γ
(where Γ· and Γ means boundary and no bounadry respectively.)
## Acknowledgements
This library is influenced by
- https://github.com/devongovett/grapheme-breaker
- https://github.com/orling/grapheme-splitter
- https://github.com/unicode-rs/unicode-segmentation