https://github.com/seanghay/betterkhmer

Regex-free, fast Khmer Encoding normalizer ported to 18 languages
https://github.com/seanghay/betterkhmer

c cpp csharp dart flutter go java khmer khmer-normalize khmer-normalizer kotlin perl php python ruby rust zig

Last synced: about 1 month ago
JSON representation

Regex-free, fast Khmer Encoding normalizer ported to 18 languages

Host: GitHub
URL: https://github.com/seanghay/betterkhmer
Owner: seanghay
Created: 2026-05-15T11:33:55.000Z (2 months ago)
Default Branch: main
Last Pushed: 2026-05-16T03:59:38.000Z (about 2 months ago)
Last Synced: 2026-05-26T12:46:01.005Z (about 2 months ago)
Topics: c, cpp, csharp, dart, flutter, go, java, khmer, khmer-normalize, khmer-normalizer, kotlin, perl, php, python, ruby, rust, zig
Language: Objective-C
Homepage:
Size: 1.18 MB
Stars: 15
Watchers: 0
Forks: 2
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

awesome-khmer-language - seanghay/betterkhmer - free, fast Khmer Encoding normalizer ported to 18 languages (Awesome Khmer Language / 2. Toolkit)

README

          # BetterKhmer

Khmer Unicode normalizer ported to 18 languages. All implementations expose a single `normalize()` function and pass the same 10,085-line fixture suite.

Normalizes Khmer text according to the proposed normal encoding structure at https://www.unicode.org/L2/L2022/22290-khmer-encoding.pdf. It does not attempt to identify faulty text — it ensures two strings that would render the same are output as the same string.

Based on the original [khmer-normalizer](https://github.com/seanghay/khmer-normalizer) by [SIL Global](https://software.sil.org/), MIT license.

## Example

ខែ្មរ is corrected to ខ្មែរ:

- Input: ខ `U+1781` ែ `U+17C2` ្ `U+17D2` ម `U+1798` រ `U+179A`

- Output: ខ `U+1781` ្ `U+17D2` ម `U+1798` ែ `U+17C2` រ `U+179A`

## Languages

**This is not published to any package registry.** Each port is one

self-contained source file — copy it straight into your project.

| Language    | Source file (copy into your project) |

|-------------|--------------------------------------|

| Python      | `python/betterkhmer/src/betterkhmer/__init__.py` |

| Go          | `go/betterkhmer/betterkhmer.go` |

| Rust        | `rust/betterkhmer/src/lib.rs` |

| Swift       | `swift/betterkhmer/Sources/BetterKhmer/BetterKhmer.swift` |

| Dart        | `dart/betterkhmer/lib/betterkhmer.dart` |

| Ruby        | `ruby/betterkhmer/lib/betterkhmer.rb` |

| PHP         | `php/betterkhmer/src/BetterKhmer.php` |

| Java        | `java/betterkhmer/src/main/java/com/betterkhmer/BetterKhmer.java` |

| Kotlin      | `kotlin/betterkhmer/src/main/kotlin/com/betterkhmer/BetterKhmer.kt` |

| C#          | `csharp/betterkhmer/src/BetterKhmer.cs` |

| C           | `c/betterkhmer/src/betterkhmer.c` (+ `.h`) |

| C++         | `cpp/betterkhmer/src/betterkhmer.cpp` (+ `.hpp`) |

| TypeScript  | `typescript/betterkhmer/src/index.ts` |

| Zig         | `zig/betterkhmer/src/betterkhmer.zig` |

| Perl        | `perl/betterkhmer/lib/BetterKhmer.pm` |

| Elixir      | `elixir/betterkhmer/lib/betterkhmer.ex` |

| VB.NET      | `vbnet/betterkhmer/src/BetterKhmer.vb` |

| Objective-C | `objc/betterkhmer/src/BetterKhmer.m` (+ `.h`) |

| Lua         | `lua/betterkhmer/betterkhmer.lua` |

## API

Each language exposes one function: **`normalize(input, lang="km")`**.

- `lang = "km"` — Modern Khmer (default)

- `lang = "xhm"` — Middle Khmer

```python

# Python

from betterkhmer import normalize

result = normalize("ខ្មែរ")

```

```go

// Go

result := betterkhmer.Normalize("ខ្មែរ")

```

```typescript

// TypeScript / JavaScript

import { normalize } from 'betterkhmer';

const result = normalize('ខ្មែរ');

```

See the per-language `README.md` in each subdirectory for usage and test details.

## Why this exists

Khmer syllables are two-dimensional arrangements of marks surrounding a base consonant. Unicode does not mandate a single encoding order for these marks, so the same rendered word can be stored as multiple distinct byte sequences.

The word ស្ត្រី ("woman") can be encoded at least three ways that look identical on screen:

| Sequence | Codepoints | Sounds like |

|----------|------------|-------------|

| ស ្ត ្រ ី | U+179F U+17D2 U+178F U+17D2 U+179A U+17B8 | s-t-r-ī (correct) |

| ស ្រ ្ត ី | U+179F U+17D2 U+179A U+17D2 U+178F U+17B8 | s-r-t-ī |

| ស ្រ ី ្ត | U+179F U+17D2 U+179A U+17B8 U+17D2 U+178F | s-r-ī-t |

This disorder has real consequences:

- **Search breaks** — Google returns completely different results for visually identical queries typed in different apps.

- **Security spoofing** — `ស្ត្រី.com`, `ស្រ្តី.com`, and `ស្រី្ត.com` look the same in a browser bar but route to different servers.

- **Code review is unreliable** — variable names that appear identical may differ in encoding, making malicious substitutions invisible.

- **Rendering artifacts** — some browsers show dotted-circle error markers for out-of-order marks that others silently accept.

`normalize()` collapses all equivalent forms into one canonical byte sequence, so search, comparison, storage, and security checks behave correctly regardless of which keyboard or app produced the text.

Further reading: [Order and Disorder in Unicode](https://lontar.eu/en/notes/order-and-disorder-in-unicode/) · [Proposed Khmer encoding structure (Unicode L2/22-290)](https://www.unicode.org/L2/L2022/22290-khmer-encoding.pdf)

**Talk**: [S3T1 — Discrepancies in Khmer Unicode Character Ordering Rules and a Proposed Solution](https://www.youtube.com/watch?v=mD-nrfvWtgc) — the conference presentation behind the encoding proposal that this library implements.

## What it does

- Sorts character components within each Khmer syllable by Unicode category

- Canonicalizes compound vowel sequences (e.g. េ + ា → ោ)

- Applies consonant shifters (TRIISAP / MUUSIKATOAN) correctly

- Converts lunar date notation to dedicated Unicode symbols

## Fixtures

`fixtures/input.txt` and `fixtures/expected.txt` contain 10,085 test pairs sampled from real Khmer text. Regenerate with:

```sh

python3 scripts/gen_fixtures.py

```

## Benchmark

Throughput of the current implementation. One **op** = one `normalize()`

call on one line. Corpus: all 10,085 `fixtures/input.txt` lines held in

memory; 3 untimed warmup passes, then K timed full passes (timed region

≥ 5 s), best of two runs; **only the normalize loop is timed** (file IO,

process start and JIT/VM warmup excluded); release/optimized builds.

| Language    |        ops/sec |

|-------------|---------------:|

| Java        | 85,888 |

| Kotlin      | 55,406 |

| Go          | 53,802 |

| C#          | 49,693 |

| C           | 49,120 |

| VB.NET      | 48,352 |

| Rust        | 47,181 |

| TypeScript  | 44,230 |

| C++         | 43,540 |

| Objective-C | 42,880 |

| Dart        | 36,599 |

| Swift       | 19,749 |

| Elixir      | 13,613 |

| PHP         |  7,214 |

| Zig         |  6,412 |

| Ruby        |  4,940 |

| Python      |  4,013 |

| Perl        |  3,847 |

| Lua         |  3,591 |

All ports produce identical normalized output (verified by the 10,085-line

fixture suite). Absolute numbers are indicative only — they are not strictly

comparable across languages because runtimes, GC and per-call memory models

differ (e.g. C/C++/Objective-C and Zig allocate and free a result buffer on

every call, which dominates Zig's figure; Ruby/PHP/Perl still use the regex

implementation).

## License

MIT

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/seanghay/betterkhmer

Awesome Lists containing this project

README