{"id":50082289,"url":"https://github.com/seanghay/betterkhmer","last_synced_at":"2026-06-02T00:30:44.966Z","repository":{"id":358173189,"uuid":"1239729011","full_name":"seanghay/betterkhmer","owner":"seanghay","description":"Regex-free, fast Khmer Encoding normalizer ported to 18 languages","archived":false,"fork":false,"pushed_at":"2026-05-16T03:59:38.000Z","size":1234,"stargazers_count":15,"open_issues_count":0,"forks_count":2,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-05-26T12:46:01.005Z","etag":null,"topics":["c","cpp","csharp","dart","flutter","go","java","khmer","khmer-normalize","khmer-normalizer","kotlin","perl","php","python","ruby","rust","zig"],"latest_commit_sha":null,"homepage":"","language":"Objective-C","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/seanghay.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-05-15T11:33:55.000Z","updated_at":"2026-05-18T09:51:57.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/seanghay/betterkhmer","commit_stats":null,"previous_names":["seanghay/betterkhmer"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/seanghay/betterkhmer","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/seanghay%2Fbetterkhmer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/seanghay%2Fbetterkhmer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/seanghay%2Fbetterkhmer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/seanghay%2Fbetterkhmer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/seanghay","download_url":"https://codeload.github.com/seanghay/betterkhmer/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/seanghay%2Fbetterkhmer/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33800675,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-01T02:00:06.963Z","response_time":115,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["c","cpp","csharp","dart","flutter","go","java","khmer","khmer-normalize","khmer-normalizer","kotlin","perl","php","python","ruby","rust","zig"],"created_at":"2026-05-22T16:00:40.439Z","updated_at":"2026-06-02T00:30:44.956Z","avatar_url":"https://github.com/seanghay.png","language":"Objective-C","funding_links":[],"categories":["Awesome Khmer Language"],"sub_categories":["2. Toolkit"],"readme":"# BetterKhmer\n\nKhmer Unicode normalizer ported to 18 languages. All implementations expose a single `normalize()` function and pass the same 10,085-line fixture suite.\n\nNormalizes Khmer text according to the proposed normal encoding structure at https://www.unicode.org/L2/L2022/22290-khmer-encoding.pdf. It does not attempt to identify faulty text — it ensures two strings that would render the same are output as the same string.\n\nBased on the original [khmer-normalizer](https://github.com/seanghay/khmer-normalizer) by [SIL Global](https://software.sil.org/), MIT license.\n\n## Example\n\nខែ្មរ is corrected to ខ្មែរ:\n\n- Input: ខ `U+1781` ែ `U+17C2` ្ `U+17D2` ម `U+1798` រ `U+179A`\n- Output: ខ `U+1781` ្ `U+17D2` ម `U+1798` ែ `U+17C2` រ `U+179A`\n\n## Languages\n\n**This is not published to any package registry.** Each port is one\nself-contained source file — copy it straight into your project.\n\n| Language    | Source file (copy into your project) |\n|-------------|--------------------------------------|\n| Python      | `python/betterkhmer/src/betterkhmer/__init__.py` |\n| Go          | `go/betterkhmer/betterkhmer.go` |\n| Rust        | `rust/betterkhmer/src/lib.rs` |\n| Swift       | `swift/betterkhmer/Sources/BetterKhmer/BetterKhmer.swift` |\n| Dart        | `dart/betterkhmer/lib/betterkhmer.dart` |\n| Ruby        | `ruby/betterkhmer/lib/betterkhmer.rb` |\n| PHP         | `php/betterkhmer/src/BetterKhmer.php` |\n| Java        | `java/betterkhmer/src/main/java/com/betterkhmer/BetterKhmer.java` |\n| Kotlin      | `kotlin/betterkhmer/src/main/kotlin/com/betterkhmer/BetterKhmer.kt` |\n| C#          | `csharp/betterkhmer/src/BetterKhmer.cs` |\n| C           | `c/betterkhmer/src/betterkhmer.c` (+ `.h`) |\n| C++         | `cpp/betterkhmer/src/betterkhmer.cpp` (+ `.hpp`) |\n| TypeScript  | `typescript/betterkhmer/src/index.ts` |\n| Zig         | `zig/betterkhmer/src/betterkhmer.zig` |\n| Perl        | `perl/betterkhmer/lib/BetterKhmer.pm` |\n| Elixir      | `elixir/betterkhmer/lib/betterkhmer.ex` |\n| VB.NET      | `vbnet/betterkhmer/src/BetterKhmer.vb` |\n| Objective-C | `objc/betterkhmer/src/BetterKhmer.m` (+ `.h`) |\n| Lua         | `lua/betterkhmer/betterkhmer.lua` |\n\n## API\n\nEach language exposes one function: **`normalize(input, lang=\"km\")`**.\n\n- `lang = \"km\"` — Modern Khmer (default)\n- `lang = \"xhm\"` — Middle Khmer\n\n```python\n# Python\nfrom betterkhmer import normalize\nresult = normalize(\"ខ្មែរ\")\n```\n\n```go\n// Go\nresult := betterkhmer.Normalize(\"ខ្មែរ\")\n```\n\n```typescript\n// TypeScript / JavaScript\nimport { normalize } from 'betterkhmer';\nconst result = normalize('ខ្មែរ');\n```\n\nSee the per-language `README.md` in each subdirectory for usage and test details.\n\n## Why this exists\n\nKhmer syllables are two-dimensional arrangements of marks surrounding a base consonant. Unicode does not mandate a single encoding order for these marks, so the same rendered word can be stored as multiple distinct byte sequences.\n\nThe word ស្ត្រី (\"woman\") can be encoded at least three ways that look identical on screen:\n\n| Sequence | Codepoints | Sounds like |\n|----------|------------|-------------|\n| ស ្ត ្រ ី | U+179F U+17D2 U+178F U+17D2 U+179A U+17B8 | s-t-r-ī (correct) |\n| ស ្រ ្ត ី | U+179F U+17D2 U+179A U+17D2 U+178F U+17B8 | s-r-t-ī |\n| ស ្រ ី ្ត | U+179F U+17D2 U+179A U+17B8 U+17D2 U+178F | s-r-ī-t |\n\nThis disorder has real consequences:\n\n- **Search breaks** — Google returns completely different results for visually identical queries typed in different apps.\n- **Security spoofing** — `ស្ត្រី.com`, `ស្រ្តី.com`, and `ស្រី្ត.com` look the same in a browser bar but route to different servers.\n- **Code review is unreliable** — variable names that appear identical may differ in encoding, making malicious substitutions invisible.\n- **Rendering artifacts** — some browsers show dotted-circle error markers for out-of-order marks that others silently accept.\n\n`normalize()` collapses all equivalent forms into one canonical byte sequence, so search, comparison, storage, and security checks behave correctly regardless of which keyboard or app produced the text.\n\nFurther reading: [Order and Disorder in Unicode](https://lontar.eu/en/notes/order-and-disorder-in-unicode/) · [Proposed Khmer encoding structure (Unicode L2/22-290)](https://www.unicode.org/L2/L2022/22290-khmer-encoding.pdf)\n\n**Talk**: [S3T1 — Discrepancies in Khmer Unicode Character Ordering Rules and a Proposed Solution](https://www.youtube.com/watch?v=mD-nrfvWtgc) — the conference presentation behind the encoding proposal that this library implements.\n\n## What it does\n\n- Sorts character components within each Khmer syllable by Unicode category\n- Canonicalizes compound vowel sequences (e.g. េ + ា → ោ)\n- Applies consonant shifters (TRIISAP / MUUSIKATOAN) correctly\n- Converts lunar date notation to dedicated Unicode symbols\n\n## Fixtures\n\n`fixtures/input.txt` and `fixtures/expected.txt` contain 10,085 test pairs sampled from real Khmer text. Regenerate with:\n\n```sh\npython3 scripts/gen_fixtures.py\n```\n\n## Benchmark\n\nThroughput of the current implementation. One **op** = one `normalize()`\ncall on one line. Corpus: all 10,085 `fixtures/input.txt` lines held in\nmemory; 3 untimed warmup passes, then K timed full passes (timed region\n≥ 5 s), best of two runs; **only the normalize loop is timed** (file IO,\nprocess start and JIT/VM warmup excluded); release/optimized builds.\n\n| Language    |        ops/sec |\n|-------------|---------------:|\n| Java        | 85,888 |\n| Kotlin      | 55,406 |\n| Go          | 53,802 |\n| C#          | 49,693 |\n| C           | 49,120 |\n| VB.NET      | 48,352 |\n| Rust        | 47,181 |\n| TypeScript  | 44,230 |\n| C++         | 43,540 |\n| Objective-C | 42,880 |\n| Dart        | 36,599 |\n| Swift       | 19,749 |\n| Elixir      | 13,613 |\n| PHP         |  7,214 |\n| Zig         |  6,412 |\n| Ruby        |  4,940 |\n| Python      |  4,013 |\n| Perl        |  3,847 |\n| Lua         |  3,591 |\n\nAll ports produce identical normalized output (verified by the 10,085-line\nfixture suite). Absolute numbers are indicative only — they are not strictly\ncomparable across languages because runtimes, GC and per-call memory models\ndiffer (e.g. C/C++/Objective-C and Zig allocate and free a result buffer on\nevery call, which dominates Zig's figure; Ruby/PHP/Perl still use the regex\nimplementation).\n\n## License\n\nMIT\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fseanghay%2Fbetterkhmer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fseanghay%2Fbetterkhmer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fseanghay%2Fbetterkhmer/lists"}