{"id":21281741,"url":"https://github.com/prophetlamb/homoglyph","last_synced_at":"2025-08-26T06:36:13.150Z","repository":{"id":65610455,"uuid":"595571469","full_name":"ProphetLamb/Homoglyph","owner":"ProphetLamb","description":"Determine homoglyphs for utf-32 codepoints","archived":false,"fork":false,"pushed_at":"2023-01-31T20:53:57.000Z","size":150,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-01-22T04:27:23.192Z","etag":null,"topics":["csharp","hashmap","homoglyph","homoglyphs","library","orthography"],"latest_commit_sha":null,"homepage":"","language":"C#","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ProphetLamb.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-01-31T11:08:32.000Z","updated_at":"2024-05-26T02:44:56.000Z","dependencies_parsed_at":"2023-02-16T21:46:06.168Z","dependency_job_id":null,"html_url":"https://github.com/ProphetLamb/Homoglyph","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ProphetLamb%2FHomoglyph","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ProphetLamb%2FHomoglyph/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ProphetLamb%2FHomoglyph/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ProphetLamb%2FHomoglyph/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ProphetLamb","download_url":"https://codeload.github.com/ProphetLamb/Homoglyph/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243740094,"owners_count":20340203,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["csharp","hashmap","homoglyph","homoglyphs","library","orthography"],"created_at":"2024-11-21T10:50:28.083Z","updated_at":"2025-03-15T14:22:01.399Z","avatar_url":"https://github.com/ProphetLamb.png","language":"C#","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Homoglyph\n\n\u003e In orthography and typography, a homoglyph is one of two or more graphemes, characters, or glyphs with shapes that appear identical or very similar.\n[Wikipedia](https://en.wikipedia.org/wiki/Homoglyph)\n\n## Usage\n\nAllows the user to retrieve all homoglyphs for a specific utf-32 code-point as a `ReadOnlySpan\u003cuint\u003e`.\n\n```csharp\nusing Homoglyph;\n\nReadOnlySpan\u003cuint\u003e codepoints = Homoglyphs.GetHomoglyphs('\\u0020'); // \u0026nbsp;\nConsole.WriteLine(codepoints.Contains(' ') ? \"Yay\" : \"Nope\");\n// Output: Yay\n```\n\n## [API documentation](./doc/Homoglyph/index.md)\n\n## Continuous Integration\n\n| Build                                                                                                                                                                                           | Test                                                                                                                                                                    | Coverage                                                                                                                                                                                |\n| ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |\n| \u003csup\u003eAppveyor\u003c/sup\u003e [![Build status](https://ci.appveyor.com/api/projects/status/8xi6uuuur1y5qup8/branch/master?svg=true)](https://ci.appveyor.com/project/ProphetLamb/homoglyph/branch/master) | \u003csup\u003eAppveyor\u003c/sup\u003e [![AppVeyor tests](https://img.shields.io/appveyor/tests/ProphetLamb/homoglyph)](https://ci.appveyor.com/project/ProphetLamb/homoglyph/build/tests) | \u003csup\u003eCoveralls\u003c/sup\u003e [![Coverage Status](https://coveralls.io/repos/github/ProphetLamb/Homoglyph/badge.svg?branch=HEAD)](https://coveralls.io/github/ProphetLamb/Homoglyph?branch=HEAD) |\n| ![Build history](https://buildstats.info/appveyor/chart/ProphetLamb/Homoglyph/?branch=master)                                                                                                   |\n\n\n## Behind the scenes\n\nThe library embeds a Homoglyph hash table into its DLL .text section. A specific feature that allows us to load byte sized data is used. If a property exposes a array of `byte` or `sbyte` as a `ReadOnlySpan\u003cbyte\u003e`, then the array is not allocation on the heap a 2nd time, but loaded directly from the dll .text section.\n\nThe [following IL](https://sharplab.io/#v2:C4LglgNgNAJiDUAfAAgJgAwFgBQaCMOOyeAnABQDCAdAAoDa6AugJQDcReAbAARrcXcA3jm6jeAZl5duAJQCmAQxgB5AHYQAngGUADgtUAeAEYbgcgHzca3ALyXVcgO7cTZuoyHd0ADwxRuAL7s2AFAA) is generated for such a property getter.\n```csharp\nusing System;\n\nConsole.WriteLine(C.P[0]);\nstatic class C {\n    public static ReadOnlySpan\u003cbyte\u003e P =\u003e new byte[] { 0x20, };\n}\n```\n\n```asm\nIL_0000: ldsflda uint8 '\u003cPrivateImplementationDetails\u003e'::'36A9E7F1C95B82FFB99743E0C5C4CE95D83C9A430AAC59F84EF3CBFAB6145068'\nIL_0005: ldc.i4.1\nIL_0006: newobj instance void valuetype [System.Runtime]System.ReadOnlySpan`1\u003cuint8\u003e::.ctor(void*, int32)\nIL_000b: ret\n```\n\nCurrently, no array allocation is performed. If we were to replace `byte` with `nuint` or most other primitives a separate array would be allocated and filled inline.\n\n```asm\nIL_0000: ldc.i4.1\nIL_0001: newarr [System.Runtime]System.UIntPtr\nIL_0006: dup\nIL_0007: ldc.i4.0\nIL_0008: ldc.i4.s 32\nIL_000a: conv.i\nIL_000b: stelem.i\nIL_000c: call valuetype [System.Runtime]System.ReadOnlySpan`1\u003c!0\u003e valuetype [System.Runtime]System.ReadOnlySpan`1\u003cnative uint\u003e::op_Implicit(!0[])\nIL_0011: ret\n```\n\n### Hash map\n\nThe native shape of homoglyph groupings would be a array of dynamic sized arrays of codepoints.\n```csharp\nuint[][] homoglyphs = new uint[][] {\n\tnew uint[] { 20,a0,1680,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,200a,2028,2029,202f,205f },\n\tnew uint[] { ... },\n};\n```\nLookup of any given group is `O(n)` with `n` the sum of the number of codepoints over all groups. Additionally it would require many separate heap allocations.\n\nThe first step is to flatten this array into a 1-dimensional structure, by concatenating all groupings.\nWe note down the start index and length of each group in a separate array.\n\n```csharp\nnuint[] homoglyph_flat = new uint[] { 20,a0,1680,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,200a,2028,2029,202f,205f, ... };\n(ushort Index, byte Length)[] homoglyph_indexer = new[] { ((ushort)0, (byte)18), ...};\n```\n\nThis memory optimization brings us down to two allocations, but lookup remains inefficient.\n\nIn order to optimize the lookup we reshape `homoglyph_flat` into a new array of $8419$ items inserting at a index using a hash function. This same index is then used to store the index and length data. The resulting hash table has a fill of $6184 / 8419 ~= 73%$.\n\nOn this table a hash lookup using the `codepoint` as its own hash value yields the index and length inside the `homoglyph_flat` array. The slice of the array then represents the entire grouping.\n\n```csharp\nuint[] blockValues = new uint[] { 20,a0, ... };\nuint[] hashValues = new uint[] { 0, 0, 0, ... };\nushort[] hashIndices = new ushort[] { 0, 0, 0, ... };\nbyte[] hashIndices = new byte[] { 0, 0, 0, ... };\n\npublic static ReadOnlySpan\u003cuint\u003e GetHomoglyphs(uint codepoint) {\n\tint prediction = codepoint % 8419;\n\twhile (true) {\n\t\tif (hashValues[prediction] == codepoint) {\n\t\t\tushort index = hashIndicies[prediction];\n\t\t\tbyte length = hashLengths[prediction];\n\t\t\treturn blockValues[index..(index + length)];\n\t\t}\n\t\tif (predictionCount \u003c= HashCapacity) {\n\t\t\tprediction = (prediction + 1) % 8419;\n\t\t\tcontinue;\n\t\t}\n\n\t\treturn default;\n\t}\n}\n```\n\nFurther optimization requires the embedding the arrays into the DLL .text section, optimizing the probing algorithm, the modulo calculation for 64bit platforms, and preventing probing bound checks for array access when possible.\n\nThe result of this optimization can be found in [Homoglyph.cs](./src/Homoglyph.cs)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fprophetlamb%2Fhomoglyph","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fprophetlamb%2Fhomoglyph","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fprophetlamb%2Fhomoglyph/lists"}