{"id":21835064,"url":"https://github.com/simdutf/simdunicode","last_synced_at":"2025-04-14T08:51:52.148Z","repository":{"id":245298739,"uuid":"647496052","full_name":"simdutf/SimdUnicode","owner":"simdutf","description":"Fast SIMD-based UTF-8 Validation in C#","archived":false,"fork":false,"pushed_at":"2025-02-08T17:04:42.000Z","size":1916,"stargazers_count":42,"open_issues_count":4,"forks_count":7,"subscribers_count":5,"default_branch":"main","last_synced_at":"2025-03-27T22:22:46.962Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"C#","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/simdutf.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-05-30T23:01:56.000Z","updated_at":"2025-03-03T19:37:01.000Z","dependencies_parsed_at":"2024-06-21T07:57:07.037Z","dependency_job_id":"30e29c48-9167-4b9c-82fe-5f6c69bf47b3","html_url":"https://github.com/simdutf/SimdUnicode","commit_stats":null,"previous_names":["simdutf/simdunicode"],"tags_count":3,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/simdutf%2FSimdUnicode","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/simdutf%2FSimdUnicode/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/simdutf%2FSimdUnicode/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/simdutf%2FSimdUnicode/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/simdutf","download_url":"https://codeload.github.com/simdutf/SimdUnicode/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248852084,"owners_count":21171837,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-27T20:17:17.089Z","updated_at":"2025-04-14T08:51:52.141Z","avatar_url":"https://github.com/simdutf.png","language":"C#","readme":"# SimdUnicode\n[![.NET](https://github.com/simdutf/SimdUnicode/actions/workflows/dotnet.yml/badge.svg)](https://github.com/simdutf/SimdUnicode/actions/workflows/dotnet.yml)\n\nThis is a fast C# library to validate UTF-8 strings.\n\n\n## Motivation\n\nWe seek to speed up the `Utf8Utility.GetPointerToFirstInvalidByte` function from the C# runtime library.\n[The function is private in the Microsoft Runtime](https://github.com/dotnet/runtime/blob/4d709cd12269fcbb3d0fccfb2515541944475954/src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf8Utility.Validation.cs), but we can expose it manually. The C# runtime \nfunction is well optimized and it makes use of advanced CPU instructions. Nevertheless, we propose\nan alternative that can be several times faster.\n\nSpecifically, we provide the function `SimdUnicode.UTF8.GetPointerToFirstInvalidByte` which is a faster\ndrop-in replacement:\n```cs\n// Returns \u0026inputBuffer[inputLength] if the input buffer is valid.\n/// \u003csummary\u003e\n/// Given an input buffer \u003cparamref name=\"pInputBuffer\"/\u003e of byte length \u003cparamref name=\"inputLength\"/\u003e,\n/// returns a pointer to where the first invalid data appears in \u003cparamref name=\"pInputBuffer\"/\u003e.\n/// The parameter \u003cparamref name=\"Utf16CodeUnitCountAdjustment\"/\u003e is set according to the content of the valid UTF-8 characters encountered, counting -1 for each 2-byte character, -2 for each 3-byte and 4-byte characters.\n/// The parameter \u003cparamref name=\"ScalarCodeUnitCountAdjustment\"/\u003e is set according to the content of the valid UTF-8 characters encountered, counting -1 for each 4-byte character.\n/// \u003c/summary\u003e\n/// \u003cremarks\u003e\n/// Returns a pointer to the end of \u003cparamref name=\"pInputBuffer\"/\u003e if the buffer is well-formed.\n/// \u003c/remarks\u003e\npublic unsafe static byte* GetPointerToFirstInvalidByte(byte* pInputBuffer, int inputLength, out int Utf16CodeUnitCountAdjustment, out int ScalarCodeUnitCountAdjustment);\n```\n\nThe function uses advanced instructions (SIMD) on 64-bit ARM and x64 processors, but fallbacks on a\nconventional implementation on other systems. We provide extensive tests and benchmarks.\n\nWe apply the algorithm used by Node.js, Bun, Oracle GraalVM, by the PHP interpreter and other important systems. The algorithm has been described in the follow article:\n\n- John Keiser, Daniel Lemire, [Validating UTF-8 In Less Than One Instruction Per Byte](https://arxiv.org/abs/2010.03090), Software: Practice and Experience 51 (5), 2021\n\n\n## Requirements\n\nWe recommend you install .NET 8 or better: https://dotnet.microsoft.com/en-us/download/dotnet/8.0\n\n\n## Running tests\n\n```\ndotnet test\n```\n\nTo see which tests are running, we recommend setting the verbosity level:\n\n```\ndotnet test -v=normal\n```\n\nMore details could be useful:\n```\ndotnet test -v d\n```\n\nTo get a list of available tests, enter the command:\n\n```\ndotnet test --list-tests\n```\n\nTo run specific tests, it is helpful to use the filter parameter:\n\n\n```\ndotnet test --filter TooShortErrorAvx2\n```\n\nOr to target specific categories:\n\n```\ndotnet test --filter \"Category=scalar\"\n```\n\n## Running Benchmarks\n\nTo run the benchmarks, run the following command:\n```\ncd benchmark\ndotnet run -c Release\n```\n\nTo run just one benchmark, use a filter:\n\n```\ncd benchmark\ndotnet run --configuration Release --filter \"*Twitter*\"\ndotnet run --configuration Release --filter \"*Lipsum*\"\n```\n\nIf you are under macOS or Linux, you may want to run the benchmarks in privileged mode:\n\n```\ncd benchmark\nsudo dotnet run -c Release\n```\n\n\n## Results (x64)\n\nOn an Intel Ice Lake system, our validation function is up to 13 times\nfaster than the standard library.\nA realistic input is Twitter.json which is mostly ASCII with some Unicode content\nwhere we are 2.4 times faster.\n\n| data set        | SimdUnicode AVX-512 (GB/s) | .NET speed (GB/s) | speed up |\n|:----------------|:------------------------|:-------------------|:-------------------|\n| Twitter.json    | 29                      | 12                | 2.4 x |\n| Arabic-Lipsum   | 12                    | 2.3               | 5.2 x |\n| Chinese-Lipsum  | 12                    | 3.9               | 3.0 x |\n| Emoji-Lipsum    | 12                     | 0.9               | 13 x |\n| Hebrew-Lipsum   |12                    | 2.3               | 5.2 x |\n| Hindi-Lipsum    | 12                     | 2.1               | 5.7 x |\n| Japanese-Lipsum | 10                     | 3.5               | 2.9 x |\n| Korean-Lipsum   | 10                     | 1.3               | 7.7 x |\n| Latin-Lipsum    | 76                      | 76                | --- |\n| Russian-Lipsum  | 12                    | 1.2               | 10 x |\n\n\n\nOn x64 system, we offer several functions: a fallback function for legacy systems,\na SSE42 function for older CPUs, an AVX2 function for current x64 systems and\nan AVX-512 function for the most recent processors (AMD Zen 4 or better, Intel\nIce Lake, etc.).\n\n## Results (ARM)\n\nOn an Apple M2 system, our validation function is 1.5 to four times\nfaster than the standard library.\n\n| data set      | SimdUnicode speed (GB/s) | .NET speed (GB/s) |  speed up |\n|:----------------|:-----------|:--------------------------|:-------------------|\n| Twitter.json    |  25        | 14                        | 1.8 x           |\n| Arabic-Lipsum   |  7.4       | 3.5                       | 2.1 x           |\n| Chinese-Lipsum  |  7.4       | 4.8                       | 1.5 x           |\n| Emoji-Lipsum    |  7.4       | 2.5                       | 3.0 x           |\n| Hebrew-Lipsum   |  7.4       | 3.5                       | 2.1 x           |\n| Hindi-Lipsum    |  7.3       | 3.0                       | 2.4 x           |\n| Japanese-Lipsum |  7.3       | 4.6                       | 1.6 x           |\n| Korean-Lipsum   |  7.4       | 1.8                       | 4.1 x           |\n| Latin-Lipsum    |  87        | 38                        | 2.3 x           |\n| Russian-Lipsum  |  7.4       | 2.7                       | 2.7 x           |\n\nOn a Graviton 3, our validation function is 1.2 to over five times\nfaster than the standard library.\n\n| data set      | SimdUnicode speed (GB/s) | .NET speed (GB/s) |  speed up |\n|:----------------|:-----------|:--------------------------|:-------------------|\n| Twitter.json    |  19        | 11                        | 1.7 x           |\n| Arabic-Lipsum   |  5.2       | 2.7                       | 1.9 x           |\n| Chinese-Lipsum  |  5.2        | 4.5                       | 1.2 x           |\n| Emoji-Lipsum    |  5.2        | 0.9                       | 5.8 x           |\n| Hebrew-Lipsum   |  5.2        | 2.7                       | 1.9 x           |\n| Hindi-Lipsum    |  5.2        | 2.4                       | 2.2 x           |\n| Japanese-Lipsum | 5.2        |3.9                       | 1.3 x           |\n| Korean-Lipsum   |  5.2        | 1.5                       | 3.5 x           |\n| Latin-Lipsum    |  57        | 26                        | 2.2 x           |\n| Russian-Lipsum  |  5.2        | 2.8                       | 1.9 x           |\n\nOn a Neoverse V1 (Graviton 3), our validation function is 1.3 to over five times\nfaster than the standard library.\n\n| data set      | SimdUnicode speed (GB/s) | .NET speed (GB/s) |  speed up |\n|:----------------|:-----------|:--------------------------|:-------------------|\n| Twitter.json    |  14        | 8.7                        | 1.4 x           |\n| Arabic-Lipsum   |  4.2       | 2.0                       | 2.1 x           |\n| Chinese-Lipsum  |  4.2        | 2.6                       | 1.6 x           |\n| Emoji-Lipsum    |  4.2        | 0.8                       | 5.3 x           |\n| Hebrew-Lipsum   |  4.2        | 2.0                       | 2.1 x           |\n| Hindi-Lipsum    |  4.2        | 1.6                       | 2.6 x           |\n| Japanese-Lipsum |  4.2        | 2.4                       | 1.8 x           |\n| Korean-Lipsum   |  4.2        | 1.3                       | 3.2 x           |\n| Latin-Lipsum    |  42        | 17                        | 2.5 x           |\n| Russian-Lipsum  |  4.2        | 0.95                       | 4.4 x           |\n\n\nOn a Qualcomm 8cx gen3 (Windows Dev Kit 2023), we get roughly the same relative performance\nboost as the Neoverse V1.\n\n| data set      | SimdUnicode speed (GB/s) | .NET speed (GB/s) |  speed up |\n|:----------------|:-----------|:--------------------------|:-------------------|\n| Twitter.json    |  17        | 10                        | 1.7 x           |\n| Arabic-Lipsum   |  5.0       | 2.3                       | 2.2 x           |\n| Chinese-Lipsum  |  5.0       | 2.9                       | 1.7 x           |\n| Emoji-Lipsum    |  5.0       | 0.9                       | 5.5 x           |\n| Hebrew-Lipsum   |  5.0       | 2.3                       | 2.2 x           |\n| Hindi-Lipsum    |  5.0       | 1.9                       | 2.6 x           |\n| Japanese-Lipsum |  5.0       | 2.7                       | 1.9 x           |\n| Korean-Lipsum   |  5.0       | 1.5                       | 3.3 x           |\n| Latin-Lipsum    |  50        | 20                       | 2.5 x           |\n| Russian-Lipsum  |  5.0       | 1.2                       | 5.2 x           |\n\n\nOn a Neoverse N1 (Graviton 2), our validation function is 1.3 to over four times\nfaster than the standard library.\n\n| data set      | SimdUnicode speed (GB/s) | .NET speed (GB/s) |  speed up |\n|:----------------|:-----------|:--------------------------|:-------------------|\n| Twitter.json    |  12        | 8.7                        | 1.4 x           |\n| Arabic-Lipsum   |  3.4       | 2.0                       | 1.7 x           |\n| Chinese-Lipsum  |  3.4       | 2.6                       | 1.3 x           |\n| Emoji-Lipsum    |  3.4       | 0.8                       | 4.3 x           |\n| Hebrew-Lipsum   |  3.4       | 2.0                       | 1.7 x           |\n| Hindi-Lipsum    |  3.4       | 1.6                       | 2.1 x           |\n| Japanese-Lipsum |  3.4       | 2.4                       | 1.4 x           |\n| Korean-Lipsum   |  3.4       | 1.3                       | 2.6 x           |\n| Latin-Lipsum    |  42        | 17                        | 2.5 x           |\n| Russian-Lipsum  |  3.3       | 0.95                       | 3.5 x           |\n\nOn a Neoverse N1 (Graviton 2), our validation function is up to over three times\nfaster than the standard library.\n\n\n| data set      | SimdUnicode speed (GB/s) | .NET speed (GB/s) |  speed up |\n|:----------------|:-----------|:--------------------------|:-------------------|\n| Twitter.json    |  7.8        | 5.7                        | 1.4 x           |\n| Arabic-Lipsum   |  2.5       | 0.9                       | 2.8 x           |\n| Chinese-Lipsum  |  2.5       | 1.8                       | 1.4 x           |\n| Emoji-Lipsum    |  2.5       | 0.7                       | 3.6 x           |\n| Hebrew-Lipsum   |  2.5       | 0.9                       | 2.7 x           |\n| Hindi-Lipsum    |  2.3       | 1.0                       | 2.3 x           |\n| Japanese-Lipsum |  2.4       | 1.7                       | 1.4 x           |\n| Korean-Lipsum   |  2.5       | 1.0                       | 2.5 x           |\n| Latin-Lipsum    |  23        | 13                        | 1.8 x           |\n| Russian-Lipsum  |  2.3      | 0.7                       | 3.3 x           |\n\n\n## Building the library\n\n```\ncd src\ndotnet build\n```\n\n## Code format\n\nWe recommend you use `dotnet format`. E.g.,\n\n```\ndotnet format\n```\n\n## Programming tips\n\nYou can print the content of a vector register like so:\n\n```C#\n        public static void ToString(Vector256\u003cbyte\u003e v)\n        {\n            Span\u003cbyte\u003e b = stackalloc byte[32];\n            v.CopyTo(b);\n            Console.WriteLine(Convert.ToHexString(b));\n        }\n        public static void ToString(Vector128\u003cbyte\u003e v)\n        {\n            Span\u003cbyte\u003e b = stackalloc byte[16];\n            v.CopyTo(b);\n            Console.WriteLine(Convert.ToHexString(b));\n        }\n```\n\n## Performance tips\n\n- Be careful: `Vector128.Shuffle` is not the same as `Ssse3.Shuffle` nor is  `Vector256.Shuffle` the same as `Avx2.Shuffle`. Prefer the latter.\n- Similarly `Vector128.Shuffle` is not the same as `AdvSimd.Arm64.VectorTableLookup`, use the latter.\n- `stackalloc` arrays should probably not be used in class instances.\n- In C#, `struct` might be preferable to `class` instances as it makes it clear that the data is thread local.\n- You can ask for an asm dump: `DOTNET_JitDisasm=NEON64HTMLScan dotnet run -c Release`. See [Viewing JIT disassembly and dumps](https://github.com/dotnet/runtime/blob/main/docs/design/coreclr/jit/viewing-jit-dumps.md).\n- You can get profiling data: `dotnet run -c Release -- -p EP`.\n\n## More reading \n\n- [Add optimized UTF-8 validation and transcoding apis, hook them up to UTF8Encoding](https://github.com/dotnet/coreclr/pull/21948/files#diff-2a22774bd6bff8e217ecbb3a41afad033ce0ca0f33645e9d8f5bdf7c9e3ac248)\n- https://github.com/dotnet/runtime/issues/41699\n- https://learn.microsoft.com/en-us/dotnet/standard/design-guidelines/\n- https://learn.microsoft.com/en-us/dotnet/csharp/fundamentals/coding-style/coding-conventions\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsimdutf%2Fsimdunicode","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsimdutf%2Fsimdunicode","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsimdutf%2Fsimdunicode/lists"}