{"id":19938438,"url":"https://github.com/hunyadi/simdparse","last_synced_at":"2025-07-05T00:04:53.581Z","repository":{"id":224337061,"uuid":"762719475","full_name":"hunyadi/simdparse","owner":"hunyadi","description":"High-speed parser with vector instructions","archived":false,"fork":false,"pushed_at":"2024-11-26T23:39:29.000Z","size":119,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-03-01T12:48:56.537Z","etag":null,"topics":["avx2-instructions","datetime-parser","parser-library","simd-instructions","uuid-parser"],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/hunyadi.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-02-24T14:00:41.000Z","updated_at":"2024-11-26T23:39:33.000Z","dependencies_parsed_at":"2024-02-25T12:27:58.415Z","dependency_job_id":"e8de9059-2bdb-4298-a725-bc683d60735e","html_url":"https://github.com/hunyadi/simdparse","commit_stats":null,"previous_names":["hunyadi/simdparse"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/hunyadi/simdparse","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hunyadi%2Fsimdparse","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hunyadi%2Fsimdparse/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hunyadi%2Fsimdparse/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hunyadi%2Fsimdparse/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/hunyadi","download_url":"https://codeload.github.com/hunyadi/simdparse/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hunyadi%2Fsimdparse/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":263636790,"owners_count":23492304,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["avx2-instructions","datetime-parser","parser-library","simd-instructions","uuid-parser"],"created_at":"2024-11-12T23:40:09.734Z","updated_at":"2025-07-05T00:04:53.563Z","avatar_url":"https://github.com/hunyadi.png","language":"C++","readme":"# simdparse: High-speed parser with vector instructions\n\nThis header-only C++ library parses character strings into objects with efficient storage, including\n\n* long integers,\n* date and time objects, consisting of year, month, day, hour, minute, second and fractional parts (millisecond, microsecond and nanosecond),\n* IPv4 and IPv6 addresses,\n* UUIDs,\n* Base64-encoded strings.\n\nParsing employs a single instruction multiple data (SIMD) approach, operating on parts of the input string in parallel.\n\n## Technical background\n\nInternally, the implementation uses AVX2 vector instructions (intrinsics) to parse\n\n* strings of decimal and hexadecimal digits into a C++ `unsigned long long`,\n* RFC 3339 date-time strings into C++ `datetime` objects (consisting of year, month, day, hour, minute, second and fractional part),\n* RFC 4122 UUID strings or 32-digit hexadecimal strings into C++ `uuid` objects (stored internally as a 16-byte array), and\n* RFC 4648 Base64 strings encoded with a safe alphabet for URLs and file names into objects of type `vector\u003cbyte\u003e`.\n\nFor parsing IPv4 and IPv6 addresses, the parser calls the C function [inet_pton](https://man7.org/linux/man-pages/man3/inet_pton.3.html) in libc or Windows Sockets (WinSock2).\n\n## Usage\n\nParse an RFC 3339 date-time string into a date-time object:\n\n```cpp\n#include \u003csimdparse/datetime.hpp\u003e\n#include \u003csimdparse/parse.hpp\u003e\n// ...\n\nusing namespace simdparse;\n\nstd::string_view str = \"1984-10-24 23:59:59.123Z\";\ndatetime obj;\nif (parse(obj, str)) {\n   // success\n} else {\n   // handle error\n}\n```\n\nParse a string into an object, triggering an exception on failure:\n\n```cpp\ntry {\n   auto obj = parse\u003cdatetime\u003e(str);\n} catch (parse_error\u0026) {\n   // handle error\n}\n```\n\n## Compiling\n\nThis is a header-only library. C++17 or later is required.\n\nYou should enable the AVX2 instruction set to make full use of the library capabilities:\n\n* `-mavx2` with Clang and GCC\n* `/arch:AVX2` with MSVC\n\nThe code is looking at whether the macro `__AVX2__` is defined.\n\n## Supported formats\n\n### Integers\n\nStrings of decimal digits (without sign) or hexadecimal digits that fit into a C++ `unsigned long long` (when parsed).\n\n### Date-time format\n\nDate-time strings with `UTC` designator:\n\n```\nYYYY-MM-DDThh:mm:ss UTC\n```\n\nDate-time strings with `UTC` suffix *Zulu*:\n\n```\nYYYY-MM-DDThh:mm:ssZ\nYYYY-MM-DDThh:mm:ss.fffZ\nYYYY-MM-DDThh:mm:ss.ffffffZ\nYYYY-MM-DDThh:mm:ss.fffffffffZ\n```\n\nDate-time strings with time zone offset:\n\n```\nYYYY-MM-DDThh:mm:ss+hh:mm\nYYYY-MM-DDThh:mm:ss.fff+hh:mm\nYYYY-MM-DDThh:mm:ss.ffffff+hh:mm\nYYYY-MM-DDThh:mm:ss.fffffffff+hh:mm\n```\n\nNaive date-time strings without time zone designator:\n\n```\nYYYY-MM-DDThh:mm:ss\nYYYY-MM-DDThh:mm:ss.fff\nYYYY-MM-DDThh:mm:ss.ffffff\nYYYY-MM-DDThh:mm:ss.fffffffff\n```\n\nThe character `T` may be substituted with a space character.\n\nFractional digits usually give millisecond (3-digit), microsecond (6-digit) or nanosecond (9-digit) precision. However, any number of fractional digits are supported between 0 and 9. The fractional part separator of `.` must be omitted when no fractional digits are present.\n\n### Date format\n\n```\nYYYY-MM-DD\n```\n\n### Time format\n\n```\nhh:mm:ssZ\nhh:mm:ss.fffZ\nhh:mm:ss.ffffffZ\nhh:mm:ss.fffffffffZ\n```\n\n### UUID format\n\n```\n{xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx}\nxxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx\nxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\n```\n\n## Implementation\n\n### Decimal strings\n\nTo parse integers, we first copy the string of digits, right-aligned, into an internal buffer of 16 bytes pre-populated with `0` digits. The following ASCII character data is stored:\n\n```\n'0' ... '0' '1' '2' '3' '4' '5' '6' '7' '8' '9'\n```\n\nThe buffer is then read into a `__m128i` register.\n\nNext, each character is checked against a lower bound of `'0'` and an upper bound of `'9'`. If any character is outside the bounds, parsing fails.\n\nThen, each character is converted into their numeric 8-bit equivalent:\n\n```\n 0  0  0  0  0  0  0  1  2  3  4  5  6  7  8  9\n```\n\nWith the help of a weighting vector, we multiply each odd position with 10 (and leave each even position as-is), and add members of consecutive 8-bit integer pairs to produce 16-bit integers:\n\n```\n 0  0  0  0  0  0  0  1  2  3  4  5  6  7  8  9\n10  1 10  1 10  1 10  1 10  1 10  1 10  1 10  1\n-----------------------------------------------\n    0     0     0     1    23    45    67    89\n```\n\nNext, we repeat the procedure, merging 16-bit integers into 32-bit integers:\n\n```\n    0     0     0     1    23    45    67    89\n  100     1   100     1   100     1   100     1\n-----------------------------------------------\n          0           1        2345        6789\n```\n\nUnfortunately, there are no AVX2 instructions to multiply and horizontally add 32-bit integers. As a workaround, we pack 32-bit integers into 16-bit slots with saturation. However, since all integers are within the 16-bit range, there is no data loss. Finally, we repeat the previous step but with different scale factors, merging 16-bit integers into 32-bit integers:\n\n```\n    0     1  2345  6789     0     0     0     0\n10000     1 10000     1     0     0     0     0\n-----------------------------------------------\n          1    23456789           0           0\n```\n\nLastly, we extract the two 32-bit integers and combine them into an unsigned long integer, scaling each component with the weight appropriate to their ordinal position. In our example, we obtain the number `123456789`.\n\n### Date-time strings\n\nParsing date-time strings starts by copying the string into an internal buffer of 32 bytes (the character `_` indicates an unspecified blank value):\n\n```\nYYYY-MM-DD hh:mm:ss.fffffffff___\n1984-10-24 23:59:59.123456789___\n```\n\nIf there are fewer than 9 fractional digits, the extra places are filled with `'0'`.\n\nThe buffer is then read into a `__m256i` AVX2 register.\n\nNext, each character is checked against a lower bound and an upper bound, which depend on the character position:\n\n* `'0'` or `'1'` for first digit of month,\n* `'0'` to `'3'` for first digit of day,\n* `'0'` to `'2'` for first digit of hour,\n* `'0'` to `'5'` for first digit of minute and second,\n* `'0'` to `'9'` for other digits (e.g. year, second digit of hour, or fractional part),\n* exact match for separator characters `'-'`, `':'` and `'.'`\n\nIf any character is outside the bounds, parsing fails.\n\nIf all constraints match, the numeric value of the ASCII character `'0'` is subtracted from each byte. (The same is accomplished with a bitwise *AND* on each byte against `0x0f`.) This makes each position hold the numeric value the digit corresponds to.\n\nNext, digits are shuffled to pack parts together:\n\n```\nYYYY-MM-DD hh:mm | :ss.fffffffff___\nYYYYMMDDhhmm____ | ss_fff_fff_fff__\n```\n\nThe character `_` indicates an unspecified blank value, and `|` indicates a lane boundary across which no shuffling is possible.\n\nThen, each byte in the register is multiplied by a weight, and neighboring values are added to form 16-bit integers. Take the date-time string `1984-10-24 23:59:59.123456789` as an example:\n\n```\n1 9 8 4 1 0 2 4 2 3 5 9 _ _ _ _ | 5 9 _ 1 2 3 _ 4 5 6 _ 7 8 9 _ _\n```\n\nLet's consider the first lane. Here, bytes are weighted by 10 and 1:\n\n```\n   1   9   8   4   1   0   2   4   2   3   5   9   _   _   _   _\n  10   1  10   1  10   1  10   1  10   1  10   1   0   0   0   0\n```\n\nWhen multiplied and added, it yields:\n\n```\n   0  19   0  84   0  10   0  24   0  23   0  59   0   0   0   0\n```\n\nWe see that the month, day, hour and minute values can be read directly from the register, and the year value can be obtained as a combination of two values, with the first to be  scaled by 100.\n\nLet's consider the second lane. Here, bytes are weighted by 100, 10 and 1:\n\n```\n   5   9   _   1   2   3   _   4   5   6   _   7   8   9   _   _\n  10   1   0 100  10   1   0 100  10   1   0 100  10   1   0   0\n```\n\nWhen multiplied and added, it produces the following output:\n\n```\n   0  59   0 100   0  23   0 400   0  56   0 700   0  89   0   0\n```\n\nWe see that the number of seconds can be read from the register, and millisecond, microsecond and nanosecond parts can be obtained by adding two numbers.\n\n### Hexadecimal strings\n\nThe first step in parsing a string of hexadecimal digits is converting their hexadecimal representation into their numerical value. Consider the following example with characters right-aligned in a buffer of 16 digits:\n\n```\n'0' '1' '2' '3' '4' '5' '6' '7' '8' '9' 'a' 'b' 'c' 'd' 'e' 'f'\n30  31  32  33  34  35  36  37  38  39  61  62  63  64  65  66\n```\n\nWe create three masks by comparing each digit to a range of permitted values. One mask filters decimal digits `'0'...'9'`, another filters uppercase letters `'A'...'F'` and yet another filters lowercase letters `'a'...'f'`.\n\nWe set a minimum value for the smallest element in each group: 48 (`0x30` = `'0'`) for decimal digits, 65 (`0x41` = `'A'`) for uppercase letters, and 97 (`0x61` = `'a'`) for lowercase letters. We subtract the corresponding minimum value from each character as matched by the mask.\n\nIn our example, this will yield the following, with each number expressed in hexadecimal:\n\n```\n30 31 32 33 34 35 36 37 38 39 61 62 63 64 65 66\n 0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f\n```\n\nNext, we rearrange the numeric values such that they correspond to little-endian byte order, and separate odd and even positions into 32-bit groups:\n\n```\n f  d  b  9 |  e  c  a  8 |  7  5  3  1 |  6  4  2  0\n```\n\nThis seemingly peculiar arrangement will become clear when we left-shift every second 32-bit word by 4:\n\n```\n f  d  b  9 | e0 c0 a0 80 | 7  5  3  1 | 60 40 20  0\n```\n\nWe can now see that if we horizontally add (as if with the operator `+`) consecutive 32-bit words, we recover the value represented by the first and second 32 bits of the original 64-bit number:\n\n```\nef cd ab 89 | 67 45 23  1\n```\n\nNote that this is little-endian storage. The integer value is understood as the reverse order of bytes:\n\n```\n 1 23 45 67 | 89 ab cd ef\n```\n\nIn other words, we have obtained the numeric value represented by the original hexadecimal string.\n\n### Base64 with URL-safe alphabet\n\nBase64 decoding with an alphabet safe both URLs and file names follows the [vector lookup algorithm](http://0x80.pl/notesen/2016-01-17-sse-base64-decoding.html#vector-lookup-pshufb-with-bitmask-new) described by Wojciech Muła. The main difference is that while in regular Base64, characters `+` and `/` occupy the same high nibble, in [modified Base64](https://datatracker.ietf.org/doc/html/rfc4648#section-5), character `-` has its own high nibble, whereas `_` shares the high nibble with uppercase letters. As such, SIMD comparison for equality is done on `_` instead of `/`. For extracting bytes, we use the [multipy-add variant](http://0x80.pl/notesen/2016-01-17-sse-base64-decoding.html#pack-multiply-add-variant-update). Modified Base64 does not have the padding character `=`. As opposed to the algorithms by Wojciech Muła, we use 32-byte AVX2 instructions (`__m256i`) with shuffle on two 16-byte lanes, not their 16-byte variants (`__m128i`).\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhunyadi%2Fsimdparse","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhunyadi%2Fsimdparse","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhunyadi%2Fsimdparse/lists"}