Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/mnemnion/runeset

Fast UTF-8 codepoint sets for Zig.
https://github.com/mnemnion/runeset

parsing sets utf-8 zig

Last synced: 2 days ago
JSON representation

Fast UTF-8 codepoint sets for Zig.

Awesome Lists containing this project

README

        

# Runeset: Sets of UTF-8 Characters

This library offers a compact data structure for "generalized"[^1] UTF-8 encoded codepoints. The design is based on an [implicit data structure](https://en.wikipedia.org/wiki/Implicit_data_structure)[^2], which uses `@popCount` and bit masking to check membership quickly, with minimal branching, and without having to decode the UTF-8 into another format (for instance, a codepoint).

This design is original, in the sense that I invented it. There may be prior art, it's remarkably difficult to search for "UTF-8 character sets" and find papers on set data structures, so I can't say with high confidence that it's truly novel; in a sense, it's an obvious extension of the widespread practice of using a pair of `u64` bitmasks to detect a set of ASCII values. What I can say is: it's effective.

The `RuneSet` struct is just a slice of `u64`, offering a few variations on set membership tests, depending on how confident you are that the string it's testing is valid UTF-8, and what you would like the test to do in the event that it isn't. Also supported are the fundamental set operations: equality, subset, union, intersection, and difference. All of the three combining forms produce a new `RuneSet`, and must be given an `Allocator`.

The `RuneSet` is immutable by design, and as such, doesn't support adding or removing codepoints to an already-created set. This would be possible to implement in principle, but I don't happen to have a use for mutable adds and removes, and therefore didn't write them. You can create a `RuneSet` containing a single character, and via union or difference, either add or subtract that character, if you would like: this would create a new `RuneSet` with just that character added or removed.

The data structure is very fast. The test suite, with extensions, contains 5MB of sample data, with a 20MB extra collection, and performs a multitude of operations on that data, on the order of a dozen per string. In ReleaseFast mode, the main suite of 5MB concludes in roughly a second on an M1. With the annex, including fuzzing the creation function 2^24 times, a complete run is over in six seconds, give or take a few milli. About 2/3rds of that is the fuzzing.

While I have yet to set up definitive benchmarks, it's clear that membership testing using `RuneSet` has a throughput over 100 MB/sec on modern machines, potentially substantially higher than this. I'll update this with better numbers when I get around to benchmarking it; the focus has been on implementing the library, and assuring good test coverage.

For real character sets, it is also quite sparing of memory. The CJK Ideographs block of Unicode, consisting of 22,992 codepoints, is represented as a `RuneSet` in 338 words. Dense ranges such as that block can of course be tested by decoding the tested sequence into a codepoint and checking if it's in that range, but the combination of generality, compactness, and speed, offered by the `RuneSet`, is unique.

Speaking of test coverage, the library has 100% line coverage, which you may verify by running `zig build cov`, provided you have [kcov](https://github.com/SimonKagstrom/kcov) installed. To run the extended suite, including the fuzzer, use `zig build -Dtest-more test`. This will take an appreciable amount of time in the default debug-mode builds, as it will trigger the 244 assertions which gird the library.

## Roadmap

I intend to add some ability to write out the data structure, whether as binary data, Zig source code, or potentially both. As an allocating data structure, it isn't currently possible to generate a RuneSet at comptime, and many applications of them will use character sets known in advance, which may as well be built into the `.rodata` of programs, rather than constructed at runtime from strings.

I also plan to wrap all of the major functionality in C-compatible data structures and functions, for use wherever the C ABI is spoken.

Other than that, I consider the library fully implemented and feature-complete. Odds are good, however, that more open-source code, which makes use of the RuneSet as a foundation, will be forthcoming.

[^1]: Generalized, here, has a similar meaning to its use in [WTF-8](https://simonsapin.github.io/wtf-8/), except that `RuneSet` can also encode so-called "overlong" encodings, as well as surrogates (paired vs. unpaired is a meaningless distinction for a codepoint set). This falls out of the design: it's capable of encoding those sequences, and I saw no reason to impose a performance burden in order to prevent it. On the contrary, should you want to represent the set of all overlong encodings, perhaps to detect them and raise an error, a `RuneSet` is a fine way to do it. A set which does not contain overlong encodings or surrogates will never match against them, so this seeming laxity comes with no disadvantages. The five- and six-byte sequences from Pike and Thompson's original FSS UTF-8 of 1993 are not supported. As such, what would constitute lead bytes for those sequences are rejected as invalid. It would be possible to extend the `RuneSet` data structure to include these, if there were ever any point in so doing.

[^2]: That the `RuneSet` is 'implicit' doesn't make it the smallest possible encoding for every subset of UTF-8. For example, the CJK Unified Ideographs block could be represented as two numbers, and for some crafted inputs, such as one in 64 characters throughout the Unicode range, a bitmask of one bit per character would be smaller as well. It means that the structure is implicit to the data, without indirection, and that in full generality it isn't possible to represent UTF-8 sequences more compactly, compared to the information theoretic lower-bound. Our overhead is limited to one word, used to store one of the offsets as an important optimization.