Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/hunyadi/murify
Efficient in-memory compression for URLs
https://github.com/hunyadi/murify
Last synced: about 1 month ago
JSON representation
Efficient in-memory compression for URLs
- Host: GitHub
- URL: https://github.com/hunyadi/murify
- Owner: hunyadi
- License: mit
- Created: 2024-03-17T17:58:57.000Z (9 months ago)
- Default Branch: master
- Last Pushed: 2024-08-21T09:08:08.000Z (4 months ago)
- Last Synced: 2024-08-21T10:31:36.903Z (4 months ago)
- Language: C++
- Size: 27.3 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# murify: Efficient in-memory compression for URLs
This header-only C++ library converts URLs into a fully reversible compact representation. While general-purpose compression algorithms (e.g. GZIP or ZSTD) may not operate efficiently on URLs, URL compaction may reduce URL string length by as much as 75%, at low CPU cost. This may save substantial space when millions of URLs have to be kept in memory simultaneously.
Compaction is accomplished with a combination of several techniques:
* Decimal integers are represented as their binary equivalent, packed into minimum width. For example, the character string `123` of length 3 becomes the hexadecimal value `0x7B` and is persisted in a single byte. The character string `4294967295` of length 10 becomes the hexadecimal value `0xFFFFFFFF` and is persisted in 4 bytes.
* Frequently occurring strings (such as components in a path, or keys in a query string) are interned, and only the index in the lookup table is stored, packed into minimum width. A long but frequent path component such as `management` may become an index stored in a single byte.
* UUID strings (typically 36 characters) are parsed into a 16-byte array.
* When Base64-encoded data is encountered (e.g. a JWT or a user identifier), it's decoded and the raw representation is persisted, resulting in savings of 25%.
* Type is identified with a control byte. Integer width, string length or lookup table index is packed into the control byte whenever possible.
* Composite types such as URL path or query string are persisted as a combination of length and series of values, separators (e.g. `/`, `&` or `=`) are not stored.