{"id":24655012,"url":"https://github.com/wrightdylan/codecs","last_synced_at":"2025-03-21T02:16:13.958Z","repository":{"id":274170913,"uuid":"850622059","full_name":"wrightdylan/codecs","owner":"wrightdylan","description":"A collection of encoders and decoders","archived":false,"fork":false,"pushed_at":"2025-01-25T11:50:40.000Z","size":41,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-01-25T12:25:39.338Z","etag":null,"topics":["codec","compression","decoder","encoder","huffman","library","rust","rust-lang"],"latest_commit_sha":null,"homepage":"","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/wrightdylan.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE-APACHE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-09-01T10:08:10.000Z","updated_at":"2025-01-25T11:50:44.000Z","dependencies_parsed_at":"2025-01-25T12:36:55.767Z","dependency_job_id":null,"html_url":"https://github.com/wrightdylan/codecs","commit_stats":null,"previous_names":["wrightdylan/codecs"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wrightdylan%2Fcodecs","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wrightdylan%2Fcodecs/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wrightdylan%2Fcodecs/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wrightdylan%2Fcodecs/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/wrightdylan","download_url":"https://codeload.github.com/wrightdylan/codecs/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":244722747,"owners_count":20499154,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["codec","compression","decoder","encoder","huffman","library","rust","rust-lang"],"created_at":"2025-01-25T22:35:53.991Z","updated_at":"2025-03-21T02:16:13.937Z","avatar_url":"https://github.com/wrightdylan.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"# A collection of coders and decoders\n\n## Codecs:\n\n- Huffman\n\n## To do:\n\n- Arithmetic\n- LZW\n- RLE\n\n## Huffman\n\nHuffman is a greedy algorithm used to compress large text files. This is accomplished by building a tree based on the frequency of characters in the text. For more, see [article](https://en.wikipedia.org/wiki/Huffman_coding). Compression of files averages about 50%, and handles UTF-8 just fine.\n\nUpdate: `Serde` serialisation works out to be quite large, and it also includes a lot of empty bytes, most likely used as a fixed width header to describe the length of serialised bytes. Preliminary testing using a custom serialisation shows a reduction of the tree information to a 5th of `Serde`'s output. \nThis uses a custom schema as follows:\n┌───┬──╌╌──┬─┬──╌╌┄┄┄┄╌╌──┐\\\n└───┴──╌╌──┴─┴──╌╌┄┄┄┄╌╌──┘\\\n2 or 1-4 bytes: Tree data length either in two bytes or variable width bytes.\\\nn bytes: Tree data\\\n1 byte: Number of data packing bits\\\nm bytes: Data (indefinite length)\\\n\nThis custom serialisation works perfectly for ASCII encoding, or single byte UTF-8, but it breaks multiple byte UTF-8. This can be fixed to account for variable width UTF-8 encoding, however the resulting tree data would probably not be that much smaller than simply sticking to `Serde`, but this is highly dependent on what language is being stored in the tree.\n\nUpdate 2:\nThe original tree data length of 1 byte was enough for standard Roman characters and some 2-byte Unicode languages, but 3-byte Unicode presented some problems even with a short sentence due to overflows. A fixed with of 2 bytes (1 word) was used instead. There did not appear to be much point in having a variable width header, but an additional byte for Romance languages will not make much of a difference, and it is unlikely more than 65,535 bytes will be needed unless a very large text in Japanese, for example, that uses all known characters in the language will be compressed. Ultimately, compression is still very good with Romance languages, but it suffers to varying degrees with others.\n\nUpdate 3:\nImplementing variable width headers was just far too tempting. This is now one to four bytes, which will allow 28 bits of tree length information, but if you need 268,435,456 bytes for your tree, you're probably doing something very wrong. See the [section below](#variable-width-encoding) for more details on this encoding.\n\nUpdate 4:\nFixed width or variable width headers can now be selected as a feature. The default is 2-byte fixed width, or use the `vwe_header` feature for the option.\n\n### Implementations\n- `easy_encode()` provides a simple interface to encode a string to terminal.\n- `encode_to_bitstream()` provides a more useful interface that packages the encoded data with the tree, and can be saved to file.\n- `decode_from_bitstream()` reverses the above function.\n\n## Variable width encoding\nTwo new functions deal with encoding/decoding variable width headers. These are internal to the codec library, and are not intended for use externally. The first takes an unsigned int, ideally usize, and checks that it's less than the  maximum value of a 28-bit number. Numbers below 128 can be stored in a single byte where the most significant bit is 0, and the remaining bits are for data. Larger numbers will use an encoded first byte, and the remainder will be normal bytes. The first byte will have a 1 for each trailing byte, and a 0 separator. The table below shows how bytes are encoded for their size.\n\n| Range | First byte |\n|:-|:-|\n| 0 - 127                 | 0XXX_XXXX |\n| 128 - 16_383            | 10XX_XXXX |\n| 16_384 - 2_097_151      | 110X_XXXX |\n| 2_097_152 - 268,435,455 | 1110_XXXX |\n\n\n## License\nThis project is dual-licensed under both the [Apache License](LICENSE-APACHE) (Version 2.0) and [MIT license](LICENSE-MIT).\n\n`SPDX-License-Identifier: Apache-2.0 AND MIT`","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwrightdylan%2Fcodecs","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fwrightdylan%2Fcodecs","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwrightdylan%2Fcodecs/lists"}