{"id":13719138,"url":"https://github.com/Ed-von-Schleck/shoco","last_synced_at":"2025-05-07T11:31:15.604Z","repository":{"id":16098661,"uuid":"18843548","full_name":"Ed-von-Schleck/shoco","owner":"Ed-von-Schleck","description":"shoco is a compressor for small text strings","archived":false,"fork":false,"pushed_at":"2023-09-09T15:36:13.000Z","size":2892,"stargazers_count":364,"open_issues_count":32,"forks_count":65,"subscribers_count":31,"default_branch":"master","last_synced_at":"2024-05-22T00:13:13.822Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"http://ed-von-schleck.github.io/shoco/","language":"C","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Ed-von-Schleck.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2014-04-16T14:53:46.000Z","updated_at":"2024-05-14T15:02:18.000Z","dependencies_parsed_at":"2024-01-05T22:05:08.428Z","dependency_job_id":"40d4d0ec-32a5-47bb-9c23-d897cefc9885","html_url":"https://github.com/Ed-von-Schleck/shoco","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Ed-von-Schleck%2Fshoco","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Ed-von-Schleck%2Fshoco/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Ed-von-Schleck%2Fshoco/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Ed-von-Schleck%2Fshoco/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Ed-von-Schleck","download_url":"https://codeload.github.com/Ed-von-Schleck/shoco/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252868732,"owners_count":21816917,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-03T01:00:43.072Z","updated_at":"2025-05-07T11:31:15.123Z","avatar_url":"https://github.com/Ed-von-Schleck.png","language":"C","funding_links":[],"categories":["Data processing","C"],"sub_categories":["Compression"],"readme":"\n**shoco**: a fast compressor for short strings\n--------------------------------------------\n\n**shoco** is a C library to compress and decompress short strings. It is [very fast](#comparisons-with-other-compressors) and [easy to use](#api). The default compression model is optimized for english words, but you can [generate your own compression model](#generating-compression-models) based on *your specific* input data.\n\n**shoco** is free software, distributed under the [MIT license](https://raw.githubusercontent.com/Ed-von-Schleck/shoco/master/LICENSE).\n\n## Quick Start\n\nCopy [shoco.c](https://raw.githubusercontent.com/Ed-von-Schleck/shoco/master/shoco.c), [shoco.h](https://raw.githubusercontent.com/Ed-von-Schleck/shoco/master/shoco.h) and [shoco_model.h](https://raw.githubusercontent.com/Ed-von-Schleck/shoco/master/shoco_model.h) from [github/shoco](https://github.com/Ed-von-Schleck/shoco) over to your project. Include `shoco.h` and you are ready to use the [API](#api)!\n\n### API\n\nHere is all of it:\n\n```C\nsize_t shoco_compress(const char * in, size_t len, char * out, size_t bufsize);\nsize_t shoco_decompress(const char * in, size_t len, char * out, size_t bufsize);\n```\n\nIf the `len` argument for `shoco_compress` is 0, the input char is assumed to be null-terminated. If it’s a positive integer, parsing the input will stop after this length, or at a null-char, whatever comes first. `shoco_decompress` however will need a positive integer for `len` (most likely you should pass the return value of `shoco_compress`).\n\nThe return value is the number of bytes written. If it is less than `bufsize`, all is well. In case of decompression, a null-terminator is written. If the return value is exactly `bufsize`, the output is all there, but _not_ null-terminated. It is up to you to decide if that’s an error or not. If the buffer is not large enough for the output, the return value will be `bufsize` + 1. You might want to allocate a bigger output buffer. The compressed string will never be null-terminated.\n\nIf you are sure that the input data is plain ASCII, your `out` buffer for `shoco_compress` only needs to be as large as the input string. Otherwise, the output buffer may need to be up to 2x as large as the input, if it’s a 1-byte encoding, or even larger for multi-byte or variable-width encodings like UTF-8.\n\nFor the standard values of _shoco_, maximum compression is 50%, so the `out` buffer for `shoco_decompress` needs to be a maximum of twice the size of the compressed string.\n\n## How It Works\n\nHave you ever tried compressing the string “hello world” with **gzip**? Let’s do it now:\n\n```bash\n$ echo \"hello world\" | gzip -c | wc -c\n32\n```\n\nSo the output is actually *larger* than the input string. And **gzip** is quite good with short input: **xz** produces an output size of 68 bytes. Of course, compressing short strings is not what they are made for, because you rarely need to make small strings even smaller – except when you do. That’s why **shoco** was written.\n\n**shoco** works best if your input is ASCII. In fact, the most remarkable property of **shoco** is that the compressed size will *never* exceed the size of your input string, provided it is plain ASCII. What is more: An ASCII string is suitable input for the decompressor (which will return the exact same string, of course). That property comes at a cost, however: If your input string is not entirely (or mostly) ASCII, the output may grow. For some inputs, it can grow quite a lot. That is especially true for multibyte encodings such as UTF-8. Latin-1 and comparable encodings fare better, but will still increase your output size, if you don’t happen to hit a common character. Why is that so?\n\nIn every language, some characters are used more often than others. English is no exception to this rule. So if one simply makes a list of the, say, sixteen most common characters, four bits would be sufficient to refer to them (as opposed to eight bits – one byte – used by ASCII). But what if the input string includes an uncommon character, that is not in this list? Here’s the trick: We use the first bit of a `char` to indicate if the following bits refer to a short common character index, or a normal ASCII byte. Since the first bit in plain ASCII is always 0, setting the first bit to 1 says “the next bits represent short indices for common chars”. But what if our character is not ASCII (meaning the first bit of the input `char` is not 0)? Then we insert a marker that says “copy the next byte over as-is”, and we’re done. That explains the growth for non-ASCII characters: This marker takes up a byte, doubling the effective size of the character.\n\nHow **shoco** actually marks these packed representations is a bit more complicated than that (e.g., we also need to specify *how many* packed characters follow, so a single leading bit won’t be sufficient), but the principle still holds.\n\nBut **shoco** is a bit smarter than just to abbreviate characters based on absolute frequency – languages have more regularities than that. Some characters are more likely to be encountered next to others; the canonical example would be **q**, that’s *almost always* followed by a **u**. In english, *the*, *she*, *he*, *then* are all very common words – and all have a **h** followed by a **e**. So if we’d assemble a list of common characters *following common characters*, we can do with even less bits to represent these *successor* characters, and still have a good hit rate. That’s the idea of **shoco**: Provide short representations of characters based on the previous character.\n\nThis does not allow for optimal compression – by far. But if one carefully aligns the representation packs to byte boundaries, and uses the ASCII-first-bit-trick above to encode the indices, it works well enough. Moreover, it is blazingly fast. You wouldn’t want to use **shoco** for strings larger than, say, a hundred bytes, because then the overhead of a full-blown compressor like **gzip** begins to be dwarfed by the advantages of the much more efficient algorithms it uses.\n\nIf one would want to classify **shoco**, it would be an [entropy encoder](http://en.wikipedia.org/wiki/Entropy_encoding), because the length of the representation of a character is determined by the probability of encountering it in a given input string. That’s opposed to [dictionary coders](http://en.wikipedia.org/wiki/Dictionary_coder) that maintain a dictionary of common substrings. An optimal compression for short strings could probably be achieved using an [arithmetic coder](http://en.wikipedia.org/wiki/Arithmetic_coding) (also a type of entropy encoder), but most likely one could not achieve the same kind of performance that **shoco** delivers.\n\nHow does **shoco** get the information about character frequencies? They are not pulled out of thin air, but instead generated by analyzing text with a relatively simple script. It counts all *bigrams* – two successive characters – in the text and orders them by frequency. If wished for, it also tests for best encodings (like: Is it better to spend more bits on the leading character or on the successor character?), and then outputs its findings as a header file for `shoco.c` to include. That means the statistical model is compiled in; we simply can’t add it to the compressed string without blowing it out of proportions (and defeating the whole purpose of this exercise). This script is shipped with **shoco**, and the [next section](#generating-compression-models) is about how *you* can use it to generate a model that’s optimized for *your* kind of data. Just remember that, with **shoco**, you need to control both ends of the chain (compression **and** decompression), because you can’t decompress data correctly if you’re not sure that the compressor has used the same model.\n\n## Generating Compression Models\n\nMaybe your typical input isn’t english words. Maybe it’s german or french – or whole sentences. Or file system paths. Or URLs. While the standard compression model of **shoco** should work for all of these, it might be worthwile to train **shoco** for this specific type of input data.\n\nFortunately, that’s really easy: **shoco** includes a python script called `generate_compression_model.py` that takes one or more text files and ouputs a header file ready for **shoco** to use. Here’s an example that trains **shoco** with a dictionary (btw., not the best kind of training data, because it’s dominated by uncommon words):\n\n```bash\n$ ./generate_compression_model.py /usr/share/dict/words -o shoco_model.h\n```\n\nThere are options on how to chunk and strip the input data – for example, if we want to train **shoco** with the words in a readme file, but without punctuation and whitespace, we could do\n\n```bash\n$ ./generate_compression_model.py --split=whitespace --strip=punctuation README.md\n```\n\nSince we haven’t specified an output file, the resulting table file is printed on stdout.\n\nThis is most likely all you’ll need to generate a good model, but if you are adventurous, you might want to play around with all the options of the script: Type `generate_compression_model.py --help` to get a friendly help message. We won’t dive into the details here, though – just one word of warning: Generating tables can be slow if your input data is large, and _especially_ so if you use the `--optimize-encoding` option. Using [pypy](http://pypy.org/) can significantly speed up the process.\n\n## Comparisons With Other Compressors\n\n### smaz\n\nThere’s another good small string compressor out there: [**smaz**](https://github.com/antirez/**smaz**). **smaz** seems to be dictionary based, while **shoco** is an entropy encoder. As a result, **smaz** will often do better than **shoco** when compressing common english terms. However, **shoco** typically beats **smaz** for more obscure input, as long as it’s ASCII. **smaz** may enlarge your string for uncommon words (like numbers), **shoco** will never do that for ASCII strings.\n\nPerformance-wise, **shoco** is typically faster by at least a factor of 2. As an example, compressing and decompressing all words in `/usr/dict/share/words` with **smaz** takes around 0.325s on my computer and compresses on average by 28%, while **shoco** has a compression average of 33% (with the standard model; an optimized model will be even better) and takes around 0.145s. **shoco** is _especially_ fast at decompression.\n\n**shoco** can be trained with user data, while **smaz**’s dictionary is built-in. That said, the maximum compression rate of **smaz** is hard to reach for **shoco**, so depending on your input type, you might fare better with **smaz** (there’s no way around it: You have to measure it yourself).\n\n### gzip, xz\n\nAs mentioned, **shoco**’s compression ratio can’t (and doesn’t want to) compete with gzip et al. for strings larger than a few bytes. But for very small strings, it will always be better than standard compressors.\n\nThe performance of **shoco** should always be several times faster than about any standard compression tool. For testing purposes, there’s a binary inlcuded (unsurprisingly called `shoco`) that compresses and decompresses single files. The following timings were made with this command line tool. The data is `/usr/share/dict/words` (size: 4,953,680), compressing it as a whole (not a strong point of **shoco**):\n\ncompressor | compression time | decompression time | compressed size\n-----------|------------------|--------------------|----------------\nshoco      | 0.070s           | 0.010s             | 3,393,975\ngzip       | 0.470s           | 0.048s             | 1,476,083\nxz         | 3.300s           | 0.148s             | 1,229,980\n\nThis demonstates quite clearly that **shoco**’s compression rate sucks, but also that it’s _very_ fast.\n\n## Javascript Version\n\nFor showing off, **shoco** ships with a Javascript version (`shoco.js`) that’s generated with [emscripten](https://github.com/kripken/emscripten). If you change the default compression model, you need to re-generate it by typing `make js`. You do need to have emscripten installed. The output is [asm.js](http://asmjs.org/) with a small shim to provide a convenient API:\n\n```js\ncompressed = shoco.compress(input_string);\noutput_string = shoco.decompress(compressed);\n```\n\nThe compressed string is really a [Uint8Array](https://developer.mozilla.org/en-US/docs/Web/API/Uint8Array), since that resembles a C string more closely. The Javascript version is not as furiously fast as the C version because there’s dynamic (heap) memory allocation involved, but I guess there’s no way around it.\n\n`shoco.js` should be usable as a node.js module.\n\n## Tools And Other Included Extras\n\nMost of them have been mentioned already, but for the sake of completeness – let’s have a quick overview over what you’ll find in the repo:\n\n#### `shoco.c`, `shoco.h`, `shoco_model.h`\n\nThe heart of the project. If you don’t want to bother with nitty-gritty details, and the compression works for you, it’s all you’ll ever need.\n\n#### `models/*`\n\nAs examples, there are more models included. Feel free to use one of them instead of the default model: Just copy it over `shoco_model.h` and you’re all set. Re-build them with `make models`.\n\n#### `training_data/*`\n\nSome books from [Project Gutenberg](http://www.gutenberg.org/ebooks/) used for generating the default model.\n\n#### `shoco.js`\n\nJavascript library, generated by **emscripten**. Also usable as a [node.js](http://nodejs.org/) module (put it in `node_modules` and `require` it). Re-build with `make js`.\n\n#### `shoco.html`\n\nA example of how to use `shoco.js` in a website.\n\n#### `shoco`\n\nA testing tool for compressing and decompressing files. Build it with `make shoco` or just `make`. Use it like this:\n\n```bash\n$ shoco compress file-to-compress.txt compressed-file.shoco\n$ shoco decompress compressed-file.shoco decompressed-file.txt\n```\n\nIt’s not meant for production use, because I can’t image why one would want to use **shoco** on entire files.\n\n#### `test_input`\n\nAnother testing tool for compressing and decompressing every line in the input file. Build it with `make test_input`. Usage example:\n\n```bash\n$ time ./test_input \u003c /usr/share/dict/words \nNumber of compressed strings: 479828, average compression ratio: 33%\n\nreal   0m0.158s\nuser   0m0.145s\nsys    0m0.013s\n```\n\nAdding the command line switch `-v` gives line-by-line information about the compression ratios.\n\n#### `Makefile`\n\nIt’s not the cleanest or l33test Makefile ever, but it should give you hints for integrating **shoco** into your project.\n\n#### `tests`\n\nInvoke them with `make check`. They should pass.\n\n## Things Still To Do\n\n**shoco** is stable, and it works well – but I’d have only tested it with gcc/clang on x86_64 Linux. Feedback on how it runs on other OSes, compilers and architectures would be highly appreciated! If it fails, it’s a bug (and given the size of the project, it should be easy to fix). Other than that, there’s a few issues that could stand some improvements:\n\n* There should be more tests, because there’s _never_ enough tests. Ever. Patches are very welcome!\n* Tests should include model generation. As that involves re-compilation, these should probably written as a Makefile, or in bash or Python (maybe using `ctypes` to call the **shoco**-functions directly).\n* The Python script for model generation should see some clean-up, as well as documentation. Also it should utilize all cpu cores (presumably via the `multiprocess`-module). This is a good task for new contributers!\n* Again for model generation: Investigate why **pypy** isn’t as fast as should be expected ([jitviewer](https://bitbucket.org/pypy/jitviewer/) might be of help here).\n* Make a real **node.js** module.\n* The current SSE2 optimization is probably not optimal. Anyone who loves to tinker with these kinds of micro-optimizations is invited to try his or her hand here.\n* Publishing/packaging it as a real library probably doesn’t make much sense, as the model is compiled-in, but maybe we should be making it easier to use **shoco** as a git submodule (even if it’s just about adding documentation), or finding other ways to avoid the copy\u0026paste installation.\n\n## Feedback\n\nIf you use **shoco**, or like it for whatever reason, I’d really love to [hear from you](mailto:christian.h.m.schramm at gmail.com - replace the 'at' with @ and delete this sentence-)! If wished for, I can provide integration with **shoco** for your commercial services (at a price, of course), or for your totally awesome free and open source software (for free, if I find the time). Also, a nice way of saying thanks is to support me financially via\n[git tip](https://www.gittip.com/Ed-von-Schleck/) or [flattr](https://flattr.com/submit/auto?user_id=Christian.Schramm\u0026url=https://ed-von-schleck.github.io/shoco\u0026language=C\u0026tags=github\u0026category=software).\n\nIf you find a bug, or have a feature request, [file it](https://github.com/Ed-von-Schleck/shoco/issues/new)! If you have a question about usage or internals of **shoco**, ask it on [stackoverflow](https://stackoverflow.com/questions/ask) for good exposure – and write me a mail, so that I don’t miss it.\n\n## Authors\n\n**shoco** is written by [Christian Schramm](mailto:christian.h.m.schramm at gmail.com - replace the 'at' with @ and delete this sentence).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FEd-von-Schleck%2Fshoco","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FEd-von-Schleck%2Fshoco","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FEd-von-Schleck%2Fshoco/lists"}