{"id":15689333,"url":"https://github.com/christophevg/huffman","last_synced_at":"2025-10-14T12:30:35.695Z","repository":{"id":137514676,"uuid":"289450885","full_name":"christophevg/huffman","owner":"christophevg","description":"Simple and straigthforward implementation of Huffman coding - as a small exercise","archived":true,"fork":false,"pushed_at":"2020-08-22T12:02:28.000Z","size":316,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-01-31T06:31:44.602Z","etag":null,"topics":["benchmarking","excercise","huffman","python"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/christophevg.png","metadata":{"files":{"readme":".github/README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-08-22T08:33:36.000Z","updated_at":"2023-10-28T08:55:30.000Z","dependencies_parsed_at":null,"dependency_job_id":"f8088647-1731-4cd7-a6b4-172c12ce83b4","html_url":"https://github.com/christophevg/huffman","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/christophevg/huffman","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/christophevg%2Fhuffman","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/christophevg%2Fhuffman/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/christophevg%2Fhuffman/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/christophevg%2Fhuffman/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/christophevg","download_url":"https://codeload.github.com/christophevg/huffman/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/christophevg%2Fhuffman/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":279018662,"owners_count":26086576,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-14T02:00:06.444Z","response_time":60,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["benchmarking","excercise","huffman","python"],"created_at":"2024-10-03T18:01:43.930Z","updated_at":"2025-10-14T12:30:35.689Z","avatar_url":"https://github.com/christophevg.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Huffman\n\n\u003e Simple and straigthforward implementation of Huffman coding - as a small exercise.\n\nRunning the script as is, it performs all steps in the Huffman coding process on a predefined \"hello world\" string...\n\n```bash\n$ python -m huffman.coding\n88 hello world\n[('h', 1), ('e', 1), ('l', 3), ('o', 2), (' ', 1), ('w', 1), ('r', 1), ('d', 1)]\n((((((('r', 1), ('d', 1)), 2), (((' ', 1), ('w', 1)), 2)), 4), ((('l', 3), ((((('h', 1), ('e', 1)), 2), ('o', 2)), 4)), 7)), 11)\n 11\n   4\n     2\n       ('r', 1)\n       ('d', 1)\n     2\n       (' ', 1)\n       ('w', 1)\n   7\n     ('l', 3)\n     4\n       2\n         ('h', 1)\n         ('e', 1)\n       ('o', 2)\n((('r', 'd'), (' ', 'w')), ('l', (('h', 'e'), 'o')))\n{'r': '000', 'd': '001', ' ': '010', 'w': '011', 'l': '10', 'h': '1100', 'e': '1101', 'o': '111'}\n32 0.36363636363636365 11001101101011101001111100010001\n0001r1d01 1w01l001h1e1o\n((('r', 'd'), (' ', 'w')), ('l', (('h', 'e'), 'o')))\n{'000': 'r', '001': 'd', '010': ' ', '011': 'w', '10': 'l', '1100': 'h', '1101': 'e', '111': 'o'}\nhello world\n```\n\nCommand line arguments are considered a string...\n\n```bash\n$ python -m huffman.coding hello world from cli\n160 hello world from cli\n[('h', 1), ('e', 1), ('l', 4), ('o', 3), (' ', 3), ('w', 1), ('r', 2), ('d', 1), ('f', 1), ('m', 1), ('c', 1), ('i', 1)]\n((((((((('c', 1), ('i', 1)), 2), ((('f', 1), ('m', 1)), 2)), 4), ('l', 4)), 8), ((((('r', 2), ('o', 3)), 5), (((' ', 3), ((((('w', 1), ('d', 1)), 2), ((('h', 1), ('e', 1)), 2)), 4)), 7)), 12)), 20)\n 20\n   8\n     4\n       2\n         ('c', 1)\n         ('i', 1)\n       2\n         ('f', 1)\n         ('m', 1)\n     ('l', 4)\n   12\n     5\n       ('r', 2)\n       ('o', 3)\n     7\n       (' ', 3)\n       4\n         2\n           ('w', 1)\n           ('d', 1)\n         2\n           ('h', 1)\n           ('e', 1)\n(((('c', 'i'), ('f', 'm')), 'l'), (('r', 'o'), (' ', (('w', 'd'), ('h', 'e')))))\n{'c': '0000', 'i': '0001', 'f': '0010', 'm': '0011', 'l': '01', 'r': '100', 'o': '101', ' ': '110', 'w': '11100', 'd': '11101', 'h': '11110', 'e': '11111'}\n68 0.425 11110111110101101110111001011000111101110001010010100111100000010001\n00001c1i01f1m1l001r1o01 001w1d01h1e\n(((('c', 'i'), ('f', 'm')), 'l'), (('r', 'o'), (' ', (('w', 'd'), ('h', 'e')))))\n{'0000': 'c', '0001': 'i', '0010': 'f', '0011': 'm', '01': 'l', '100': 'r', '101': 'o', '110': ' ', '11100': 'w', '11101': 'd', '11110': 'h', '11111': 'e'}\nhello world from cli\n```\n\nDownload a large file, e.g. from [https://corpus.canterbury.ac.nz/descriptions/](https://corpus.canterbury.ac.nz/descriptions/) and provide it to the script...\n\n```bash\n$ make large\nwget http://corpus.canterbury.ac.nz/resources/large.zip\n--2020-08-22 10:37:09--  http://corpus.canterbury.ac.nz/resources/large.zip\nResolving corpus.canterbury.ac.nz (corpus.canterbury.ac.nz)... 132.181.17.8\nConnecting to corpus.canterbury.ac.nz (corpus.canterbury.ac.nz)|132.181.17.8|:80... connected.\nHTTP request sent, awaiting response... 302 Moved Temporarily\nLocation: https://corpus.canterbury.ac.nz/resources/large.zip [following]\n--2020-08-22 10:37:09--  https://corpus.canterbury.ac.nz/resources/large.zip\nConnecting to corpus.canterbury.ac.nz (corpus.canterbury.ac.nz)|132.181.17.8|:443... connected.\nHTTP request sent, awaiting response... 200 OK\nLength: 3256280 (3,1M) [application/zip]\nSaving to: ‘large.zip’\n\nlarge.zip            100%[===================\u003e]   3,10M   366KB/s    in 8,8s    \n\n2020-08-22 10:37:20 (361 KB/s) - ‘large.zip’ saved [3256280/3256280]\n\nunzip large.zip -d large\nArchive:  large.zip\n  inflating: large/bible.txt         \n  inflating: large/E.coli            \n  inflating: large/world192.txt      \n\n$ python -m huffman.coding large/bible.txt\n32379136 bits\n17747595 bits 0.5481182388560337 %\n```\n\n## Decoding Performance\n\nMy initial implementation for decoding used a dictionary with bitstrings as keys and the corresponding characters as values. It was a similar approach to the encoding logic. Due to not actually decoding the encoded bible, I didn't notice that this took ... like forever ;-)\n\nI changed the implementation to one that takes the encoding tree and traverse it, based on the bits in the encoded stream, adding the character when arriving at a leaf. This proved to be much faster, but still rather slow. Nevertheless, it introduced me to pytext-benchmark:\n\n```bash\n$ make test\ntox\nGLOB sdist-make: /Users/xtof/Workspace/huffman/setup.py\npy37 inst-nodeps: /Users/xtof/Workspace/huffman/.tox/.tmp/package/1/huffman-0.0.2.zip\npy37 installed: attrs==20.1.0,certifi==2020.6.20,chardet==3.0.4,coverage==5.2.1,coveralls==2.1.2,docopt==0.6.2,huffman @ file:///Users/xtof/Workspace/huffman/.tox/.tmp/package/1/huffman-0.0.2.zip,idna==2.10,importlib-metadata==1.7.0,iniconfig==1.0.1,more-itertools==8.4.0,packaging==20.4,pluggy==0.13.1,py==1.9.0,py-cpuinfo==7.0.0,pyparsing==2.4.7,pytest==6.0.1,pytest-benchmark==3.2.3,requests==2.24.0,six==1.15.0,toml==0.10.1,urllib3==1.25.10,zipp==3.1.0\npy37 run-test-pre: PYTHONHASHSEED='1916796342'\npy37 run-test: commands[0] | coverage run -m '--omit=*/.tox/*,*/distutils/*,*/tests/*' pytest\n=================================== test session starts ===================================\nplatform darwin -- Python 3.7.7, pytest-6.0.1, py-1.9.0, pluggy-0.13.1\ncachedir: .tox/py37/.pytest_cache\nbenchmark: 3.2.3 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)\nrootdir: /Users/xtof/Workspace/huffman, configfile: tox.ini, testpaths: tests\nplugins: benchmark-3.2.3\ncollected 1 item                                                                          \n\ntests/test_coding.py .                                                              [100%]\n\n\n------------------------------------------- benchmark: 1 tests ------------------------------------------\nName (time in s)        Min     Max    Mean  StdDev  Median     IQR  Outliers     OPS  Rounds  Iterations\n---------------------------------------------------------------------------------------------------------\ntest_roundtrip       4.6859  5.0134  4.8253  0.1543  4.7472  0.2770       1;0  0.2072       5           1\n---------------------------------------------------------------------------------------------------------\n\nLegend:\n  Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile.\n  OPS: Operations Per Second, computed as 1 / Mean\n=================================== 1 passed in 39.77s ====================================\n_________________________________________ summary _________________________________________\n  py37: commands succeeded\n  congratulations :)\n```\n\nNow this is going to prove to be addictive, giving me a very small piece of code to optimize to the moon and back ;-)\n\n### Fixed Key Length Lookup Table\n\nGoing back to the dictionary mapping bitstrings to characters, extending the bitstrings to a fixed key length (of the longest key), adding duplicates for all missing bits after the key, proved to be a (though memory consuming) much faster way, slashing the decoding time roughly by two ;-)\n\nSo in stead of looking in this table:\n\n```python\n{\n  \"000\": \"w\",\n  \"001\": \" \",\n  \"010\": \"d\",\n  \"011\": \"h\",\n  \"10\": \"l\",\n  \"1100\": \"e\",\n  \"1101\": \"r\",\n  \"111\": \"o\"\n}\n```\n\nI'm now looking in this table:\n\n```python\n{\n  \"0000\": (\"w\", 3), \"0001\": (\"w\", 3),\n  \"0010\": (\" \", 3), \"0011\": (\" \", 3),\n  \"0100\": (\"d\", 3), \"0101\": (\"d\", 3),\n  \"0110\": (\"h\", 3), \"0111\": (\"h\", 3),\n  \"1000\": (\"l\", 2), \"1001\": (\"l\", 2), \"1010\": (\"l\", 2), \"1011\": (\"l\", 2),\n  \"1100\": (\"e\", 4),\n  \"1101\": (\"r\", 4),\n  \"1110\": (\"o\", 3), \"1111\": (\"o\", 3)\n}\n```\n\nIt allows us to consume a fixed amount of bits from the code, ensures that the lookup table will always return a match, including the characters and the actual amount of bits that have to be consumed from the code.\n\n![Traversal vs Fixed Length Lookup Table](../media/traversal_vs_fixed_lookup.png)\n\nBuilding the lookup table takes little time, in comparison with the gain in decoding time.\n\n## TODO\n- implement it with \"real bits\" ;-) \n\n## References\n\n* [https://en.wikipedia.org/wiki/Huffman_coding](https://en.wikipedia.org/wiki/Huffman_coding)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fchristophevg%2Fhuffman","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fchristophevg%2Fhuffman","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fchristophevg%2Fhuffman/lists"}