{"id":13670429,"url":"https://github.com/luispedro/diskhash","last_synced_at":"2025-04-09T10:05:05.215Z","repository":{"id":48919876,"uuid":"93728566","full_name":"luispedro/diskhash","owner":"luispedro","description":"Diskbased (persistent) hashtable","archived":false,"fork":false,"pushed_at":"2024-10-01T13:06:00.000Z","size":101,"stargazers_count":161,"open_issues_count":4,"forks_count":28,"subscribers_count":9,"default_branch":"main","last_synced_at":"2025-04-02T08:01:41.148Z","etag":null,"topics":["c","c-plus-plus","hashtable","haskell","persistence","python"],"latest_commit_sha":null,"homepage":"","language":"C","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/luispedro.png","metadata":{"files":{"readme":"README.md","changelog":"ChangeLog","contributing":null,"funding":null,"license":"COPYING","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2017-06-08T08:59:50.000Z","updated_at":"2025-03-18T01:36:03.000Z","dependencies_parsed_at":"2024-10-26T20:47:16.441Z","dependency_job_id":"1c499603-8197-4b6c-bb8d-5aa77593b3b4","html_url":"https://github.com/luispedro/diskhash","commit_stats":{"total_commits":87,"total_committers":4,"mean_commits":21.75,"dds":0.03448275862068961,"last_synced_commit":"883f281b9f5b836901e011ec319cb297539e5778"},"previous_names":[],"tags_count":16,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/luispedro%2Fdiskhash","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/luispedro%2Fdiskhash/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/luispedro%2Fdiskhash/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/luispedro%2Fdiskhash/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/luispedro","download_url":"https://codeload.github.com/luispedro/diskhash/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248018060,"owners_count":21034048,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["c","c-plus-plus","hashtable","haskell","persistence","python"],"created_at":"2024-08-02T09:00:42.268Z","updated_at":"2025-04-09T10:05:05.188Z","avatar_url":"https://github.com/luispedro.png","language":"C","funding_links":[],"categories":["C"],"sub_categories":[],"readme":"# Disk-based hashtable\n\n[![Build \u0026 test (Haskell)](https://github.com/luispedro/diskhash/actions/workflows/build_haskell_w_nix.yml/badge.svg)](https://github.com/luispedro/diskhash/actions/workflows/build_haskell_w_nix.yml)\n[![Build \u0026 test (Python)](https://github.com/luispedro/diskhash/actions/workflows/build_python_w_nix.yml/badge.svg)](https://github.com/luispedro/diskhash/actions/workflows/build_python_w_nix.yml)\n[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](https://opensource.org/licenses/MIT)\n\n\nA simple disk-based hash table (i.e., persistent hash table).\n\nIt is a hashtable implemented on memory-mapped disk, so that it can be loaded\nwith a single `mmap()` system call and used in memory directly (being as fast\nas an in-memory hashtable once it is loaded from disk).\n\nThe code is in C, wrappers are provided for Python, Haskell, and C++. The\nwrappers follow similar APIs with variations to accommodate the language\nspecificity. They all use the same underlying code, so you can open a hashtable\ncreated in C from Haskell, modify it within your Haskell code, and later open\nthe result in Python.\n\nCross-language functionality will only work for simple types where you can\ncontrol their binary representation (64-bit integers, for example).\n\nReading does not touch the disk representation at all and, thus, can be done on\ntop of read-only files or using multiple threads (and different processes will\nshare the memory: the operating system does that for you). Writing or modifying\nvalues is, however, not thread-safe.\n\n## Examples\n\nThe following examples all create a hashtable to store longs (`int64_t`), then\nset the value associated with the key `\"key\"` to 9. In the current API, the\nmaximum size of the keys needs to be pre-specified, which is the value `15`\nbelow.\n\n### Raw C\n\n```c\n#include \u003cstdio.h\u003e\n#include \u003cinttypes.h\u003e\n#include \"diskhash.h\"\n\nint main(void) {\n    HashTableOpts opts;\n    opts.key_maxlen = 15;\n    opts.object_datalen = sizeof(int64_t);\n    char* err = NULL;\n    HashTable* ht = dht_open(\"testing.dht\", opts, O_RDWR|O_CREAT, \u0026err);\n    if (!ht) {\n        if (!err) err = \"Unknown error\";\n        fprintf(stderr, \"Failed opening hash table: %s.\\n\", err);\n        return 1;\n    }\n    long i = 9;\n    dht_insert(ht, \"key\", \u0026i);\n    \n    long* val = (long*) dht_lookup(ht, \"key\");\n    printf(\"Looked up value: %l\\n\", *val);\n\n    dht_free(ht);\n    return 0;\n}\n```\n\nThe C API relies on error codes and error strings (the `\u0026err` argument above).\nThe header file has [decent\ndocumentation](https://github.com/luispedro/diskhash/blob/master/src/diskhash.h).\n\n### Haskell\n\nIn Haskell, you have different types/functions for read-write and read-only\nhashtables. Read-write operations are `IO` operations, read-only hashtables are\npure.\n\nRead write example:\n\n```haskell\nimport Data.DiskHash\nimport Data.Int\nmain = do\n    ht \u003c- htOpenRW \"testing.dht\" 15\n    htInsertRW ht \"key\" (9 :: Int64)\n    val \u003c- htLookupRW \"key\" ht\n    print val\n```\n\nRead only example (`htLookupRO` is pure in this case):\n\n```haskell\nimport Data.DiskHash\nimport Data.Int\nmain = do\n    ht \u003c- htOpenRO \"testing.dht\" 15\n    let val :: Int64\n        val = htLookupRO \"key\" ht\n    print val\n```\n\n\n### Python\n\nPython's interface is based on the [struct\nmodule](https://docs.python.org/3/library/struct.html). For example, `'ll'`\nrefers to a pair of 64-bit ints (_longs_):\n\n```python\nimport diskhash\n\ntb = diskhash.StructHash(\n    fname=\"testing.dht\", \n    keysize=15, \n    structformat='ll',  # store pairs of longs\n    mode='rw',\n)\nvalue = [1, 2]  # pair of longs\ntb.insert(\"key\", *value)\nprint(tb.lookup(\"key\"))\n```\n\nThe Python interface is currently Python 3 only. Patches to extend it to 2.7\nare welcome, but it's not a priority.\n\n\n### C++\n\nIn C++, a simple wrapper is defined, which provides a modicum of type-safety.\nYou use the `DiskHash\u003cT\u003e` template. Additionally, errors are reported through\nexceptions (both `std::bad_alloc` and `std::runtime_error` can be thrown) and\nnot return codes.\n\n```c++\n#include \u003ciostream\u003e\n#include \u003cstring\u003e\n\n#include \u003cdiskhash.hpp\u003e\n\nint main() {\n    const int key_maxlen = 15;\n    dht::DiskHash\u003cuint64_t\u003e ht(\"testing.dht\", key_maxlen, dht::DHOpenRW);\n    std::string line;\n    uint64_t ix = 0;\n    while (std::getline(std::cine, line)) {\n        if (line.length() \u003e key_maxlen) {\n            std::cerr \u003c\u003c \"Key too long: '\" \u003c\u003c line \u003c\u003c \"'. Aborting.\\n\";\n            return 2;\n        }\n        const bool inserted = ht.insert(line.c_str(), ix);\n        if (!inserted) {\n            std::cerr  \u003c\u003c \"Found repeated key '\" \u003c\u003c line \u003c\u003c \"' (ignored).\\n\";\n        }\n        ++ix;\n    }\n    return 0;\n}\n```\n\n## Stability\n\nThis is _beta_ software. It is good enough that I am using it, but the API can\nchange in the future with little warning. The binary format is versioned (the\nmagic string encodes its version, so changes can be detected and you will get\nan error message in the future rather than some silent misbehaviour.\n\n[Automated unit testing](https://travis-ci.com/luispedro/diskhash) ensures that\nbasic mistakes will not go uncaught.\n\n## Limitations\n\n- You must specify the maximum key size. This can be worked around either by\n  pre-hashing the keys (with a strong hash) or using multiple hash tables for\n  different key sizes. Neither is currently implemented in diskhash.\n\n- You cannot delete objects. This was not a necessity for my uses, so it was\n  not implemented. A simple implementation could be done by marking objects as\n  \"deleted\" in place and recompacting when the hash table size changes or with\n  an explicit `dht_gc()` call. It may also be important to add functionality to\n  shrink hashtables so as to not waste disk space.\n\n- The algorithm is a rather naïve implementation of linear addression. It would\n  not be hard to switch to [robin hood\n  hashing](https://www.sebastiansylvan.com/post/robin-hood-hashing-should-be-your-default-hash-table-implementation/)\n  and this may indeed happen in the near future.\n\nLicense: MIT\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fluispedro%2Fdiskhash","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fluispedro%2Fdiskhash","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fluispedro%2Fdiskhash/lists"}