{"id":22911991,"url":"https://github.com/opencoff/go-mph","last_synced_at":"2025-04-04T18:24:22.369Z","repository":{"id":41371149,"uuid":"371509548","full_name":"opencoff/go-mph","owner":"opencoff","description":"Minimal perfect hash functions in go-lang","archived":false,"fork":false,"pushed_at":"2025-04-01T03:09:22.000Z","size":1632,"stargazers_count":4,"open_issues_count":0,"forks_count":1,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-04-01T04:22:54.311Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/opencoff.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-05-27T21:34:42.000Z","updated_at":"2025-04-01T03:09:26.000Z","dependencies_parsed_at":"2024-04-10T02:52:44.147Z","dependency_job_id":"2fd11061-390e-4c27-bf88-cbefcd5af122","html_url":"https://github.com/opencoff/go-mph","commit_stats":{"total_commits":4,"total_committers":2,"mean_commits":2.0,"dds":0.25,"last_synced_commit":"e0d51b83a49fbe063b519dda8e885ed70ad1b028"},"previous_names":[],"tags_count":4,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/opencoff%2Fgo-mph","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/opencoff%2Fgo-mph/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/opencoff%2Fgo-mph/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/opencoff%2Fgo-mph/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/opencoff","download_url":"https://codeload.github.com/opencoff/go-mph/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247227253,"owners_count":20904656,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-12-14T04:19:29.412Z","updated_at":"2025-04-04T18:24:22.362Z","avatar_url":"https://github.com/opencoff.png","language":"Go","funding_links":[],"categories":[],"sub_categories":[],"readme":"[![GoDoc](https://godoc.org/github.com/opencoff/go-mph?status.svg)](https://godoc.org/github.com/opencoff/go-mph)\n[![Go Report Card](https://goreportcard.com/badge/github.com/opencoff/go-mph)](https://goreportcard.com/report/github.com/opencoff/go-mph)\n\n# go-mph - Minimal Perfect Hash Functions with persistence\n\n## What is it?\nA library to create, query and serialize/de-serialize minimal perfect hash function (\"MPHF\").\nThere are two implementations of MPH's for large data sets:\n\n1. [CHD](http://cmph.sourceforge.net/papers/esa09.pdf) -\n   inspired by this [gist](https://gist.github.com/pervognsen/b21f6dd13f4bcb4ff2123f0d78fcfd17).\n\n2. [BBHash](https://arxiv.org/abs/1702.03154). It is in part inspired by\n   Damien Gryski's [Boomphf](https://github.com/dgryski/go-boomphf)\n\nOne can construct an on-disk constant-time lookup using `go-mph` and\none of the MPHFs.  Such a DB is useful in situations\nwhere the key/value pairs are NOT changed frequently; i.e.,\nread-dominant workloads. The typical pattern in such situations is\nto build the constant-DB _once_ for efficient retrieval and do\nlookups multiple times.\n\n*NOTE* Minimal Perfect Hash functions take a fixed input and\ngenerate a mapping to lookup the items in constant time. In\nparticular, they are NOT a replacement for a traditional hash-table;\ni.e., it may yield false-positives when queried using keys not\npresent during construction. In concrete terms:\n\n   Let S = {k0, k1, ... kn}  be your input key set.\n\n   If H: S -\u003e {0, .. n} is a minimal perfect hash function, then\n   H(kx) for kx NOT in S may yield an integer result (indicating\n   that kx was successfully \"looked up\").\n\nThe way one deals with this is to compare the actual keys stored\nagainst that index. `DBReader()`'s `Find()` method demonstrates how\nthis is done.\n\n`go-mph` uses cryptographically strong checksum on the entire MPH DB *metadata*.\nAdditionally, tt uses siphash-2-4 checksums on each individual key-val \nrecord. This siphash checksum is verified opportunistically when keys\nare looked up in the MPH DB.  The DB reader uses\nan in-memory cache for speeding up lookups.\n\n\n\n## How do I use it?\nLike any other golang library: `go get github.com/opencoff/go-mph`.\nThe library exposes the following types:\n\n* `DBWriter`: Used to construct a constant database of key-value\n  pairs - where the lookup of a given key is done in constant time\n  using CHD or BBHash. This type can be created by one of two\n  functions: `NewChdDBWriter()` or `NewBBHashDBWriter()`.\n\n  Once created, you add keys \u0026 values to it via the `Add()` method.\n  After all the entries are added, you freeze the database by\n  calling the `Freeze()` method.\n\n  `DBWriter` optimizes the database if there are no values present -\n  i.e., keys-only. This optimization significantly reduces the\n  file-size.\n\n* `DBReader`: Used to read a pre-constructed perfect-hash database and\n  use it for constant-time lookups. The DBReader class comes with its\n  own key/val cache to reduce disk accesses. The number of cache\n  entries is configurable.\n\n  After initializing the DB, key lookups are done primarily with the\n  `Find()` method. A convenience method `Lookup()` elides errors and\n  only returns the value and a boolean.\n\nFirst, lets run some tests and make sure mph is working fine:\n\n```sh\n\n  $ git clone https://github.com/opencoff/go-mph\n  $ cd go-mph\n  $ go test .\n\n```\n\n## Example Program\nThere is a working example of the `DBWriter` and `DBReader` APIs in \nthe `example/` sub directory. This example demonstrates the following\nfunctionality:\n\n- add one or more space delimited key/value files (first field is key, second\n  field is value)\n- add one or more CSV files (first field is key, second field is value)\n- Write the resulting MPH DB to disk\n- Read the DB and verify its integrity\n- Dump the contents of the DB or the DB \"meta data\"\n\nNow, lets build and run the example program:\n```sh\n\n  $ make\n  $ go build -o mphdb ./example\n  $ ./mphdb -V make foo.db -t txt chd /usr/share/dict/words\n  $ ./mphdb -V fsck foo.db\n  $ ./mphdb -V dump -m foo.db\n  $ ./mphdb -V dump -a foo.db\n```\n\nThis example above stores the words in the system dictionary into\na fast-lookup table using the CHD algorithm. `mphdb -h` shows you a helpful usage for what\nelse you can do with the example program.\n\nThere is a helper python script to generate a very large text file of\nhostnames and IP addresses: `genhosts.py`. You can run it like so:\n\n```sh\n\n  $ python ./example/genhosts.py 192.168.0.0/16 \u003e a.txt\n```\n\nThe above example generates 65535 hostnames and corresponding IP addresses; each of the\nIP addresses is sequentially drawn from the given subnet.\n\n**NOTE** If you use a \"/8\" subnet mask you will generate a _lot_ of data (~430MB in size).\n\nOnce you have the input generated, you can feed it to the `example` program above to generate\na MPH DB:\n```sh\n\n  $ ./mphdb make foo.db chd a.txt\n  $ ./mphdb fsck foo.db\n```\n\nIt is possible that \"mphdb\" fails to construct a DB after trying 1,000,000 times. In that case,\ntry lowering the \"load\" factor (default is 0.85).\n\n```sh\n\n  $ ./mphdb make -l 0.75 foo.db chd a.txt\n```\n\nThe example program in `example/` has helper routines to add from a\ntext or CSV delimited file: see `example/text.go`. In fact is is a more-or-less complete\nusage of the MPH library API.\n\n## Implementation Notes\n\n* *bbhash.go*: Main implementation of the BBHash algorithm. This\n  file implements the `MPHBuilder` and `MPH` interfaces (defined in\n  *mph.go*).\n\n* *bbhash_marshal.go*: Marshaling/Unmarshaling bbhash MPHF tables.\n\n* *bitvector.go*: thread-safe bitvector implementation including a\n  simple rank algorithm.\n\n* *chd.go*: The main implementation of the CHD algorithm. This\n  file implements the `MPHBuilder` and `MPH` interfaces (defined in\n  *mph.go*).\n\n* *chd_marshal.go*: Marshaling/Unmarshaling CHD MPHF tables.\n\n* *dbreader.go*: Provides a constant-time lookup of a previously\n  constructed MPH DB. DB reads use `mmap(2)` for reading the MPH\n  metadata.  For little-endian architectures, there is no data\n  \"parsing\" of the lookup tables, offset tables etc. They are \n  interpreted in-situ from the mmap'd data. To keep the code\n  generic, every multi-byte int is converted to little-endian order\n  before use. These conversion routines are in *endian_XX.go*.\n\n* *dbwriter.go*: Create a read-only, constant-time MPH lookup DB. It \n  can store arbitrary byte stream \"values\" - each of which is\n  identified by a unique `uint64` key. The DB structure is optimized\n  for reading on the most common architectures - little-endian:\n  amd64, arm64 etc.\n\n* *slices.go*: Non-copying type conversion to/from byte-slices to\n  uints of different widths.\n\n* *utils.go*: Random number utils and other bits\n\n## License\nGPL v2.0\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fopencoff%2Fgo-mph","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fopencoff%2Fgo-mph","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fopencoff%2Fgo-mph/lists"}