{"id":16701470,"url":"https://github.com/ferd/simhash","last_synced_at":"2026-03-06T16:03:26.588Z","repository":{"id":4926310,"uuid":"6082896","full_name":"ferd/simhash","owner":"ferd","description":"Simhashing for Erlang -- hashing algorithm to find near-duplicates in binary data.","archived":false,"fork":false,"pushed_at":"2018-02-09T20:04:57.000Z","size":147,"stargazers_count":43,"open_issues_count":0,"forks_count":9,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-10-05T13:35:59.341Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Erlang","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":"puppetlabs/puppet-docs","license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ferd.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2012-10-04T22:16:05.000Z","updated_at":"2024-07-31T12:58:13.000Z","dependencies_parsed_at":"2022-07-08T02:29:00.969Z","dependency_job_id":null,"html_url":"https://github.com/ferd/simhash","commit_stats":null,"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"purl":"pkg:github/ferd/simhash","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ferd%2Fsimhash","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ferd%2Fsimhash/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ferd%2Fsimhash/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ferd%2Fsimhash/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ferd","download_url":"https://codeload.github.com/ferd/simhash/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ferd%2Fsimhash/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30184885,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-06T14:42:24.748Z","status":"ssl_error","status_checked_at":"2026-03-06T14:42:14.925Z","response_time":250,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-12T18:44:12.258Z","updated_at":"2026-03-06T16:03:26.556Z","avatar_url":"https://github.com/ferd.png","language":"Erlang","funding_links":[],"categories":[],"sub_categories":[],"readme":"Simhash\n=======\n\nThis module implements simhashing in Erlang.\n\nWhile hash functions such as MD5 or SHA try to get unique value\nfor unique pieces of data, there is no way for them to represent\nhow similar they are -- it's not a design concern for these functions,\nand in fact, it is something they usually want to avoid.\nSimilarly for cryptographic hash functions like bcrypt or scrypt.\n\nSimhashing, on the otherhand, tries to provide a signature for some\npiece of data while allowing different signatures to be similar when\nthe data they hash is similar.\n\nSimhashes are then useful in order to figure out duplicates or near-\nduplicates between different pieces of data by being able to find\nthe distance between two given hashes.\n\nFor more resources on simhashing, you may read the following:\n\n- http://matpalm.com/resemblance/simhash/\n- http://www.cs.princeton.edu/courses/archive/spring04/cos598B/bib/CharikarEstim.pdf\n- http://irl.cs.tamu.edu/people/sadhan/papers/cikm2011.pdf\n\nHow To Build\n------------\nThe module can be compiled using `./rebar compile` or `make`.\n\nBy default, the simhash library will use MD5 as the function to\nhash the shingles made from the binary structure. It is the second most\naccurate one, but also the second slowest one, striking a decent balance.\n\nBy passing macros, other hashing algorithms can be used:\n- `PHASH` for Erlang's `phash2` (32 bits, fastest, least accurate)\n- `MD5` for MD5 (default) (128 bits, slower, more accurate)\n- `SHA` for SHA-160 (slowest, most accurate).\n\nIf you want to use SHA-160 or phash2 hashing by default, it is recommended you\nprovide the macros in your own `rebar` config or whatever other\ntool that lets you declare them when compiling (`{d,'SHA'}` for\nexample).\n\nTo run tests, call `make test`.\n\nHow To Use It\n-------------\n\nTo hash a binary:\n\n    1\u003e VoiceHash = simhash:hash(\u003c\u003c\"My voice is my password.\"\u003e\u003e).\n    \u003c\u003c194,5,119,237,104,38,63,181,151,39,73,226,19,230,140,89,\n      33,12,178,125\u003e\u003e\n\nTo hash any other Erlang term:\n\n    2\u003e PidHash = simhash:hash(term_to_binary(self())).\n    \u003c\u003c128,255,187,43,142,160,234,204,110,124,209,236,156,227,\n      43,35,236,151,89,57\u003e\u003e\n\nYou can then find the distance between these as follows:\n\n    3\u003e simhash:distance(VoiceHash, PidHash).\n    86\n    4\u003e simhash:distance(simhash:hash(\u003c\u003c\"My voice is my passport.\"\u003e\u003e), VoiceHash).\n    27\n\nThis value is somewhat arbitrary, and can be more useful when you want\nto compare more than two elements to find the closest match:\n\n    5\u003e DB = [{simhash:hash(Txt), Txt}\n    5\u003e      || Txt \u003c- [term_to_binary([a,b,c,d,e,f]),\n    5\u003e                 \u003c\u003c\"a b c d e f\"\u003e\u003e, term_to_binary(\"a b c d e f\"),\n    5\u003e                 \u003c\u003c\"My voice is my password.\"\u003e\u003e]].\n    ...\n    6\u003e {Distance1, Hash1} = simhash:closest(\n    6\u003e      simhash:hash(\u003c\u003c\"My voice is my passport.\"\u003e\u003e),\n    6\u003e      [Hash || {Hash,_Txt} \u003c- DB]).\n    ...\n    7\u003e {Distance1, proplists:get_value(Hash1, DB)}.\n    {27, \u003c\u003c\"My voice is my password.\"\u003e\u003e}\n    \n    7\u003e {Distance2, Hash2} = simhash:closest(\n    7\u003e      simhash:hash(\u003c\u003c\"d e f g h i\"\u003e\u003e),\n    7\u003e      [Hash || {Hash,_Txt} \u003c- DB]).\n    ...\n    8\u003e {Distance2, proplists:get_value(Hash2, DB)}.\n    {62, \u003c\u003c\"a b c d e f\"\u003e\u003e}\n    \n    8\u003e {Distance3, Hash3} = simhash:closest(\n    8\u003e      simhash:hash(term_to_binary({a,b,c,d,e,f})),\n    8\u003e      [Hash || {Hash,_Txt} \u003c- DB]).\n    ...\n    9\u003e {Distance3, binary_to_term(proplists:get_value(Hash1, DB))}.\n    {22, [a,b,c,d,e,f]}\n\nWhat you consider to be an acceptable treshold for distance in order\nto consider two structures as near-duplicates or duplicates is highly\ndependent on the kind (and size) of data you have and the hashing\nalgorithm chosen when compiling.\n\nIf the default shingling mechanism isn't what you need (and it is\nunlikely to be with larger data sets or with particular vocabularies\nyou want to sort by frequency), you can also pass in your own\nweighed features, so that some items are worth more than others:\n\n    10\u003e simhash:distance(\n    10\u003e   simhash:hash([{1,\u003c\u003c\"my\"\u003e\u003e},{1,\u003c\u003c\"car\"\u003e\u003e}, {1,\u003c\u003c\"is\"\u003e\u003e}, {1,\u003c\u003c\"black\"\u003e\u003e}]),\n    10\u003e   simhash:hash([{1,\u003c\u003c\"my\"\u003e\u003e},{1,\u003c\u003c\"car\"\u003e\u003e}, {1,\u003c\u003c\"is\"\u003e\u003e}, {1,\u003c\u003c\"blue\"\u003e\u003e}])).\n    6\n    11\u003e simhash:distance(\n    11\u003e   simhash:hash([{1,\u003c\u003c\"my\"\u003e\u003e},{1,\u003c\u003c\"car\"\u003e\u003e}, {1,\u003c\u003c\"is\"\u003e\u003e}, {5,\u003c\u003c\"blue\"\u003e\u003e}]),\n    11\u003e   simhash:hash([{1,\u003c\u003c\"my\"\u003e\u003e},{1,\u003c\u003c\"car\"\u003e\u003e}, {1,\u003c\u003c\"is\"\u003e\u003e}, {5,\u003c\u003c\"black\"\u003e\u003e}])).\n    17\n\nIn the tests above, you can see that by giving more weigh to the color, it's possible to make the simhash behave differently to the same original string.\n\nFinally, it is possible to use the simhash library with your own hash function if you wish to do so. The hash function must accept a binary and return a binary. You will also need to provide an argument explaining how many bits is contained in your hashes:\n\n    12\u003e F = fun(X) -\u003e crypto:hash_final(crypto:hash_update(crypto:hash_init(sha512), X)) end.\n    #Fun\u003cerl_eval.6.82930912\u003e\n    13\u003e F(\u003c\u003c\"abc\"\u003e\u003e).\n    \u003c\u003c221,175,53,161,147,...\u003e\u003e\n\nThe function `F` defines a simple way to call sha512 hashes from the crypto module. It can be used with simhashes as follows:\n\n    14\u003e simhash:hash(\u003c\u003c\"abcdef\"\u003e\u003e, F, 512).\n    \u003c\u003c60,149,116,223,113,...\u003e\u003e\n    15\u003e simhash:hash([{5,\u003c\u003c\"ab\"\u003e\u003e},{2, \u003c\u003c\"cdef\"\u003e\u003e}], F, 512).\n    \u003c\u003c180,232,215,0,38,245,...\u003e\u003e\n\nNotes\n-----\n\nAs of now, this library is rather experimental and hasn't made it\nto production anywhere else. Handle with caution.\n\nChangelog\n---------\n\n### 0.3.0: ###\n- MD5 is the default simhashing algorithm, for the accuracy/speed balance\n- Added a way to customize the hashing algorithm at run time.\n- Common Test tests!\n\n### 0.2.0: ###\n- Adding a way to submit a user's own features/shingles with weight.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fferd%2Fsimhash","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fferd%2Fsimhash","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fferd%2Fsimhash/lists"}