{"id":15579010,"url":"https://github.com/dapper91/schindel","last_synced_at":"2025-08-09T06:15:42.719Z","repository":{"id":43165433,"uuid":"453174197","full_name":"dapper91/schindel","owner":"dapper91","description":"Rust min-shingle hashing implementation","archived":false,"fork":false,"pushed_at":"2022-08-27T08:46:03.000Z","size":8,"stargazers_count":5,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-04-09T02:35:40.038Z","etag":null,"topics":["fuzzy-matching","fuzzy-search","minshingle","ngrams","rust","shingles"],"latest_commit_sha":null,"homepage":"","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"unlicense","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/dapper91.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2022-01-28T18:24:59.000Z","updated_at":"2023-02-27T14:09:32.000Z","dependencies_parsed_at":"2022-09-26T20:31:45.218Z","dependency_job_id":null,"html_url":"https://github.com/dapper91/schindel","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dapper91%2Fschindel","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dapper91%2Fschindel/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dapper91%2Fschindel/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dapper91%2Fschindel/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/dapper91","download_url":"https://codeload.github.com/dapper91/schindel/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":250543107,"owners_count":21447835,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["fuzzy-matching","fuzzy-search","minshingle","ngrams","rust","shingles"],"created_at":"2024-10-02T19:13:35.422Z","updated_at":"2025-04-24T01:27:24.569Z","avatar_url":"https://github.com/dapper91.png","language":"Rust","readme":"[![Crates.io][crates-badge]][crates-url]\n[![License][licence-badge]][licence-url]\n[![Test Status][test-badge]][test-url]\n[![Documentation][doc-badge]][doc-url]\n\n[crates-badge]: https://img.shields.io/crates/v/schindel.svg\n[crates-url]: https://crates.io/crates/schindel\n[licence-badge]: https://img.shields.io/badge/license-Unlicense-blue.svg\n[licence-url]: https://github.com/dapper91/schindel/blob/master/LICENSE\n[test-badge]: https://github.com/dapper91/schindel/actions/workflows/test.yml/badge.svg?branch=master\n[test-url]: https://github.com/dapper91/schindel/actions/workflows/test.yml\n[doc-badge]: https://docs.rs/schindel/badge.svg\n[doc-url]: https://docs.rs/schindel\n\n\n# Rust min-shingle hashing implementation\n\nThis crate implements simple min-shingle hashing algorithm.\nFor more information see [W-shingling](https://en.wikipedia.org/wiki/W-shingling).\n\n\n# Algorithm\n\nShingle hash (or w-shingle) is a set of n-grams each of which composed of contiguous tokens within an input sequence\nshifted by one element. For example, the document: \n\n`to be or not to be that is the question`\n\nhas the following set of 2-grams (shingles):\n\n`(to, be)`, `(be, or)`, `(or, not)`, `(not, to)`, `(be, that)`, `(that, is)`, `(is, the)`, `(the, question)`\n\n*note*: 2-gram `(to, be)` occurs twice. \n\nThe 2-gram set is a document shingle hash. \nThat hash can be used to measure two documents resemblance using Jaccard coefficient:\n\n`R(doc1, doc2) = (H(doc1) ⋂ H(doc2)) / (H(doc1) ⋃ H(doc2))`\n\nwhere:\n- `R` - resemblance\n- `H` - shingle hash\n\nThe previous algorithm is not scalable to large documents because an n-gram set could grow very fast.\nFor example, if 3-grams is used and input sequence alphabet is 255 symbols then the set could be of size\n`255 ^ 3` or `~16 * 10 ^ 6` in worst case which consumes a lot of memory.\n\nTo resolve that problem min-shingle algorithm is used. It exploits special optimisation technic:\ninstead of storing all sequence n-grams n-gram hashes are calculated and a minimal hash value is saved.\nBecause the minimal value of a data stream can be calculated on the fly (without saving all the values),\nmemory consumption is drastically reduced. Repeating that process with several hash functions \n(or several hash function seeds) shingle hash is produced.\nAs well as shingle hash min-shingle hash can be used to measure distance (or resemblance) between documents.\n\n# Basic example\n\nAdd `schindel` dependency to `Cargo.toml`:\n\n```toml\n[dependencies]\nschindel = \"^0.1.0\"\n```\n\nAdd the following code to your `main.rs`:\n\n``` rust\nuse schindel::shingles::{MinShingleHash, Murmur3Hasher};\n\nfn main() {\n    let original = \"\\\n        “My sight is failing,” she said finally. “Even when I was young I could not have read what was written there. \\\n        But it appears to me that that wall looks different. Are the Seven Commandments the same as they used to be, \\\n        Benjamin?” For once Benjamin consented to break his rule, and he read out to her what was written on the wall. \\\n        There was nothing there now except a single Commandment. It ran:\\\n        ALL ANIMALS ARE EQUAL BUT SOME ANIMALS ARE MORE EQUAL THAN OTHERS\";\n\n    let plagiarism = \"\\\n        “My sight is failing,” she said finally. “When I was young I could not have read what was written there. \\\n        But it appears to me that that wall looks different. Are the Seven Commandments the same as they used to be” \\\n        Benjamin read out to her what was written. There was nothing there now except a single Commandment. \\\n        It ran: ALL ANIMALS ARE EQUAL BUT SOME ANIMALS ARE MORE EQUAL THAN OTHERS\";\n\n    let other = \"\\\n        Throughout the spring and summer they worked a sixty-hour week, and in August Napoleon announced that there \\\n        would be work on Sunday afternoons as well. This work was strictly voluntary, but any animal who absented \\\n        himself from it would have his rations reduced by half. Even so, it was found necessary to leave certain \\\n        tasks undone. The harvest was a little less successful than in the previous year, and two fields which \\\n        should have been sown with roots in the early summer were not sown because the ploughing had not been \\\n        completed early enough. It was possible to foresee that the coming winter would be a hard one.\";\n\n    const HASH_LEN: usize = 100;\n    const NGRAM_LEN: usize = 5;\n\n    let original_hash = MinShingleHash::\u003cMurmur3Hasher, HASH_LEN, NGRAM_LEN\u003e::new(original.chars());\n\n    let plagiarism_hash = MinShingleHash::\u003cMurmur3Hasher, HASH_LEN, NGRAM_LEN\u003e::new(plagiarism.chars());\n    println!(\"plagiarism similarity: {}\", original_hash.compare(\u0026plagiarism_hash));\n\n    let other_hash = MinShingleHash::\u003cMurmur3Hasher, HASH_LEN, NGRAM_LEN\u003e::new(other.chars());\n    println!(\"other text similarity: {}\", original_hash.compare(\u0026other_hash));\n}\n```\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdapper91%2Fschindel","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdapper91%2Fschindel","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdapper91%2Fschindel/lists"}