{"id":21961747,"url":"https://github.com/daac-tools/find-simdoc","last_synced_at":"2025-06-22T05:07:11.741Z","repository":{"id":59743935,"uuid":"530928159","full_name":"daac-tools/find-simdoc","owner":"daac-tools","description":"Finding all pairs of similar documents time- and memory-efficiently","archived":false,"fork":false,"pushed_at":"2025-03-13T03:24:22.000Z","size":230,"stargazers_count":60,"open_issues_count":1,"forks_count":3,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-06-03T07:42:11.768Z","etag":null,"topics":["all-pairs","document-search","rust","similarity-search"],"latest_commit_sha":null,"homepage":"https://docs.rs/find-simdoc","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/daac-tools.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE-APACHE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":"CODEOWNERS","security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2022-08-31T03:56:10.000Z","updated_at":"2025-03-13T03:24:27.000Z","dependencies_parsed_at":"2025-04-23T20:39:59.373Z","dependency_job_id":null,"html_url":"https://github.com/daac-tools/find-simdoc","commit_stats":null,"previous_names":["legalforce-research/find-simdoc"],"tags_count":2,"template":false,"template_full_name":null,"purl":"pkg:github/daac-tools/find-simdoc","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/daac-tools%2Ffind-simdoc","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/daac-tools%2Ffind-simdoc/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/daac-tools%2Ffind-simdoc/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/daac-tools%2Ffind-simdoc/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/daac-tools","download_url":"https://codeload.github.com/daac-tools/find-simdoc/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/daac-tools%2Ffind-simdoc/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":261238911,"owners_count":23128882,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["all-pairs","document-search","rust","similarity-search"],"created_at":"2024-11-29T10:17:47.755Z","updated_at":"2025-06-22T05:07:06.720Z","avatar_url":"https://github.com/daac-tools.png","language":"Rust","funding_links":[],"categories":["Rust"],"sub_categories":[],"readme":"# Finding all pairs of similar documents\n\n[![Crates.io](https://img.shields.io/crates/v/find-simdoc)](https://crates.io/crates/find-simdoc)\n[![Documentation](https://docs.rs/find-simdoc/badge.svg)](https://docs.rs/find-simdoc)\n![Build Status](https://github.com/legalforce-research/find-simdoc/actions/workflows/rust.yml/badge.svg)\n\nThis software provides time- and memory-efficient all pairs similarity searches in documents.\n\n## Problem definition\n\n- Input\n  - List of documents $D = (d_1, d_2, \\dots, d_n)$\n  - Distance function $\\delta: D \\times D \\rightarrow [0,1]$\n  - Radius threshold $r \\in [0,1]$\n- Output\n  - All pairs of similar document ids $R = \\\\{ (i,j): i \u003c j, \\delta(d_i, d_j) \\leq r \\\\}$\n\n## Features\n\n### Easy to use\n\nThis software supports all essential steps of document similarity search,\nfrom feature extraction to output of similar pairs.\nTherefore, you can immediately try the fast all pairs similarity search using your document files.\n\n### Flexible tokenization\n\nYou can specify any delimiter when splitting words in tokenization for feature extraction.\nThis can be useful in languages where multiple definitions of words exist, such as Japanese or Chinese.\n\n### Time and memory efficiency\n\nThe time and memory complexities are *linear* over the numbers of input documents and output results\non the basis of the ideas behind the locality sensitive hashing (LSH) and [sketch sorting approach](https://proceedings.mlr.press/v13/tabei10a.html).\n\n### Tunable search performance\n\nLSH allows tuning of performance in accuracy, time, and memory, through a manual parameter specifying search dimensions.\nYou can flexibly perform searches depending on your dataset and machine environment.\n  - Specifying lower dimensions allows for faster and rougher searches with less memory usage.\n  - Specifying higher dimensions allows for more accurate searches with more memory usage.\n\n### Pure Rust\n\nThis software is implemented in Rust, achieving safe and fast performance.\n\n## Running example\n\nHere, we describe the basic usage of this software through an example of running the CLI tool.\n\nFirst of all, install `rustc` and `cargo` following the [official instructions](https://www.rust-lang.org/tools/install) since this software is implemented in Rust.\n\n### 1. Data preparation\n\nYou have to prepare a text file containing documents line by line (NOT including empty lines).\n\nTo produce an example file used throughout this description, you can use `scripts/load_nltk_dataset.py` that downloads the Reuters Corpus provided by NLTK.\nRun the following command.\n\n```\n$ ./scripts/load_nltk_dataset.py reuters\n```\n\n`reuters.txt` will be output.\n\n```\n$ head reuters.txt\nhre properties \u0026 lt ; hre \u003e 1st qtr jan 31 net shr 38 cts vs 47 cts net 2 , 253 , 664 vs 2 , 806 , 820 gross income 5 , 173 , 318 vs 5 , 873 , 904 note : net includes gains on sale of real estate of 126 , 117 dlrs vs 29 , 812 dlrs .\nthe firm , however , is supplying temporary financing , and sources close to the transaction disputed the claim that the firm will not end up paying for its equity position . \nconoco , which has completed geological prospecting for the tunisian government , has transferred one third of its option rights in the region to ina , it said .\n\" willis faber ' s stake in morgan grenfell has been a very successful investment ,\" it said .\nchina reports 700 mln dlr two - month trade deficit china ' s trade deficit totalled 700 mln dlrs in the first two months of this year , according to figures released by the state statistics bureau .\nthe treasury said baker and stoltenberg \" are consulting with their g - 7 colleagues and are confident that this will enable them to foster exchange rate stability around current levels .\"\nu . s . tariffs are due to take effect on april 17 .\nsome dealers said there were growing signs the united states wanted the dollar to fall further .\nsince last august smart has been leading talks to open up japan to purchases of more u . s .- made automotive parts .\nthe resulting association will operate under the name of charter and will be based in bristol .\n```\n\nFully-duplicate documents in `reuters.txt` are removed because they are noisy in evaluation of similarity searches.\nTo do this, the output lines are shuffled, and your file will not be the identical to the example.\n\n### 2. Finding all pairs of similar documents\n\nThe workspace `find-simdoc-cli` provides CLI tools for fast all pairs similarity searches in documents.\nThe approach consists of three steps:\n\n1. Extract features from documents\n   - Set representation of character or word ngrams\n   - Tfidf-weighted vector representation of character or word ngrams\n2. Convert the features into binary sketches through locality sensitive hashing (LSH)\n   - [1-bit minwise hashing](https://dl.acm.org/doi/abs/10.1145/1772690.1772759) for the Jaccard similarity\n   - [Simplified simhash](https://dl.acm.org/doi/10.1145/1242572.1242592) for the Cosine similarity\n3. Search for similar sketches in the Hamming space using a modified variant of the [sketch sorting approach](https://proceedings.mlr.press/v13/tabei10a.html)\n\n#### 2.1 Jaccard space\n\nThe executable `jaccard` provides a similarity search in the [Jaccard space](https://en.wikipedia.org/wiki/Jaccard_index).\nYou can check the arguments with the following command.\n\n```\n$ cargo run --release -p find-simdoc-cli --bin jaccard -- --help\n```\n\nRun the following command if you want to search for `reuters.txt` with\n\n- search radius `0.1`,\n- tokens of character `5`-grams, and\n- `15*64=960` dimensions in the Hamming space.\n\n```\n$ cargo run --release -p find-simdoc-cli --bin jaccard -- -i reuters.txt -r 0.1 -w 5 -c 15 \u003e result-jaccard.csv\n```\n\nArgument `-c` indicates the number of dimensions in the Hamming space,\na trade-off parameter between approximation accuracy and search speed.\nThe larger this value, the higher the accuracy, but the longer the search takes.\n[This section](#4-testing-the-accuracy-of-1-bit-minwise-hashing) describes how to examine the approximation accuracy for the number of dimensions.\n\nPairs of similar documents (indicated by zero-origin line numbers) and their distances are reported.\n\n```\n$ head result-jaccard.csv\ni,j,dist\n191,29637,0.07291666666666667\n199,38690,0.0375\n274,10048,0.07083333333333333\n294,27675,0.04791666666666667\n311,13812,0.04583333333333333\n361,50938,0.08958333333333333\n469,6360,0.035416666666666666\n546,10804,0.0875\n690,28281,0.0875\n```\n\n#### 2.2 Cosine space\n\nThe executable `cosine` provides a similarity search in the [Cosine space](https://en.wikipedia.org/wiki/Cosine_similarity).\nYou can check the arguments with the following command.\n\n```\n$ cargo run --release -p find-simdoc-cli --bin cosine -- --help\n```\n\nRun the following command if you want to search for `reuters.txt` with\n\n- search radius `0.1`,\n- tokens of word `3`-grams,\n- word delimiter `\" \"` (i.e., a space),\n- `10*64=640` dimensions in the Hamming space, and\n- weighting using the standard TF and the smoothed IDF.\n\n```\n$ cargo run --release -p find-simdoc-cli --bin cosine -- -i reuters.txt -r 0.1 -d \" \" -w 3 -c 10 -T standard -I smooth \u003e result-cosine.csv\n```\n\nPairs of similar documents (indicated by zero-origin line numbers) and their distances are reported.\n\n```\n$ head result-cosine.csv\ni,j,dist\n542,49001,0.084375\n964,24198,0.09375\n1872,3024,0.0859375\n1872,6823,0.090625\n1872,8462,0.0953125\n1872,11402,0.090625\n1872,18511,0.0859375\n1872,41491,0.0875\n1872,48344,0.0859375\n```\n\n### 3. Printing similar documents\n\nThe executable `dump` prints similar documents from an output CSV file.\n\nIf you want to print similar documents in `reuters.txt` with the result `result-jaccard.csv`,\nrun the following command.\n\n```\n$ cargo run --release -p find-simdoc-cli --bin dump -- -i reuters.txt -s result-jaccard.csv\n[i=191,j=29637,dist=0.07291666666666667]\npending its deliberations , harper and row ' s board has postponed indefinitely a special meeting of stockholders that had been scheduled for april 2 to discuss a proposal to recapitalize the company ' s stock to create two classes of shares with different voting rights .\npending its deliberations , harper and row ' s board has postponed indefinitely a special meeting of stockholders that had been scheduled for april 2 to discuss a proposal to recapitalize the company ' s stock in order to create two classes of shares with different votinmg rights .\n[i=199,j=38690,dist=0.0375]\ngovernment officials had no immediate comment on the report , which advised a reduction in the overall size of the public investment programme and greater emphasis on the preservation of peru ' s export potential .\ngovernment officials had no immediate comment on the report , which advised a reduction in the overall size of the public investment program and greater emphasis on the preservation of peru ' s export potential .\n[i=274,j=10048,dist=0.07083333333333333]\nthe measure was adopted as part of a wide - ranging trade bill that will be considered by the full house in april before it moves on to the senate .\nthe measure was adopted as part of a wide - ranging trade bill that will be considered by the full house in april before it moves onto the senate .\n[i=294,j=27675,dist=0.04791666666666667]\nthe company said the start - up was necessitated by continuing strong demand for aluminum and dwindling worldwide inventories , and that the metal is needed to supply reynolds ' various fabricating businesses .\nthe company said the start up was necessitated by continuing strong demand for aluminum and dwindling worldwide inventories , and that the metal is needed to supply reynolds ' various fabricating businesses .\n[i=311,j=13812,dist=0.04583333333333333]\nhe said in an interview with reuter that after a few years it was likely south korea would drop barriers to foreign goods and move toward a more balanced trade position .\nhe said in an interview with reuters that after a few years it was likely south korea would drop barriers to foreign goods and move toward a more balanced trade position .\n[i=361,j=50938,dist=0.08958333333333333]\nhog and cattle slaughter guesstimates chicago mercantile exchange floor traders and commission house representatives are guesstimating today ' s hog slaughter at about 295 , 000 to 305 , 000 head versus 307 , 000 week ago and 311 , 000 a year ago .\nhog and cattle slaughter guesstimates chicago mercantile exchange floor traders and commission house representatives are guesstimating today ' s hog slaughter at about 295 , 000 to 308 , 000 head versus 305 , 000 week ago and 308 , 000 a year ago .\n[i=469,j=6360,dist=0.035416666666666666]\nthe national planning department forecast that in 1987 coffee , colombia ' s traditional major export , will account for only one - third of total exports , or about 1 . 5 billion dlrs .\nthe national planning department forecast that in 1987 coffee , colombia ' s traditional major export , will account for only one third of total exports , or about 1 . 5 billion dlrs .\n...\n```\n\n### 4. Testing the accuracy of 1-bit minwise hashing\n\nLSH is an approximate solution, and you may want to know the accuracy.\nThe executable `minhash_acc` allows you to examine\n- the mean absolute error that is the averaged gap between the normalized Hamming distance and the actual Jaccard distance; and\n- the number of true results, precisions, recalls, and F1-scores for search radii {0.01, 0.02, 0.05, 0.1, 0.2, 0.5}.\n\nTo use this executable, we recommend extracting a small subset from your dataset\nbecause it exactly computes distances for all possible pairs (although the computation is accelerated with parallelization).\n\n```\n$ head -5000 reuters.txt \u003e reuters.5k.txt\n```\n\nYou can test the number of Hamming dimensions from 64 to 6400\n(i.e., the number of chunks from 1 to 100 indicated with `-c`)\nwith the following command.\nThe arguments for feature extraction are the same as those of `jaccard`.\n\n```\n$ cargo run --release -p find-simdoc-cli --bin minhash_acc -- -i reuters.5k.txt -w 5 \u003e acc.csv\n```\n\n## Approximation accuracy of 1-bit minwise hashing\n\nLSH is an approximate solution, and the number of dimensions in the Hamming space\n(indicated with the command line argument `-c`) is related to the approximation accuracy.\nAs a hint for choosing a parameter of `-c`, we show experimental results obtained from `reuters.txt` of 51,535 documents when setting `-w 5`.\n\n### Mean absolute error (MAE)\n\nThe following figure shows MAEs while varying the number of Hamming dimensions from 64 to 6400 (i.e., the number of chunks from 1 to 100 indicated with `-c`).\n\n![](./figures/mae_reuters.svg)\n\nAs expected, the larger the number, the higher the accuracy. For example, when the number of dimensions is 1024 (setting the argument `-c 16`), we achieve the MAE around 2.5%.\n\n### Recall\n\nOf the precision, recall, and F1 score, the most interesting would be the recall.\nThis is because false positives can be filtered out in post processing.\n\nThe following figure shows recalls in search with radii 0.05, 0.1, and 0.2 (indicated with the argument `-r`).\n\n![](./figures/recall_reuters.svg)\n\nFor radii 0.1 and 0.2, over 90% recalls are achieved in most cases.\nFor smaller radius 0.05, 75-90% recalls are obtained because the MAE becomes larger relative to the radius.\n\nBy the way, the numbers of true results are 50, 201, and 626 for radii 0.05, 0.1, and 0.2, respecitvely.\n\n### F1 score\n\nThe following figure shows F1 scores in search with radii 0.05, 0.1, and 0.2 (indicated with the argument `-r`).\n\n![](./figures/f1_reuters.svg)\n\n- For radius 0.05, over 90% scores are achieved from 3520 dimensions (setting `-c 55`). \n- For radius 0.1, over 90% scores are achieved from 704 dimensions (setting `-c 11`).\n- For radius 0.2, over 90% scores are achieved from 448 dimensions (setting `-c 7`).\n\n## Disclaimer\n\nThis software is developed by LegalForce, Inc.,\nbut not an officially supported LegalForce product.\n\n## License\n\nLicensed under either of\n\n * Apache License, Version 2.0\n   ([LICENSE-APACHE](LICENSE-APACHE) or http://www.apache.org/licenses/LICENSE-2.0)\n * MIT license\n   ([LICENSE-MIT](LICENSE-MIT) or http://opensource.org/licenses/MIT)\n\nat your option.\n\n## Contribution\n\nUnless you explicitly state otherwise, any contribution intentionally submitted\nfor inclusion in the work by you, as defined in the Apache-2.0 license, shall be\ndual licensed as above, without any additional terms or conditions.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdaac-tools%2Ffind-simdoc","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdaac-tools%2Ffind-simdoc","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdaac-tools%2Ffind-simdoc/lists"}