{"id":13563466,"url":"https://github.com/ChenghaoMou/text-dedup","last_synced_at":"2025-04-03T20:30:56.009Z","repository":{"id":37245633,"uuid":"347428086","full_name":"ChenghaoMou/text-dedup","owner":"ChenghaoMou","description":"All-in-one text de-duplication","archived":false,"fork":false,"pushed_at":"2024-05-21T20:22:11.000Z","size":6154,"stargazers_count":664,"open_issues_count":0,"forks_count":73,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-03-18T13:14:28.556Z","etag":null,"topics":["data-processing","de-duplication","nlp","text-processing"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ChenghaoMou.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.bib","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-03-13T17:00:44.000Z","updated_at":"2025-03-17T09:54:20.000Z","dependencies_parsed_at":"2023-10-13T13:26:13.863Z","dependency_job_id":"9404e748-a6f4-42d3-bd76-fcae5c5b8319","html_url":"https://github.com/ChenghaoMou/text-dedup","commit_stats":null,"previous_names":[],"tags_count":17,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ChenghaoMou%2Ftext-dedup","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ChenghaoMou%2Ftext-dedup/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ChenghaoMou%2Ftext-dedup/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ChenghaoMou%2Ftext-dedup/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ChenghaoMou","download_url":"https://codeload.github.com/ChenghaoMou/text-dedup/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247074418,"owners_count":20879248,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-processing","de-duplication","nlp","text-processing"],"created_at":"2024-08-01T13:01:19.591Z","updated_at":"2025-04-03T20:30:55.615Z","avatar_url":"https://github.com/ChenghaoMou.png","language":"Python","funding_links":[],"categories":["Python","nlp","Multi-modal Data"],"sub_categories":[],"readme":"\u003ccenter\u003e\u003cimg src=\"./banner.png\"/ style=\"background-color:white;\"\u003e\u003c/center\u003e\n\n![GitHub](https://img.shields.io/github/license/ChenghaoMou/text-dedup) [![Codacy Badge](https://app.codacy.com/project/badge/Grade/cc66178e49d24908ac1fb2b2dbe4e5b3)](https://www.codacy.com/gh/ChenghaoMou/text-dedup/dashboard?utm_source=github.com\u0026utm_medium=referral\u0026utm_content=ChenghaoMou/text-dedup\u0026utm_campaign=Badge_Grade) [![Codacy Badge](https://app.codacy.com/project/badge/Coverage/cc66178e49d24908ac1fb2b2dbe4e5b3)](https://www.codacy.com/gh/ChenghaoMou/text-dedup/dashboard?utm_source=github.com\u0026utm_medium=referral\u0026utm_content=ChenghaoMou/text-dedup\u0026utm_campaign=Badge_Coverage) [![DOI](https://zenodo.org/badge/347428086.svg)](https://zenodo.org/badge/latestdoi/347428086)\n\n## Installation\n\n```bash\npip install text-dedup\n```\n\nor\n\n```bash\npip install git+https://github.com/ChenghaoMou/text-dedup\n```\n\n## Documentation\n\n[Github Pages](https://chenghaomou.github.io/text-dedup/)\n\n## Features\n\nThis repository contains a collection of text deduplication scripts that are ready to use, or modify based on your needs:\n\n- RETSim/UniSim, an embedding-based near deduplication (WIP)\n- MinHash + MinHashLSH, including a spark implementation suitable for large (TB) datasets\n- 64 or 128 bit SimHash\n- SuffixArray Substring\n- Bloom Filter\n- Exact Hash (document-level, line-level/ccnet)\n\nI also have big plans for the future:\n\n- [ ] Memory benchmark for streaming processing\n- [ ] Inter-dataset deduplication\n- [ ] Rewrite suffix array in Python\n- [ ] A collections of other deduplication methods: SuperMinHash, ProbMinHash, TreeMinHash, BagMinHash, [Optimal Densification for Fast and Accurate Minwise Hashing](https://arxiv.org/abs/1703.04664), [Fast Similarity Sketching](https://arxiv.org/abs/1704.04370)\n\nHowever, I do not intent to build a general purpose deduplication library, which was the goal of this repo early on. I will gradually retire the pypi package as well. The reason behind it is that each use-case can be wildly different and requires careful design and consideration. I sincerely encourage you to read the script first (they are relatively short) so you can understand what are at stake here when using it. You can use it to bootstrap your own script, or just use it as a reference.\n\n## Acknowledgements\n\nThis repository is inspired by the following projects, and is heavily influenced by lessons learned from my own participation in [BigScience (Apache 2.0)](https://github.com/bigscience-workshop) and [BigCode (Apache 2.0)](https://github.com/bigcode-project). There is a [blog post](https://publish.obsidian.md/chenghao/posts/20230220150602) about the journey. Feedbacks are welcome!\n\n- [Datasketch](https://github.com/ekzhu/datasketch) (MIT)\n- [simhash-py](https://github.com/seomoz/simhash-py/tree/master/simhash) and [simhash-cpp](https://github.com/seomoz/simhash-cpp) (MIT)\n- [Deduplicating Training Data Makes Language Models Better](https://github.com/google-research/deduplicate-text-datasets) (Apache 2.0)\n- [Gaoya](https://github.com/serega/gaoya) (MIT)\n\n## Quick Examples\n\n\u003cdetails\u003e\n\n\u003csummary\u003eNative PySpark\u003c/summary\u003e\n\n_MODIFY `text_dedup/minhash_spark.py` FOR YOUR OWN PROJECT AND DATASET FIRST!_\n\nAssuming you have a downloaded dataset (in parquet files) under \"./temp-data\", you can process with file with your local compute by:\n\n```bash\nexport PYSPARK_PYTHON=\"path to your python with scipy, xxhash, and numpy installed\"\nspark-submit --executor-memory 16g \\\n    --driver-memory 20g \\\n    --executor-cores 3 \\\n    --num-executors 2 \\\n    --packages graphframes:graphframes:0.8.2-spark3.2-s_2.12 \\\n    --conf \"spark.executor.extraJavaOptions=-Dlog4j.configuration=./log4j.properties\" \\\n    --conf \"spark.driver.extraJavaOptions=-Dlog4j.configuration=./log4j.properties\" \\\n    text_dedup/minhash_spark.py\\\n    --input \"./temp-data\" \\\n    --output \"./temp-output\" \\\n    --column \"text\" \\\n    --threshold 0.7\n```\n\n```\nDEBUG __main__ - ------------------------------------------------------------------------------------------------------------------------\nDEBUG __main__ - Using B=25, R=10\nDEBUG __main__ - Loaded documents: 88803\nDEBUG __main__ - args.input='./temp-data'\nDEBUG __main__ - args.output='./temp-output'\nDEBUG __main__ - args.threshold=0.7\nDEBUG __main__ - args.ngram_size=5\nDEBUG __main__ - args.min_length=5\nDEBUG __main__ - args.num_perm=250\nDEBUG __main__ - args.column='text'\nDEBUG __main__ - id                                                              : bigint\nDEBUG __main__ - text                                                            : string\nDEBUG __main__ - meta                                                            : struct\u003cwarc_headers:struct\u003cwarc-record-id:string,warc-date:string,content-type:string,content-length:int,warc-type:string,warc-identified-content-language:string,warc-refers-to:string,warc-target-uri:string,warc-block-digest:string\u003e,identification:struct\u003clabel:string,prob:float\u003e,annotations:array\u003cstring\u003e,line_identifications:array\u003cstruct\u003clabel:string,prob:float\u003e\u003e\u003e\nDEBUG __main__ - __id__                                                          : bigint\nDEBUG __main__ - ------------------------------------------------------------------------------------------------------------------------\nDEBUG __main__ - Initial edges: 52102\nDEBUG __main__ - Edges DataFrame: 52102\nDEBUG __main__ - Vertices DataFrame: 50206\nDEBUG __main__ - Assignment DataFrame: 50206\nDEBUG __main__ - Merging records: 88803\nINFO  __main__ - Saving with 1 partitions and 44092 rows each\nDEBUG __main__ - ------------------------------------------------------------------------------------------------------------------------\nDEBUG __main__ - Number of rows before:    88803\nDEBUG __main__ - Number of rows after:     44092\nDEBUG __main__ - Percentage of rows kept:  49.65%\nDEBUG __main__ - Output:                   ./temp-output\nDEBUG __main__ - Time:                     68.80s\nDEBUG __main__ - ------------------------------------------------------------------------------------------------------------------------\n\n```\n\nOr take a look at [bigcode-v2/run.sh](https://github.com/bigcode-project/bigcode-dataset/blob/main/near_deduplication/bigcode-v2/run.sh) on how to run the job with GCP DataProc.\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\n\u003csummary\u003eUniSim (WIP)\u003c/summary\u003e\n\nBased on Google's RETSim model([Github](https://github.com/google/unisim), [Arxiv](https://arxiv.org/abs/2311.17264)), it is an embedding based on near-deduplication method.\n\nFor a large dataset, it would require GPU(s) for fast inference.\n\n```bash\npython text_dedup/ann_unisim.py --path truthful_qa --name generation --split validation --output temp --column question\n```\n\nOutput:\n\n```\nINFO     Load Dataset                    : 5.56s\nINFO     Index Dataset                   : 8.13s\nINFO     Clustering                      : 8.72s\nINFO     Filtering                       : 0.35s\nINFO     Saving                          : 0.01s\nINFO     Cleaning                        : 0.00s\nINFO     Total                           : 22.77s\nINFO     Before                          : 817\nINFO     After                           : 788\n```\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\n\u003csummary\u003eSuffix Array Substring Exact Deduplication\u003c/summary\u003e\n\n```bash\n# input\npython -m text_dedup.suffix_array \\\n    --path \"oscar-corpus/OSCAR-2201\" \\\n    --name \"gl\" \\\n    --split \"train\" \\\n    --cache_dir \"./cache\" \\\n    --output \"output/suffix_array/oscar_gl_dedup\" \\\n    --column \"text\" \\\n    --google_repo_path \"/Users/chenghao/Downloads/Projects/text-dedup/deduplicate-text-datasets\" \\\n    --use_auth_token true\n\n# output\nINFO     Loading                       : 2.75 seconds\nINFO     Preprocessing                 : 4.78 seconds\nINFO     SuffixArray                   : 98.29 seconds\nINFO     SelfSimilar                   : 4.24 seconds\nINFO     Restore                       : 0.25 seconds\nINFO     Deduplicate                   : 6.23 seconds\nINFO     Saving                        : 8.91 seconds\nINFO     Total                         : 125.45 seconds\nINFO     Before                        : 180332342 bytes (88803)\nINFO     After                         : 97646271 bytes (40404)\n```\n\n\u003c/details\u003e\n\u003cdetails\u003e\n\n\u003csummary\u003eMinHash Near Deduplication\u003c/summary\u003e\n\n```bash\n# input\npython -m text_dedup.minhash \\\n  --path \"oscar-corpus/OSCAR-2201\" \\\n  --name \"gl\" \\\n  --split \"train\" \\\n  --cache_dir \"./cache\" \\\n  --output \"output/minhash/oscar_gl_dedup\" \\\n  --column \"text\" \\\n  --batch_size 10000 \\\n  --use_auth_token true\n\n# output\nINFO     Loading                         : 2.62 seconds\nINFO     MinHashing                      : 0.08 seconds\nINFO     Clustering                      : 2.20 seconds\nINFO     Filtering                       : 0.53 seconds\nINFO     Saving                          : 9.86 seconds\nINFO     Total                           : 15.29 seconds\nINFO     Data Number (before)            : 88803\nINFO     Data Number (after)             : 44124 (49.69%)\nINFO     Duplicate Number                : 44679 (50.31%)\nINFO     🤗 Happy Deduplicating 🤗\n```\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eSimHash Near Deduplication\u003c/summary\u003e\n\n```bash\n# input\npython -m text_dedup.simhash \\\n  --path \"oscar-corpus/OSCAR-2201\" \\\n  --name \"gl\" \\\n  --split \"train\" \\\n  --cache_dir \"./cache\" \\\n  --output \"output/simhash/oscar_gl_dedup\" \\\n  --column \"text\" \\\n  --batch_size 10000 \\\n  --use_auth_token true\n\n# output\nINFO     Loading                         : 2.60 seconds\nINFO     SimHashing                      : 0.04 seconds\nINFO     Indexing                        : 28.88 seconds\nINFO     Filtering                       : 0.88 seconds\nINFO     Saving                          : 10.41 seconds\nINFO     Total                           : 42.80 seconds\nINFO     Data Number (before)            : 88803\nINFO     Data Number (after)             : 46163 (51.98%)\nINFO     Duplicate Number                : 42640 (48.02%)\nINFO     🤗 Happy Deduplicating 🤗\n```\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eExact Hash Exact Deduplication\u003c/summary\u003e\n\n```bash\n# input\npython -m text_dedup.exact_hash \\\n    --path \"oscar-corpus/OSCAR-2201\" \\\n    --name \"gl\" \\\n    --split \"train\" \\\n    --cache_dir \"./cache\" \\\n    --output \"output/exact_hash/oscar_gl_dedup\" \\\n    --column \"text\" \\\n    --batch_size 1000 \\\n    --use_auth_token true\n\n# output\nINFO     Loading                       : 2.95s\nINFO     Processing                    : 3.79s\nINFO     Filtering                     : 0.10s\nINFO     Saving                        : 2.89s\nINFO     Total                         : 9.72s\nINFO     Before                        : 88803\nINFO     After                         : 47049\n```\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eBloom Filter Exact Deduplication\u003c/summary\u003e\n\n```bash\n# input\npython -m text_dedup.bloom_filter \\\n    --path \"oscar-corpus/OSCAR-2201\" \\\n    --name \"gl\" \\\n    --split \"train\" \\\n    --cache_dir \"./cache\" \\\n    --output \"output/bloom_filter/oscar_gl_dedup\" \\\n    --error_rate 1e-5 \\\n    --column \"text\" \\\n    --use_auth_token true    --batch_size 1000\n\n# output\nINFO     Loading                       : 2.72s\nINFO     Processing                    : 4.84s\nINFO     Filtering                     : 0.10s\nINFO     Saving                        : 2.88s\nINFO     Total                         : 10.54s\nINFO     Before                        : 88803\nINFO     After                         : 47045\n```\n\n\u003c/details\u003e\n\n## Benchmarks\n\n\u003e [!note]\n\u003e Spark implementation has some overhead for small datasets, so I recommend using the script only when you have a large dataset and enough compute resources.\n\n\u003cdetails\u003e\n\u003csummary\u003epinecone/core-2020-05-10-deduplication\u003c/summary\u003e\n\nSee `tests/benchmark_core.py` for reproduction.\n\n| Algorithm                       | Precision (Duplicates) | Recall (Duplicates) | Precision (Non Duplicates) | Recall (Non Duplicates) | Macro F1 score |  Accuracy | Time     |\n| :------------------------------ | ---------------------: | ------------------: | -------------------------: | ----------------------: | -------------: | --------: | :------- |\n| UniSim                          |                 0.9307 |              0.8924 |                     0.9055 |                  0.9394 |         0.9181 |    0.9054 | 1305.79s |\n| MinHash Spark                   |                  0.957 |              0.9445 |                     0.9471 |                   0.959 |          0.952 |    0.9202 | 691.77s  |\n| MinHash                         |                 0.9594 |              0.9445 |                     0.9474 |                  0.9616 |     **0.9534** |     0.924 | 18.88s   |\n| SimHash                         |                 0.9042 |               0.721 |                      0.792 |                  0.9329 |         0.8481 |    0.8321 | 644.36s  |\n| Exact Title                     |                 0.8302 |              0.5521 |                     0.7098 |                  0.9065 |           0.77 |    0.7456 | -        |\n| Exact Title Matching [^1]       |                  0.830 |                0.50 |                      0.709 |                   0.992 |          0.757 |     0.746 | -        |\n| Simhash Matching [^1]           |                  0.697 |               0.247 |                      0.598 |                   0.985 |          0.631 |     0.616 | -        |\n| Document Vector Similarity [^1] |                  0.912 |               0.779 |                      0.861 |                   0.986 |          0.885 |     0.883 | -        |\n| Hybrid Method [^1]              |                  0.908 |               0.828 |                      0.899 |                   0.979 |          0.904 |     0.903 | -        |\n| LaBSE[^2]                       |                  0.937 |               0.923 |                      0.930 |                   0.943 |          0.933 |     0.919 | -        |\n| Multilingual USE[^2]            |                  0.917 |               0.907 |                      0.918 |                   0.927 |          0.917 |     0.909 | -        |\n| Multilingual E5-Base[^2]        |                  0.931 |               0.908 |                      0.919 |                   0.939 |          0.924 |     0.920 | -        |\n| MinHash + LSH[^2]               |                  0.929 |               0.902 |                      0.915 |                   0.938 |          0.921 |     0.918 | -        |\n| RETSim Partial-Dup[^2]          |                  0.945 |               0.941 |                      0.945 |                   0.949 |          0.945 | **0.928** | -        |\n| RETSim Near-Dup[^2]             |                  0.928 |               0.937 |                      0.942 |                   0.934 |          0.935 | **0.926** | -        |\n\n\u003c/details\u003e\n\u003cdetails\u003e\n\u003csummary\u003eNEWS-COPY\u003c/summary\u003e\n\nSee `tests/benchmark_news.py` for reproduction.\n\nAdjusted Rand Index (ARI) on NEWS-COPY dataset:\n\n| Model/Algorithm          | ARI       |\n| :----------------------- | :-------- |\n| SimHash                  | 0.612     |\n| MinHash (Spark)          | 0.740     |\n| MinHash                  | 0.742     |\n| RETSim Near-Dup + ANN\\*  | _0.051_   |\n| n-gram [^3]              | 0.440     |\n| SimHash[^2]              | 0.695     |\n| MinHash[^3]              | 0.737     |\n| MinHash[^2]              | 0.783     |\n| Multilingual USE[^2]     | 0.730     |\n| Multilingual E5-Base[^2] | 0.742     |\n| S-BERT[^3]               | 0.700     |\n| RETSim Partial-Dup[^2]   | 0.831     |\n| RETSim Near-Dup[^2]      | 0.704     |\n| Re-ranking [^3]          | **0.937** |\n| Bi-encoder [^3]          | 0.915     |\n\n\\*: I can't seem to reproduce the results from the paper.\n\n[^1]: [Deduplication of Scholarly Documents using Locality Sensitive Hashing and Word Embeddings](https://aclanthology.org/2020.lrec-1.113)\n[^2]: [RETSim: Resilient and Efficient Text Similarity](https://arxiv.org/abs/2311.17264)\n[^3]: [Noise-Robust De-Duplication at Scale](https://www.semanticscholar.org/paper/Noise-Robust-De-Duplication-at-Scale-Silcock-D'Amico-Wong/7ca41cc5fc364b713aba5b573ae4ada801fd788a)\n\n\u003c/details\u003e\n\n\u003c!-- ## FAQ\n\n### Why use scripts instead of OOD classes and functions?\n\nEarly versions of the code uses object-oriented design for hashing and indexing, which was very difficult because different methods share little to no abstraction. In order to complie something that is useful, a lot of the wrapper code was used, and that actually increased the overhead of using this library. Additionally, deduplicating is often a one-time thing in data preprocessing pipeline, there isn't really a need for inline access. --\u003e\n\n\u003c!-- ### Why license change?\n\nBecause the google repo is licensed under Apache 2.0, I have to update from MIT. Util that part of code is completely re-implemented, Apache 2.0. will be the license I use. --\u003e\n\n## License\n\n[Apache 2.0](https://duckduckgo.com/l/?uddg=https%3A%2F%2Fwww.apache.org%2Flicenses%2FLICENSE%2D2.0.html\u0026rut=617d395c7a807de85e5707aca1f765e5b69a1627ed84c0aefa950e54e00a3094)\n\n## Citations\n\nGenerally, you can cite this repository as:\n\n```bibtex\n@software{chenghao_mou_2023_8364980,\n  author       = {Chenghao Mou and\n                  Chris Ha and\n                  Kenneth Enevoldsen and\n                  Peiyuan Liu},\n  title        = {ChenghaoMou/text-dedup: Reference Snapshot},\n  month        = sep,\n  year         = 2023,\n  publisher    = {Zenodo},\n  version      = {2023.09.20},\n  doi          = {10.5281/zenodo.8364980},\n  url          = {https://doi.org/10.5281/zenodo.8364980}\n}\n```\n\nThe spark version was born from [BigCode (Apache 2.0)](https://github.com/bigcode-project) and [BigScience (Apache 2.0)](https://github.com/bigscience-workshop), and you can cite the original paper if you want:\n\n```bibtex\n@article{\nkocetkov2023the,\ntitle={The Stack: 3 {TB} of permissively licensed source code},\nauthor={Denis Kocetkov and Raymond Li and Loubna Ben allal and Jia LI and Chenghao Mou and Yacine Jernite and Margaret Mitchell and Carlos Mu{\\~n}oz Ferrandis and Sean Hughes and Thomas Wolf and Dzmitry Bahdanau and Leandro Von Werra and Harm de Vries},\njournal={Transactions on Machine Learning Research},\nissn={2835-8856},\nyear={2023},\nurl={https://openreview.net/forum?id=pxpbTdUEpD},\nnote={}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FChenghaoMou%2Ftext-dedup","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FChenghaoMou%2Ftext-dedup","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FChenghaoMou%2Ftext-dedup/lists"}