{"id":37072150,"url":"https://github.com/nlx-group/overlapy","last_synced_at":"2026-01-14T08:28:08.873Z","repository":{"id":57449948,"uuid":"406493489","full_name":"nlx-group/overlapy","owner":"nlx-group","description":"Python package developed to evaluate textual overlap (N-Grams) between two volumes of text.","archived":false,"fork":false,"pushed_at":"2021-09-23T11:07:19.000Z","size":46,"stargazers_count":9,"open_issues_count":0,"forks_count":2,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-09-30T16:41:29.367Z","etag":null,"topics":["data-contamination","nlp","textual-analysis"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/nlx-group.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-09-14T19:17:27.000Z","updated_at":"2024-10-18T02:08:06.000Z","dependencies_parsed_at":"2022-09-01T19:13:03.888Z","dependency_job_id":null,"html_url":"https://github.com/nlx-group/overlapy","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/nlx-group/overlapy","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nlx-group%2Foverlapy","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nlx-group%2Foverlapy/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nlx-group%2Foverlapy/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nlx-group%2Foverlapy/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/nlx-group","download_url":"https://codeload.github.com/nlx-group/overlapy/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nlx-group%2Foverlapy/sbom","scorecard":{"id":690743,"data":{"date":"2025-08-11","repo":{"name":"github.com/nlx-group/overlapy","commit":"b2a7ea472bc6daa32ce3d94c62f2b4efd113e5b0"},"scorecard":{"version":"v5.2.1-40-gf6ed084d","commit":"f6ed084d17c9236477efd66e5b258b9d4cc7b389"},"score":3,"checks":[{"name":"Code-Review","score":0,"reason":"Found 0/24 approved changesets -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project requires human code review before pull requests (aka merge requests) are merged.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#code-review"}},{"name":"Token-Permissions","score":-1,"reason":"No tokens found","details":null,"documentation":{"short":"Determines if the project's workflows follow the principle of least privilege.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#token-permissions"}},{"name":"Dangerous-Workflow","score":-1,"reason":"no workflows found","details":null,"documentation":{"short":"Determines if the project's GitHub Action workflows avoid dangerous patterns.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#dangerous-workflow"}},{"name":"SAST","score":0,"reason":"no SAST tool detected","details":["Warn: no pull requests merged into dev branch"],"documentation":{"short":"Determines if the project uses static code analysis.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#sast"}},{"name":"Packaging","score":-1,"reason":"packaging workflow not detected","details":["Warn: no GitHub/GitLab publishing workflow detected."],"documentation":{"short":"Determines if the project is published as a package that others can easily download, install, easily update, and uninstall.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#packaging"}},{"name":"Binary-Artifacts","score":10,"reason":"no binaries found in the repo","details":null,"documentation":{"short":"Determines if the project has generated executable (binary) artifacts in the source repository.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#binary-artifacts"}},{"name":"Pinned-Dependencies","score":-1,"reason":"no dependencies found","details":null,"documentation":{"short":"Determines if the project has declared and pinned the dependencies of its build process.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#pinned-dependencies"}},{"name":"Maintained","score":0,"reason":"0 commit(s) and 0 issue activity found in the last 90 days -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project is \"actively maintained\".","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#maintained"}},{"name":"CII-Best-Practices","score":0,"reason":"no effort to earn an OpenSSF best practices badge detected","details":null,"documentation":{"short":"Determines if the project has an OpenSSF (formerly CII) Best Practices Badge.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#cii-best-practices"}},{"name":"Security-Policy","score":0,"reason":"security policy file not detected","details":["Warn: no security policy file detected","Warn: no security file to analyze","Warn: no security file to analyze","Warn: no security file to analyze"],"documentation":{"short":"Determines if the project has published a security policy.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#security-policy"}},{"name":"Fuzzing","score":0,"reason":"project is not fuzzed","details":["Warn: no fuzzer integrations found"],"documentation":{"short":"Determines if the project uses fuzzing.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#fuzzing"}},{"name":"License","score":10,"reason":"license file detected","details":["Info: project has a license file: LICENSE:0","Info: FSF or OSI recognized license: MIT License: LICENSE:0"],"documentation":{"short":"Determines if the project has defined a license.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#license"}},{"name":"Vulnerabilities","score":10,"reason":"0 existing vulnerabilities detected","details":null,"documentation":{"short":"Determines if the project has open, known unfixed vulnerabilities.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#vulnerabilities"}},{"name":"Signed-Releases","score":-1,"reason":"no releases found","details":null,"documentation":{"short":"Determines if the project cryptographically signs release artifacts.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#signed-releases"}},{"name":"Branch-Protection","score":0,"reason":"branch protection not enabled on development/release branches","details":["Warn: branch protection not enabled for branch 'master'"],"documentation":{"short":"Determines if the default and release branches are protected with GitHub's branch protection settings.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#branch-protection"}}]},"last_synced_at":"2025-08-22T02:14:54.757Z","repository_id":57449948,"created_at":"2025-08-22T02:14:54.757Z","updated_at":"2025-08-22T02:14:54.757Z"},"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28414003,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-14T08:16:59.381Z","status":"ssl_error","status_checked_at":"2026-01-14T08:13:45.490Z","response_time":107,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-contamination","nlp","textual-analysis"],"created_at":"2026-01-14T08:28:08.139Z","updated_at":"2026-01-14T08:28:08.854Z","avatar_url":"https://github.com/nlx-group.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cp align=\"center\"\u003e\u003cimg src=\"logo-w-text.png\" alt=\"Overlapy Logo\" /\u003e\u003c/p\u003e\n\n--------------------------------------------------------------------------------\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"#about\"\u003eAbout\u003c/a\u003e ⚭\n  \u003ca href=\"#installation\"\u003eInstallation\u003c/a\u003e ⚭\n  \u003ca href=\"#usage\"\u003eUsage\u003c/a\u003e ⚭\n  \u003ca href=\"#citation\"\u003eCitation\u003c/a\u003e\n\u003c/p\u003e\n\n## About\n\nOverlapy is a Python package developed to evaluate textual overlap (N-Grams) between two volumes of text. In fact, it comes from the necessity of evaluating \"data contamination\" between pre-training datasets for Language Models and testsets of NLP tasks. This problem is starting to become relevant: as models become ever larger, rapidly entering the trillions of parameters mark, they can fit larger pre-training language modelling datasets, which have started to inch closer to the terabytes mark.\n\nThe web is a source of nearly unlimited natural language text, making it one of the favourite sources to obtain unlabelled text. Websites like Reddit (\u003chttps://reddit.com/\u003e) aggregate content and outbound links in inconcievable amounts. However, these resources are not exclusive to the language modelling task, and other tasks use them to construct even labelled datasets. As web crawlers extend their scrapped nodes, the probability of obtaining text that has been used in other tasks grows larger. With the capability of these models to memorize spans of text, it can just so happen that specific spans from examples of a tasks' testset could have been found in the pre-training dataset. The language model could have memorized it, making it previously seen data less than ideal as we want to test our models with unseen (o.o.d) data. This constitutes a problem for the present and future.\n\nThe methodology followed for this implementation is described in GPT-3's paper appendix (\u003chttps://arxiv.org/abs/2005.14165\u003e). It can be decomposed into three main parts: tokenize, choosing N-Gram size, calculate N-Gram collisions between pre-training datasets and testsets.\n\n1. A token is considered an alphanumeric character, delimited by whitespace, and lowercased. In overlapy, the tokenization function is arbitrary (user-defined), and does not need to follow this definition.\n2. N-Gram size is determined to be the 5th percentile of the distribution of testset examples lengths. The authors set a minimum size of 8 and maximum size of 13. We follow this definition, however, allow the user to redefine the percentile, minimum and maximum size.\n3. Collisions are calculated by our package using the Aho-Corasick algorithm (\u003chttps://dl.acm.org/doi/10.1145/360825.360855\u003e). The testsets are decomposed into N-Grams. Subsequently, we distribute the pre-training dataset to a pool of workers, calculating matches between the testset N-Grams and examples from the pre-training dataset.\n\n\n## Installation\n\nPackaged developed to work with Python 3+. Some examples require Python 3.6+ and nltk (\u003chttp://www.nltk.org/\u003e) installed.\n\ntqdm (\u003chttps://github.com/tqdm/tqdm\u003e) not mandatory to have installed but is recommended to track the progress, especially for jobs with several hundreds of gigabytes of text.\n\n```bash\npip install overlapy\n```\n\n## Usage\n\nIt follows the contents of an usage example from one of our examples found [here](examples/).\n\n```python\nfrom overlapy import OverlapyTestSet, Overlapy\n\npretraining_dataset = [\n    \"A B A C D E F G\",\n    \"A C F J K H E\",\n    \"V L N M Q\",\n    \"A B A C Ç T Z V E\",\n    \"L M N O P\",\n]\n\ntestset_examples = [\n    \"B A B A C O Q W R\",  # Match A B A C with #1, #4 from pretraining_dataset\n    \"O P Q F J K H\",  # Match F J K H with #2 from pretraining_dataset\n    \"W E R E\",  # No match\n    \"I E T Z V E L\",  # Match T Z V E with #4 from pretraining_dataset\n    \"K E K W\",  # No match\n]\n# Total examples matched: 3\n\n\ndef tokenizer(s):\n    # Simple tokenization by whitespace.\n    return s.split()\n\n\n# We'll override the parameter min_n and set it to 1 as we want the ngram value to be allowed\n# to be less than 8. The testset examples were constructed for it to be 4, actually.\ntestset = OverlapyTestSet(\n    \"test\", min_n=1, examples=[tokenizer(s) for s in testset_examples]\n)\nprint(f\"N value: {testset.compute_n()}\")\nprint(f\"# NGrams: {len(set(map(tuple, list(testset.ngrams()))))}\")\n\n# We create an Overlapy object, handing three arguments:\n#   * Testsets: A list of OverlapyTestSet objects that we want to study.\n#   * Dataset: Dataset we want to calculate collisions with\n#   * n_workers: Number of worker processes to use\noverlapy = Overlapy(\n    testsets=[testset],\n    dataset=[tokenizer(s) for s in pretraining_dataset],\n    n_workers=2,\n)\n# Let's run and get the matches\nmatches = overlapy.run()\n\n# We should be getting 3 testset examples that have been flagged for matches.\n#    #0 matches on A B A C\n#    #1 matches on F J K H\n#    #3 matches on T V Z E\n# As we had noted above\nprint(f\"Matches: {list(testset.get_matches(matches))}\")\n```\n\n## Citation\n\nBibtex citation will be available soon.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnlx-group%2Foverlapy","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnlx-group%2Foverlapy","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnlx-group%2Foverlapy/lists"}