{"id":19031415,"url":"https://github.com/plandes/spanmatch","last_synced_at":"2026-02-10T20:02:38.470Z","repository":{"id":177736382,"uuid":"652239919","full_name":"plandes/spanmatch","owner":"plandes","description":"Unsupervised Position-Based Semantic Matching","archived":false,"fork":false,"pushed_at":"2025-01-25T03:05:47.000Z","size":432,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-11-29T07:27:36.408Z","etag":null,"topics":["document","information-retrieval","natural-language-processing","nlp","span"],"latest_commit_sha":null,"homepage":"https://plandes.github.io/spanmatch/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/plandes.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2023-06-11T14:30:31.000Z","updated_at":"2025-01-25T03:05:50.000Z","dependencies_parsed_at":null,"dependency_job_id":"1c305168-0c98-4dd2-901f-e48afa761359","html_url":"https://github.com/plandes/spanmatch","commit_stats":null,"previous_names":["plandes/spanmatch"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/plandes/spanmatch","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/plandes%2Fspanmatch","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/plandes%2Fspanmatch/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/plandes%2Fspanmatch/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/plandes%2Fspanmatch/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/plandes","download_url":"https://codeload.github.com/plandes/spanmatch/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/plandes%2Fspanmatch/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29314703,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-10T17:48:59.043Z","status":"ssl_error","status_checked_at":"2026-02-10T17:45:37.240Z","response_time":65,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["document","information-retrieval","natural-language-processing","nlp","span"],"created_at":"2024-11-08T21:23:19.319Z","updated_at":"2026-02-10T20:02:38.451Z","avatar_url":"https://github.com/plandes.png","language":"Python","readme":"# Unsupervised Position-Based Semantic Matching\n\n[![PyPI][pypi-badge]][pypi-link]\n[![Python 3.11][python311-badge]][python311-link]\n[![Build Status][build-badge]][build-link]\n\nAn API to match spans of semantically similar text across documents.  Each\nmatch is a span of text in a source document and another span of text in a\ntarget document that are both tied together.\n\n\u003c!-- markdown-toc start - Don't edit this section. Run M-x markdown-toc-refresh-toc --\u003e\n## Table of Contents\n\n- [Introduction](#introduction)\n- [Documentation](#documentation)\n- [Usage](#usage)\n- [Citation](#citation)\n- [Obtaining](#obtaining)\n- [Changelog](#changelog)\n- [License](#license)\n\n\u003c!-- markdown-toc end --\u003e\n\n\n## Introduction\n\nSpans are formed by a weighted combination of the semantic similarity of the\neach document's text and the token position.  Hyperparameters are used to\ncontrol which take precedent (semantic similarity or token position for longer\ncontiguous token spans).\n\nThis is done using position embeddings on a third (see Figure 1) axis shows\ndata blue word embeddings moving from cluster 1 to cluster 2. Cluster spans the\ndischarge summaries (orange), the note antecedent (green) and arrows connecting\nthe tokens to word points.\n\n![Figure 1](./doc/pos-emb.png)\n\n*Figure 1*\n\nFor more information, see the \"Hybrid Semantic Positional Token Clustering\"\nsection in our paper [Hospital Discharge Summarization Data Provenance].  This\npaper's primary repository is [here](https://github.com/uic-nlp-lab/dsprov).\n\n\n## Documentation\n\nSee the [full documentation](https://plandes.github.io/spanmatch/index.html).\nThe [API reference](https://plandes.github.io/spanmatch/api.html) is also\navailable.\n\n\n## Usage\n\n```python\nfrom zensols.cli import CliHarness\nfrom zensols.nlp import FeatureDocument, FeatureDocumentParser\nfrom zensols.spanmatch import Match, MatchResult, Matcher, ApplicationFactory\n\nSOURCE = \"\"\"\\\nJohannes Gutenberg (1398 – 1468) was a German goldsmith and publisher who\nintroduced printing to Europe. His introduction of mechanical movable type\nprinting to Europe started the Printing Revolution and is widely regarded as the\nmost important event of the modern period. It played a key role in the\nscientific revolution and laid the basis for the modern knowledge-based economy\nand the spread of learning to the masses.\n\nGutenberg many contributions to printing are: the invention of a process for\nmass-producing movable type, the use of oil-based ink for printing books,\nadjustable molds, and the use of a wooden printing press. His truly epochal\ninvention was the combination of these elements into a practical system that\nallowed the mass production of printed books and was economically viable for\nprinters and readers alike.\n\"\"\"\n\nSUMMARY = \"\"\"\\\nThe German Johannes Gutenberg introduced printing in Europe. His invention had a\ndecisive contribution in spread of mass-learning and in building the basis of\nthe modern society.\n\"\"\"\n\nharness: CliHarness = ApplicationFactory.create_harness()\ndoc_parser: FeatureDocumentParser = harness['spanmatch_doc_parser']\nmatcher: Matcher = harness['spanmatch_matcher']\nsource: FeatureDocument = doc_parser(SOURCE)\nsummary: FeatureDocument = doc_parser(SUMMARY)\n# shorten source doc span length by scaling up positional importance\nmatcher.hyp.source_position_scale = 2.5\n# elongate summary doc span length by scaling up positional importance\nmatcher.hyp.target_position_scale = 0.9\nres: MatchResult = matcher(source, summary)\nmatch: Match\nfor i, match in enumerate(res.matches[:5]):\n\tmatch.write(include_flow=False)\n```\n\nOutput:\n\n```abnf\n2023-06-11 08:22:38,392 24 matches found\nsource (0, 55):\n    Johannes Gutenberg (1398 – 1468) was a German goldsmith\ntarget (4, 29):\n    German Johannes Gutenberg\nsource (524, 631):\n    type, the use of oil-based ink for printing books, adjustable molds, and the use\n    of a wooden printing press\ntarget (4, 59):\n    German Johannes Gutenberg introduced printing in Europe\nsource (301, 421):\n    scientific revolution and laid the basis for the modern knowledge-based economy\n    and the spread of learning to the masses\ntarget (106, 177):\n    spread of mass-learning and in building the basis of the modern society\nsource (516, 585):\n    movable type, the use of oil-based ink for printing books, adjustable\ntarget (116, 169):\n    mass-learning and in building the basis of the modern\nsource (168, 199):\n    started the Printing Revolution\ntarget (106, 145):\n    spread of mass-learning and in building\n```\n\n\n## Obtaining\n\nThe easiest way to install the command line program is via the `pip` installer:\n```bash\npip3 install --use-deprecated=legacy-resolver zensols.spanmatch\n```\n\nBinaries are also available on [pypi].\n\n\n## Citation\n\nIf you use this project in your research please use the following BibTeX entry:\n\n```bibtex\n@inproceedings{landesHospitalDischargeSummarization2023,\n  title = {Hospital {{Discharge Summarization Data Provenance}}},\n  booktitle = {The 22nd {{Workshop}} on {{Biomedical Natural Language Processing}} and {{BioNLP Shared Tasks}}},\n  author = {Landes, Paul and Chaise, Aaron and Patel, Kunal and Huang, Sean and Di Eugenio, Barbara},\n  date = {2023-07},\n  pages = {439--448},\n  publisher = {{Association for Computational Linguistics}},\n  location = {{Toronto, Canada}},\n  url = {https://aclanthology.org/2023.bionlp-1.41},\n  urldate = {2023-07-10},\n  eventtitle = {{{BioNLP}} 2023}\n}\n```\n\n\n## Changelog\n\nAn extensive changelog is available [here](CHANGELOG.md).\n\n\n## License\n\n[MIT License](LICENSE.md)\n\nCopyright (c) 2023 - 2025 Paul Landes\n\n\n\u003c!-- links --\u003e\n[pypi]: https://pypi.org/project/zensols.spanmatch/\n[pypi-link]: https://pypi.python.org/pypi/zensols.spanmatch\n[pypi-badge]: https://img.shields.io/pypi/v/zensols.spanmatch.svg\n[python311-badge]: https://img.shields.io/badge/python-3.11-blue.svg\n[python311-link]: https://www.python.org/downloads/release/python-3110\n[build-badge]: https://github.com/plandes/spanmatch/workflows/CI/badge.svg\n[build-link]: https://github.com/plandes/spanmatch/actions\n\n[Hospital Discharge Summarization Data Provenance]: https://aclanthology.org/2023.bionlp-1.41/\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fplandes%2Fspanmatch","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fplandes%2Fspanmatch","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fplandes%2Fspanmatch/lists"}