https://github.com/plandes/spanmatch

Unsupervised Position-Based Semantic Matching
https://github.com/plandes/spanmatch

document information-retrieval natural-language-processing nlp span

Last synced: 4 months ago
JSON representation

Unsupervised Position-Based Semantic Matching

Host: GitHub
URL: https://github.com/plandes/spanmatch
Owner: plandes
License: mit
Created: 2023-06-11T14:30:31.000Z (about 3 years ago)
Default Branch: master
Last Pushed: 2025-01-25T03:05:47.000Z (over 1 year ago)
Last Synced: 2025-11-29T07:27:36.408Z (7 months ago)
Topics: document, information-retrieval, natural-language-processing, nlp, span
Language: Python
Homepage: https://plandes.github.io/spanmatch/
Size: 422 KB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE.md

Awesome Lists containing this project

README

          # Unsupervised Position-Based Semantic Matching

[![PyPI][pypi-badge]][pypi-link]

[![Python 3.11][python311-badge]][python311-link]

[![Build Status][build-badge]][build-link]

An API to match spans of semantically similar text across documents.  Each

match is a span of text in a source document and another span of text in a

target document that are both tied together.

## Table of Contents

- [Introduction](#introduction)

- [Documentation](#documentation)

- [Usage](#usage)

- [Citation](#citation)

- [Obtaining](#obtaining)

- [Changelog](#changelog)

- [License](#license)

## Introduction

Spans are formed by a weighted combination of the semantic similarity of the

each document's text and the token position.  Hyperparameters are used to

control which take precedent (semantic similarity or token position for longer

contiguous token spans).

This is done using position embeddings on a third (see Figure 1) axis shows

data blue word embeddings moving from cluster 1 to cluster 2. Cluster spans the

discharge summaries (orange), the note antecedent (green) and arrows connecting

the tokens to word points.

![Figure 1](./doc/pos-emb.png)

*Figure 1*

For more information, see the "Hybrid Semantic Positional Token Clustering"

section in our paper [Hospital Discharge Summarization Data Provenance].  This

paper's primary repository is [here](https://github.com/uic-nlp-lab/dsprov).

## Documentation

See the [full documentation](https://plandes.github.io/spanmatch/index.html).

The [API reference](https://plandes.github.io/spanmatch/api.html) is also

available.

## Usage

```python

from zensols.cli import CliHarness

from zensols.nlp import FeatureDocument, FeatureDocumentParser

from zensols.spanmatch import Match, MatchResult, Matcher, ApplicationFactory

SOURCE = """\

Johannes Gutenberg (1398 – 1468) was a German goldsmith and publisher who

introduced printing to Europe. His introduction of mechanical movable type

printing to Europe started the Printing Revolution and is widely regarded as the

most important event of the modern period. It played a key role in the

scientific revolution and laid the basis for the modern knowledge-based economy

and the spread of learning to the masses.

Gutenberg many contributions to printing are: the invention of a process for

mass-producing movable type, the use of oil-based ink for printing books,

adjustable molds, and the use of a wooden printing press. His truly epochal

invention was the combination of these elements into a practical system that

allowed the mass production of printed books and was economically viable for

printers and readers alike.

"""

SUMMARY = """\

The German Johannes Gutenberg introduced printing in Europe. His invention had a

decisive contribution in spread of mass-learning and in building the basis of

the modern society.

"""

harness: CliHarness = ApplicationFactory.create_harness()

doc_parser: FeatureDocumentParser = harness['spanmatch_doc_parser']

matcher: Matcher = harness['spanmatch_matcher']

source: FeatureDocument = doc_parser(SOURCE)

summary: FeatureDocument = doc_parser(SUMMARY)

# shorten source doc span length by scaling up positional importance

matcher.hyp.source_position_scale = 2.5

# elongate summary doc span length by scaling up positional importance

matcher.hyp.target_position_scale = 0.9

res: MatchResult = matcher(source, summary)

match: Match

for i, match in enumerate(res.matches[:5]):

	match.write(include_flow=False)

```

Output:

```abnf

2023-06-11 08:22:38,392 24 matches found

source (0, 55):

    Johannes Gutenberg (1398 – 1468) was a German goldsmith

target (4, 29):

    German Johannes Gutenberg

source (524, 631):

    type, the use of oil-based ink for printing books, adjustable molds, and the use

    of a wooden printing press

target (4, 59):

    German Johannes Gutenberg introduced printing in Europe

source (301, 421):

    scientific revolution and laid the basis for the modern knowledge-based economy

    and the spread of learning to the masses

target (106, 177):

    spread of mass-learning and in building the basis of the modern society

source (516, 585):

    movable type, the use of oil-based ink for printing books, adjustable

target (116, 169):

    mass-learning and in building the basis of the modern

source (168, 199):

    started the Printing Revolution

target (106, 145):

    spread of mass-learning and in building

```

## Obtaining

The easiest way to install the command line program is via the `pip` installer:

```bash

pip3 install --use-deprecated=legacy-resolver zensols.spanmatch

```

Binaries are also available on [pypi].

## Citation

If you use this project in your research please use the following BibTeX entry:

```bibtex

@inproceedings{landesHospitalDischargeSummarization2023,

  title = {Hospital {{Discharge Summarization Data Provenance}}},

  booktitle = {The 22nd {{Workshop}} on {{Biomedical Natural Language Processing}} and {{BioNLP Shared Tasks}}},

  author = {Landes, Paul and Chaise, Aaron and Patel, Kunal and Huang, Sean and Di Eugenio, Barbara},

  date = {2023-07},

  pages = {439--448},

  publisher = {{Association for Computational Linguistics}},

  location = {{Toronto, Canada}},

  url = {https://aclanthology.org/2023.bionlp-1.41},

  urldate = {2023-07-10},

  eventtitle = {{{BioNLP}} 2023}

}

```

## Changelog

An extensive changelog is available [here](CHANGELOG.md).

## License

[MIT License](LICENSE.md)

Copyright (c) 2023 - 2025 Paul Landes

[pypi]: https://pypi.org/project/zensols.spanmatch/

[pypi-link]: https://pypi.python.org/pypi/zensols.spanmatch

[pypi-badge]: https://img.shields.io/pypi/v/zensols.spanmatch.svg

[python311-badge]: https://img.shields.io/badge/python-3.11-blue.svg

[python311-link]: https://www.python.org/downloads/release/python-3110

[build-badge]: https://github.com/plandes/spanmatch/workflows/CI/badge.svg

[build-link]: https://github.com/plandes/spanmatch/actions

[Hospital Discharge Summarization Data Provenance]: https://aclanthology.org/2023.bionlp-1.41/

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/plandes/spanmatch

Awesome Lists containing this project

README