https://github.com/philgooch/abbreviation-extraction
Python3 implementation of the Schwartz-Hearst algorithm for extracting abbreviation-definition pairs
https://github.com/philgooch/abbreviation-extraction
abbreviations information-extraction keyword-extraction nlp python3
Last synced: 5 months ago
JSON representation
Python3 implementation of the Schwartz-Hearst algorithm for extracting abbreviation-definition pairs
- Host: GitHub
- URL: https://github.com/philgooch/abbreviation-extraction
- Owner: philgooch
- License: mit
- Created: 2017-10-25T11:09:16.000Z (over 8 years ago)
- Default Branch: develop
- Last Pushed: 2023-10-20T21:33:57.000Z (over 2 years ago)
- Last Synced: 2025-10-20T12:44:46.708Z (8 months ago)
- Topics: abbreviations, information-extraction, keyword-extraction, nlp, python3
- Language: Python
- Homepage:
- Size: 57.6 KB
- Stars: 88
- Watchers: 5
- Forks: 20
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
- License: LICENSE.txt
Awesome Lists containing this project
README
# Extraction of abbreviation-definition pairs
[](https://travis-ci.org/philgooch/abbreviation-extraction)
## Version: 0.2.5
This is a Python3 implementation of the [Schwartz-Hearst algorithm](https://psb.stanford.edu/psb-online/proceedings/psb03/schwartz.pdf)
for identifying abbreviations and their corresponding definitions in free text[1].
The [original implementation is in Java](http://biotext.berkeley.edu/software.html), and Vincent Van Asch created a Python2 implementation at
http://www.cnts.ua.ac.be/~vincent/scripts/abbreviations.py
* NB: As of March 2019 this link appears to be dead.
I have simplified, refactored it for Python 3 and added some tests.
This version outputs a Python dictionary of abbreviation:definition pairs.
## Installation for command-line use
pip install -r requirements.txt
### Usage
From the command line
python abbreviations/schwartz_hearst.py
## Installation as a module
python3 setup.py install
or
pip install abbreviations
### Usage
from abbreviations import schwartz_hearst
# By default, the most recently encountered definition for each term is returned
pairs = schwartz_hearst.extract_abbreviation_definition_pairs(doc_text='The emergency room (ER) was busy')
pairs = schwartz_hearst.extract_abbreviation_definition_pairs(file_path='')
# If multiple definitions are encountered for each term, you might want to return the most common for each
pairs = schwartz_hearst.extract_abbreviation_definition_pairs(doc_text='...', most_common_definition=True)
# ... or you might want to return the first encountered definition for each
pairs = schwartz_hearst.extract_abbreviation_definition_pairs(doc_text='...', first_definition=True)
# when using a longer text, the format is line-separated sentences:
import nltk
sentences = nltk.sent_tokenize(longer_text)
pairs = schwartz_hearst.extract_abbreviation_definition_pairs(doc_text='\n'.join(sentences))
[1] A. Schwartz and M. Hearst (2003) A Simple Algorithm for Identifying Abbreviations Definitions in Biomedical Text.
Biocomputing, 451-462.