Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/lhncbc/pymetamaplite
A minimal implementation of the MetaMapLite named entity recognizer in Python,
https://github.com/lhncbc/pymetamaplite
metamap named-entity-recognition umls
Last synced: 26 days ago
JSON representation
A minimal implementation of the MetaMapLite named entity recognizer in Python,
- Host: GitHub
- URL: https://github.com/lhncbc/pymetamaplite
- Owner: LHNCBC
- License: other
- Created: 2022-04-07T15:47:42.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2024-03-29T13:45:28.000Z (9 months ago)
- Last Synced: 2024-05-20T02:20:38.973Z (7 months ago)
- Topics: metamap, named-entity-recognition, umls
- Language: Python
- Homepage:
- Size: 65.4 KB
- Stars: 3
- Watchers: 2
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE.md
Awesome Lists containing this project
README
# Description
A minimal implementation of the MetaMapLite named entity recognizer in
Python.# Prerequisites
+ Python 3.8
+ NLTK or some other library that supplies a part of speech tagger and
a tokenizer.# Building and Installing pyMetaMapLite
Building the wheel package from sources:
Installing prequisites using `pip`:
python3 -m pip install nltk
See NLTK documentation at for more information on
NLTK.Building the wheel package from sources:
python3 -m pip install --upgrade pip
python3 -m pip install wheel
python3 -m pip install --upgrade build
python3 -m buildIn some environments, you might need to run `build` with the `--no-isolation` option.
python3 -m build --no-isolation
Installing the wheel package into your virtual environment:
python3 -m pip install dist/pymetamaplite-{version}-py3-none-any.whl
# Usage
This Python implementation of MetaMapLite uses inverted indexes
previously intended for use by the Java implementation of MetaMapLite.
The indexes are available at the MetaMapLite Web Page
(https://metamap.nlm.nih.gov/MetaMapLite.html).Below is an an example of using the MetaMapLite module on the string
"inferior vena cava stent filter" using NLTK to provide part-of-speech
tagging and tokenization:import nltk
from collections import namedtuple
from metamaplite import MetaMapLite
ivfdir = '/path/to/public_mm_lite/data/ivf/2020AA/USAbase'
label = ''
case_sensitive = False
use_sources = []
use_semtypes = []
postags = set(["CD", "FW", "RB", "IN", "NN", "NNS",
"NNP", "NNPS", "JJ", "JJR", "JJS", "LS"])
stopwords = []
excludedterms = []
mminst = MetaMapLite(ivfdir, use_sources, use_semtypes, postags,
stopwords, excludedterms)# Convert tokens and part of speech tags into named tuples with
# the following defintion:
Token = namedtuple('Token', ['text', 'tag_', 'idx', 'start'])def add_spans(postokenlist):
"""Add spans to part-of-speech tokenlist of tuples of form:
(tokentext, part-of-speech-tag). """
tokenlist = []
start = 0
idx = 0
for token in postokenlist:
tokenlist.append(
Token(text=token[0], tag_=token[1], idx=idx, start=start))
start = start + len(token[0]) + 1
idx += 1
return tokenlistinputtext = 'inferior vena cava stent filter'
print('input text: "%s"' % inputtext)
texttokenlist = inputtext.split(' ')
postokenlist = nltk.pos_tag(texttokenlist)
tokenlist = add_spans(postokenlist)# pass tokenlist to get_entities to find entities in the input
# text.
matches = mminst.get_entities(tokenlist, span_info=True)
for term in matches:
print('{}'.format(term.text))
print(' start: {}'.format(term.start))
print(' end: {}'.format(term.end))
print(' postings:')
for post in term.postings:
print(' {}'.format(post))output:
length of list of tokensublists: 15
length of list of term_info_list: 7
inferior vena cava
start: 0
end: 18
postings:
C0042458|S0002351|4|Inferior vena cava|RCD|PT
C0042458|S0002351|5|Inferior vena cava|SNM|PT
C0042458|S0002351|6|Inferior vena cava|SNMI|PT
C0042458|S0002351|7|Inferior vena cava|UWDA|PT
C0042458|S0002351|8|Inferior vena cava|FMA|PT
C0042458|S0002351|9|Inferior vena cava|SNOMEDCT_US|SY
C0042458|S0906979|10|INFERIOR VENA CAVA|NCI_CDISC|PT
C0042458|S6146821|13|inferior vena cava|NCI_NCI-GLOSS|PT
C0042458|S0380063|24|Inferior Vena Cava|NCI_caDSR|SY
C0042458|S0380063|25|Inferior Vena Cava|NCI|PT
C0042458|S0380063|26|Inferior Vena Cava|MSH|ET
C1269024|S0002351|3|Inferior vena cava|SNOMEDCT_US|ISUse the function __result_utils.add_semantic_types__ to add semantic
types and convert postings to records:from metamaplite import result_utils
matches0 = mminst.get_entities(tokenlist, span_info=True)
matches = result_utils.add_semantic_types(mminst, matches0)
for term in matches:
print('{}'.format(term.text))
print(' start: {}'.format(term.start))
print(' end: {}'.format(term.end))
print(' postings:')
for post in term.postings:
print(' {}'.format(post))output:
inferior vena cava
start: 0
end: 18
postings:
PostingSTS(cui='C0042458', sui='S0002351', idx='4',
str='Inferior vena cava', src='SNM', termtype='PT',
semtypeset=['bpoc'])
PostingSTS(cui='C0042458', sui='S0002351', idx='5',
str='Inferior vena cava', src='SNMI', termtype='PT',
semtypeset=['bpoc'])
PostingSTS(cui='C0042458', sui='S0002351', idx='6',
str='Inferior vena cava', src='UWDA', termtype='PT',
semtypeset=['bpoc'])
PostingSTS(cui='C0042458', sui='S0002351', idx='7',
str='Inferior vena cava', src='FMA', termtype='PT',
semtypeset=['bpoc'])
PostingSTS(cui='C0042458', sui='S0002351', idx='8',
str='Inferior vena cava', src='SNOMEDCT_US', termtype='SY',
semtypeset=['bpoc'])
PostingSTS(cui='C0042458', sui='S0906979', idx='9',
str='INFERIOR VENA CAVA', src='NCI_CDISC', termtype='PT',
semtypeset=['bpoc'])
PostingSTS(cui='C0042458', sui='S6146821', idx='11',
str='inferior vena cava', src='CHV', termtype='PT',
semtypeset=['bpoc'])
PostingSTS(cui='C0042458', sui='S6146821', idx='12',
str='inferior vena cava', src='NCI_NCI-GLOSS', termtype='PT',
semtypeset=['bpoc'])
PostingSTS(cui='C0042458', sui='S0380063', idx='23',
str='Inferior Vena Cava', src='NCI', termtype='SY',
semtypeset=['bpoc'])
PostingSTS(cui='C0042458', sui='S0380063', idx='24',
str='Inferior Vena Cava', src='NCI', termtype='PT',
semtypeset=['bpoc'])
PostingSTS(cui='C0042458', sui='S0380063', idx='25',
str='Inferior Vena Cava', src='MSH', termtype='ET',
semtypeset=['bpoc'])
PostingSTS(cui='C1269024', sui='S0002351', idx='3',
str='Inferior vena cava', src='SNOMEDCT_US', termtype='IS',
semtypeset=['bpoc'])# Excluding Terms by Concept
Format of excluded_terms list, each entry is the concept, and the term
to be excluded for that concept separated by a colon (:).excluded_terms = [
'C0004002:got'
'C0006104:bra'
'C0011710:doc'
'C0012931:construct'
'C0014522:ever'
'C0015737:national'
'C0018081:clap'
'C0023668:lie'
'C0025344:period'
'C0025344:periods'
'C0029144:optical'
'C0071973:prime']The excluded term list is provided as parameter during the
instantiation of the MetaMapLite instance:
mminst = MetaMapLite(ivfdir, use_sources, use_semtypes, postags,
stopwords, excludedterms=excluded_terms)
# Building Indexes## Input Data Tables and Associated Formats
The input tables are place in a directory (ivfdir) containing four files):
ivfdir
|-- tables
|-- ifconfig
|-- mrconso.eng
|-- mrsat.rrf
|-- mrsty.rrf### mrconso.eng - Metathesaurus concepts
Each record in this file contains the preferred name and synonyms for
each concept as well as other information including vocabulary source
identifier and any vocabulary specific term identifiers.C0000005|ENG|P|L0000005|PF|S0007492|Y|A26634265||M0019694|D012711|MSH|PEP|D012711|(131)I-Macroaggregated Albumin|0|N|256|
C0000005|ENG|S|L0270109|PF|S0007491|Y|A26634266||M0019694|D012711|MSH|ET|D012711|(131)I-MAA|0|N|256|
C0000039|ENG|P|L0000039|PF|S0007564|N|A0016515||M0023172|D015060|MSH|MH|D015060|1,2-Dipalmitoylphosphatidylcholine|0|N|256|
C0000039|ENG|P|L0000039|PF|S0007564|N|A17972823||N0000007747||NDFRT|PT|N0000007747|1,2-Dipalmitoylphosphatidylcholine|0|N|256|
C0000039|ENG|P|L0000039|PF|S0007564|Y|A8394967||||MTH|PN|NOCODE|1,2-Dipalmitoylphosphatidylcholine|0|N|256|### mrsat.rrf
### mrsty.rrf
Semantic type identifiers assigned to each concept:
## table generation from Metathesaurus files
Not implemented in Python, See Java implementation in MetaMapLite.
### ifconfig
This file contains the schemas for tables used in the later sections:
cui_st.txt|cuist|2|0|cui|st|TXT|TXT
cui_sourceinfo.txt|cuisourceinfo|6|0,1,3|cui|sui|i|str|src|tty|TXT|TXT|INT|TXT|TXT|TXT
cui_concept.txt|cuiconcept|2|0,1|cui|concept|TXT|TXT
mesh_tc_relaxed.txt|meshtcrelaxed|2|0,1|mesh|tc|TXT|TXT
vars.txt|vars|7|0,2|term|tcat|word|wcat|varlevel|history||TXT|TXT|TXT|TXT|TXT|TXT|TXT## Index generation
The program invocation:
python -m metamaplite.index.build_index ivfdir
Generates:
ivfdir
|-- tables
|-- indicesWhere the directory indices contains the inverted index files.
# Speeding up pyMetaMapLite
## Entity Lookup Caching
By using the optional parameter `use_cache=True` when instantiating
the MetaMapLite instance lookups for strings, semantic types, and
preferred names will be cached after the initial lookup. Any
subsequent lookup will use the cache directly instead of accessing the
index on disk. This can result in a significant speed up when
processing large collections at the expense of using more memory:mminst = MetaMapLite(ivfdir, use_sources, use_semtypes, postags,
stopwords, excludedterms, use_cache=True)## Using Pyston
Both NLTK and pyMetaMapLite can be run in the Pyston implementation of
Python. Pyston provides many of the speedups of Cython without
requiring translation of the Python to the C language which can be
problematic on some platforms.+ Pyston Gihub page:
+ Documentation on installing Pyston in an existing conda installation: