Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/plandes/mimic

MIMIC III Corpus Parsing
https://github.com/plandes/mimic

mimic-iii natural-language-processing parsing-library spacy

Last synced: 9 days ago
JSON representation

MIMIC III Corpus Parsing

Awesome Lists containing this project

README

        

# MIMIC III Corpus Parsing

[![PyPI][pypi-badge]][pypi-link]
[![Python 3.11][python311-badge]][python311-link]
[![Build Status][build-badge]][build-link]

A utility library for parsing the [MIMIC-III] corpus. This uses [spaCy] and
extends the [zensols.mednlp] to parse the [MIMIC-III] medical note dataset.
Features include:

* Creates both natural language and medical features from medical notes. The
latter is generated using linked entity concepts parsed with [MedCAT] via
[zensols.mednlp].
* Modifies the [spaCy] tokenizer to chunk masked tokens. For example, `[`,
`**`, `First`, `Name` `**` `]` becomes `[**First Name**]`.
* Provides a clean Pythonic object oriented representation of MIMIC-III
admissions and medical notes.
* Interfaces MIMIC-III data as a relational database (either PostgreSQL or
SQLite).
* Paragraph chunking using the most common syntax/physician templates provided
in the MIMIC-III dataset.

## Documentation

See the [full documentation](https://plandes.github.io/mimic/index.html).
The [API reference](https://plandes.github.io/mimic/api.html) is also
available.

## Obtaining

The easiest way to install the command line program is via the `pip` installer:
```bash
pip3 install zensols.mimic
```

Binaries are also available on [pypi].

## Installation

1. Install the package: `pip3 install zensols.mimic`
2. Install the database (either PostgreSQL or SQLite).

## Configuration

After a database is installed it must be configured in a new file `~/.mimicrc`
that you create. This INI formatted file also specifies where to cache data:
```ini
[default]
# the directory where cached data is stored
data_dir = ~/directory/to/cached/data
```
If this file doesn't exist, it must be specified with the `--config` option.

### SQLite

SQLite is the default database used for MIMIC-III access, but, it is slower and
not as well tested compared to the [PostgreSQL](PostgreSQL) driver. See the
[SQLite database file] using the [SQLite instructions] to create the SQLite
file from MIMIC-III if you need database access.

Once you create the file, configure it with the API using the following
additional configuration in the `--config` specified file is also necessary (or in
`~/.mimicrc`):
```ini
[mimic_sqlite_conn_manager]
db_file = path: /mimic3.sqlite3
```

### PostgreSQL

PostgreSQL is the preferred way to access MIMIC-II for this API. The MIMIC-III
database can be loaded by following the [PostgreSQL instructions], or consider
the [PostgreSQL Docker image]. Then configure the database by adding the
following to `~/.mimicrc`:
```ini
[mimic_default]
resources_dir = resource(zensols.mimic): resources
sql_resources = ${resources_dir}/postgres
conn_manager = mimic_postgres_conn_manager

[mimic_db]
database =
host =
port =
user =
password =
```

The Python PostgreSQL client package is also needed (not needed for the
[SQLite](#sqlite-configuration) installs), which can be installed with:
```bash
pip3 install zensols.dbpg
```

## Usage

The [Corpus] class is the data access object used to read and parse the corpus:

```python
# get the MIMIC-III corpus data acceess object
>>> from zensols.mimic import ApplicationFactory
>>> corpus = ApplicationFactory.get_corpus()

# get an admission by hadm_id
>>> adm = corpus.hospital_adm_stash['165315']

# get the first discharge note (some have admissions have addendums)
>>> from zensols.mimic.regexnote import DischargeSummaryNote
>>> ds = adm.notes_by_category[DischargeSummaryNote.CATEGORY][0]

# dump the note as a human readable section-by-section
>>> ds.write()
row_id: 12144
category: Discharge summary
description: Report
annotator: regular_expression
----------------------0:chief-complaint (CHIEF COMPLAINT)-----------------------
Unresponsiveness
-----------1:history-of-present-illness (HISTORY OF PRESENT ILLNESS)------------
The patient is a ...

# get features of the note useful in ML models as a Pandas dataframe
>>> df = ds.feature_dataframe

# get only medical features (CUI, entity, NER and POS tag) for the HPI section
>>> df[(df['section'] == 'history-of-present-illness') & (df['cui_'] != '--')]['norm cui_ detected_name_ ent_ tag_'.split()]
norm cui_ detected_name_ ent_ tag_
15 history C0455527 history~of~hypertension concept NN
```

See the [application example], which gives a fine grain way of configuring the
API.

## Medical Note Segmentation

This package uses regular expressions to segment notes. However, the
[zensols.mimicsid] uses annotations and a model trained by clinical informatics
physicians. Using this package gives this enhanced segmentation without any
API changes.

## Citation

If you use this project in your research please use the following BibTeX entry:

```bibtex
@inproceedings{landes-etal-2023-deepzensols,
title = "{D}eep{Z}ensols: A Deep Learning Natural Language Processing Framework for Experimentation and Reproducibility",
author = "Landes, Paul and
Di Eugenio, Barbara and
Caragea, Cornelia",
editor = "Tan, Liling and
Milajevs, Dmitrijs and
Chauhan, Geeticka and
Gwinnup, Jeremy and
Rippeth, Elijah",
booktitle = "Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023)",
month = dec,
year = "2023",
address = "Singapore, Singapore",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.nlposs-1.16",
pages = "141--146"
}
```

## Changelog

An extensive changelog is available [here](CHANGELOG.md).

## Community

Please star this repository and let me know how and where you use this API.
Contributions as pull requests, feedback and any input is welcome.

## License

[MIT License](LICENSE.md)

Copyright (c) 2022 - 2025 Paul Landes

[pypi]: https://pypi.org/project/zensols.mimic/
[pypi-link]: https://pypi.python.org/pypi/zensols.mimic
[pypi-badge]: https://img.shields.io/pypi/v/zensols.mimic.svg
[python311-badge]: https://img.shields.io/badge/python-3.11-blue.svg
[python311-link]: https://www.python.org/downloads/release/python-3110
[build-badge]: https://github.com/plandes/mimic/workflows/CI/badge.svg
[build-link]: https://github.com/plandes/mimic/actions

[MIMIC-III]: https://physionet.org/content/mimiciii-demo/1.4/
[MedCAT]: https://github.com/CogStack/MedCAT
[spaCy]: https://spacy.io
[zensols.mednlp]: https://github.com/plandes/mednlp

[SQLite instructions]: https://github.com/MIT-LCP/mimic-code/tree/main/mimic-iii/buildmimic/sqlite
[PostgreSQL instructions]: https://github.com/MIT-LCP/mimic-code/blob/main/mimic-iii/buildmimic/postgres/README.md
[PostgreSQL Docker image]: https://github.com/plandes/mimicdb
[SQLite database file]: https://github.com/plandes/mimicdbsqlite
[Corpus]: https://plandes.github.io/mimic/api/zensols.mimic.html#zensols.mimic.corpus.Corpus
[application example]: https://github.com/plandes/mimic/blob/master/example/shownote.py
[zensols.mimicsid]: https://github.com/plandes/mimicsid