https://github.com/scai-bio/datastew

Python library for intelligent data stewardship using Large Language Model (LLM) embeddings
https://github.com/scai-bio/datastew

data-harmonization data-stewardship large-language-models

Last synced: 3 months ago
JSON representation

Python library for intelligent data stewardship using Large Language Model (LLM) embeddings

Host: GitHub
URL: https://github.com/scai-bio/datastew
Owner: SCAI-BIO
License: apache-2.0
Created: 2024-07-01T11:54:23.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2025-04-29T13:47:46.000Z (3 months ago)
Last Synced: 2025-04-29T14:50:48.623Z (3 months ago)
Topics: data-harmonization, data-stewardship, large-language-models
Language: Python
Homepage: https://pypi.org/project/datastew/
Size: 1.36 MB
Stars: 5
Watchers: 3
Forks: 0
Open Issues: 10
Metadata Files:
- Readme: README.md
- License: LICENSE
- Citation: CITATION.cff

Awesome Lists containing this project

README

        # datastew

![tests](https://github.com/SCAI-BIO/datastew/actions/workflows/tests.yml/badge.svg) ![GitHub Release](https://img.shields.io/github/v/release/SCAI-BIO/datastew)

Datastew is a python library for intelligent data harmonization using Large Language Model (LLM) vector embeddings.

## Installation

```bash

pip install datastew

```

## Usage

### Harmonizing excel/csv resources

You can directly import common data models, terminology sources or data dictionaries for harmonization directly from a

csv, tsv or excel file. An example how to match two separate variable descriptions is shown in

[datastew/scripts/mapping_excel_example.py](datastew/scripts/mapping_excel_example.py):

```python

from datastew.process.parsing import DataDictionarySource

from datastew.process.mapping import map_dictionary_to_dictionary

# Variable and description refer to the corresponding column names in your excel sheet

source = DataDictionarySource("source.xlxs", variable_field="var", description_field="desc")

target = DataDictionarySource("target.xlxs", variable_field="var", description_field="desc")

df = map_dictionary_to_dictionary(source, target)

df.to_excel("result.xlxs")

```

The resulting file contains the pairwise variable mapping based on the closest similarity for all possible matches

as well as a similarity measure per row.

Per default this will use the local MiniLM model, which may not yield the optimal performance. If you got an OpenAI API

key it is possible to use their embedding API instead. To use your key, create a Vectorizer model and pass it to the

function:

```python

from datastew.embedding import Vectorizer

from datastew.process.mapping import map_dictionary_to_dictionary

vectorizer = Vectorizer("text-embedding-ada-002", key="your_api_key")

df = map_dictionary_to_dictionary(source, target, vectorizer=vectorizer)

```

---

### Creating and using stored mappings

A simple example how to initialize an in memory database and compute a similarity mapping is shown in

[datastew/scripts/mapping_db_example.py](datastew/scripts/mapping_db_example.py):

1) Initialize the repository and embedding model:

    ```python

    from datastew.embedding import Vectorizer

    from datastew.repository import WeaviateRepository

    from datastew.repository.model import Terminology, Concept, Mapping

    repository = WeaviateRepository(mode='remote', path='localhost', port=8080)

    vectorizer = Vectorizer()

    # vectorizer = Vectorizer("text-embedding-ada-002", key="your_key") # Use this line for higher accuracy if you have an OpenAI API key

    ```

2) Create a baseline of data to map to in the initialized repository. Text gets attached to any unique concept of an

existing or custom vocabulary or terminology namespace in the form of a mapping object containing the text, embedding,

and the name of sentence embedder used. Multiple Mapping objects with textually different but semantically equal

descriptions can point to the same Concept.

    ```python

    terminology = Terminology("snomed CT", "SNOMED")

    text1 = "Diabetes mellitus (disorder)"

    concept1 = Concept(terminology, text1, "Concept ID: 11893007")

    mapping1 = Mapping(concept1, text1, vectorizer.get_embedding(text1), vectorizer.model_name)

    text2 = "Hypertension (disorder)"

    concept2 = Concept(terminology, text2, "Concept ID: 73211009")

    mapping2 = Mapping(concept2, text2, vectorizer.get_embedding(text2), vectorizer.model_name)

    repository.store_all([terminology, concept1, mapping1, concept2, mapping2])

    ```

3) Retrieve the closest mappings and their similarities for a given text:

```python

text_to_map = "Sugar sickness" # Semantically similar to "Diabetes mellitus (disorder)"

embedding = vectorizer.get_embedding(text_to_map)

results = repository.get_closest_mappings(embedding, similarities=True, limit=2)

for result in results:

    print(result)

```

output:

```python

snomed CT > Concept ID: 11893007 : Diabetes mellitus (disorder) | Diabetes mellitus (disorder) | Similarity: 0.4735338091850281

snomed CT > Concept ID: 73211009 : Hypertension (disorder) | Hypertension (disorder) | Similarity: 0.2003161907196045

```

You can also import data from file sources (csv, tsv, xlsx) or from a public API like OLS. An example script to

download & compute embeddings for SNOMED from ebi OLS can be found in

[datastew/scripts/ols_snomed_retrieval.py](datastew/scripts/ols_snomed_retrieval.py).

---

### Embedding visualization

You can visualize the embedding space of multiple data dictionary sources with t-SNE plots utilizing different

language models. An example how to generate a t-sne plot is shown in

[datastew/scripts/tsne_visualization.py](datastew/scripts/tsne_visualization.py):

```python

from datastew.embedding import Vectorizer

from datastew.process.parsing import DataDictionarySource

from datastew.visualisation import plot_embeddings

# Variable and description refer to the corresponding column names in your excel sheet

data_dictionary_source_1 = DataDictionarySource("source1.xlsx", variable_field="var", description_field="desc")

data_dictionary_source_2 = DataDictionarySource("source2.xlsx", variable_field="var", description_field="desc")

vectorizer = Vectorizer()

plot_embeddings([data_dictionary_source_1, data_dictionary_source_2], vectorizer=vectorizer)

```

![t-SNE plot](./docs/tsne_plot.png)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/scai-bio/datastew

Awesome Lists containing this project

README