https://github.com/scai-bio/datastew
Python library for intelligent data stewardship using Large Language Model (LLM) embeddings
https://github.com/scai-bio/datastew
data-harmonization data-stewardship large-language-models
Last synced: 24 days ago
JSON representation
Python library for intelligent data stewardship using Large Language Model (LLM) embeddings
- Host: GitHub
- URL: https://github.com/scai-bio/datastew
- Owner: SCAI-BIO
- License: apache-2.0
- Created: 2024-07-01T11:54:23.000Z (11 months ago)
- Default Branch: main
- Last Pushed: 2025-04-29T13:47:46.000Z (25 days ago)
- Last Synced: 2025-04-29T14:50:48.623Z (25 days ago)
- Topics: data-harmonization, data-stewardship, large-language-models
- Language: Python
- Homepage: https://pypi.org/project/datastew/
- Size: 1.36 MB
- Stars: 5
- Watchers: 3
- Forks: 0
- Open Issues: 10
-
Metadata Files:
- Readme: README.md
- License: LICENSE
- Citation: CITATION.cff
Awesome Lists containing this project
README
# datastew
 
Datastew is a python library for intelligent data harmonization using Large Language Model (LLM) vector embeddings.
## Installation
```bash
pip install datastew
```## Usage
### Harmonizing excel/csv resources
You can directly import common data models, terminology sources or data dictionaries for harmonization directly from a
csv, tsv or excel file. An example how to match two separate variable descriptions is shown in
[datastew/scripts/mapping_excel_example.py](datastew/scripts/mapping_excel_example.py):```python
from datastew.process.parsing import DataDictionarySource
from datastew.process.mapping import map_dictionary_to_dictionary# Variable and description refer to the corresponding column names in your excel sheet
source = DataDictionarySource("source.xlxs", variable_field="var", description_field="desc")
target = DataDictionarySource("target.xlxs", variable_field="var", description_field="desc")df = map_dictionary_to_dictionary(source, target)
df.to_excel("result.xlxs")
```The resulting file contains the pairwise variable mapping based on the closest similarity for all possible matches
as well as a similarity measure per row.Per default this will use the local MiniLM model, which may not yield the optimal performance. If you got an OpenAI API
key it is possible to use their embedding API instead. To use your key, create a Vectorizer model and pass it to the
function:```python
from datastew.embedding import Vectorizer
from datastew.process.mapping import map_dictionary_to_dictionaryvectorizer = Vectorizer("text-embedding-ada-002", key="your_api_key")
df = map_dictionary_to_dictionary(source, target, vectorizer=vectorizer)
```---
### Creating and using stored mappings
A simple example how to initialize an in memory database and compute a similarity mapping is shown in
[datastew/scripts/mapping_db_example.py](datastew/scripts/mapping_db_example.py):1) Initialize the repository and embedding model:
```python
from datastew.embedding import Vectorizer
from datastew.repository import WeaviateRepository
from datastew.repository.model import Terminology, Concept, Mappingrepository = WeaviateRepository(mode='remote', path='localhost', port=8080)
vectorizer = Vectorizer()
# vectorizer = Vectorizer("text-embedding-ada-002", key="your_key") # Use this line for higher accuracy if you have an OpenAI API key
```2) Create a baseline of data to map to in the initialized repository. Text gets attached to any unique concept of an
existing or custom vocabulary or terminology namespace in the form of a mapping object containing the text, embedding,
and the name of sentence embedder used. Multiple Mapping objects with textually different but semantically equal
descriptions can point to the same Concept.```python
terminology = Terminology("snomed CT", "SNOMED")text1 = "Diabetes mellitus (disorder)"
concept1 = Concept(terminology, text1, "Concept ID: 11893007")
mapping1 = Mapping(concept1, text1, vectorizer.get_embedding(text1), vectorizer.model_name)text2 = "Hypertension (disorder)"
concept2 = Concept(terminology, text2, "Concept ID: 73211009")
mapping2 = Mapping(concept2, text2, vectorizer.get_embedding(text2), vectorizer.model_name)repository.store_all([terminology, concept1, mapping1, concept2, mapping2])
```3) Retrieve the closest mappings and their similarities for a given text:
```python
text_to_map = "Sugar sickness" # Semantically similar to "Diabetes mellitus (disorder)"
embedding = vectorizer.get_embedding(text_to_map)results = repository.get_closest_mappings(embedding, similarities=True, limit=2)
for result in results:
print(result)
```output:
```python
snomed CT > Concept ID: 11893007 : Diabetes mellitus (disorder) | Diabetes mellitus (disorder) | Similarity: 0.4735338091850281
snomed CT > Concept ID: 73211009 : Hypertension (disorder) | Hypertension (disorder) | Similarity: 0.2003161907196045
```You can also import data from file sources (csv, tsv, xlsx) or from a public API like OLS. An example script to
download & compute embeddings for SNOMED from ebi OLS can be found in
[datastew/scripts/ols_snomed_retrieval.py](datastew/scripts/ols_snomed_retrieval.py).---
### Embedding visualization
You can visualize the embedding space of multiple data dictionary sources with t-SNE plots utilizing different
language models. An example how to generate a t-sne plot is shown in
[datastew/scripts/tsne_visualization.py](datastew/scripts/tsne_visualization.py):```python
from datastew.embedding import Vectorizer
from datastew.process.parsing import DataDictionarySource
from datastew.visualisation import plot_embeddings# Variable and description refer to the corresponding column names in your excel sheet
data_dictionary_source_1 = DataDictionarySource("source1.xlsx", variable_field="var", description_field="desc")
data_dictionary_source_2 = DataDictionarySource("source2.xlsx", variable_field="var", description_field="desc")vectorizer = Vectorizer()
plot_embeddings([data_dictionary_source_1, data_dictionary_source_2], vectorizer=vectorizer)
```