https://github.com/scai-bio/kitsune
Kitsune is a next-generation data steward and harmonization tool.
https://github.com/scai-bio/kitsune
data-harmonization data-stewardship embeddings large-language-models semantic-mapping
Last synced: 23 days ago
JSON representation
Kitsune is a next-generation data steward and harmonization tool.
- Host: GitHub
- URL: https://github.com/scai-bio/kitsune
- Owner: SCAI-BIO
- License: apache-2.0
- Created: 2023-11-24T08:33:06.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-04-29T08:19:37.000Z (24 days ago)
- Last Synced: 2025-04-30T06:07:33.493Z (23 days ago)
- Topics: data-harmonization, data-stewardship, embeddings, large-language-models, semantic-mapping
- Language: TypeScript
- Homepage: https://kitsune.scai.fraunhofer.de
- Size: 11.2 MB
- Stars: 3
- Watchers: 2
- Forks: 1
- Open Issues: 8
-
Metadata Files:
- Readme: README.md
- License: LICENSE
- Citation: CITATION.cff
Awesome Lists containing this project
README
#
Kitsune 
*Kitsune* is a next-generation data steward and harmonization tool. Building on the legacy of systems like Usagi, Kitsune leverages LLM embeddings to intelligently map semantically similar terms even when their string representations differ substentially. This results in more robust data harmonization and improved performance in real-world scenarios.
(Formerly: INDEX – the Intelligent Data Steward Toolbox)
## Features
- **LLM Embeddings:** Uses state-of-the-art language models to capture semantic similarity.
- **Intelligent Mapping:** Improves over traditional string matching with context-aware comparisons.
- **Extensible:** Designed for integration into modern data harmonization pipelines.## Installation
Run the frontend client, api, vector database and local embedding model using the local docker-compose file:
```bash
docker-compose -f docker-compose.local.yaml up
```You can access the frontend on [localhost:4200](localhost:4200)
## Ontology Import via API
The API supports multiple methods for importing ontology (terminology) data into the system. Depending on your source and needs, you can choose from the following options:
1. Importing from OLS (Pre-integrated):
This is the most straightforward method. The API is integrated with the [Ontology Lookup Service (OLS)](https://www.ebi.ac.uk/ols4/ontologies), allowing you to import any ontology available in their catalog.
```bash
curl -X 'PUT' \
'{api_url}/imports/terminology?terminology_id={terminology_id}&model={vectorizer_model}' \
-H 'accept: application/json'
```- `terminology_id` (required): The ID of the ontology you want to import (e.g., `hp`, `efo`, `chebi`, etc.).
- `vectorizer_model` (optional), vectorizer model to be used for generating embeddings.
- Example:```bash
curl -X 'PUT' \
'{api_url}/imports/terminology?terminology_id=hp' \
-H 'accept: application/json'
```2. Importing SNOMED CT:
- SNOMED CT can be imported using a shortcut endpoint. This is equivalent to using the OLS integration with terminology_id=snomed, but provides a cleaner interface.
```bash
curl -X 'PUT' \
'{api_url}/imports/terminology/snomed?model={vectorizer_model}' \
-H 'accept: application/json'
```- `vectorizer_model` (optional), vectorizer model to be used for generating embeddings.
3. Importing Your Own Ontology (JSONL Files):
For full flexibility, you can upload your own ontology using `.jsonl` (JSON Lines) files. This allows you to import:
- Terminologies (namespaces)
- Concepts (terms within the terminology)
- Mappings (links between embeddings and existing concepts)> ⚠️ The objects should be imported in the following order:
>
> 1. "Terminology"
> 2. "Concepts"
> 3. "Mappings"```bash
curl -X 'PUT' \
'{api_url}/imports/jsonl?object_type={object_type}' \
-H 'accept: application/json' \
-H 'Content-Type: multipart/form-data' \
-F 'file=@{your_file}.jsonl'
```- `object_type`(required): One of `terminology`, `concept`, or `mapping`
- `file` (required): The `.jsonl` file to be uploaded (multipart/from-data)## JSONL File Structure
Each line in your `.jsonl` file must represent a single object with the following structure
```json
{
"class": "",
"id": "",
"properties": { ... },
"references": { ... }, // for Concept and Mapping
"vectors": { ... } // optional, for Mapping only
}
```- `class`, referring to the corresponding Terminology, Concept, and Mapping collections.
- `id`, a unique id per object generated by uuid.
- `properties`, a dictionary containing the properties of the object.In addition, an object can contain the following if applicable:
- `references`, a dictionary specifying a referencing between objects of different collections by their `id`. Not applicable for Terminology collection.
- `vector`, a dictionary containing the sentence embedding. Only applicable to Mapping collection.### Example JSONL Structures
#### Terminology
Terminology has one attribute in its properties called `name` referring to the name of the terminology being imported.
```json
{
"class": "Terminology",
"id": "6c7b7146-5895-5097-a84e-df41b520c936",
"properties": {
"name": "OHDSI"
}
}
```#### Concept
Concept has two attributes in its properties called `conceptID` and `prefLabel` referring to the concept entry ID within the terminology and preferred label for the entry, respectively.
A concept object also contain a reference attribute `hasTerminology` pointing to the terminology it belongs to.
```json
{
"class": "Concept",
"id": "818fc18f-77ff-5889-9a23-51d1e85c368e",
"properties": {
"conceptID": "37523947",
"prefLabel": "Body Fat Percentage"
},
"references": {
"hasTerminology": "6c7b7146-5895-5097-a84e-df41b520c936"
}
}
```#### Mapping
A mapping object may or may not contain the vectors and the structure of the JSONL file will change accordingly. The structure of the file also depends on whether you are utilizing Weaviate vectorizers or not.
Regardless a mapping object will always contain a reference attribute `hasConcept` pointing to the concept it belongs to.
##### Mapping Object without Utilizing Weaviate Vectorizers
A mapping object without utilizing Weaviate vectorizers will have two attributes in its properties called `text` and `hasSentenceEmbedder` referring to the description of its corresponding concept and the vectorizer model used to embed the description, respectively.
A pre-computed vector can be stored in `vectors` dictionary with the key `default`.
```json
{
"class": "Mapping",
"id": "1b1b7a6e-9000-58ef-8f62-034b4795854a",
"properties": {
"text": "Body Fat Percentage",
"hasSentenceEmbedder": "nomic-embed-text"
},
"references": {
"hasConcept": "818fc18f-77ff-5889-9a23-51d1e85c368e"
},
"vectors": {
"default": [0.1, 0.2, 0.3]
}
}
```The vectors does not have to be pre-computed and if not supplied will be computed during the import process. You can find the structure of JSONL file without vectors below.
```json
{
"class": "Mapping",
"id": "1b1b7a6e-9000-58ef-8f62-034b4795854a",
"properties": {
"text": "Body Fat Percentage",
"hasSentenceEmbedder": "nomic-embed-text"
},
"references": {
"hasConcept": "818fc18f-77ff-5889-9a23-51d1e85c368e"
}
}
```##### Mapping Object Utilizing Weaviate Vectorizers
Weaviate Vectorizers utilizes named vectors and computes the embeddings during the import process. Thus, eliminating the need for `hasSentenceEmbedder` and `vectors` attributes.
```json
{
"class": "Mapping",
"id": "1b1b7a6e-9000-58ef-8f62-034b4795854a",
"properties": {
"text": "Body Fat Percentage"
},
"references": {
"hasConcept": "818fc18f-77ff-5889-9a23-51d1e85c368e"
}
}
```