https://github.com/scai-bio/kitsune

Kitsune is a next-generation data steward and harmonization tool.
https://github.com/scai-bio/kitsune

data-harmonization data-stewardship embeddings large-language-models semantic-mapping

Last synced: 23 days ago
JSON representation

Kitsune is a next-generation data steward and harmonization tool.

Host: GitHub
URL: https://github.com/scai-bio/kitsune
Owner: SCAI-BIO
License: apache-2.0
Created: 2023-11-24T08:33:06.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2025-04-29T08:19:37.000Z (24 days ago)
Last Synced: 2025-04-30T06:07:33.493Z (23 days ago)
Topics: data-harmonization, data-stewardship, embeddings, large-language-models, semantic-mapping
Language: TypeScript
Homepage: https://kitsune.scai.fraunhofer.de
Size: 11.2 MB
Stars: 3
Watchers: 2
Forks: 1
Open Issues: 8
Metadata Files:
- Readme: README.md
- License: LICENSE
- Citation: CITATION.cff

Awesome Lists containing this project

README

# Logo Kitsune ![GitHub Release](https://img.shields.io/github/v/release/SCAI-BIO/kitsune)

*Kitsune* is a next-generation data steward and harmonization tool. Building on the legacy of systems like Usagi, Kitsune leverages LLM embeddings to intelligently map semantically similar terms even when their string representations differ substentially. This results in more robust data harmonization and improved performance in real-world scenarios.

(Formerly: INDEX – the Intelligent Data Steward Toolbox)

## Features

- **LLM Embeddings:** Uses state-of-the-art language models to capture semantic similarity.
- **Intelligent Mapping:** Improves over traditional string matching with context-aware comparisons.
- **Extensible:** Designed for integration into modern data harmonization pipelines.

## Installation

Run the frontend client, api, vector database and local embedding model using the local docker-compose file:

```bash
docker-compose -f docker-compose.local.yaml up
```

You can access the frontend on [localhost:4200](localhost:4200)

## Ontology Import via API

The API supports multiple methods for importing ontology (terminology) data into the system. Depending on your source and needs, you can choose from the following options:

1. Importing from OLS (Pre-integrated):

This is the most straightforward method. The API is integrated with the [Ontology Lookup Service (OLS)](https://www.ebi.ac.uk/ols4/ontologies), allowing you to import any ontology available in their catalog.

```bash
curl -X 'PUT' \
'{api_url}/imports/terminology?terminology_id={terminology_id}&model={vectorizer_model}' \
-H 'accept: application/json'
```

- `terminology_id` (required): The ID of the ontology you want to import (e.g., `hp`, `efo`, `chebi`, etc.).
- `vectorizer_model` (optional), vectorizer model to be used for generating embeddings.
- Example:

```bash
curl -X 'PUT' \
'{api_url}/imports/terminology?terminology_id=hp' \
-H 'accept: application/json'
```

2. Importing SNOMED CT:

- SNOMED CT can be imported using a shortcut endpoint. This is equivalent to using the OLS integration with terminology_id=snomed, but provides a cleaner interface.

```bash
curl -X 'PUT' \
'{api_url}/imports/terminology/snomed?model={vectorizer_model}' \
-H 'accept: application/json'
```

- `vectorizer_model` (optional), vectorizer model to be used for generating embeddings.

3. Importing Your Own Ontology (JSONL Files):

For full flexibility, you can upload your own ontology using `.jsonl` (JSON Lines) files. This allows you to import:
- Terminologies (namespaces)
- Concepts (terms within the terminology)
- Mappings (links between embeddings and existing concepts)

> ⚠️ The objects should be imported in the following order:
>
> 1. "Terminology"
> 2. "Concepts"
> 3. "Mappings"

```bash
curl -X 'PUT' \
'{api_url}/imports/jsonl?object_type={object_type}' \
-H 'accept: application/json' \
-H 'Content-Type: multipart/form-data' \
-F 'file=@{your_file}.jsonl'
```

- `object_type`(required): One of `terminology`, `concept`, or `mapping`
- `file` (required): The `.jsonl` file to be uploaded (multipart/from-data)

## JSONL File Structure

Each line in your `.jsonl` file must represent a single object with the following structure

```json
{
"class": "",
"id": "",
"properties": { ... },
"references": { ... }, // for Concept and Mapping
"vectors": { ... } // optional, for Mapping only
}
```

- `class`, referring to the corresponding Terminology, Concept, and Mapping collections.
- `id`, a unique id per object generated by uuid.
- `properties`, a dictionary containing the properties of the object.

In addition, an object can contain the following if applicable:

- `references`, a dictionary specifying a referencing between objects of different collections by their `id`. Not applicable for Terminology collection.
- `vector`, a dictionary containing the sentence embedding. Only applicable to Mapping collection.

### Example JSONL Structures

#### Terminology

Terminology has one attribute in its properties called `name` referring to the name of the terminology being imported.

```json
{
"class": "Terminology",
"id": "6c7b7146-5895-5097-a84e-df41b520c936",
"properties": {
"name": "OHDSI"
}
}
```

#### Concept

Concept has two attributes in its properties called `conceptID` and `prefLabel` referring to the concept entry ID within the terminology and preferred label for the entry, respectively.

A concept object also contain a reference attribute `hasTerminology` pointing to the terminology it belongs to.

```json
{
"class": "Concept",
"id": "818fc18f-77ff-5889-9a23-51d1e85c368e",
"properties": {
"conceptID": "37523947",
"prefLabel": "Body Fat Percentage"
},
"references": {
"hasTerminology": "6c7b7146-5895-5097-a84e-df41b520c936"
}
}
```

#### Mapping

A mapping object may or may not contain the vectors and the structure of the JSONL file will change accordingly. The structure of the file also depends on whether you are utilizing Weaviate vectorizers or not.

Regardless a mapping object will always contain a reference attribute `hasConcept` pointing to the concept it belongs to.

##### Mapping Object without Utilizing Weaviate Vectorizers

A mapping object without utilizing Weaviate vectorizers will have two attributes in its properties called `text` and `hasSentenceEmbedder` referring to the description of its corresponding concept and the vectorizer model used to embed the description, respectively.

A pre-computed vector can be stored in `vectors` dictionary with the key `default`.

```json
{
"class": "Mapping",
"id": "1b1b7a6e-9000-58ef-8f62-034b4795854a",
"properties": {
"text": "Body Fat Percentage",
"hasSentenceEmbedder": "nomic-embed-text"
},
"references": {
"hasConcept": "818fc18f-77ff-5889-9a23-51d1e85c368e"
},
"vectors": {
"default": [0.1, 0.2, 0.3]
}
}
```

The vectors does not have to be pre-computed and if not supplied will be computed during the import process. You can find the structure of JSONL file without vectors below.

##### Mapping Object Utilizing Weaviate Vectorizers

Weaviate Vectorizers utilizes named vectors and computes the embeddings during the import process. Thus, eliminating the need for `hasSentenceEmbedder` and `vectors` attributes.

```json
{
"class": "Mapping",
"id": "1b1b7a6e-9000-58ef-8f62-034b4795854a",
"properties": {
"text": "Body Fat Percentage"
},
"references": {
"hasConcept": "818fc18f-77ff-5889-9a23-51d1e85c368e"
}
}
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/scai-bio/kitsune

Awesome Lists containing this project

README