https://github.com/makefinks/rag-annotator

A PySide6 GUI for curating ground truth data for retrieval systems
https://github.com/makefinks/rag-annotator

Last synced: 16 days ago
JSON representation

A PySide6 GUI for curating ground truth data for retrieval systems

Host: GitHub
URL: https://github.com/makefinks/rag-annotator
Owner: makefinks
License: mit
Created: 2025-04-13T21:41:52.000Z (11 months ago)
Default Branch: main
Last Pushed: 2025-06-04T07:12:34.000Z (10 months ago)
Last Synced: 2025-06-04T13:59:07.754Z (10 months ago)
Language: Python
Size: 3.12 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# RAG Annotation Tool for Ground Truth Creation

The annotation tool can be used to simplify the ground truth creation process for the evaluation of retrieval systems.

It allows for selecting a set of texts that are relevant to a given text / description and supports the following features:
- **Interactive selection** of relevant texts
- Metadata display and **highlighting of keywords and phrases**
- Built in **BM25 search capabilities** to locate and include additional texts from the corpus

## Demo
![Gif demo](app/images/demo.gif)

## Installation
### Clone the repository

```bash
git clone https://github.com/makefinks/rag-annotator.git
cd rag-annotator
```

### Install requirements (uv is recommended)
```bash
uv sync
```
> alternatively use:
> pip install -r requirements.txt
### Activate the virtual environment
```bash
source .venv/bin/activate
```

## How it works

### Input preperation
The usage of this tool requires a specific input format (JSON).
The detailed format can be seen in `utils/ground_truth_schema.json`.

#### Ground Truth Schema Explanation
The input JSON must contain the following main fields:

- **points**: An array of objects, each representing an evaluation point. Each point contains:
- `id`: Integer identifier for the point.
- `title`: Title string for the point.
- `description`: Description or query string.
- `keywords (optional)`: Array of keywords relevant to the points description. Highlighted inside retrieved texts.
- `fetched_texts`: Array of text objects fetched for this point. Each has:
- `id`: Integer identifier for the text.
- `text`: The text content.
- `source`: The source string.
- `metadata` (optional): Additional metadata (e.g., description).
- `highlights` (optional): Array of strings to highlight in the text.
- `selected_texts`: Array of selected text objects (same structure as `fetched_texts`).
- `evaluated`: Boolean indicating if the point has been evaluated.
- **all_texts**: An array of all possible text objects in the dataset, each with:
- `id`: Integer identifier.
- `text`: The text content.

Refer to `app/utils/ground_truth_schema.json` for the complete and up-to-date schema.

### Usage of the tool
```bash
python annotation_tool.py
```
Upon selecting the file in the prepared format the tool will load the data and display the GUI:

- **Top Panel**: The current index of the evaluation object / point, a title dropdown, and the description / text of the object.
- **Left Panel**: The list of texts that were fetched and are supposed to be selected as relevant.
- **Right Panel**: A BM25 seach bar and result display, that allows you to search for a specific texts for all texts in the dataset.
- **Bottom Panel**: Buttons for Navigation and saving the current state of the annotation.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/makefinks/rag-annotator

Awesome Lists containing this project

README