https://github.com/makefinks/rag-annotator
A PySide6 GUI for curating ground truth data for retrieval systems
https://github.com/makefinks/rag-annotator
Last synced: 16 days ago
JSON representation
A PySide6 GUI for curating ground truth data for retrieval systems
- Host: GitHub
- URL: https://github.com/makefinks/rag-annotator
- Owner: makefinks
- License: mit
- Created: 2025-04-13T21:41:52.000Z (11 months ago)
- Default Branch: main
- Last Pushed: 2025-06-04T07:12:34.000Z (10 months ago)
- Last Synced: 2025-06-04T13:59:07.754Z (10 months ago)
- Language: Python
- Size: 3.12 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# RAG Annotation Tool for Ground Truth Creation
The annotation tool can be used to simplify the ground truth creation process for the evaluation of retrieval systems.
It allows for selecting a set of texts that are relevant to a given text / description and supports the following features:
- **Interactive selection** of relevant texts
- Metadata display and **highlighting of keywords and phrases**
- Built in **BM25 search capabilities** to locate and include additional texts from the corpus
## Demo

## Installation
### Clone the repository
```bash
git clone https://github.com/makefinks/rag-annotator.git
cd rag-annotator
```
### Install requirements (uv is recommended)
```bash
uv sync
```
> alternatively use:
> pip install -r requirements.txt
### Activate the virtual environment
```bash
source .venv/bin/activate
```
## How it works
### Input preperation
The usage of this tool requires a specific input format (JSON).
The detailed format can be seen in `utils/ground_truth_schema.json`.
#### Ground Truth Schema Explanation
The input JSON must contain the following main fields:
- **points**: An array of objects, each representing an evaluation point. Each point contains:
- `id`: Integer identifier for the point.
- `title`: Title string for the point.
- `description`: Description or query string.
- `keywords (optional)`: Array of keywords relevant to the points description. Highlighted inside retrieved texts.
- `fetched_texts`: Array of text objects fetched for this point. Each has:
- `id`: Integer identifier for the text.
- `text`: The text content.
- `source`: The source string.
- `metadata` (optional): Additional metadata (e.g., description).
- `highlights` (optional): Array of strings to highlight in the text.
- `selected_texts`: Array of selected text objects (same structure as `fetched_texts`).
- `evaluated`: Boolean indicating if the point has been evaluated.
- **all_texts**: An array of all possible text objects in the dataset, each with:
- `id`: Integer identifier.
- `text`: The text content.
Refer to `app/utils/ground_truth_schema.json` for the complete and up-to-date schema.
### Usage of the tool
```bash
python annotation_tool.py
```
Upon selecting the file in the prepared format the tool will load the data and display the GUI:
- **Top Panel**: The current index of the evaluation object / point, a title dropdown, and the description / text of the object.
- **Left Panel**: The list of texts that were fetched and are supposed to be selected as relevant.
- **Right Panel**: A BM25 seach bar and result display, that allows you to search for a specific texts for all texts in the dataset.
- **Bottom Panel**: Buttons for Navigation and saving the current state of the annotation.