https://github.com/radlab-dev-group/qa-model
A Polish extractive Question‑Answering (QA) solution with utilities for mapping its token space to generative LLMs (Gemma, LLaMA, GPT‑OSS). The repo contains scripts for training the QA model, inference pipelines, a CLI to pre‑compute token‑mapping tables, and a generation helper that injects static bias from QA logits into a downstream generator.
https://github.com/radlab-dev-group/qa-model
Last synced: 5 months ago
JSON representation
A Polish extractive Question‑Answering (QA) solution with utilities for mapping its token space to generative LLMs (Gemma, LLaMA, GPT‑OSS). The repo contains scripts for training the QA model, inference pipelines, a CLI to pre‑compute token‑mapping tables, and a generation helper that injects static bias from QA logits into a downstream generator.
- Host: GitHub
- URL: https://github.com/radlab-dev-group/qa-model
- Owner: radlab-dev-group
- License: apache-2.0
- Created: 2025-10-03T09:27:14.000Z (9 months ago)
- Default Branch: main
- Last Pushed: 2025-12-27T17:26:20.000Z (6 months ago)
- Last Synced: 2025-12-29T15:39:06.572Z (6 months ago)
- Language: Python
- Size: 43.9 KB
- Stars: 1
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# qa‑model
**Polish Extractive Question‑Answering + Token‑Mapping for Generative LLMs**
This repository provides a complete workflow for a Polish extractive QA model based on a fine‑tuned RoBERTa checkpoint,
together with tooling to bridge its token vocabulary to a set of modern generative models (Gemma‑3, LLaMA‑3, GPT‑OSS).
The token‑mapping enables static‑bias generation, where QA‑derived logits steer the output of a generator.
---
## Features
- **Polish extractive QA** fine‑tuned from `radlab/polish-roberta-large-v2`.
- **Pre‑computed token maps** from the QA tokenizer to various generative tokenizers.
- **Static bias processor** (`QAStaticBiasProcessor`) that adds a weighted bias derived from QA logits to a generator’s
logits.
- End‑to‑end training script with optional Weights & Biases logging.
- Simple inference script using Hugging Face `pipeline`.
- Fully reproducible environment (Python 3.10, `virtualenv`).
---
## Installation
1. **Clone the repository**
```bash
git clone https://github.com/radlab-dev-group/qa-model.git
cd radlab-qa-model
```
2. **Create a virtual environment and install dependencies**
```bash
python3.10 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
```
3. **(Optional) Install additional Hugging Face models**
```bash
pip install transformers tqdm click
```
---
## Training the QA Model
The training script `qa-model-train.py` fine‑tunes the RoBERTa checkpoint on the Polish **POQUAD** dataset.
The model card is available in [HF_MODEL_CARD](qa-model/HF_MODEL_CARD.md).
```bash
python qa-model-train.py \
--model-path sdadas/polish-roberta-large-v2 \
--dataset-path clarin-pl/poquad \
--output-dir ./models \
--out-model-name polish-roberta-large-v2-qa \
--max-context-size 512 \
--store-artifacts-to-wb # optional: push dataset & model to Weights & Biases
```
Key arguments:
| Argument | Description |
|---------------------------|--------------------------------------------------------------------|
| `--model-path` | Base model checkpoint (default: `sdadas/polish-roberta-large-v2`). |
| `--dataset-path` | HF dataset identifier (default: `clarin-pl/poquad`). |
| `--output-dir` | Directory where training artifacts will be saved. |
| `--out-model-name` | Name of the final QA model. |
| `--max-context-size` | Maximum token length for the context (default: 512). |
| `--store-artifacts-to-wb` | Store dataset and model in a W&B run. |
Training uses `Trainer` from 🤗 Transformers with mixed‑precision support,
checkpointing, and early‑stopping based on evaluation steps.
---
## Running Inference
A quick inference example is provided in `qa-model-inference.py`:
```bash
python qa-model-inference.py
```
The script loads the model from the Hugging Face Hub
[radlab/polish-qa-v2](https://huggingface.co/radlab/polish-qa-v2)
and runs a single QA pair:
```python
from transformers import pipeline
model_path = "radlab/polish-qa-v2"
qa = pipeline("question-answering", model=model_path)
question = "Co będzie w budowanym obiekcie?"
context = """..."""
print(qa(question=question, context=context.replace("\n", " ")))
```
The output is a JSON‑like dict with `answer`, `score`, `start`, and `end`.
---
## Building Token‑Mapping Tables
The script `build_all_maps.py` creates JSON maps that translate QA token IDs to one or more IDs of a target generative
model.
```bash
python qa-utils/build_all_maps_tokenizer.py
```
- **Configuration** – edit the `ROBERTA_MODELS` and `GENAI_MODELS` lists inside the script to add or remove models.
- **Cache** – generated maps are stored under `cache/maps/`
and metadata (hashes) under `cache/meta/maps_metadata.json`.
- **Force rebuild** – use `--force-rebuild` to ignore existing caches.
These maps are later consumed by the bias processor.
---
## License
This project is licensed under the Apache 2.0 License – see the [LICENSE](LICENSE) file for details.