{"id":44350973,"url":"https://github.com/radlab-dev-group/qa-model","last_synced_at":"2026-02-11T15:05:57.519Z","repository":{"id":317841201,"uuid":"1069021992","full_name":"radlab-dev-group/qa-model","owner":"radlab-dev-group","description":"A Polish extractive Question‑Answering (QA) solution with utilities for mapping its token space to  generative LLMs (Gemma, LLaMA, GPT‑OSS). The repo contains scripts for training the QA model, inference pipelines, a CLI to pre‑compute token‑mapping tables, and a generation helper that injects static bias from QA logits into a downstream generator.","archived":false,"fork":false,"pushed_at":"2025-12-27T17:26:20.000Z","size":45,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-12-29T15:39:06.572Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/radlab-dev-group.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-10-03T09:27:14.000Z","updated_at":"2025-12-27T17:24:50.000Z","dependencies_parsed_at":"2025-10-03T12:24:17.987Z","dependency_job_id":"1fe9ae5c-cc8b-48f2-8f07-a30bcddfdf09","html_url":"https://github.com/radlab-dev-group/qa-model","commit_stats":null,"previous_names":["radlab-dev-group/qa-model"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/radlab-dev-group/qa-model","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/radlab-dev-group%2Fqa-model","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/radlab-dev-group%2Fqa-model/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/radlab-dev-group%2Fqa-model/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/radlab-dev-group%2Fqa-model/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/radlab-dev-group","download_url":"https://codeload.github.com/radlab-dev-group/qa-model/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/radlab-dev-group%2Fqa-model/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29336093,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-11T14:34:07.188Z","status":"ssl_error","status_checked_at":"2026-02-11T14:34:06.809Z","response_time":97,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-02-11T15:05:56.953Z","updated_at":"2026-02-11T15:05:57.515Z","avatar_url":"https://github.com/radlab-dev-group.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# qa‑model\n\n**Polish Extractive Question‑Answering + Token‑Mapping for Generative LLMs**\n\nThis repository provides a complete workflow for a Polish extractive QA model based on a fine‑tuned RoBERTa checkpoint,\ntogether with tooling to bridge its token vocabulary to a set of modern generative models (Gemma‑3, LLaMA‑3, GPT‑OSS).\nThe token‑mapping enables static‑bias generation, where QA‑derived logits steer the output of a generator.\n\n---\n\n## Features\n\n- **Polish extractive QA** fine‑tuned from `radlab/polish-roberta-large-v2`.\n- **Pre‑computed token maps** from the QA tokenizer to various generative tokenizers.\n- **Static bias processor** (`QAStaticBiasProcessor`) that adds a weighted bias derived from QA logits to a generator’s\n  logits.\n- End‑to‑end training script with optional Weights \u0026 Biases logging.\n- Simple inference script using Hugging Face `pipeline`.\n- Fully reproducible environment (Python 3.10, `virtualenv`).\n\n---\n\n## Installation\n\n1. **Clone the repository**\n\n```bash\ngit clone https://github.com/radlab-dev-group/qa-model.git\ncd radlab-qa-model\n```\n\n2. **Create a virtual environment and install dependencies**\n\n```bash\npython3.10 -m venv .venv\nsource .venv/bin/activate\npip install -r requirements.txt\n```\n\n3. **(Optional) Install additional Hugging Face models**\n\n```bash\npip install transformers tqdm click\n```\n\n---\n\n## Training the QA Model\n\nThe training script `qa-model-train.py` fine‑tunes the RoBERTa checkpoint on the Polish **POQUAD** dataset.\nThe model card is available in [HF_MODEL_CARD](qa-model/HF_MODEL_CARD.md).\n\n```bash\npython qa-model-train.py \\\n  --model-path sdadas/polish-roberta-large-v2 \\\n  --dataset-path clarin-pl/poquad \\\n  --output-dir ./models \\\n  --out-model-name polish-roberta-large-v2-qa \\\n  --max-context-size 512 \\\n  --store-artifacts-to-wb   # optional: push dataset \u0026 model to Weights \u0026 Biases\n```\n\nKey arguments:\n\n| Argument                  | Description                                                        |\n|---------------------------|--------------------------------------------------------------------|\n| `--model-path`            | Base model checkpoint (default: `sdadas/polish-roberta-large-v2`). |\n| `--dataset-path`          | HF dataset identifier (default: `clarin-pl/poquad`).               |\n| `--output-dir`            | Directory where training artifacts will be saved.                  |\n| `--out-model-name`        | Name of the final QA model.                                        |\n| `--max-context-size`      | Maximum token length for the context (default: 512).               |\n| `--store-artifacts-to-wb` | Store dataset and model in a W\u0026B run.                              |\n\nTraining uses `Trainer` from 🤗 Transformers with mixed‑precision support,\ncheckpointing, and early‑stopping based on evaluation steps.\n\n---\n\n## Running Inference\n\nA quick inference example is provided in `qa-model-inference.py`:\n\n```bash\npython qa-model-inference.py\n```\n\nThe script loads the model from the Hugging Face Hub \n[radlab/polish-qa-v2](https://huggingface.co/radlab/polish-qa-v2)\nand runs a single QA pair:\n\n```python\nfrom transformers import pipeline\n\nmodel_path = \"radlab/polish-qa-v2\"\nqa = pipeline(\"question-answering\", model=model_path)\n\nquestion = \"Co będzie w budowanym obiekcie?\"\ncontext = \"\"\"...\"\"\"\nprint(qa(question=question, context=context.replace(\"\\n\", \" \")))\n```\n\nThe output is a JSON‑like dict with `answer`, `score`, `start`, and `end`.\n\n---\n\n## Building Token‑Mapping Tables\n\nThe script `build_all_maps.py` creates JSON maps that translate QA token IDs to one or more IDs of a target generative\nmodel.\n\n```bash\npython qa-utils/build_all_maps_tokenizer.py\n```\n\n- **Configuration** – edit the `ROBERTA_MODELS` and `GENAI_MODELS` lists inside the script to add or remove models.\n- **Cache** – generated maps are stored under `cache/maps/`\n  and metadata (hashes) under `cache/meta/maps_metadata.json`.\n- **Force rebuild** – use `--force-rebuild` to ignore existing caches.\n\nThese maps are later consumed by the bias processor.\n\n---\n\n## License\n\nThis project is licensed under the Apache 2.0 License – see the [LICENSE](LICENSE) file for details.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fradlab-dev-group%2Fqa-model","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fradlab-dev-group%2Fqa-model","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fradlab-dev-group%2Fqa-model/lists"}