https://github.com/simula/pointdetectcount
https://github.com/simula/pointdetectcount
Last synced: 16 days ago
JSON representation
- Host: GitHub
- URL: https://github.com/simula/pointdetectcount
- Owner: simula
- Created: 2025-05-11T20:23:18.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-06-18T08:26:49.000Z (about 1 year ago)
- Last Synced: 2025-06-18T09:28:40.628Z (about 1 year ago)
- Language: Python
- Size: 7.81 KB
- Stars: 1
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# 𩺠PointDetectCount: Multi-Task Medical Image Understanding with Instruction-Tuned Vision-Language Models
This repository contains the code and data generation scripts used in the paper:
**[Point, Detect, Count: Multi-Task Medical Image Understanding with Instruction-Tuned Vision-Language Models](https://arxiv.org/html/2505.16647v1)**
`Sushant Gautam, Michael A. Riegler, PƄl Halvorsen`
*arXiv preprint, May 2025*
---
## š Overview
PointDetectCount is a unified multi-task framework for fine-tuning instruction-tuned vision-language models (VLMs) on three fundamental medical imaging tasks:
- **Pointing (Localization)**
- **Bounding Box Detection**
- **Counting (Object Enumeration)**
The model is trained and evaluated on the [MedMultiPoints](https://huggingface.co/datasets/SimulaMet/MedMultiPoints) dataset, a multimodal dataset comprising diverse clinical annotations.
---
## š¦ Dataset
Dataset is available via Hugging Face:
š [`SimulaMet/MedMultiPoints`](https://huggingface.co/datasets/SimulaMet/MedMultiPoints)
All raw images should be stored locally in the `MedMultiPoints-images/` directory.
### Download Images Locally
You can download the image files directly from the Hugging Face dataset using the
[`datasets`](https://github.com/huggingface/datasets) library:
```python
from datasets import load_dataset
# Load the dataset
ds = load_dataset("SimulaMet/MedMultiPoints")
# Path to save images and a metadata file
output_dir = "MedMultiPoints-images"
import os
os.makedirs(output_dir, exist_ok=True)
# Save one image per unique hash
for sha, row in ds["train"].to_pandas().groupby("image_sha256").nth(0).iterrows():
row["image_data"].save(os.path.join(output_dir, f"{sha}.jpg"))
```
This snippet creates the `MedMultiPoints-images/` folder (if it doesn't already
exist) and writes each image from the dataset to that directory using the image's
SHA-256 hash as the filename.
| Columns | Type | Description |
|-------------------|--------------|-------------------------------------------------------------------|
| `image` | Image | Raw medical image |
| `image_sha256` | string | SHA-256 checksum for integrity |
| `img_size` | `[int, int]` | Image dimensions: `[width, height]` |
| `points` | `[[x, y]]` | List of point annotations |
| `bbox` | `[[x1, y1, x2, y2]]` | List of bounding boxes |
| `count` | int | Number of annotated objects |
| `label` | string | Object class (e.g., polyp, sperm, cluster, etc.) |
| `collection_method` | string | Task relevance (e.g., detection, counting) |
| `classification` | string | Free-form annotation description |
| `organ` | string | Organ or modality type (e.g., GI tract, sperm) |
**Instruction-Fused JSONL Files**:
- [`multi-task-train.jsonl`](https://huggingface.co/datasets/SimulaMet/MedMultiPoints/resolve/main/instruction_dataset/multi-task-train.jsonl)
- [`multi-task-test.jsonl`](https://huggingface.co/datasets/SimulaMet/MedMultiPoints/resolve/main/instruction_dataset/multi-task-test.jsonl)
---
## š¾ Fine-Tuned Model
Model weights are available via Hugging Face:
š [`SimulaMet/PointDetectCount-Qwen2.5-VL-7B-LoRA`](https://huggingface.co/SimulaMet/PointDetectCount-Qwen2.5-VL-7B-LoRA)
---
## š ļø Repository Structure
| File/Folder | Description |
|-----------------------|--------------------------------------------------------------------------|
| `create_datasetJSON.py` | Generates instruction-formatted JSONL files for multi-task fine-tuning |
| `evaluate_qwen.py` | Evaluates VLM outputs against structured annotations (bbox, point, count) |
| `MedMultiPoints-images/` | Directory to store dataset images locally |
---
## š Usage
### Create Instruction Dataset
Run the conversion script to produce an instruction-formatted dataset. Adjust the image directory or output path if needed:
```bash
python create_datasetJSON.py --image-dir MedMultiPoints-images --output kvasir_valid.jsonl
```
### Evaluate Predictions
Compare your model's predictions with the provided ground truth using:
```bash
python evaluate_qwen.py --dataset kvasir_valid-qwen-6task-test.jsonl --results kvasir_valid-qwen-6task-test-result.jsonl
```
### Fine-Tune Qwen (LoRA)
Training uses the instruction-fused training file available at
[`multi-task-train.jsonl`](https://huggingface.co/datasets/SimulaMet/MedMultiPoints/resolve/main/instruction_dataset/multi-task-train.jsonl):
```bash
swift sft --model Qwen/Qwen2.5-VL-7B-Instruct \
--train_type lora \
--dataset /home/sushant/D1/MIUA/kvasir-format/multi-task-train.jsonl \
--output_dir /home/sushant/D1/MIUA/kvasir-format/training2 \
--num_train_epochs 5 \
--eval_steps 200 \
--save_total_limit 3 \
--report_to wandb \
--per_device_train_batch_size 4
```
### Inference
Infer using either the fine-tuned checkpoint or the original model:
```bash
# Finetuned model
swift infer --model SimulaMet/PointDetectCount-Qwen2.5-VL-7B-LoRA \
--val_dataset https://huggingface.co/datasets/SimulaMet/MedMultiPoints/resolve/main/instruction_dataset/multi-task-test.jsonl \
--result_path qwen_outputs/qwen-finetuned-6task-test500-result.jsonl \
--use_hf true
# Public checkpoint
swift infer --model Qwen/Qwen2.5-VL-7B-Instruct \
--val_dataset https://huggingface.co/datasets/SimulaMet/MedMultiPoints/resolve/main/instruction_dataset/multi-task-test.jsonl \
--result_path qwen_outputs/qwen-public-6task-test500-result.jsonl \
--use_hf true
```
---
## š§ Methodology Summary
We fine-tune [Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) using [LoRA](https://arxiv.org/abs/2106.09685) for instruction-based multi-task image understanding.
- Each image is associated with 5 instruction-response pairs.
- Responses are expected to be JSON-formatted predictions.
- Tasks are trained jointly using commonly used language modeling loss.
For more details, see [Section IV of the paper](https://arxiv.org/html/2505.16647v1#S4).
---
## š Evaluation Metrics
| Task | Metrics (Key) |
|------------------|------------------------------------------------|
| **Counting** | MAE, MSE |
| **Pointing** | Point MAE, RMSE, Matching Accuracy, Zero-cases |
| **Bounding Box** | mAP, mAP@50, mAP@75, IoU |
Evaluation scripts are provided in `evaluate_qwen.py`.
---
## š Citation
If you use this work, please cite:
```bibtex
@incollection{Gautam,
author = {Gautam, Sushant and Riegler, Michael A. and Halvorsen, P{\aa}l},
title = {{Point, Detect, Count: Multi-Task Medical Image Understanding with Instruction-Tuned Vision-Language Models}},
booktitle = {{2025 IEEE 38th International Symposium on Computer-Based Medical Systems (CBMS)}},
journal = {Published in: 2025 IEEE 38th International Symposium on Computer-Based Medical Systems (CBMS)},
pages = {18--20},
publisher = {IEEE},
doi = {10.1109/CBMS65348.2025.00090}
}
```
---
## š¬ Contact
For questions or collaboration inquiries, reach out to:
š§ sushant@simula.no