{"id":31944416,"url":"https://github.com/yueyang1996/knobo","last_synced_at":"2026-04-02T18:38:53.894Z","repository":{"id":243150127,"uuid":"804675141","full_name":"YueYANG1996/KnoBo","owner":"YueYANG1996","description":"NeurIPS 2024 (spotlight): A Textbook Remedy for Domain Shifts Knowledge Priors for Medical Image Analysis ","archived":false,"fork":false,"pushed_at":"2024-10-15T18:54:20.000Z","size":1025,"stargazers_count":26,"open_issues_count":0,"forks_count":3,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-10-14T10:25:11.336Z","etag":null,"topics":["confounding","domain-shift","interpretable-machine-learning","medical","retrieval-augmented-generation"],"latest_commit_sha":null,"homepage":"https://yueyang1996.github.io/knobo/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/YueYANG1996.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2024-05-23T03:43:37.000Z","updated_at":"2025-03-12T18:36:43.000Z","dependencies_parsed_at":"2024-10-17T00:39:42.707Z","dependency_job_id":null,"html_url":"https://github.com/YueYANG1996/KnoBo","commit_stats":null,"previous_names":["yueyang1996/knobo"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/YueYANG1996/KnoBo","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/YueYANG1996%2FKnoBo","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/YueYANG1996%2FKnoBo/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/YueYANG1996%2FKnoBo/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/YueYANG1996%2FKnoBo/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/YueYANG1996","download_url":"https://codeload.github.com/YueYANG1996/KnoBo/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/YueYANG1996%2FKnoBo/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31313091,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-02T12:59:32.332Z","status":"ssl_error","status_checked_at":"2026-04-02T12:54:48.875Z","response_time":89,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["confounding","domain-shift","interpretable-machine-learning","medical","retrieval-augmented-generation"],"created_at":"2025-10-14T10:24:18.447Z","updated_at":"2026-04-02T18:38:53.875Z","avatar_url":"https://github.com/YueYANG1996.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003ch2 align=\"center\" style=\"line-height: 50px;\"\u003e\n    \u003cimg src=\"https://yueyang1996.github.io/knobo/static/images/knobo_logo.png\" style=\"vertical-align: middle;\" width=\"50px\"/\u003e\n    A Textbook Remedy for Domain Shifts \u003cbr\u003e\n    Knowledge Priors for Medical Image Analysis\n\u003c/h2\u003e\n\n\n\u003ch4 align=\"center\"\u003e\n  \u003ca href=\"https://arxiv.org/abs/2405.14839\"\u003ePaper\u003c/i\u003e\u003c/a\u003e | \u003ca href=\"https://yueyang1996.github.io/knobo/\"\u003eProject Page\u003c/i\u003e\u003c/a\u003e\n\u003c/h4\u003e\n\n##  Table of Contents\n1. [CLIP Models](#clip-models)\n2. [Installation](#installation)\n3. [Quick Start](#quick-start)\n4. [Directories](#directories)\n5. [Extract Features](#extract-features)\n6. [Generate Bottlenecks from Medical Documents](#generate-bottlenecks-from-medical-documents)\n7. [Train Grounding Functions](#train-grounding-functions)\n8. [Baselines](#baselines)\n\n\n## CLIP Models\nWe release the two CLIP models we trained for X-ray and Skin Lesion images on huggingface.\n* **WhyXrayCLIP** 🩻 : https://huggingface.co/yyupenn/whyxrayclip\n* **WhyLesionCLIP** 👍🏽 : https://huggingface.co/yyupenn/whylesionclip\n\n\n## Installation\nAfter cloning the repo, you can install the required dependencies and download the data by running the following commands:\n```bash\ngit clone https://github.com/YueYANG1996/KnoBo.git\ncd KnoBo\nsh setup.sh\n```\n\n## Quick Start\nTo get the results of KnoBo on X-ray datasets, you can run the following command:\n```bash\npython modules/cbm.py \\\n    --mode binary \\\n    --bottleneck PubMed \\\n    --number_of_features 150 \\\n    --add_prior True \\\n    --modality xray \\\n    --model_name whyxrayclip \\\n```\nThe output will be saved to `./data/results/`. You can change the `--modality` to `skin` and `--model_name` to `whylesionclip` to get the results on Skin Lesion datasets.\n\n\n## Directories\n* `data/`: Contains the data for all experiments.\n  - `data/bottlenecks/`: Contains the concept bottleneck created using medical documents.\n  - `data/datasets/`: This contains the splits for all datasets. You may need to download the images of each dataset from its original sources. Please refer to the [DATASETS.md](DATASETS.md) for more details.\n  - `data/features/`: Contains the features extracted from different models.\n  - `data/grounding_functions/`: Contains the grounding functions for each concept in the bottleneck.\n  - `data/results/`: Contains the results of all experiments.\n\n* `modules/`: Contains the scripts for all experiments.\n  - [`modules/cbm.py`](modules/cbm.py): Contains the script for the running linear-based models, including KnoBo, linear probing, and PCBM.\n  - [`modules/extract_features.py`](modules/extract_features.py): Contains the script for extracting image features using different models.\n  - [`modules/train_grounding.py`](modules/train_grounding.py): Contains the script for training the grounding functions for each concept in the bottleneck.\n  - [`modules/end2end.py`](modules/end2end.py) : Contains the script for training the end-to-end model, including ViT and DenseNet.\n  - [`modules/LSL.py`](modules/LSL.py): Contains the script for fine-tuning CLIP with knowledge (Language-shaped Learning).\n  - [`modules/models.py`](modules/models.py) : Contains the models used in the experiments.\n  - [`modules/utils.py`](modules/utils.py) : Contains the utility functions.\n\n\n## Extract Features\nAfter running the [`setup.sh`](setup.sh), you should have the features extracted from the two CLIP models we trained in the `data/features/` directory. If you want to extract features using other models, you can run the following command:\n```bash\npython modules/extract_features.py \\\n    --dataset_name \u003cNAME OF THE DATASET\u003e \\\n    --model_name \u003cNAME OF THE MODEL\u003e \\\n    --image_dir \u003cPATH TO THE IMAGE DIRECTORY\u003e \\\n```\nThe supported models are listed [here](https://github.com/YueYANG1996/KnoBo/blob/e3e3171b74b6c8f42046676aa6c6ae21a034deba/modules/extract_features.py#L141). We provide a bash script [`extract_features.sh`](extract_features.sh) to extract features for all datasets using the two CLIP models we trained.\n\n\n## Generate Bottlenecks from Medical Documents\nWe build the retrieval-based concept bottleneck generation pipeline based on [MedRAG](https://arxiv.org/pdf/2402.13178). You need to first clone our [forked version](https://github.com/YueYANG1996/MedRAG/tree/main) and set up the environment by running the following commands:\n```bash\ngit clone https://github.com/YueYANG1996/MedRAG.git\ncd MedRAG\nsh setup.sh\n```\nIt may take a while since it needs to download the 5M PubMed documents (29.5 GB). After setting up the environment, you can test the RAG system by running the [`test.py`](https://github.com/YueYANG1996/MedRAG/blob/main/test.py).\n\nTo generate the concept bottleneck from medical documents, you can run the following command:\n```bash\npython concept_generation.py \\\n    --modality \u003cxray or skin\u003e \\\n    --corpus_name \u003cNAME OF THE CORPUS\u003e \\\n    --number_of_concepts \u003cNUMBER OF CONCEPTS\u003e \\\n    --openai_key \u003cOPENAI API KEY\u003e \\\n```\nFor the `--corpus_name,` you can choose from `PubMed_all` (this is our version of PubMed with all paragraphs), `PubMed` (this is MedRAG's original version of PubMed, which only has abstracts), `Textbooks,` `StatPearls` and `Wikipedia`. The generated bottleneck will be saved to `./data/bottlenecks/\u003cmodality\u003e_\u003ccorpus\u003e_\u003cnumber_of_concepts\u003e.txt`.\n\n**Annotate concepts:** You can annotate clinical reports for each concept in the bottleneck by running the following command:\n```bash\npython annotate_question.py \\\n    --annotator \u003ct5 of gpt4\u003e \\\n    --modality \u003cxray or skin\u003e \\\n    --bottleneck \u003cNAME OF THE BOTTLENECK\u003e \\\n    --number_of_reports \u003cNUMBER OF REPORTS TO ANNOTATE\u003e \\\n    --openai_key \u003cOPENAI API\u003e \\\n```\nThe default LLM for annotation is [Flan-T5-XXL](https://huggingface.co/google/flan-t5-xxl). You can change it to GPT-4 by setting `--annotator gpt4` (warning: this may cost a lot of money). The default number of reports to annotate is 1000. The annotated reports will be saved to `./data/concept_annotation_\u003cmodality\u003e/annotations_\u003cannotator\u003e/`.\n\n\n## Train Grounding Functions\nTo train the grounding functions for each concept in the bottleneck, you can run the following command:\n```bash\npython modules/train_grounding.py \\\n    --modality \u003cxray or skin\u003e \\\n    --bottleneck \u003cNAME OF THE BOTTLENECK\u003e \\\n```\nEach grounding function is a binary classifier that predicts whether the concept is present in the image. The output will be saved to `./data/grounding_functions/\u003cmodality\u003e/\u003cconcept\u003e/`.\n\n\n## Baselines\n* **Linear Probing**: `python modules/cbm.py --mode linear_probe --modality \u003cxray or skin\u003e --model_name \u003cvision backbone\u003e`.\n\n* **PCBM-h**: `python modules/cbm.py --mode pcbm --bottleneck PubMed --number_of_features 150 --modality \u003cxray or skin\u003e --model_name \u003cvision backbone\u003e`.\n\n* **End-to-End**: `python modules/end2end.py --modality \u003cxray or skin\u003e --model_name \u003cvit or densenet\u003e`.\n\n* **LSL**: You need to first fine-tune the CLIP model with knowledge using the following command:\n  ```bash\n  python modules/LSL.py \\\n      --modality \u003cxray or skin\u003e \\\n      --clip_model_name \u003cbase model, e.g., whyxrayclip\u003e \\\n      --bottleneck \u003cNAME OF THE BOTTLENECK\u003e \\\n      --image_dir \u003cPATH TO THE IMAGE DIRECTORY\u003e \\\n  ```\n  Then, extract the features using the fine-tuned CLIP model and get the final results same as linear probing: `python modules/cbm.py --mode linear_probe --modality \u003cxray or skin\u003e --model_name \u003cfine-tuned vision backbone\u003e`. We provide the models we fine-tuned on PubMed in the `data/model_weights/` directory.\n\n\n## Citation\nPlease cite our paper if you find our work useful!\n```bibtex\n@article{yang2024textbook,\n      title={A Textbook Remedy for Domain Shifts: Knowledge Priors for Medical Image Analysis}, \n      author={Yue Yang and Mona Gandhi and Yufei Wang and Yifan Wu and Michael S. Yao and Chris Callison-Burch and James C. Gee and Mark Yatskar},\n      journal={arXiv preprint arXiv:2405.14839},\n      year={2024}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fyueyang1996%2Fknobo","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fyueyang1996%2Fknobo","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fyueyang1996%2Fknobo/lists"}