{"id":17300713,"url":"https://github.com/michiyasunaga/dragon","last_synced_at":"2025-06-17T11:34:07.395Z","repository":{"id":108128202,"uuid":"547587897","full_name":"michiyasunaga/dragon","owner":"michiyasunaga","description":"[NeurIPS 2022] DRAGON 🐲: Deep Bidirectional Language-Knowledge Graph Pretraining ","archived":false,"fork":false,"pushed_at":"2023-05-10T15:53:32.000Z","size":617,"stargazers_count":322,"open_issues_count":1,"forks_count":46,"subscribers_count":8,"default_branch":"main","last_synced_at":"2025-03-31T11:03:29.847Z","etag":null,"topics":["graph-neural-networks","knowledge-graph","language-model","pretraining","question-answering","reasoning","transformer"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":false,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/michiyasunaga.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-10-08T00:01:52.000Z","updated_at":"2025-03-16T21:04:23.000Z","dependencies_parsed_at":null,"dependency_job_id":"8e5ce2f4-436f-4d04-a8b1-b7f65782133f","html_url":"https://github.com/michiyasunaga/dragon","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/michiyasunaga%2Fdragon","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/michiyasunaga%2Fdragon/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/michiyasunaga%2Fdragon/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/michiyasunaga%2Fdragon/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/michiyasunaga","download_url":"https://codeload.github.com/michiyasunaga/dragon/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247648977,"owners_count":20972945,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["graph-neural-networks","knowledge-graph","language-model","pretraining","question-answering","reasoning","transformer"],"created_at":"2024-10-15T11:29:55.644Z","updated_at":"2025-04-07T12:10:20.710Z","avatar_url":"https://github.com/michiyasunaga.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# DRAGON: Deep Bidirectional Language-Knowledge Graph Pretraining\n\nThis repo provides the source code \u0026 data of our paper \"[DRAGON: Deep Bidirectional Language-Knowledge Graph Pretraining](https://arxiv.org/abs/2210.09338)\" (NeurIPS 2022).\n\n\n### Overview\nDRAGON is a new foundation model (improvement of BERT) that is **pre-trained jointly from text and knowledge graphs** for improved language, knowledge and reasoning capabilities. Specifically, it was trained with two simultaneous self-supervised objectives, language modeling and link prediction, that encourage deep bidirectional reasoning over text and knowledge graphs.\n\nDRAGON can be used as a drop-in replacement for BERT. It achieves better performance in various NLP tasks, and is particularly effective for **knowledge and reasoning-intensive** tasks such as multi-step reasoning and low-resource QA.\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"./figs/dragon.png\" width=\"1000\" title=\"DRAGON model overview\" alt=\"\"\u003e\n\u003c/p\u003e\n\n\n\n## 0. Dependencies\n\nRun the following commands to create a conda environment:\n```bash\nconda create -y -n dragon python=3.8\nconda activate dragon\npip install torch==1.10.1+cu113 torchvision -f https://download.pytorch.org/whl/cu113/torch_stable.html\npip install transformers==4.9.1 wandb nltk spacy==2.1.6\npython -m spacy download en\npip install scispacy==0.3.0\npip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.3.0/en_core_sci_sm-0.3.0.tar.gz\npip install torch-scatter==2.0.9 torch-sparse==0.6.12 torch-geometric==2.0.0 -f https://pytorch-geometric.com/whl/torch-1.10.1+cu113.html\n```\n\n## 1. Download pretrained models\nYou can download pretrained DRAGON models below. Place the downloaded model files under `./models`\n\n| Model | Domain | Size | Pretraining Text | Pretraining Knowledge Graph | Download Link |\n| ------------- | ------------- | --------- | ---- | ---- | ---- |\n| DRAGON   | General     | 360M parameters | BookCorpus | ConceptNet | [general_model](https://nlp.stanford.edu/projects/myasu/DRAGON/models/general_model.pt) |\n| DRAGON   | Biomedicine | 360M parameters | PubMed | UMLS | [biomed_model](https://nlp.stanford.edu/projects/myasu/DRAGON/models/biomed_model.pt) |\n\n\n## 2. Download data\n### Commonsense domain\nYou can download all the preprocessed data from [**[here]**](https://nlp.stanford.edu/projects/myasu/DRAGON/data_preprocessed.zip). This includes the ConceptNet knowledge graph as well as CommonsenseQA, OpenBookQA and RiddleSense datasets. Specifically, run:\n```\nwget https://nlp.stanford.edu/projects/myasu/DRAGON/data_preprocessed.zip\nunzip data_preprocessed.zip\nmv data_preprocessed data\n```\n\n\n**(Optional)** If you would like to preprocess the raw data from scratch, you can download the raw data – ConceptNet Knowledge graph, CommonsenseQA, OpenBookQA – by:\n```\n./download_raw_data.sh\n```\nTo preprocess the raw data, run:\n```\nCUDA_VISIBLE_DEVICES=0 python preprocess.py -p \u003cnum_processes\u003e --run common csqa obqa\n```\nYou can specify the GPU you want to use in the beginning of the command `CUDA_VISIBLE_DEVICES=...`. The script will:\n* Setup ConceptNet (e.g., extract English relations from ConceptNet, merge the original 42 relation types into 17 types)\n* Convert the QA datasets into .jsonl files (e.g., stored in `data/csqa/statement/`)\n* Identify all mentioned concepts in the questions and answers\n* Extract subgraphs for each q-a pair\n\n\n\n### Biomedical domain\nYou can download all the preprocessed data from [**[here]**](https://nlp.stanford.edu/projects/myasu/DRAGON/data_preprocessed.zip). This includes the UMLS biomedical knowledge graph and MedQA dataset.\n\n**(Optional)** If you would like to preprocess MedQA from scratch, follow `utils_biomed/preprocess_medqa.ipynb` and then run\n```\nCUDA_VISIBLE_DEVICES=0 python preprocess.py -p \u003cnum_processes\u003e --run medqa\n```\n\n\u003cbr\u003e\nThe resulting file structure should look like this:\n\n```plain\n.\n├── README.md\n├── models/\n    ├── general_model.pt\n    ├── biomed_model.pt\n\n└── data/\n    ├── cpnet/                 (preprocessed ConceptNet KG)\n    └── csqa/\n        ├── train_rand_split.jsonl\n        ├── dev_rand_split.jsonl\n        ├── test_rand_split_no_answers.jsonl\n        ├── statement/             (converted statements)\n        ├── grounded/              (grounded entities)\n        ├── graphs/                (extracted subgraphs)\n        ├── ...\n    ├── obqa/\n    ├── umls/                  (preprocessed UMLS KG)\n    └── medqa/\n```\n\n## 3. Train DRAGON\nTo train DRAGON on CommonsenseQA, OpenBookQA, RiddleSense, MedQA, run:\n```\nscripts/run_train__csqa.sh\nscripts/run_train__obqa.sh\nscripts/run_train__riddle.sh\nscripts/run_train__medqa.sh\n```\n\n\n**(Optional)** If you would like to pretrain DRAGON (i.e. self-supervised pretraining), run\n```\nscripts/run_pretrain.sh\n```\nAs a quick demo, this script uses sentences from CommonsenseQA as training data.\nIf you wish to use a larger, general corpus like BookCorpus, follow Section 5 (Use your own dataset) to prepare the training data.\n\n\n## 4. Evaluate trained models\nFor CommonsenseQA, OpenBookQA, RiddleSense, MedQA, run:\n```\nscripts/run_eval__csqa.sh\nscripts/run_eval__obqa.sh\nscripts/run_eval__riddle.sh\nscripts/run_eval__medqa.sh\n```\nYou can download trained model checkpoints in the next section.\n\n\n### Trained model examples\nCommonsenseQA\n\u003ctable\u003e\n  \u003ctr\u003e\n    \u003cth\u003eTrained model\u003c/th\u003e\n    \u003cth\u003eIn-house Dev acc.\u003c/th\u003e\n    \u003cth\u003eIn-house Test acc.\u003c/th\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003cth\u003eDRAGON \u003ca href=\"https://nlp.stanford.edu/projects/myasu/DRAGON/models/csqa_model.pt\"\u003e[link]\u003c/a\u003e\u003c/th\u003e\n    \u003cth\u003e0.7928\u003c/th\u003e\n    \u003cth\u003e0.7615\u003c/th\u003e\n  \u003c/tr\u003e\n\u003c/table\u003e\n\nOpenBookQA\n\u003ctable\u003e\n  \u003ctr\u003e\n    \u003cth\u003eTrained model\u003c/th\u003e\n    \u003cth\u003eDev acc.\u003c/th\u003e\n    \u003cth\u003eTest acc.\u003c/th\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003cth\u003eDRAGON \u003ca href=\"https://nlp.stanford.edu/projects/myasu/DRAGON/models/obqa_model.pt\"\u003e[link]\u003c/a\u003e\u003c/th\u003e\n    \u003cth\u003e0.7080\u003c/th\u003e\n    \u003cth\u003e0.7280\u003c/th\u003e\n  \u003c/tr\u003e\n\u003c/table\u003e\n\nRiddleSense\n\u003ctable\u003e\n  \u003ctr\u003e\n    \u003cth\u003eTrained model\u003c/th\u003e\n    \u003cth\u003eIn-house Dev acc.\u003c/th\u003e\n    \u003cth\u003eIn-house Test acc.\u003c/th\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003cth\u003eDRAGON \u003ca href=\"https://nlp.stanford.edu/projects/myasu/DRAGON/models/riddle_model.pt\"\u003e[link]\u003c/a\u003e\u003c/th\u003e\n    \u003cth\u003e0.6869\u003c/th\u003e\n    \u003cth\u003e0.7157\u003c/th\u003e\n  \u003c/tr\u003e\n\u003c/table\u003e\n\nMedQA\n\u003ctable\u003e\n  \u003ctr\u003e\n    \u003cth\u003eTrained model\u003c/th\u003e\n    \u003cth\u003eDev acc.\u003c/th\u003e\n    \u003cth\u003eTest acc.\u003c/th\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003cth\u003e\u003ca href=\"https://github.com/michiyasunaga/LinkBERT\"\u003eBioLinkBERT\u003c/a\u003e + DRAGON \u003ca href=\"https://nlp.stanford.edu/projects/myasu/DRAGON/models/medqa_model.pt\"\u003e[link]\u003c/a\u003e\u003c/th\u003e\n    \u003cth\u003e0.4308\u003c/th\u003e\n    \u003cth\u003e0.4768\u003c/th\u003e\n  \u003c/tr\u003e\n\u003c/table\u003e\n\n**Note**: The models were trained and tested with HuggingFace transformers==4.9.1.\n\n\n\n## 5. Use your own dataset\n- Convert your dataset to  `{train,dev,test}.statement.jsonl`  in .jsonl format (see `data/csqa/statement/train.statement.jsonl`)\n- Create a directory in `data/{yourdataset}/` to store the .jsonl files\n- Modify `preprocess.py` and perform subgraph extraction for your data\n- Modify `utils/parser_utils.py` to support your own dataset\n\n\n\n## Citation\nIf you find our work helpful, please cite the following:\n```bib\n@InProceedings{yasunaga2022dragon,\n  author =  {Michihiro Yasunaga and Antoine Bosselut and Hongyu Ren and Xikun Zhang and Christopher D. Manning and Percy Liang and Jure Leskovec},\n  title =   {Deep Bidirectional Language-Knowledge Graph Pretraining},\n  year =    {2022},  \n  booktitle = {Neural Information Processing Systems (NeurIPS)},  \n}\n```\n\n## Acknowledgment\nThis repo is built upon the following works:\n```\nGreaseLM: Graph REASoning Enhanced Language Models for Question Answering\nhttps://github.com/snap-stanford/GreaseLM\n\nQA-GNN: Question Answering using Language Models and Knowledge Graphs\nhttps://github.com/michiyasunaga/qagnn\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmichiyasunaga%2Fdragon","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmichiyasunaga%2Fdragon","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmichiyasunaga%2Fdragon/lists"}