https://github.com/INK-USC/MHGRN

Scalable Multi-Hop Relational Reasoning for Knowledge-Aware Question Answering (EMNLP 2020)
https://github.com/INK-USC/MHGRN
Last synced: 6 months ago
JSON representation
Scalable Multi-Hop Relational Reasoning for Knowledge-Aware Question Answering (EMNLP 2020)
Host: GitHub
URL: https://github.com/INK-USC/MHGRN
Owner: INK-USC
Created: 2020-04-28T08:46:41.000Z (about 5 years ago)
Default Branch: master
Last Pushed: 2021-10-23T16:52:14.000Z (over 3 years ago)
Last Synced: 2024-08-03T09:07:09.783Z (9 months ago)
Language: Python
Homepage:
Size: 2.26 MB
Stars: 245
Watchers: 12
Forks: 46
Open Issues: 11
Metadata Files:
- Readme: README.md
Awesome Lists containing this project

StarryDivineSky - INK-USC/MHGRN - hop relational reasoning module（多跳关系推理模型）叫做MHGRN多跳推理网络。该模型在额外的多跳知识图谱中抽取的子网络中进行推理。本文提出的方法将已有的基于路径的常识推理以及GCN融合在了一起，并在CommonsenseQA和OpenbookQA上取得了良好的效果。 (知识图谱问答KBQA、多跳推理 / 其他_文本生成、文本对话)
README

        # Multi-Hop Graph Relation Networks (EMNLP 2020)

 [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

This is the repo of our EMNLP'20 [paper](https://arxiv.org/abs/2005.00646):

```

Scalable Multi-Hop Relational Reasoning for Knowledge-Aware Question Answering

Yanlin Feng*, Xinyue Chen*, Bill Yuchen Lin, Peifeng Wang, Jun Yan and Xiang Ren.

EMNLP 2020.

*=equal contritbution

```

This repository also implements other graph encoding models for question answering (including vanilla LM finetuning).

- **RelationNet**

- **R-GCN**

- **KagNet** 

- **GConAttn**

- **KVMem**

- **MHGRN (or. MultiGRN)**

Each model supports the following text encoders:

- **LSTM**

- **GPT**

- **BERT** 

- **XLNet** 

- **RoBERTa**

## Resources

We provide preprocessed ConceptNet and pretrained entity embeddings for your own usage. These resources are independent of the source code.

***Note that the following reousrces can be download [here](https://drive.google.com/drive/folders/155codqEnsKazO8-BchF3rO_cP3EyYdws).***

### ConceptNet (5.6.0)

| Description                  | Downloads                                                    | Notes                                                        |

| ---------------------------- | ------------------------------------------------------------ | ------------------------------------------------------------ |

| Entity Vocab                 | [entity-vocab](https://drive.google.com/drive/folders/155codqEnsKazO8-BchF3rO_cP3EyYdws) | one entity per line, space replaced by '_'                   |

| Relation Vocab               | [relation-vocab](https://drive.google.com/drive/folders/155codqEnsKazO8-BchF3rO_cP3EyYdws) | one relation per line, merged                                |

| ConceptNet (CSV format)      | [conceptnet-5.6.0-csv](https://drive.google.com/drive/folders/155codqEnsKazO8-BchF3rO_cP3EyYdws) | English tuples extracted from the full conceptnet with merged relations |

| ConceptNet (NetworkX format) | [conceptnet-5.6.0-networkx](https://drive.google.com/drive/folders/155codqEnsKazO8-BchF3rO_cP3EyYdws) | NetworkX pickled format, pruned by filtering out stop words  |

### Entity Embeddings (Node Features)

Entity embeddings are packed into a matrix of shape (#ent, dim) and stored in numpy format. Use `np.load` to read the file. You may need to download the vocabulary files first.

| Embedding Model | Dimensionality | Description                                               | Downloads                                                    |

| --------------- | -------------- | --------------------------------------------------------- | ------------------------------------------------------------ |

| TransE          | 100            | Obtained using OpenKE with optim=sgd, lr=1e-3, epoch=1000 | [entities]() [relations]() |

| NumberBatch     | 300            |    | [entities]() |

| BERT-based      | 1024           | Provided by Zhengwei                                      | [entities](https://drive.google.com/drive/folders/155codqEnsKazO8-BchF3rO_cP3EyYdws) |

## Dependencies

- [Python]() >= 3.6

- [PyTorch]() == 1.1.0

- [transformers]() == 2.0.0

- [tqdm]()

- [dgl]() == 0.3.1 (GPU version)

- [networkx]() == 2.3

Run the following commands to create a conda environment (assume CUDA10):

```bash

conda create -n krqa python=3.6 numpy matplotlib ipython

source activate krqa

conda install pytorch=1.1.0 torchvision cudatoolkit=10.0 -c pytorch

pip install dgl-cu100==0.3.1

pip install transformers==2.0.0 tqdm networkx==2.3 nltk spacy==2.1.6

python -m spacy download en

```

## Usage

### 1. Download Data

First, you need to download all the necessary data in order to train the model:

```bash

git clone https://github.com/INK-USC/MHGRN.git

cd MHGRN

bash scripts/download.sh

```

The script will:

- Download the [CommonsenseQA]() dataset

- Download [ConceptNet]()

- Download pretrained TransE embeddings

### 2. Preprocess

To preprocess the data, run:

```bash

python preprocess.py

```

By default, all available CPU cores will be used for multi-processing in order to speed up the process. Alternatively, you can use "-p" to specify the number of processes to use:

```bash

python preprocess.py -p 20

```

The script will:

- Convert the original datasets into .jsonl files (stored in `data/csqa/statement/`)

- Extract English relations from ConceptNet, merge the original 42 relation types into 17 types

- Identify all mentioned concepts in the questions and answers

- Extract subgraphs for each q-a pair

The preprocessing procedure takes approximately 3 hours on a 40-core CPU server. Most intermediate files are in .jsonl or .pk format and stored in various folders. The resulting file structure will look like:

```plain

.

├── README.md

└── data/

    ├── cpnet/                 (prerocessed ConceptNet)

    ├── glove/                 (pretrained GloVe embeddings)

    ├── transe/                (pretrained TransE embeddings)

    └── csqa/

        ├── train_rand_split.jsonl

        ├── dev_rand_split.jsonl

        ├── test_rand_split_no_answers.jsonl

        ├── statement/             (converted statements)

        ├── grounded/              (grounded entities)

        ├── paths/                 (unpruned/pruned paths)

        ├── graphs/                (extracted subgraphs)

        ├── ...

```

### 3. Hyperparameter Search (optional)

To search the parameters for RoBERTa-Large on CommonsenseQA:

```bash

bash scripts/param_search_lm.sh csqa roberta-large

```

To search the parameters for BERT+RelationNet on CommonsenseQA:

```bash

bash scripts/param_search_rn.sh csqa bert-large-uncased

```

### 4. Training 

Each graph encoding model is implemented in a single script:

| Graph Encoder                                                | Script      | Description                                                  |

| ------------------------------------------------------------ | ----------- | ------------------------------------------------------------ |

| None                                                         | lm.py       | w/o knowledge graph                                          |

| [Relation Network]() | rn.py       |                                                              |

| [R-GCN]()              | rgcn.py     | Use `--gnn_layer_num ` and `--num_basis` to specify #layer and #basis |

| [KagNet](https://arxiv.org/abs/1909.02151)                   | kagnet.py   | Adapted from , still tuning |

| Gcon-Attn                                                    | gconattn.py |                                                              |

| KV-Memory                                                    | kvmem.py    |                                                              |

| MHGRN                                                        | grn.py      |                                                              |

Some important command line arguments are listed as follows (run `python {lm,rn,rgcn,...}.py -h` for a complete list):

| Arg                             | Values                                                     | Description                      | Notes                                                        |

| ------------------------------- | ---------------------------------------------------------- | -------------------------------- | ------------------------------------------------------------ |

| `--mode`                        | {train, eval, ...}                                         | Training or Evaluation           | default=train                                                |

| `-enc, --encoder`               | {lstm, openai-gpt, bert-large-unased, roberta-large, ....} | Text Encoer                      | Model names (except for lstm) are the ones used by [huggingface-transformers](), default=bert-large-uncased |

| `--optim`                       | {adam, adamw, radam}                                       | Optimizer                        | default=radam                                                |

| `-ds, --dataset`                | {csqa, obqa}                                               | Dataset                          | default=csqa                                                 |

| `-ih, --inhouse`                | {0, 1}                                                     | Run In-house Split               | default=1, only applicable to CSQA                           |

| `--ent_emb`                     | {transe, numberbatch, tzw}                                 | Entity Embeddings                | default=tzw (BERT-based node features)                       |

| `-sl, --max_seq_len`            | {32, 64, 128, 256}                                         | Maximum Sequence Length          | Use 128 or 256 for datasets that contain long sentences! default=64 |

| `-elr, --encoder_lr`            | {1e-5, 2e-5, 3e-5, 6e-5, 1e-4}                             | Text Encoder LR                  | dataset specific and text encoder specific, default values in `utils/parser_utils.py` |

| `-dlr, --decoder_lr`            | {1e-4, 3e-4, 1e-3, 3e-3}                                   | Graph Encoder LR                 | dataset specific and model specific, default values in `{model}.py` |

| `--lr_schedule`                 | {fixed, warmup_linear, warmup_constant}                    | Learning Rate Schedule           | default=fixed                                                |

| `-me, --max_epochs_before_stop` | {2, 4, 6}                                                  | Early Stopping Patience          | default=2                                                    |

| `--unfreeze_epoch`              | {0, 3}                                                     | Freeze Text Encoder for N epochs | model specific                                               |

| `-bs, --batch_size`             | {16, 32, 64}                                               | Batch Size                       | default=32                                                   |

| `--save_dir`                    | str                                                        | Checkpoint Directory             | model specific                                               |

| `--seed`                        | {0, 1, 2, 3}                                               | Random Seed                      | default=0                                                    |

For example, run the following command to train a RoBERTa-Large model on CommonsenseQA:

```bash

python lm.py --encoder roberta-large --dataset csqa

```

To train a RelationNet with BERT-Large-Uncased as the encoder:

```bash

python rn.py --encoder bert-large-uncased

```

To **reproduce the reported results of MultiGRN** on CommonsenseQA official set:

```

bash scripts/run_grn_csqa.sh

```

### 5. Evaluation

To evaluate a trained model (you need to specify `--save_dir` if the checkpoint is not stored in the default directory):

```bash

python {lm,rn,rgcn,...}.py --mode eval [ --save_dir path/to/directory/ ]

```

## Use Your Own Dataset

- Convert your dataset to  `{train,dev,test}.statement.jsonl`  in .jsonl format (see `data/csqa/statement/train.statement.jsonl`)

- Create a directory in `data/{yourdataset}/` to store the .jsonl files

- Modify `preprocess.py` and perform subgraph extraction for your data

- Modify `utils/parser_utils.py` to support your own dataset

- Tune `encoder_lr`,`decoder_lr` and other important hyperparameters, modify `utils/parser_utils.py` and `{model}.py` to record the tuned hyperparameters
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/INK-USC/MHGRN

Awesome Lists containing this project

README