https://github.com/thunlp/multird

Code and data of the AAAI-20 paper "Multi-channel Reverse Dictionary Model"
https://github.com/thunlp/multird

nlp reverse-dictionary

Last synced: about 1 year ago
JSON representation

Code and data of the AAAI-20 paper "Multi-channel Reverse Dictionary Model"

Host: GitHub
URL: https://github.com/thunlp/multird
Owner: thunlp
License: mit
Created: 2019-11-15T13:31:02.000Z (over 6 years ago)
Default Branch: master
Last Pushed: 2020-10-06T13:51:40.000Z (over 5 years ago)
Last Synced: 2025-04-03T20:37:09.441Z (about 1 year ago)
Topics: nlp, reverse-dictionary
Language: Python
Homepage: https://wantwords.thunlp.org/
Size: 138 KB
Stars: 110
Watchers: 7
Forks: 23
Open Issues: 3
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # MultiRD

Code and data of the AAAI-20 paper "**Multi-channel Reverse Dictionary Model**" [[pdf](https://arxiv.org/pdf/1912.08441.pdf)]

## Requirements

* Python 3.x

* Pytorch 1.x

* Other requirements: numpy, tqdm, nltk, gensim, thulac

## Quick Start

Download the code and data from [Google Drive](https://drive.google.com/drive/folders/1jeyPE8iGdGUSVJe_6Smr_NzoWfR52f4g?usp=sharing) or [Tsinghua Cloud](https://cloud.tsinghua.edu.cn/d/ec29131d38fd4ca2a6ca/), where the code is the same as that here.

Unzip the data.zip (under English and Chinese paths respectively), and all files under `EnglishReverseDictionary` and `ChineseReverseDictionary` should be prepared as follows:

```

ReverseDictionary

|- EnglishReverseDictionary

|  |- data

|  |  |- data_train.json

|  |  |- data_dev.json

|  |  |- data_test_500_rand1_seen.json

|  |  |- data_test_500_rand1_unseen.json

|  |  |- data_defi_c.json           [definitions of the target words in 200 descriptions]

|  |  |- data_desc_c.json           [testset of 200 descriptions]

|  |  |- vec_inuse.json             [Only embeddings used in this model are included.]

|  |  |- lexname_all.txt

|  |  |- root_affix_freq.txt

|  |  |- sememes_all.txt

|  |  |- target_words.txt

|  |- code

|     |- main.py

|     |- model.py

|     |- data.py

|     |- evaluate.py

|     |- evaluate_result.py

|     |- analyse_result.py

|     |- result_analysis_En_1200.py

|- ChineseReverseDictionary

|  |- data

|  |  |- Cx.json                    [x=1,2,3,4]

|  |  |- description_sense.json     [train & dev dataset]

|  |  |- description_idio_locu.json [testset of Question]

|  |  |- description_byHand.json    [testset of description]

|  |  |- hownet.json

|  |  |- sememe.json

|  |  |- word_cilinClass.json

|  |  |- word_index.json

|  |  |- word_vector.npy            [Only embeddings used in this model are included.]

|  |- code

|     |- main.py

|     |- model.py

|     |- data.py

|     |- evaluate.py

|     |- evaluate_result.py

|- PrepareYourOwnDataset

   |- 

```

### Train English Model

Execute this command under code path：

```bash

python main.py -b [batch_size] -e [epoch_num] -g [gpu_num] -sd [random_seed] -f [freq_mor] -m [rsl, r, s, l, b] -v

```

In `-m [rsl, r, s, l, b]`, 

- `-m r` indicates the use of Morpheme information including roots and affixes. You can filter morphemes by `-f`, usually 15~35;

-  `-m s` means using the Sememe predictor;

-  `-m l` means using WordNet lexnames, which is word category information (include Lexical name and POS tag information);

-  `-m b` means not using any other information, just the basic BiLSTM model;

-  `-m rsl` means to use all information which is our Multi-channel model;

`-e` is usually set to 10~20;

`-g` indicates which GPU to use;

`-v` means showing progess bar.

After training, you will get two new files, `xxx_label_list.json` and `xxx_pred_list.json`. "xxx" indicates the mode you set in `-m`, e.g., the `-m rsl` setting indicates that the file will be `rsl_label_list.json`. 

#### Evaluation

Execute this command under code path:

```bash

python evaluate_result.py -m [mode]

```

Here, `mode` is the same as above.

Then you'll get `median rank`,  ` accuracy@1/10/100` and  `rank variance` results on 3 test sets including **seen**, **unseen** and **description**. 

You can evaluate model performance with prior knowledge:

```bash

python analyse_result.py

python result_analysis_En_1200.py -m [mode]

```

### Train Chinese Model

Execute this command under code path：

```bash

python main.py -b [batch_size] -e [epoch_num] -g [gpu_num] -sd [random_seed] -u/-s -m [CPsc, C, P, s, c, b] -v

```

Different from English model training, we use `-u` or `-s` to represent **Unseen** or **Seen** test mode. In fact, there is no need to use the test mode on the Seen Definition test set. 

In `-m [CPsc, C, P, s, c, b]`

-  `-m C` means using Cilin word category information and we use 4 word classes in Cilin;

-  `-m P` means using POS predictor;

-  `-m s` means using Sememe predictor;

-  `-m c` indicates the use of Morpheme predictor where morphemes are Chinese characters;

-  `-m b` means not using any other information, just the basic BiLSTM model;

-  `-m CPsc` means to use all information as our Multi-channel model.

`-e` , `-g` and `-v` are the same as those in English model training. 

#### Evaluation

```bash

python evaluate_result.py -m [mode]

```

Here, the `mode` is the prefix of `xxx_label_list.json`. 

Then you'll get `median rank`,  ` accuracy@1/10/100` and  `rank variance` results on 4 test sets including **seen**, **unseen**, **Description** and **Question**. 

You can evaluate model performance with prior knowledge:

```bash

python result_analysis_Ch.py -m [mode]

```

## Prepare Your Own Data

Here is some code for reference. The data format is shown below, and you can build your own data set.

```

ReverseDictionary

|- EnglishReverseDictionary

|- ChineseReverseDictionary

|- PrepareYourOwnDataset

   |- proc_allFeatures.py

   |- get_wordnet_lexname.py

   |- get_wordnet_500sample.py

   |- process_googleVec_checkAllData.py

   |- readHowNet_to_word_sememe.py

   |- wordnik_get_defi.py

   |- check_root_affix.py

```

### Data Formats

It is json format in data_xxx.json files.

```

{

     "word": "fatalism",

     "lexnames": [

         "noun.cognition"

     ],

     "root_affix": [

         "fatal",

         "ism"

     ],

     "sememes": [

         "knowledge",

         "believe",

         "experience",

         "Fate"

     ],

     "definitions": "the doctrine that all events are predetermined by fate and are therefore unalterable"

}

```

Word embeddings are in `vec_inuse.json` which contains all target words and words in definitions. Only used words are included. The format is `{word: [vector]}`, ....

`lexname_all.txt` contains all 45 lexnames from WordNet.

`sememes_all.txt` contains 1400 sememes from HowNet.

Morphemes (root and affix) are in `root_affix_freq.txt`, which contains morphemes and their numbers, separated by spaces.

### Download and Process Data

In English experiments, we use the Description dataset from [(Hill et al. 2016)](https://arxiv.org/pdf/1504.00548.pdf). 

Word embeddings are from [GoogleNews-vectors-negative300](https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing). 

Sememes can be obtained using [OpenHowNet](https://github.com/thunlp/OpenHowNet). 

Lexnames are from WordNet which you can get them easily by NLTK.

We get morphemes by [Morfessor tool](https://morfessor.readthedocs.io/en/latest/). The used dataset is from [morpho.aalto.fi](http://morpho.aalto.fi/events/morphochallenge2010/datasets.shtml). You should train mofessor model first, and then use it to process the target words to get the corresponding roots and affixes.

```bash

morfessor-train --encoding=ISO_8859-15 --traindata-list --logfile=log.log -s model.bin -d ones wordlist-2010.eng

morfessor-segment -l ../morfessor_data/model.bin target_words.txt -o word_root_affix.txt

```

Unfortunately, the morphemes obtained by this method are not accurate. It is recommended that you use the standard root-affix dictionary.

## Cite

If you use any code or data, please cite this paper

```

@article{zhang2019multi

    title={Multi-channel Reverse Dictionary Model},

    author={Zhang, Lei and Qi, Fanchao and Liu, Zhiyuan and Wang, Yasheng and Liu, Qun and Sun, Maosong},

    journal={arXiv preprint arXiv:1912.08441},

  	year={2019}

}

```

## Contact

You can visit our [online reverse dictionary website](https://wantwords.thunlp.org/), where we have optimized our methods and datasets. Github [WantWords](https://github.com/thunlp/WantWords). You can post issues if you have any questions.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/thunlp/multird

Awesome Lists containing this project

README