{"id":19467707,"url":"https://github.com/thunlp/multird","last_synced_at":"2025-04-25T11:31:54.072Z","repository":{"id":89615983,"uuid":"221930088","full_name":"thunlp/MultiRD","owner":"thunlp","description":"Code and data of the AAAI-20 paper \"Multi-channel Reverse Dictionary Model\"","archived":false,"fork":false,"pushed_at":"2020-10-06T13:51:40.000Z","size":141,"stargazers_count":110,"open_issues_count":3,"forks_count":23,"subscribers_count":7,"default_branch":"master","last_synced_at":"2025-04-03T20:37:09.441Z","etag":null,"topics":["nlp","reverse-dictionary"],"latest_commit_sha":null,"homepage":"https://wantwords.thunlp.org/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/thunlp.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-11-15T13:31:02.000Z","updated_at":"2025-01-24T10:22:55.000Z","dependencies_parsed_at":null,"dependency_job_id":"b4b6d4d2-0a19-441d-abd4-59d3f0071867","html_url":"https://github.com/thunlp/MultiRD","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thunlp%2FMultiRD","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thunlp%2FMultiRD/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thunlp%2FMultiRD/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thunlp%2FMultiRD/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/thunlp","download_url":"https://codeload.github.com/thunlp/MultiRD/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":250808275,"owners_count":21490638,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["nlp","reverse-dictionary"],"created_at":"2024-11-10T18:36:33.277Z","updated_at":"2025-04-25T11:31:54.034Z","avatar_url":"https://github.com/thunlp.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# MultiRD\nCode and data of the AAAI-20 paper \"**Multi-channel Reverse Dictionary Model**\" [[pdf](https://arxiv.org/pdf/1912.08441.pdf)]\n\n## Requirements\n* Python 3.x\n* Pytorch 1.x\n* Other requirements: numpy, tqdm, nltk, gensim, thulac\n\n## Quick Start\nDownload the code and data from [Google Drive](https://drive.google.com/drive/folders/1jeyPE8iGdGUSVJe_6Smr_NzoWfR52f4g?usp=sharing) or [Tsinghua Cloud](https://cloud.tsinghua.edu.cn/d/ec29131d38fd4ca2a6ca/), where the code is the same as that here.\n\nUnzip the data.zip (under English and Chinese paths respectively), and all files under `EnglishReverseDictionary` and `ChineseReverseDictionary` should be prepared as follows:\n\n```\nReverseDictionary\n|- EnglishReverseDictionary\n|  |- data\n|  |  |- data_train.json\n|  |  |- data_dev.json\n|  |  |- data_test_500_rand1_seen.json\n|  |  |- data_test_500_rand1_unseen.json\n|  |  |- data_defi_c.json           [definitions of the target words in 200 descriptions]\n|  |  |- data_desc_c.json           [testset of 200 descriptions]\n|  |  |- vec_inuse.json             [Only embeddings used in this model are included.]\n|  |  |- lexname_all.txt\n|  |  |- root_affix_freq.txt\n|  |  |- sememes_all.txt\n|  |  |- target_words.txt\n|  |- code\n|     |- main.py\n|     |- model.py\n|     |- data.py\n|     |- evaluate.py\n|     |- evaluate_result.py\n|     |- analyse_result.py\n|     |- result_analysis_En_1200.py\n|- ChineseReverseDictionary\n|  |- data\n|  |  |- Cx.json                    [x=1,2,3,4]\n|  |  |- description_sense.json     [train \u0026 dev dataset]\n|  |  |- description_idio_locu.json [testset of Question]\n|  |  |- description_byHand.json    [testset of description]\n|  |  |- hownet.json\n|  |  |- sememe.json\n|  |  |- word_cilinClass.json\n|  |  |- word_index.json\n|  |  |- word_vector.npy            [Only embeddings used in this model are included.]\n|  |- code\n|     |- main.py\n|     |- model.py\n|     |- data.py\n|     |- evaluate.py\n|     |- evaluate_result.py\n|- PrepareYourOwnDataset\n   |- \u003cSee below.\u003e\n```\n\n### Train English Model\nExecute this command under code path：\n```bash\npython main.py -b [batch_size] -e [epoch_num] -g [gpu_num] -sd [random_seed] -f [freq_mor] -m [rsl, r, s, l, b] -v\n```\nIn `-m [rsl, r, s, l, b]`, \n\n- `-m r` indicates the use of Morpheme information including roots and affixes. You can filter morphemes by `-f`, usually 15~35;\n-  `-m s` means using the Sememe predictor;\n-  `-m l` means using WordNet lexnames, which is word category information (include Lexical name and POS tag information);\n-  `-m b` means not using any other information, just the basic BiLSTM model;\n-  `-m rsl` means to use all information which is our Multi-channel model;\n\n`-e` is usually set to 10~20;\n\n`-g` indicates which GPU to use;\n\n`-v` means showing progess bar.\n\n\nAfter training, you will get two new files, `xxx_label_list.json` and `xxx_pred_list.json`. \"xxx\" indicates the mode you set in `-m`, e.g., the `-m rsl` setting indicates that the file will be `rsl_label_list.json`. \n\n#### Evaluation\nExecute this command under code path:\n```bash\npython evaluate_result.py -m [mode]\n```\nHere, `mode` is the same as above.\n\nThen you'll get `median rank`,  ` accuracy@1/10/100` and  `rank variance` results on 3 test sets including **seen**, **unseen** and **description**. \n\n\n\nYou can evaluate model performance with prior knowledge:\n\n```bash\npython analyse_result.py\npython result_analysis_En_1200.py -m [mode]\n```\n\n### Train Chinese Model\nExecute this command under code path：\n```bash\npython main.py -b [batch_size] -e [epoch_num] -g [gpu_num] -sd [random_seed] -u/-s -m [CPsc, C, P, s, c, b] -v\n```\nDifferent from English model training, we use `-u` or `-s` to represent **Unseen** or **Seen** test mode. In fact, there is no need to use the test mode on the Seen Definition test set. \nIn `-m [CPsc, C, P, s, c, b]`\n\n-  `-m C` means using Cilin word category information and we use 4 word classes in Cilin;\n-  `-m P` means using POS predictor;\n-  `-m s` means using Sememe predictor;\n-  `-m c` indicates the use of Morpheme predictor where morphemes are Chinese characters;\n-  `-m b` means not using any other information, just the basic BiLSTM model;\n-  `-m CPsc` means to use all information as our Multi-channel model.\n\n`-e` , `-g` and `-v` are the same as those in English model training. \n\n#### Evaluation\n\n```bash\npython evaluate_result.py -m [mode]\n```\nHere, the `mode` is the prefix of `xxx_label_list.json`. \nThen you'll get `median rank`,  ` accuracy@1/10/100` and  `rank variance` results on 4 test sets including **seen**, **unseen**, **Description** and **Question**. \n\n\n\nYou can evaluate model performance with prior knowledge:\n```bash\npython result_analysis_Ch.py -m [mode]\n```\n\n## Prepare Your Own Data\n\nHere is some code for reference. The data format is shown below, and you can build your own data set.\n```\nReverseDictionary\n|- EnglishReverseDictionary\n|- ChineseReverseDictionary\n|- PrepareYourOwnDataset\n   |- proc_allFeatures.py\n   |- get_wordnet_lexname.py\n   |- get_wordnet_500sample.py\n   |- process_googleVec_checkAllData.py\n   |- readHowNet_to_word_sememe.py\n   |- wordnik_get_defi.py\n   |- check_root_affix.py\n```\n### Data Formats\nIt is json format in data_xxx.json files.\n```\n{\n     \"word\": \"fatalism\",\n     \"lexnames\": [\n         \"noun.cognition\"\n     ],\n     \"root_affix\": [\n         \"fatal\",\n         \"ism\"\n     ],\n     \"sememes\": [\n         \"knowledge\",\n         \"believe\",\n         \"experience\",\n         \"Fate\"\n     ],\n     \"definitions\": \"the doctrine that all events are predetermined by fate and are therefore unalterable\"\n}\n```\nWord embeddings are in `vec_inuse.json` which contains all target words and words in definitions. Only used words are included. The format is `{word: [vector]}`, ....\n`lexname_all.txt` contains all 45 lexnames from WordNet.\n`sememes_all.txt` contains 1400 sememes from HowNet.\nMorphemes (root and affix) are in `root_affix_freq.txt`, which contains morphemes and their numbers, separated by spaces.\n\n### Download and Process Data\nIn English experiments, we use the Description dataset from [(Hill et al. 2016)](https://arxiv.org/pdf/1504.00548.pdf). \n\nWord embeddings are from [GoogleNews-vectors-negative300](https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing). \n\nSememes can be obtained using [OpenHowNet](https://github.com/thunlp/OpenHowNet). \n\nLexnames are from WordNet which you can get them easily by NLTK.\n\nWe get morphemes by [Morfessor tool](https://morfessor.readthedocs.io/en/latest/). The used dataset is from [morpho.aalto.fi](http://morpho.aalto.fi/events/morphochallenge2010/datasets.shtml). You should train mofessor model first, and then use it to process the target words to get the corresponding roots and affixes.\n\n```bash\nmorfessor-train --encoding=ISO_8859-15 --traindata-list --logfile=log.log -s model.bin -d ones wordlist-2010.eng\nmorfessor-segment -l ../morfessor_data/model.bin target_words.txt -o word_root_affix.txt\n```\nUnfortunately, the morphemes obtained by this method are not accurate. It is recommended that you use the standard root-affix dictionary.\n\n\n\n## Cite\nIf you use any code or data, please cite this paper\n\n```\n@article{zhang2019multi\n    title={Multi-channel Reverse Dictionary Model},\n    author={Zhang, Lei and Qi, Fanchao and Liu, Zhiyuan and Wang, Yasheng and Liu, Qun and Sun, Maosong},\n    journal={arXiv preprint arXiv:1912.08441},\n  \tyear={2019}\n}\n```\n\n## Contact\nYou can visit our [online reverse dictionary website](https://wantwords.thunlp.org/), where we have optimized our methods and datasets. Github [WantWords](https://github.com/thunlp/WantWords). You can post issues if you have any questions.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthunlp%2Fmultird","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fthunlp%2Fmultird","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthunlp%2Fmultird/lists"}