Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/HKUST-KnowComp/ComHyper
Code for EMNLP'20 paper "When Hearst Is not Enough: Improving Hypernymy Detection from Corpus with Distributional Models"
https://github.com/HKUST-KnowComp/ComHyper
conceptualization hypernymy-detection lexical-semantics semantic-relations
Last synced: about 1 month ago
JSON representation
Code for EMNLP'20 paper "When Hearst Is not Enough: Improving Hypernymy Detection from Corpus with Distributional Models"
- Host: GitHub
- URL: https://github.com/HKUST-KnowComp/ComHyper
- Owner: HKUST-KnowComp
- License: mit
- Created: 2020-10-09T14:41:55.000Z (about 4 years ago)
- Default Branch: main
- Last Pushed: 2020-11-10T06:42:19.000Z (about 4 years ago)
- Last Synced: 2024-04-28T05:06:28.007Z (8 months ago)
- Topics: conceptualization, hypernymy-detection, lexical-semantics, semantic-relations
- Language: Python
- Homepage:
- Size: 313 KB
- Stars: 11
- Watchers: 3
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome-taxonomy - https://github.com/HKUST-KnowComp/ComHyper
README
# ComHyper [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
Code for EMNLP'20 paper "When Hearst Is not Enough: Improving Hypernymy Detection from Corpus with Distributional Models" ([arXiv](https://arxiv.org/abs/2010.04941v1))
In a nutshell, ComHyper is the complementary framework for solving hypernymy detection tasks from the perspective of blind points of Hearst pattern-based methods. As shown in the left Figure, long-tailed nouns cannot well covered by Hearst patterns and thus form non-negligible sparsity types. For such cases, we propose to use supervised distributional models for complmenting pattern-based models shown in the right Figure.
## Use ComHyper
### 1. Download Hearst pattern files and corpus.
First prepare the extracted Hearst pattern pairs such as `hearst_counts.txt.gz` from the repo [hypernymysuite](https://github.com/facebookresearch/hypernymysuite) or `data-concept.zip` from Microsoft Concept Graph (Also known as [Probase](https://concept.research.microsoft.com/Home/Download)). Specify the parameter `pattern_filename` in the `config` as the file location.
```
wget https://github.com/facebookresearch/hypernymysuite/blob/master/hearst_counts.txt.gz
curl -L "https://concept.research.microsoft.com/Home/StartDownload" > data-concept.zip
```Then extract the contexts for words from large-scale corpus such as Wiki + Gigaword or ukWac. All the contexts for one word should be organized into one `txt` file and one line for one context.
For those words appearing in the Hearst patterns (**IP words**), organize their context files into the directory `context` in the `config`. For **OOP words**, organize their context files into the `context_oov` in the `config`.
### 2. Train and evaluate the ComHyper.
For training the distributional models supervsied by the output of pattern-based models, different context encoders are provided:
```console
python train_word2score.py config/word.cfg
python train_context2score.py config/context.cfg
python train_bert2score.py config/bert.cfg
```The same evaluation scripts work for all settings. For reproducing the results, run:
```console
python evaluation/evaluation_all_context.py ../config/context.cfg
```Note that we choose not to report the `BERT` encoder results in our orginial paper due to efficiency but release the relevant codes for incoroporating effective pre-trained contextualized encoders to further improve the performance. Welcome to PR or contact cyuaq # cse.ust.hk !
## Citation
Please cite the following paper if you found our method helpful. Thanks !
```
@inproceedings{yu-etal-2020-hearst,
title = "When Hearst Is Not Enough: Improving Hypernymy Detection from Corpus with Distributional Models",
author = "Yu, Changlong and Han, Jialong and Wang, Peifeng and Song, Yangqiu and Zhang, Hongming and Ng, Wilfred and Shi, Shuming",
booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
month = "nov",
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.emnlp-main.502",
pages = "6208--6217",
}
```