Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/vered1986/HypeNET
Integrated path-based and distributional method for hypernymy detection
https://github.com/vered1986/HypeNET
Last synced: about 1 month ago
JSON representation
Integrated path-based and distributional method for hypernymy detection
- Host: GitHub
- URL: https://github.com/vered1986/HypeNET
- Owner: vered1986
- License: other
- Created: 2016-05-03T10:00:47.000Z (over 8 years ago)
- Default Branch: v2
- Last Pushed: 2017-03-21T09:15:39.000Z (almost 8 years ago)
- Last Synced: 2024-02-15T16:34:16.283Z (10 months ago)
- Language: Python
- Size: 756 KB
- Stars: 86
- Watchers: 5
- Forks: 13
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- License: LICENSE.txt
Awesome Lists containing this project
- awesome-taxonomy - https://github.com/vered1986/HypeNET
README
# HypeNET: Integrated Path-based and Distributional Method for Hypernymy Detection
This is the code used in the paper:
"Improving Hypernymy Detection with an Integrated Path-based and Distributional Method"
Vered Shwartz, Yoav Goldberg and Ido Dagan. ACL 2016. [link](http://arxiv.org/abs/1603.06076)It is used to classify hypernymy relations between term-pairs, using disributional information on each term, and path-based information, encoded using an LSTM.
***
## Version 2:
### Major features and improvements:
* Using dynet instead of pycnn (thanks @srajana!)
* Automating corpus processing with a single bash script which is more time and memory efficient### Bug fixes:
* Too many paths in parse_wikipedia (see issue [#2](https://github.com/vered1986/HypeNET/issues/2))To reproduce the results reported in the paper, please use [V1](https://github.com/vered1986/HypeNET/tree/master).
The current version acheives similar results - the integrated model's performance on the randomly split dataset is:
Precision: 0.918, Recall: 0.907, F1: 0.912***
Consider using our new project, [LexNET](https://github.com/vered1986/LexNET)! It supports classification of multiple semantic relations, and contains several model enhancements and detailed documentation.
***
Prerequisites:
* Python 2.7
* numpy
* scikit-learn
* [bsddb](https://docs.python.org/2/library/bsddb.html)
* [dynet](https://github.com/clab/dynet/)
* [spacy](spacy.io/docs/)Quick Start:
The repository contains the following directories:
* common - the knowledge resource class, which is used by other models to save the path data from the corpus.
* corpus - code for parsing the corpus and extracting paths, including the generalizations made for the baseline method.
* dataset - code for creating the dataset used in the paper, and the dataset itself.
* train - code for training and testing both variants of our model (path-based and integrated).To create a processed corpus, download a Wikipedia dump, and run:
```
bash create_resource_from_corpus.sh [wiki_dump_file] [resource_prefix]
```Where `resource_prefix` is the file path and prefix of the corpus files, e.g. `corpus/wiki`, such that the directory `corpus` will eventually contain the `wiki_*.db` files created by this script.
To train the integrated model, run:
```
train_integrated.py [resource_prefix] [dataset_prefix] [model_prefix_file] [embeddings_file] [alpha] [word_dropout_rate]
```Where:
* `resource_prefix` is the file path and prefix of the corpus files, e.g. `corpus/wiki`, such that the directory `corpus` contains the `wiki_*.db` files created by `create_resource_from_corpus.sh`.
* `dataset_prefix` is the file path of the dataset files, e.g. `dataset/rnd`, such that this directory contains 3 files: `train.tsv`, `test.tsv` and `val.tsv`.
* `model_prefix_file` is the output directory and prefix for the model files. The model is saved in 3 files: `.model`, `.params` and `.dict.`
In addition, the test set predictions are saved in `.predictions`, and the prominent paths are saved to `.paths`.
* `embeddings_file` is the pre-trained word embeddings file, in txt format (i.e., every line consists of the word, followed by a space, and its vector. See [GloVe](http://nlp.stanford.edu/data/glove.6B.zip) for an example.)
* `alpha` is the learning rate (default=0.001).
* `word_dropout_rate` is the... word dropout rate.Similarly, you can train the path-based model with `train_path_based.py` or test any of these pre-trained models using `test_integrated.py` and `test_path_based.py` respectively.