Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/DFKI-NLP/TRE
[AKBC 19] Improving Relation Extraction by Pre-trained Language Representations
https://github.com/DFKI-NLP/TRE
information-extraction machine-learning multi-task-learning nlp relation-extraction transformer
Last synced: 2 months ago
JSON representation
[AKBC 19] Improving Relation Extraction by Pre-trained Language Representations
- Host: GitHub
- URL: https://github.com/DFKI-NLP/TRE
- Owner: DFKI-NLP
- License: mit
- Created: 2019-04-08T08:38:16.000Z (over 5 years ago)
- Default Branch: master
- Last Pushed: 2021-10-01T09:27:30.000Z (over 3 years ago)
- Last Synced: 2024-08-02T04:02:27.316Z (5 months ago)
- Topics: information-extraction, machine-learning, multi-task-learning, nlp, relation-extraction, transformer
- Language: Python
- Homepage: https://arxiv.org/abs/1906.03088
- Size: 43.9 KB
- Stars: 108
- Watchers: 8
- Forks: 12
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Improving Relation Extraction by Pre-trained Language Representations
This repository contains the code of our paper:
[Improving Relation Extraction by Pre-trained Language Representations.](https://openreview.net/forum?id=BJgrxbqp67)
Christoph Alt*, Marc Hübner*, Leonhard HennigWe fine-tune the pre-trained OpenAI GPT [1] to the task of relation extraction and show that it achieves state-of-the-art results on SemEval 2010 Task 8 and TACRED relation extraction datasets.
Our code depends on huggingface's PyTorch reimplementation of the OpenAI GPT [2] - so thanks to them.
## Installation
First, clone the repository to your machine and install the requirements with the following command:
```bash
pip install -r requirements.txt
```We also need the weights of the pre-trained Transformer, which can be downloaded with the following command:
```
./download-model.sh
```The english spacy model is required for sentence segmentation:
```
python -m spacy download en
```## Prepare the data
We evaluate our model on [SemEval 2010 Task 8](https://drive.google.com/file/d/0B_jQiLugGTAkMDQ5ZjZiMTUtMzQ1Yy00YWNmLWJlZDYtOWY1ZDMwY2U4YjFk) and [TACRED](https://catalog.ldc.upenn.edu/LDC2018T24), which is available through LDC.
Our model expects the input dataset to be in JSONL format. To convert a dataset run the following command:
```bash
python dataset_converter.py --dataset=
```## Training
E.g. for training on the TACRED dataset, run the following command:```bash
CUDA_VISIBLE_DEVICES=0 python relation_extraction.py train \
--write-model True \
--masking-mode grammar_and_ner \
--batch-size 8 \
--max-epochs 3 \
--lm-coef 0.5 \
--learning-rate 5.25e-5 \
--learning-rate-warmup 0.002 \
--clf-pdrop 0.1 \
--attn-pdrop 0.1 \
--word-pdrop 0.0 \
--dataset tacred \
--data-dir \
--seed=0 \
--log-dir ./logs/
```## Evaluation
```bash
CUDA_VISIBLE_DEVICES=0 python relation_extraction.py evaluate \
--dataset tacred \
--masking_mode grammar_and_ner \
--test_file ./data/tacred/test.jsonl \
--save_dir ./logs/ \
--model_file \
--batch_size 8 \
--log_dir ./logs/
```## Trained Models
The models we trained on SemEval and TACRED to produce our paper results can be found here:
| Dataset | Masking Mode | P | R | F1 | Download |
| -------- | --------------- | ---- | ---- | ---- | --------------------------------------------------------------------------- |
| TACRED | grammar_and_ner | 70.0 | 65.0 | 67.4 | [Link](https://dfkide-my.sharepoint.com/:u:/g/personal/lehe02_dfki_de/EQ0zv4QicbVMoW6nBhnOkA8BV4Yzt7agM0hAfX07VQez4w?Web=0&Download=1) |
| SemEval | None | 87.6 | 86.8 | 87.1 | [Link](https://dfkide-my.sharepoint.com/:u:/g/personal/lehe02_dfki_de/EZBQNU99Uz9MiOuP-RHAdaYB7w8jIW0mADM4f05tGbhwZg?Web=0&Download=1) |### Download and extract model files
First, download the archive corresponding to the model you want to evaluate (links in the table above).
```bash
wget --content-disposition
```Extract the model archive containing model.pt, text_encoder.pkl, and label_encoder.pkl.
```bash
tar -xvzf
```### Run evaluation
- `dataset`: dataset to evaluate, can be one of "semeval" or "tacred".
- `test-file`: path to the JSONL test file used during evaluation
- `log-dir`: directory to store the evaluation results and predictions
- `save-dir`: directory containing the downloaded model files (model.pt, text_encoder.pkl, and label_encoder.pkl)
- `masking-mode`: masking mode to use during evaluation, can be one of "None", "grammar_and_ner", "grammar", "ner" or "unk" (**caution:** must match the mode for training)For example, to evaluate the TACRED model with "grammar_and_ner" masking, run the following command:
```bash
CUDA_VISIBLE_DEVICES=0 python relation_extraction.py evaluate \
--dataset tacred \
--test-file .//test.jsonl \
--log-dir \
--save-dir \
--masking_mode grammar_and_ner
```## Citations
If you use our code in your research or find our repository useful, please consider citing our work.```
@InProceedings{alt_improving_2019,
author = {Alt, Christoph and H\"{u}bner, Marc and Hennig, Leonhard},
title = {Improving Relation Extraction by Pre-trained Language Representations},
booktitle = {Proceedings of AKBC 2019},
year = {2019},
url = {https://openreview.net/forum?id=BJgrxbqp67},
}
```## License
lm-transformer-re is released under the MIT license. See [LICENSE](LICENSE) for additional details.## References
1. [Improving language understanding by generative pre-training](https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf). Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever.
2. [PyTorch implementation of OpenAI's Finetuned Transformer Language Model](https://github.com/huggingface/pytorch-openai-transformer-lm)