https://github.com/izuna385/entity-linking-tutorial
Bi-encoder Based Entity Linking Tutorial. You can run experiment only in 5 minutes. Experiments on Co-lab pro GPU are also supported!
https://github.com/izuna385/entity-linking-tutorial
allennlp approximate-nearest-neighbor-search bert entity-linking named-entity-disambiguation natural-language-processing
Last synced: 6 months ago
JSON representation
Bi-encoder Based Entity Linking Tutorial. You can run experiment only in 5 minutes. Experiments on Co-lab pro GPU are also supported!
- Host: GitHub
- URL: https://github.com/izuna385/entity-linking-tutorial
- Owner: izuna385
- Created: 2021-01-30T14:44:27.000Z (over 4 years ago)
- Default Branch: main
- Last Pushed: 2021-05-03T15:11:31.000Z (over 4 years ago)
- Last Synced: 2025-04-22T10:38:02.873Z (6 months ago)
- Topics: allennlp, approximate-nearest-neighbor-search, bert, entity-linking, named-entity-disambiguation, natural-language-processing
- Language: Python
- Homepage: https://medium.com/nerd-for-tech/building-bi-encoder-based-entity-linking-system-with-transformer-6c111d86500
- Size: 2.7 MB
- Stars: 34
- Watchers: 2
- Forks: 4
- Open Issues: 4
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Entity-Linking-Tutorial
* In this tutorial, we will implement a Bi-encoder based entity disambiguation system using the BC5CDR dataset and data from the MeSH knowledge base.* We will compare the surface-form based candidate generation with the Bi-encoder based one, to understand the power of Bi-encoder model in entity linking.
## Docs for English
* https://izuna385.medium.com/building-bi-encoder-based-entity-linking-system-with-transformer-6c111d86500## Docs for Japanese
* [Part 1: History](https://qiita.com/izuna385/items/9d658620b9b96b0b4ec9)
* [Part 2: Preprocecssing](https://qiita.com/izuna385/items/c2918874fbb564acf1e0)
* [Part 3: Model and Evaluation](https://qiita.com/izuna385/items/367b7b365a2791ee4f8e)
* [Part 4: ANN-search with Faiss](https://qiita.com/izuna385/items/bce14031e8a443a0db44)
* [Sub Contents: Reproduction of experimental results using Colab-Pro](https://qiita.com/izuna385/items/bbac95594e20e6990189)## Tutorial with Colab-Pro.
See [here](./docs/Colab_Pro_Tutorial.md).## Environment Setup
* First, create base environment with conda.
```
# If you don't use colab-pro, create environment from conda.
$ conda create -n allennlp python=3.7
$ conda activate allennlp
$ pip install -r requirements.txt
```## Preprocessing
* First, download preprocessed files from [here](https://drive.google.com/drive/folders/1P-iXskc-hbqXateWh3wRknni_knqsagN?usp=sharing), then unzip.
* Second, download [BC5CDR dataset](https://biocreative.bioinformatics.udel.edu/resources/corpora/biocreative-v-cdr-corpus/) to `./dataset/` and unzip.
* You have to place `CDR_DevelopmentSet.PubTator.txt`, `CDR_TestSet.PubTator.txt` and `CDR_TrainingSet.PubTator.txt` under `./dataset/`.
* Then, run `python3 BC5CDRpreprocess.py` and `python3 preprocess_mesh.py`.
## Models and Scoring
### Models
* Surface-Candidate based

* ANN-search based
### Scoring
* Default: Dot product between mention and predicted entity.
* Derived from [[Logeswaran et al., '19]](https://arxiv.org/abs/1906.07348)
* L2-distance and cosine similarity are also supported.
## Experiment and Evaluation
```
$ rm -r serialization_dir # Remove pre-experiment result if you run `python3 main.py -debug` for debugging.
$ python3 main.py
```## Parameters
We only here note critical parameters for training and evaluation. For further detail, see `parameters.py`.| Parameter Name | Description | Default |
|---------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------|
| `batch_size_for_train` | Batch size during learning. The more there are, the more the encoder will learn to choose the correct answer from more negative examples. | `16` |
| `lr` | Learning rate. | `1e-5` |
| `max_candidates_num` | Determine how many candidates are to be generated for each mention by using surface form. | `5` |
| `search_method_for_faiss` | This specifies whether to use the cosine distance (`cossim`), inner product (`indexflatip`), or L2 distance (`indexflatl2`) when performing approximate neighborhood search. | `indexflatip`|## Result
* Surface-Candidate based recall
| Generated Candidates Num | 5 | 10 | 20 |
|--------------------------|-------|-------|-------|
| dev_recall | 76.80 | 79.91 | 80.92 |
| test_recall | 74.35 | 77.14 | 78.25 |### `batch_size_for_train: 16`
* Surface-Candidate based acc.
| Generated Candidates Num | 5 | 10 | 20 |
|--------------------------|-------|-------|-------|
| dev_acc | 59.85 | 52.56 | 47.23 |
| test_acc | 58.51 | 51.38 | 45.69 |* ANN-search Based
(Generated Candidates Num: 50 (Fixed))
| Recall@X | 1 (Acc.) | 5 | 10 | 50 |
|------------|----------|-------|-------|-------|
| dev_recall | 21.58 | 42.28 | 50.48 | 67.11 |
| test_recall| 21.50 | 40.29 | 47.95 | 64.52 |### `batch_size_for_train: 48`
* Surface-Candidate based acc.
| Generated Candidates Num | 5 | 10 | 20 |
|--------------------------|-------|-------|-------|
| dev_acc | 72.39 | 68.21 | 65.40 |
| test_acc | 70.95 | 66.87 | 63.72 |* ANN-search Based
(Generated Candidates Num: 50 (Fixed))
| Recall@X | 1 (Acc.) | 5 | 10 | 50 |
|------------|----------|-------|-------|-------|
| dev_recall | 58.86 | 74.33 | 78.14 | 83.10 |
| test_recall| 57.66 | 73.14 | 76.73 | 81.39 |## LICENSE
MIT