https://github.com/izuna385/entity-linking-tutorial

Bi-encoder Based Entity Linking Tutorial. You can run experiment only in 5 minutes. Experiments on Co-lab pro GPU are also supported!
https://github.com/izuna385/entity-linking-tutorial

allennlp approximate-nearest-neighbor-search bert entity-linking named-entity-disambiguation natural-language-processing

Last synced: 6 months ago
JSON representation

Bi-encoder Based Entity Linking Tutorial. You can run experiment only in 5 minutes. Experiments on Co-lab pro GPU are also supported!

Host: GitHub
URL: https://github.com/izuna385/entity-linking-tutorial
Owner: izuna385
Created: 2021-01-30T14:44:27.000Z (over 4 years ago)
Default Branch: main
Last Pushed: 2021-05-03T15:11:31.000Z (over 4 years ago)
Last Synced: 2025-04-22T10:38:02.873Z (6 months ago)
Topics: allennlp, approximate-nearest-neighbor-search, bert, entity-linking, named-entity-disambiguation, natural-language-processing
Language: Python
Homepage: https://medium.com/nerd-for-tech/building-bi-encoder-based-entity-linking-system-with-transformer-6c111d86500
Size: 2.7 MB
Stars: 34
Watchers: 2
Forks: 4
Open Issues: 4
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          # Entity-Linking-Tutorial

* In this tutorial, we will implement a Bi-encoder based entity disambiguation system using the BC5CDR dataset and data from the MeSH knowledge base.

* We will compare the surface-form based candidate generation with the Bi-encoder based one, to understand the power of Bi-encoder model in entity linking.

## Docs for English

* https://izuna385.medium.com/building-bi-encoder-based-entity-linking-system-with-transformer-6c111d86500

## Docs for Japanese

* [Part 1: History](https://qiita.com/izuna385/items/9d658620b9b96b0b4ec9)

* [Part 2: Preprocecssing](https://qiita.com/izuna385/items/c2918874fbb564acf1e0)

* [Part 3: Model and Evaluation](https://qiita.com/izuna385/items/367b7b365a2791ee4f8e)

* [Part 4: ANN-search with Faiss](https://qiita.com/izuna385/items/bce14031e8a443a0db44)

* [Sub Contents: Reproduction of experimental results using Colab-Pro](https://qiita.com/izuna385/items/bbac95594e20e6990189)

## Tutorial with Colab-Pro.

See [here](./docs/Colab_Pro_Tutorial.md).

## Environment Setup

* First, create base environment with conda.

```

# If you don't use colab-pro, create environment from conda.

$ conda create -n allennlp python=3.7

$ conda activate allennlp

$ pip install -r requirements.txt

```

## Preprocessing

* First, download preprocessed files from [here](https://drive.google.com/drive/folders/1P-iXskc-hbqXateWh3wRknni_knqsagN?usp=sharing), then unzip.

* Second, download [BC5CDR dataset](https://biocreative.bioinformatics.udel.edu/resources/corpora/biocreative-v-cdr-corpus/) to `./dataset/` and unzip.

* You have to place `CDR_DevelopmentSet.PubTator.txt`, `CDR_TestSet.PubTator.txt` and `CDR_TrainingSet.PubTator.txt` under `./dataset/`.

* Then, run `python3 BC5CDRpreprocess.py` and `python3 preprocess_mesh.py`.

## Models and Scoring

### Models

* Surface-Candidate based

  

  ![biencoder](./docs/candidate_biencoder.png)

  

* ANN-search based

  

  ![entire_biencoder](./docs/biencoder.png)

### Scoring

* Default: Dot product between mention and predicted entity.

  ![scoring](./docs/scoring.png)

  * Derived from [[Logeswaran et al., '19]](https://arxiv.org/abs/1906.07348)

* L2-distance and cosine similarity are also supported.

## Experiment and Evaluation

```

$ rm -r serialization_dir # Remove pre-experiment result if you run `python3 main.py -debug` for debugging.

$ python3 main.py

```

## Parameters

We only here note critical parameters for training and evaluation. For further detail, see `parameters.py`.

| Parameter Name            | Description                                                                                                                                                                  | Default      |

|---------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------|

| `batch_size_for_train`    | Batch size during learning. The more there are, the more the encoder will learn to choose the correct answer from more negative examples.                                    | `16`         |

| `lr`                      | Learning rate.                                                                                                                                                               | `1e-5`       |

| `max_candidates_num`      | Determine how many candidates are to be generated for each mention by using surface form.                                                                                    | `5`          |

| `search_method_for_faiss` | This specifies whether to use the cosine distance (`cossim`), inner product (`indexflatip`), or L2 distance (`indexflatl2`) when performing approximate neighborhood search. | `indexflatip`|

## Result

* Surface-Candidate based recall

  | Generated Candidates Num | 5     | 10    | 20    |

  |--------------------------|-------|-------|-------|

  | dev_recall               | 76.80 | 79.91 | 80.92 |

  | test_recall              | 74.35 | 77.14 | 78.25 |

### `batch_size_for_train: 16`

* Surface-Candidate based acc.

  

  | Generated Candidates Num | 5     | 10    | 20    |

  |--------------------------|-------|-------|-------|

  | dev_acc                  | 59.85 | 52.56 | 47.23 |

  | test_acc                 | 58.51 | 51.38 | 45.69 |

* ANN-search Based 

  (Generated Candidates Num: 50 (Fixed))

  

  | Recall@X   | 1 (Acc.) | 5     | 10    | 50    |

  |------------|----------|-------|-------|-------|

  | dev_recall | 21.58    | 42.28 | 50.48 | 67.11 |

  | test_recall| 21.50    | 40.29 | 47.95 | 64.52 |

### `batch_size_for_train: 48`

* Surface-Candidate based acc.

  

  | Generated Candidates Num | 5     | 10    | 20    |

  |--------------------------|-------|-------|-------|

  | dev_acc                  | 72.39 | 68.21 | 65.40 |

  | test_acc                 | 70.95 | 66.87 | 63.72 |

* ANN-search Based 

  (Generated Candidates Num: 50 (Fixed))

  

  | Recall@X   | 1 (Acc.) | 5     | 10    | 50    |

  |------------|----------|-------|-------|-------|

  | dev_recall | 58.86    | 74.33 | 78.14 | 83.10 |

  | test_recall| 57.66    | 73.14 | 76.73 | 81.39 |

## LICENSE

MIT

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/izuna385/entity-linking-tutorial

Awesome Lists containing this project

README