{"id":15460472,"url":"https://github.com/izuna385/entity-linking-tutorial","last_synced_at":"2025-04-22T10:38:29.170Z","repository":{"id":37736395,"uuid":"334435568","full_name":"izuna385/Entity-Linking-Tutorial","owner":"izuna385","description":"Bi-encoder Based Entity Linking Tutorial. You can run experiment only in 5 minutes. Experiments on Co-lab pro GPU are also supported!","archived":false,"fork":false,"pushed_at":"2021-05-03T15:11:31.000Z","size":2831,"stargazers_count":34,"open_issues_count":4,"forks_count":4,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-04-22T10:38:02.873Z","etag":null,"topics":["allennlp","approximate-nearest-neighbor-search","bert","entity-linking","named-entity-disambiguation","natural-language-processing"],"latest_commit_sha":null,"homepage":"https://medium.com/nerd-for-tech/building-bi-encoder-based-entity-linking-system-with-transformer-6c111d86500","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/izuna385.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-01-30T14:44:27.000Z","updated_at":"2024-12-24T09:57:16.000Z","dependencies_parsed_at":"2022-09-13T19:31:06.553Z","dependency_job_id":null,"html_url":"https://github.com/izuna385/Entity-Linking-Tutorial","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/izuna385%2FEntity-Linking-Tutorial","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/izuna385%2FEntity-Linking-Tutorial/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/izuna385%2FEntity-Linking-Tutorial/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/izuna385%2FEntity-Linking-Tutorial/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/izuna385","download_url":"https://codeload.github.com/izuna385/Entity-Linking-Tutorial/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":250222048,"owners_count":21394807,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["allennlp","approximate-nearest-neighbor-search","bert","entity-linking","named-entity-disambiguation","natural-language-processing"],"created_at":"2024-10-01T23:22:01.892Z","updated_at":"2025-04-22T10:38:29.089Z","avatar_url":"https://github.com/izuna385.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Entity-Linking-Tutorial\n* In this tutorial, we will implement a Bi-encoder based entity disambiguation system using the BC5CDR dataset and data from the MeSH knowledge base.\n\n* We will compare the surface-form based candidate generation with the Bi-encoder based one, to understand the power of Bi-encoder model in entity linking.\n## Docs for English\n* https://izuna385.medium.com/building-bi-encoder-based-entity-linking-system-with-transformer-6c111d86500\n\n## Docs for Japanese\n* [Part 1: History](https://qiita.com/izuna385/items/9d658620b9b96b0b4ec9)\n* [Part 2: Preprocecssing](https://qiita.com/izuna385/items/c2918874fbb564acf1e0)\n* [Part 3: Model and Evaluation](https://qiita.com/izuna385/items/367b7b365a2791ee4f8e)\n* [Part 4: ANN-search with Faiss](https://qiita.com/izuna385/items/bce14031e8a443a0db44)\n* [Sub Contents: Reproduction of experimental results using Colab-Pro](https://qiita.com/izuna385/items/bbac95594e20e6990189)\n\n## Tutorial with Colab-Pro.\nSee [here](./docs/Colab_Pro_Tutorial.md).\n\n## Environment Setup\n* First, create base environment with conda.\n```\n# If you don't use colab-pro, create environment from conda.\n$ conda create -n allennlp python=3.7\n$ conda activate allennlp\n$ pip install -r requirements.txt\n```\n\n## Preprocessing\n\n* First, download preprocessed files from [here](https://drive.google.com/drive/folders/1P-iXskc-hbqXateWh3wRknni_knqsagN?usp=sharing), then unzip.\n\n* Second, download [BC5CDR dataset](https://biocreative.bioinformatics.udel.edu/resources/corpora/biocreative-v-cdr-corpus/) to `./dataset/` and unzip.\n\n* You have to place `CDR_DevelopmentSet.PubTator.txt`, `CDR_TestSet.PubTator.txt` and `CDR_TrainingSet.PubTator.txt` under `./dataset/`.\n\n* Then, run `python3 BC5CDRpreprocess.py` and `python3 preprocess_mesh.py`.\n\n## Models and Scoring\n### Models\n* Surface-Candidate based\n  \n  ![biencoder](./docs/candidate_biencoder.png)\n  \n* ANN-search based\n  \n  ![entire_biencoder](./docs/biencoder.png)\n\n### Scoring\n* Default: Dot product between mention and predicted entity.\n\n  ![scoring](./docs/scoring.png)\n\n  * Derived from [[Logeswaran et al., '19]](https://arxiv.org/abs/1906.07348)\n\n* L2-distance and cosine similarity are also supported.\n\n## Experiment and Evaluation\n```\n$ rm -r serialization_dir # Remove pre-experiment result if you run `python3 main.py -debug` for debugging.\n$ python3 main.py\n```\n\n## Parameters\nWe only here note critical parameters for training and evaluation. For further detail, see `parameters.py`.\n\n| Parameter Name            | Description                                                                                                                                                                  | Default      |\n|---------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------|\n| `batch_size_for_train`    | Batch size during learning. The more there are, the more the encoder will learn to choose the correct answer from more negative examples.                                    | `16`         |\n| `lr`                      | Learning rate.                                                                                                                                                               | `1e-5`       |\n| `max_candidates_num`      | Determine how many candidates are to be generated for each mention by using surface form.                                                                                    | `5`          |\n| `search_method_for_faiss` | This specifies whether to use the cosine distance (`cossim`), inner product (`indexflatip`), or L2 distance (`indexflatl2`) when performing approximate neighborhood search. | `indexflatip`|\n\n\n## Result\n\n* Surface-Candidate based recall\n\n  | Generated Candidates Num | 5     | 10    | 20    |\n  |--------------------------|-------|-------|-------|\n  | dev_recall               | 76.80 | 79.91 | 80.92 |\n  | test_recall              | 74.35 | 77.14 | 78.25 |\n\n### `batch_size_for_train: 16`\n\n* Surface-Candidate based acc.\n  \n  | Generated Candidates Num | 5     | 10    | 20    |\n  |--------------------------|-------|-------|-------|\n  | dev_acc                  | 59.85 | 52.56 | 47.23 |\n  | test_acc                 | 58.51 | 51.38 | 45.69 |\n\n* ANN-search Based \n\n  (Generated Candidates Num: 50 (Fixed))\n  \n  | Recall@X   | 1 (Acc.) | 5     | 10    | 50    |\n  |------------|----------|-------|-------|-------|\n  | dev_recall | 21.58    | 42.28 | 50.48 | 67.11 |\n  | test_recall| 21.50    | 40.29 | 47.95 | 64.52 |\n\n### `batch_size_for_train: 48`\n\n* Surface-Candidate based acc.\n  \n  | Generated Candidates Num | 5     | 10    | 20    |\n  |--------------------------|-------|-------|-------|\n  | dev_acc                  | 72.39 | 68.21 | 65.40 |\n  | test_acc                 | 70.95 | 66.87 | 63.72 |\n\n* ANN-search Based \n\n  (Generated Candidates Num: 50 (Fixed))\n  \n  | Recall@X   | 1 (Acc.) | 5     | 10    | 50    |\n  |------------|----------|-------|-------|-------|\n  | dev_recall | 58.86    | 74.33 | 78.14 | 83.10 |\n  | test_recall| 57.66    | 73.14 | 76.73 | 81.39 |\n\n## LICENSE\nMIT","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fizuna385%2Fentity-linking-tutorial","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fizuna385%2Fentity-linking-tutorial","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fizuna385%2Fentity-linking-tutorial/lists"}