{"id":20650497,"url":"https://github.com/hazyresearch/bootleg","last_synced_at":"2025-04-06T09:06:42.517Z","repository":{"id":39671745,"uuid":"286367631","full_name":"HazyResearch/bootleg","owner":"HazyResearch","description":"Self-Supervision for Named Entity Disambiguation at the Tail","archived":false,"fork":false,"pushed_at":"2022-06-14T02:45:59.000Z","size":7962,"stargazers_count":215,"open_issues_count":5,"forks_count":27,"subscribers_count":19,"default_branch":"master","last_synced_at":"2025-03-30T08:09:23.066Z","etag":null,"topics":["ai","machine-learning","named-entity-disambiguation","self-supervision"],"latest_commit_sha":null,"homepage":"http://hazyresearch.stanford.edu/bootleg","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/HazyResearch.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.rst","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-08-10T03:34:16.000Z","updated_at":"2025-02-09T02:14:43.000Z","dependencies_parsed_at":"2022-09-05T04:41:23.605Z","dependency_job_id":null,"html_url":"https://github.com/HazyResearch/bootleg","commit_stats":null,"previous_names":[],"tags_count":7,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HazyResearch%2Fbootleg","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HazyResearch%2Fbootleg/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HazyResearch%2Fbootleg/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HazyResearch%2Fbootleg/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/HazyResearch","download_url":"https://codeload.github.com/HazyResearch/bootleg/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247457799,"owners_count":20941906,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","machine-learning","named-entity-disambiguation","self-supervision"],"created_at":"2024-11-16T17:20:27.749Z","updated_at":"2025-04-06T09:06:42.491Z","avatar_url":"https://github.com/HazyResearch.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cp align=\"center\"\u003e\n\u003cimg src=\"web/images/full_logo.png\" width=\"150\" class=\"center\"/\u003e\n\u003c/p\u003e\n\n![GitHub Workflow Status](https://img.shields.io/github/workflow/status/HazyResearch/bootleg/CI)\n[![codecov](https://codecov.io/gh/HazyResearch/bootleg/branch/master/graph/badge.svg)](https://codecov.io/gh/HazyResearch/bootleg)\n[![Documentation Status](https://readthedocs.org/projects/bootleg/badge/?version=latest)](https://bootleg.readthedocs.io/en/latest/?badge=latest)\n[![license](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)\n\n# Self-Supervision for Named Entity Disambiguation at the Tail\nBootleg is a self-supervised named entity disambiguation (NED) system for English built to improve disambiguation of entities that occur infrequently, or not at all, in training data. We call these entities *tail* entities. This is a critical task as the majority of entities are rare. The core insight behind Bootleg is that these tail entities can be disambiguated by reasoning over entity types and relations. We give an [overview](#bootleg-overview) of how Bootleg achieves this below. For details, please see our [blog post](https://hazyresearch.stanford.edu/bootleg_blog) and [paper](http://arxiv.org/abs/2010.10363).\n\nNote that Bootleg is *actively under development* and feedback is welcome. Submit bugs on the Issues page or feel free to submit your contributions as a pull request.\n\n**Update 9-25-2021**: We changed our architecture to be a biencoder. Our entity textual input still has all the goodness of types and KG relations, but our model now requires less storage space and has improved performance. A secret to getting the biencoder to work over the tail was heavy masking of the mention in the context encoder and entity title in the entity encoder.\n\n**Update 2-15-2021**: We made a major rewrite of the codebase and moved to using Emmental for training--check out the [changelog](CHANGELOG.rst) for details)\n\n# Getting Started\n\nInstall via\n\n```\ngit clone git@github.com:HazyResearch/bootleg bootleg\ncd bootleg\npython3 setup.py install\n```\n\nCheckout out our installation and quickstart guide [here](https://bootleg.readthedocs.io/en/latest/gettingstarted/install.html).\n\n## Using a Trained Model\n### Models\nBelow is the link to download the English Bootleg model. The download comes with the saved model and config to run the model. We show in our [quickstart guide](https://bootleg.readthedocs.io/en/latest/gettingstarted/quickstart.html) and [end-to-end](tutorials/end2end_ned_tutorial.ipynb) tutorial how to load a config and run a model.\n\n| Model               | Description                     | Number Parameters | Link     |\n|-------------------  |---------------------------------|-------------------|----------|\n| BootlegUncased      | Uses titles, descriptions, types, and KG relations. Trained on uncased data. | 110M | [Download](https://bootleg-ned-data.s3-us-west-1.amazonaws.com/models/latest/bootleg_uncased.tar.gz) |\n\n### Embeddings\nBelow is the link to download a dump of all entity embeddings from our entity encoder. Follow our entity profile tutorial [here](https://github.com/HazyResearch/bootleg/blob/master/tutorials/entity_profile_tutorial.ipynb) to load our EntityProfile. From there, you can use our ```get_eid``` [method](https://bootleg.readthedocs.io/en/latest/apidocs/bootleg.symbols.html#bootleg.symbols.entity_profile.EntityProfile.get_eid) to access the row id for an entity.\n\n| Embeddings               | Description                     | Number Parameters | Link     |\n|-------------------  |---------------------------------|-------------------|----------|\n| 5.8M Wikipedia Entities      | Embeddings from BootlegUncased. | 1.2B | [Download](https://bootleg-ned-data.s3-us-west-1.amazonaws.com/models/latest/bootleg_uncased_entity_embeddings.npy.tar.gz) |\n\n### Metadata\nBelow is the link to download a dump of all entity metadata to use in our entity profile tutorial [here](https://github.com/HazyResearch/bootleg/blob/master/tutorials/entity_profile_tutorial.ipynb).\n\n| Metadata               | Description                    | Link     |\n|-------------------  |---------------------------------|----------|\n| 5.8M Wikipedia Entities      | Wikidata metadata for entities. | [Download](https://bootleg-data.s3.us-west-2.amazonaws.com/data/latest/entity_db.tar.gz) |\n\n## Training\nWe provide detailed training instructions [here](https://bootleg.readthedocs.io/en/latest/gettingstarted/training.html). We provide a starter config [here](configs/standard/train.yaml). You only need to adjust `data_config.data_dir` and `data_config.entity_dir` to points to your local data. You may need to shrink the model size to fit on your available hardware. The use the training zsh script [here](scripts/train.zsh).\n\n## Tutorials\nWe provide tutorials to help users get familiar with Bootleg [here](tutorials/).\n\n# Bootleg Overview\nGiven an input sentence, Bootleg takes the sentence and outputs a predicted entity for each detected mention. Bootleg first extracts mentions in the\nsentence, and for each mention, we extract its set of possible candidate entities\nand any structural information about that entity, e.g., type information or knowledge graph (KG) information. Bootleg leverages this information to generate an entity embedding through a Transformer entity encoder. The mention and its surrounding context is encoded in a context encoder. The entity with the highest dot product with the context is selected for each mention.\n\n![Dataflow](web/images/bootleg_dataflow.png \"Bootleg Dataflow\")\n\nMore details can be found [here](https://bootleg.readthedocs.io/en/latest/gettingstarted/input_data.html)\n\n## Inference\nGiven a pretrained model, we support three types of inference: `--mode eval`, `--mode dump_preds`, and `--mode dump_embs`. `Eval` mode is the fastest option and will run the test files through the model and output aggregated quality metrics to the log. `Dump_preds` mode will write the individual predictions and corresponding probabilities to a jsonlines file. This is useful for error analysis. `Dump_embs` mode is the same as `dump_preds`, but will additionally output entity embeddings. These can then be read and processed in a downstream system. See this [notebook](tutorials/end2end_ned_tutorial.ipynb) to see how with a downloaded Bootleg model.\n\n## Entity Embedding Extraction\nAs we have a separate encoder for generating an entity representation, we also support the ability to dump all entities to create a single entity embedding matrix for use downstream. This is done through the ```bootleg.extract_all_entities``` script. See this [notebook](tutorials/entity_embedding_tutorial.ipynb) to see how with a downloaded Bootleg model.\n\n## Training\nWe recommend using GPUs for training Bootleg models. For large datasets, we support distributed training with Pytorch's Distributed DataParallel framework to distribute batches across multiple GPUs. Check out the [Basic Training](https://bootleg.readthedocs.io/en/latest/gettingstarted/training.html) and [Advanced Training](https://bootleg.readthedocs.io/en/latest/advanced/distributed_training.html) tutorials for more information and sample data!\n\n## Downstream Tasks\nBootleg produces contextual entity embeddings (as well as learned static embeddings) that can be used in downstream tasks, such as relation extraction and question answering. Check out the [tutorial](tutorials) to see how this is done.\n\n## Other Languages\nThe released Bootleg model only supports English, but we have trained multi-lingual models using Wikipedia and Wikidata. If you have interest in doing this, please let us know with an issue request or email lorr1@cs.stanford.edu. We have data prep code to help prepare multi-lingual data.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhazyresearch%2Fbootleg","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhazyresearch%2Fbootleg","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhazyresearch%2Fbootleg/lists"}