{"id":20071224,"url":"https://github.com/fedenanni/reimplementing-tagme","last_synced_at":"2026-06-07T02:31:19.609Z","repository":{"id":91496109,"uuid":"196396423","full_name":"fedenanni/Reimplementing-TagMe","owner":"fedenanni","description":"A few scripts for using the Entity Linker TagMe, starting from a Wikipedia Dump.","archived":false,"fork":false,"pushed_at":"2020-02-17T08:35:10.000Z","size":4830,"stargazers_count":2,"open_issues_count":0,"forks_count":1,"subscribers_count":0,"default_branch":"master","last_synced_at":"2025-01-12T23:47:57.635Z","etag":null,"topics":["entity-aspects","entity-linking","tagme"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/fedenanni.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-07-11T13:06:52.000Z","updated_at":"2022-11-02T13:36:52.000Z","dependencies_parsed_at":null,"dependency_job_id":"f35df89d-88c5-4b08-8124-9fbea3521474","html_url":"https://github.com/fedenanni/Reimplementing-TagMe","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fedenanni%2FReimplementing-TagMe","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fedenanni%2FReimplementing-TagMe/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fedenanni%2FReimplementing-TagMe/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fedenanni%2FReimplementing-TagMe/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/fedenanni","download_url":"https://codeload.github.com/fedenanni/Reimplementing-TagMe/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":241502792,"owners_count":19972956,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["entity-aspects","entity-linking","tagme"],"created_at":"2024-11-13T14:28:12.992Z","updated_at":"2025-11-28T02:08:09.509Z","avatar_url":"https://github.com/fedenanni.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Reimplementing-TagMe\nHow to rebuild the Entity Linker [TagMe](http://pages.di.unipi.it/ferragina/cikm2010.pdf) in a few scripts, starting from a Wikipedia Dump. This work is inspired by the paper \"On the Reproducibility of the TAGME Entity Linking System\" [[paper](http://hasibi.com/files/ecir2016-tagme.pdf), [code](https://github.com/hasibi/TAGME-Reproducibility)].\n\n## Pre-Processing Procedure\n\n### Process a Wikipedia Dump\n\nDownload a Wikipedia dump from [here](https://dumps.wikimedia.org/enwiki/) and process it with the [WikiExtractor](http://medialab.di.unipi.it/wiki/Wikipedia_Extractor) with the following command:\n\n```\npython python WikiExtractor.py -l -s -o output/ [here put the path to the Wikipedia Dump .xml.bz2]\n```\nNote that the flag -s will keep the sections and the flag -l the links.\n\n### Extract entity, mention and ngram frequencies + entity aspects\n\nHaving the Wiki dump processed by the WikiExtractor in the \"output/\" folder, the first step is to collect a set of all entity-mentions in Wikipedia, so that you can later collect their frequency as ngrams. You can do this by using \n```\n1-CollectAllMentions.ipynb \n```\nthat will produce a all_mentions.pickle file. Note that I am using an English word tokenizer, which is the only language-dependent component of the pipeline.\n\nThe second step will extract mention, ngrams and entity counts as well as mention_to_entities statistics (e.g., how many times the mention \"Obama\" is pointing to \"Barack_Obama\" and how many times to \"Michele_Obama\"). Statistics are still divided in the n-folders consituting the output of the WikiExtractor and will be saved in the \"Store-Counts/\" folder as json files. The script will also store a .json file for each entity, with all its aspects (see [here](https://madoc.bib.uni-mannheim.de/49596/1/EAL.pdf) to know more about Entity-Aspect Linking). \n```\n2-ExtractingFreqAndAspects.ipynb\n```\n\nThe final pre-processing script will aggregate all counts needed for using TagMe in single .pickle files and save them in the \"Resources/\" folder. You can do this by running:\n```\n3-AggregateCounts.ipynb\n```\nNote that after having processed each json from \"Store-Counts/\", the script will save an intermediate count in \"Resources/\". This way you could start already using TagMe, with partial statistics.\n\n### Download resources\n\nTo use directly my reimplementation of TagMe withouth processing a Wikipedia dump you can download all resources needed from [here](https://drive.google.com/open?id=1lcq0PRRq8o_G-L-pQrV7GG-Btn-xPFlr). You will find the five .pickle files containing the needed statistics plus a tfidf_asps.pkl file having TF-IDF statistics for the final aspect linking step. As before, these statistics are computed on English text, but they are straight forward to produce in another language by simply changing the word tokenizer.\n\n## Using TagMe\n\nThrough the script\n```\nPresentation-TagMe.ipynb \n```\nyou will have a step-by-step overview of the TagMe algorithm for entity linking. It is designed to be working with [RISE](https://github.com/damianavila/RISE). The last cell in the notebook shows the potential of using aspects for further adding semantics to the linking process. If you'd like to know more about this, check out the [original TagMe](https://tagme.d4science.org/tagme/), the [work](http://hasibi.com/files/ecir2016-tagme.pdf) done by Hasibi et al. in assessing its reproducibility and our recent [dataset and demo](https://federiconanni.com/eal-d/) of entity-aspect links.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffedenanni%2Freimplementing-tagme","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ffedenanni%2Freimplementing-tagme","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffedenanni%2Freimplementing-tagme/lists"}