Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/ub-mannheim/spacyopentapioca
A spaCy wrapper of OpenTapioca for named entity linking on Wikidata
https://github.com/ub-mannheim/spacyopentapioca
entity-linking named-entity-linking spacy spacy-extensions spacy-pipeline wikidata
Last synced: 6 days ago
JSON representation
A spaCy wrapper of OpenTapioca for named entity linking on Wikidata
- Host: GitHub
- URL: https://github.com/ub-mannheim/spacyopentapioca
- Owner: UB-Mannheim
- License: mit
- Created: 2021-09-09T09:57:45.000Z (over 3 years ago)
- Default Branch: main
- Last Pushed: 2023-04-01T09:04:57.000Z (over 1 year ago)
- Last Synced: 2024-12-19T09:07:07.022Z (6 days ago)
- Topics: entity-linking, named-entity-linking, spacy, spacy-extensions, spacy-pipeline, wikidata
- Language: Python
- Homepage: https://ub-mannheim.github.io/spacyopentapioca
- Size: 1.54 MB
- Stars: 92
- Watchers: 8
- Forks: 8
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# spaCyOpenTapioca
[![PyPI version](https://badge.fury.io/py/spacyopentapioca.svg)](https://badge.fury.io/py/spacyopentapioca)
A [spaCy](https://spacy.io) wrapper of [OpenTapioca](https://opentapioca.org) for named entity linking on Wikidata.
## Table of contents
* [Installation](#installation)
* [How to use](#how-to-use)
* [Local OpenTapioca](#local-opentapioca)
* [Vizualization](#vizualization)## Installation
```shell
pip install spacyopentapioca
```or
```shell
git clone https://github.com/UB-Mannheim/spacyopentapioca
cd spacyopentapioca/
pip install .
```## How to use
After installation the OpenTapioca pipeline can be used without any other pipelines:
```python
import spacy
nlp = spacy.blank("en")
nlp.add_pipe('opentapioca')
doc = nlp("Christian Drosten works in Germany.")
for span in doc.ents:
print((span.text, span.kb_id_, span.label_, span._.description, span._.score))
```
```shell
('Christian Drosten', 'Q1079331', 'PERSON', 'German virologist and university teacher', 3.6533377082098895)
('Germany', 'Q183', 'LOC', 'sovereign state in Central Europe', 2.1099332471902863)
```The types and aliases are also available:
```python
for span in doc.ents:
print((span._.types, span._.aliases[0:5]))
```
```shell
({'Q43229': False, 'Q618123': False, 'Q5': True, 'P2427': False, 'P1566': False, 'P496': True}, ['كريستيان دروستين', 'Крістіан Дростен', 'Christian Heinrich Maria Drosten', 'کریستین دروستن', '크리스티안 드로스텐'])
({'Q43229': True, 'Q618123': True, 'Q5': False, 'P2427': False, 'P1566': True, 'P496': False}, ['IJalimani', 'R. F. A.', 'Alemania', '도이칠란트', 'Germaniya'])
```The Wikidata QIDs are attached to tokens:
```python
for token in doc:
print((token.text, token.ent_kb_id_))
```
```shell
('Christian', 'Q1079331')
('Drosten', 'Q1079331')
('works', '')
('in', '')
('Germany', 'Q183')
('.', '')
```The raw response of the OpenTapioca API can be accessed in the doc- and span-objects:
```python
raw_annotations1 = doc._.annotations
raw_annotations2 = [span._.annotations for span in doc.ents]
```The partial metadata for the response returned by the OpenTapioca API is
```python
doc._.metadata
```All span-extensions are:
```python
span._.annotations
span._.description
span._.aliases
span._.rank
span._.score
span._.types
span._.label
span._.extra_aliases
span._.nb_sitelinks
span._.nb_statements
```Note that spaCyOpenTapioca does a tiny processing of entities appearing in `doc.ents`. All entities returned by OpenTapioca can be found in `doc.spans['all_entities_opentapioca']`.
### BatchingBatched asynchronous requests to the OpenTapioca API via `nlp.pipe(List[str])`:
```python
import spacy
nlp = spacy.blank("en")
nlp.add_pipe('opentapioca')
docs = nlp.pipe(
[
"Christian Drosten works in Germany.",
"Momofuku Ando was born in Japan.".
]
)
for doc in docs:
for span in doc.ents:
print((span.text, span.kb_id_, span.label_, span._.description, span._.score))```
```shell
('Christian Drosten', 'Q1079331', 'PERSON', 'German virologist and university teacher', 3.6533377082098895)
('Germany', 'Q183', 'LOC', 'sovereign state in Central Europe', 2.1099332471902863)
('Momofuku Ando', 'Q317858', 'PERSON', 'Taiwanese-Japanese businessman', 3.6012208212234302)
('Japan', 'Q17', 'LOC', 'sovereign state in East Asia, situated on an archipelago of five main and over 6,800 smaller islands', 2.349944834167907)
```## Local OpenTapioca
If OpenTapioca is deployed locally, specify the URL of the new OpenTapioca API in the config:
```python
import spacy
nlp = spacy.blank("en")
nlp.add_pipe('opentapioca', config={"url": OpenTapiocaAPI})
doc = nlp("Christian Drosten works in Germany.")
```
## VizualizationNEL vizualization is added to spaCy via [pull request 9199](https://github.com/explosion/spaCy/pull/9199) for [issue 9129](https://github.com/explosion/spaCy/issues/9129). It is supported by spaCy >= 3.1.4.
Use manual option in displaCy:
```python
import spacy
nlp = spacy.blank("en")
nlp.add_pipe('opentapioca')
doc = nlp("Christian Drosten works\n in Charité, Germany.")
params = {"text": doc.text,
"ents": [{"start": ent.start_char,
"end": ent.end_char,
"label": ent.label_,
"kb_id": ent.kb_id_,
"kb_url": "https://www.wikidata.org/entity/" + ent.kb_id_}
for ent in doc.ents],
"title": None}
spacy.displacy.serve(params, style="ent", manual=True)
```
The visualizer is serving on http://0.0.0.0:5000![alt text](https://github.com/UB-Mannheim/spacyopentapioca/blob/main/images/nel_vizualization.png)
In Jupyter Notebook replace `spacy.displacy.serve` by `spacy.displacy.render`.