Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/erre-quadro/spikex
SpikeX - SpaCy Pipes for Knowledge Extraction
https://github.com/erre-quadro/spikex
abbreviations-detection acronym-recognition clustering entity-linking named-entity-recognition nlp noun-phrase-extract sentence-splitting spacy spacy-pipes verb-phrase-extract wikigraph wikipedia wikipedia-graph
Last synced: 5 days ago
JSON representation
SpikeX - SpaCy Pipes for Knowledge Extraction
- Host: GitHub
- URL: https://github.com/erre-quadro/spikex
- Owner: erre-quadro
- License: apache-2.0
- Created: 2020-07-09T09:58:16.000Z (over 4 years ago)
- Default Branch: master
- Last Pushed: 2021-07-30T07:49:16.000Z (over 3 years ago)
- Last Synced: 2025-01-11T17:11:38.024Z (12 days ago)
- Topics: abbreviations-detection, acronym-recognition, clustering, entity-linking, named-entity-recognition, nlp, noun-phrase-extract, sentence-splitting, spacy, spacy-pipes, verb-phrase-extract, wikigraph, wikipedia, wikipedia-graph
- Language: Python
- Homepage:
- Size: 3.43 MB
- Stars: 398
- Watchers: 9
- Forks: 28
- Open Issues: 7
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
Awesome Lists containing this project
README
# SpikeX - SpaCy Pipes for Knowledge Extraction
SpikeX is a collection of pipes ready to be plugged in a spaCy pipeline.
It aims to help in building knowledge extraction tools with almost-zero effort.[![Build Status](https://img.shields.io/azure-devops/build/erre-quadro/spikex/3/master?label=build&logo=azure-pipelines&style=flat-square)](https://dev.azure.com/erre-quadro/spikex/_build/latest?definitionId=3&branchName=master)
[![pypi Version](https://img.shields.io/pypi/v/spikex.svg?style=flat-square&logo=pypi&logoColor=white)](https://pypi.org/project/spikex/)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg?style=flat-square)](https://github.com/ambv/black)## What's new in SpikeX 0.5.0
**WikiGraph** has never been so lightning fast:
- ๐ **Performance mooning**, thanks to the adoption of a *sparse adjacency matrix* to handle pages graph, instead of using *igraph*
- ๐ **Memory optimization**, with a consumption cut by ~40% and a compressed size cut by ~20%, introducing new *bidirectional dictionaries* to manage data
- ๐ **New APIs** for a faster and easier usage and interaction
- ๐ **Overall fixes**, for a better graph and a better pages matching
## Pipes- **WikiPageX** links Wikipedia pages to chunks in text
- **ClusterX** picks noun chunks in a text and clusters them based on a revisiting of the [Ball Mapper](https://arxiv.org/abs/1901.07410) algorithm, Radial Ball Mapper
- **AbbrX** detects abbreviations and acronyms, linking them to their long form. It is based on [scispacy](https://github.com/allenai/scispacy/blob/master/scispacy/abbreviation.py)'s one with improvements
- **LabelX** takes labelings of pattern matching expressions and catches them in a text, solving overlappings, abbreviations and acronyms
- **PhraseX** creates a `Doc`'s underscore extension based on a custom attribute name and phrase patterns. Examples are **NounPhraseX** and **VerbPhraseX**, which extract noun phrases and verb phrases, respectively
- **SentX** detects sentences in a text, based on [Splitta](https://github.com/dgillick/splitta) with refinements## Tools
- **WikiGraph** with pages as leaves linked to categories as nodes
- **Matcher** that inherits its interface from the [spaCy](https://github.com/explosion/spaCy/blob/master/spacy/matcher/matcher.pyx)'s one, but built using an engine made of RegEx which boosts its performance## Install SpikeX
Some requirements are inherited from spaCy:
- **spaCy version**: 2.3+
- **Operating system**: macOS / OS X ยท Linux ยท Windows (Cygwin, MinGW, Visual
Studio)
- **Python version**: Python 3.6+ (only 64 bit)
- **Package managers**: [pip](https://pypi.org/project/spikex/)Some dependencies use **Cython** and it needs to be installed before SpikeX:
```bash
pip install cython
```Remember that a virtual environment is always recommended, in order to avoid modifying system state.
### pipAt this point, installing SpikeX via pip is a one line command:
```bash
pip install spikex
```## Usage
### Prerequirements
SpikeX pipes work with spaCy, hence a model its needed to be installed. Follow official instructions [here](https://spacy.io/usage/models#download). The brand new spaCy 3.0 is supported!
### WikiGraph
A `WikiGraph` is built starting from some key components of Wikipedia: *pages*, *categories* and *relations* between them.
#### Auto
Creating a `WikiGraph` can take time, depending on how large is its Wikipedia dump. For this reason, we provide wikigraphs ready to be used:
| Date | WikiGraph | Lang | Size (compressed) | Size (memory) | |
| --- | --- | --- | --- | --- | --- |
| 2021-05-20 | enwiki_core | EN | 1.3GB | 8GB | [![][dl]][enwiki_core_20210520] |
| 2021-05-20 | simplewiki_core | EN | 20MB | 130MB | [![][dl]][simplewiki_core_20210520] |
| 2021-05-20 | itwiki_core | IT | 208MB | 1.2GB | [![][dl]][itwiki_core_20210520] |
| More coming... |[enwiki_core_20210520]: https://errequadrosrl-my.sharepoint.com/:u:/g/personal/paolo_arduin_errequadrosrl_onmicrosoft_com/EeIb238HAmtCruMvhzZdOl8BIEBU_09XV5FnHE4SVmYzBQ?Download=1
[simplewiki_core_20210520]: https://errequadrosrl-my.sharepoint.com/:u:/g/personal/paolo_arduin_errequadrosrl_onmicrosoft_com/EWdpEV_R4JVEk_ZwvJTrAEUBsLpmJMxyWDa13sFOzQAo3Q?Download=1
[itwiki_core_20210520]: https://errequadrosrl-my.sharepoint.com/:u:/g/personal/paolo_arduin_errequadrosrl_onmicrosoft_com/EcWYGXp5SUdGvFTHN9KQ_zkBW8Zu9p0hiwpC3oKyhibXtQ?Download=1[dl]: http://i.imgur.com/gQvPgr0.png
SpikeX provides a command to shortcut downloading and installing a `WikiGraph` (Linux or macOS, Windows not supported yet):
```bash
spikex download-wikigraph simplewiki_core
```#### Manual
A `WikiGraph` can be created from command line, specifying which Wikipedia dump to take and where to save it:
```bash
spikex create-wikigraph \
\
--wiki \
--version \
--dumps-path \
```Then it needs to be packed and installed:
```bash
spikex package-wikigraph \
\
```Follow the instructions at the end of the packing process and install the distribution package in your virtual environment.
Now your are ready to use your WikiGraph as you wish:```python
from spikex.wikigraph import load as wg_loadwg = wg_load("enwiki_core")
page = "Natural_language_processing"
categories = wg.get_categories(page, distance=1)
for category in categories:
print(category)>>> Category:Speech_recognition
>>> Category:Artificial_intelligence
>>> Category:Natural_language_processing
>>> Category:Computational_linguistics```
### MatcherThe **Matcher** is identical to the spaCy's one, but faster when it comes to handle many patterns at once (order of thousands), so follow official usage instructions [here](https://spacy.io/usage/rule-based-matching#matcher).
A trivial example:
```python
from spikex.matcher import Matcher
from spacy import load as spacy_loadnlp = spacy_load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
matcher.add("TEST", [[{"LOWER": "nlp"}]])
doc = nlp("I love NLP")
for _, s, e in matcher(doc):
print(doc[s: e])>>> NLP
```### WikiPageX
The `WikiPageX` pipe uses a `WikiGraph` in order to find chunks in a text that match Wikipedia page titles.
``` python
from spacy import load as spacy_load
from spikex.wikigraph import load as wg_load
from spikex.pipes import WikiPageXnlp = spacy_load("en_core_web_sm")
doc = nlp("An apple a day keeps the doctor away")
wg = wg_load("simplewiki_core")
wpx = WikiPageX(wg)
doc = wpx(doc)
for span in doc._.wiki_spans:
print(span._.wiki_pages)>>> ['An']
>>> ['Apple', 'Apple_(disambiguation)', 'Apple_(company)', 'Apple_(tree)']
>>> ['A', 'A_(musical_note)', 'A_(New_York_City_Subway_service)', 'A_(disambiguation)', 'A_(Cyrillic)')]
>>> ['Day']
>>> ['The_Doctor', 'The_Doctor_(Doctor_Who)', 'The_Doctor_(Star_Trek)', 'The_Doctor_(disambiguation)']
>>> ['The']
>>> ['Doctor_(Doctor_Who)', 'Doctor_(Star_Trek)', 'Doctor', 'Doctor_(title)', 'Doctor_(disambiguation)']
```### ClusterX
The `ClusterX` pipe takes noun chunks in a text and clusters them using a Radial Ball Mapper algorithm.
``` python
from spacy import load as spacy_load
from spikex.pipes import ClusterXnlp = spacy_load("en_core_web_sm")
doc = nlp("Grab this juicy orange and watch a dog chasing a cat.")
clusterx = ClusterX(min_score=0.65)
doc = clusterx(doc)
for cluster in doc._.cluster_chunks:
print(cluster)>>> [this juicy orange]
>>> [a cat, a dog]
```### AbbrX
The **AbbrX** pipe finds abbreviations and acronyms in the text, linking short and long forms together:
```python
from spacy import load as spacy_load
from spikex.pipes import AbbrXnlp = spacy_load("en_core_web_sm")
doc = nlp("a little snippet with an abbreviation (abbr)")
abbrx = AbbrX(nlp.vocab)
doc = abbrx(doc)
for abbr in doc._.abbrs:
print(abbr, "->", abbr._.long_form)>>> abbr -> abbreviation
```### LabelX
The `LabelX` pipe matches and labels patterns in text, solving overlappings, abbreviations and acronyms.
```python
from spacy import load as spacy_load
from spikex.pipes import LabelXnlp = spacy_load("en_core_web_sm")
doc = nlp("looking for a computer system engineer")
patterns = [
[{"LOWER": "computer"}, {"LOWER": "system"}],
[{"LOWER": "system"}, {"LOWER": "engineer"}],
]
labelx = LabelX(nlp.vocab, [("TEST", patterns)], validate=True, only_longest=True)
doc = labelx(doc)
for labeling in doc._.labelings:
print(labeling, f"[{labeling.label_}]")>>> computer system engineer [TEST]
```### PhraseX
The `PhraseX` pipe creates a custom `Doc`'s underscore extension which fulfills with matches from phrase patterns.
```python
from spacy import load as spacy_load
from spikex.pipes import PhraseXnlp = spacy_load("en_core_web_sm")
doc = nlp("I have Melrose and McIntosh apples, or Williams pears")
patterns = [
[{"LOWER": "mcintosh"}],
[{"LOWER": "melrose"}],
]
phrasex = PhraseX(nlp.vocab, "apples", patterns)
doc = phrasex(doc)
for apple in doc._.apples:
print(apple)>>> Melrose
>>> McIntosh
```
### SentXThe **SentX** pipe splits sentences in a text. It modifies tokens' *is_sent_start* attribute, so it's mandatory to add it before *parser* pipe in the spaCy pipeline:
```python
from spacy import load as spacy_load
from spikex.pipes import SentX
from spikex.defaults import spacy_versionif spacy_version >= 3:
from spacy.language import Language@Language.factory("sentx")
def create_sentx(nlp, name):
return SentX()nlp = spacy_load("en_core_web_sm")
sentx_pipe = SentX() if spacy_version < 3 else "sentx"
nlp.add_pipe(sentx_pipe, before="parser")
doc = nlp("A little sentence. Followed by another one.")
for sent in doc.sents:
print(sent)>>> A little sentence.
>>> Followed by another one.
```## That's all folks
Feel free to contribute and have fun!