Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/ljvmiranda921/spacy-span-analyzer
Simple tool to analyze spans in your dataset. Implementation of Papay et al's work (EMNLP 2020) on span performance prediction
https://github.com/ljvmiranda921/spacy-span-analyzer
machine-learning natural-language-processing nlp spacy
Last synced: 3 months ago
JSON representation
Simple tool to analyze spans in your dataset. Implementation of Papay et al's work (EMNLP 2020) on span performance prediction
- Host: GitHub
- URL: https://github.com/ljvmiranda921/spacy-span-analyzer
- Owner: ljvmiranda921
- License: mit
- Archived: true
- Created: 2022-02-08T09:05:35.000Z (almost 3 years ago)
- Default Branch: master
- Last Pushed: 2022-05-27T07:02:46.000Z (over 2 years ago)
- Last Synced: 2024-09-14T12:57:01.948Z (4 months ago)
- Topics: machine-learning, natural-language-processing, nlp, spacy
- Language: Python
- Homepage:
- Size: 6.31 MB
- Stars: 5
- Watchers: 2
- Forks: 3
- Open Issues: 6
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
> 💫 This library is now integrated into spaCy v3.4 as [`debug data`](https://spacy.io/api/cli#debug-data)!
# spacy-span-analyzer
A simple tool to analyze the [Spans](https://spacy.io/api/span) in your
dataset. It's tightly integrated with
[spaCy](https://github.com/explosion/spaCy), so you can easily incorporate it
to existing NLP pipelines. This is also a reproduction of Papay, et al's work on [*Dissecting Span
Identification Tasks with Performance
Prediction*](https://aclanthology.org/2020.emnlp-main.396.pdf) (EMNLP 2020).## ⏳ Install
Using
[pip](https://packaging.python.org/en/latest/tutorials/installing-packages/):```sh
pip install spacy-span-analyzer
```Directly from source (I highly recommend running this within a [virtual
environment](https://docs.python.org/3/tutorial/venv.html#creating-virtual-environments)):```sh
git clone [email protected]:ljvmiranda921/spacy-span-analyzer.git
cd spacy-span-analyzer
pip install .
```## ⏯ Usage
You can use the Span Analyzer as a command-line tool:
```sh
spacy-span-analyzer ./path/to/dataset.spacy
```Or as an imported library:
```python
import spacy
from spacy.tokens import DocBin
from spacy_span_analyzer import SpanAnalyzernlp = spacy.blank("en") # or any Language model
# Ensure that your dataset is a DocBin
doc_bin = DocBin().from_disk("./path/to/data.spacy")
docs = list(doc_bin.get_docs(nlp.vocab))# Run SpanAnalyzer and get span characteristics
analyze = SpanAnalyzer(docs)
analyze.frequency
analyze.length
analyze.span_distinctiveness
analyze.boundary_distinctiveness
```Inputs are expected to be a list of spaCy [Docs](https://spacy.io/api/doc) or a [DocBin](https://spacy.io/api/docbin) (if you're using
the command-line tool).### Working with Spans
In spaCy, you'd want to store your Spans in the
[`doc.spans`](https://spacy.io/api/doc#spans) property, under a particular
`spans_key` (`sc` by default). Unlike the
[`doc.ents`](https://spacy.io/api/doc#ents) property, `doc.spans` allows
overlapping entities. This is useful especially for downstream tasks like [Span
Categorization](https://spacy.io/api/spancategorizer).A common way to do this is to use
[`char_span`](https://spacy.io/api/doc#char_span) to define a slice from your
Doc:```python
doc = nlp(text)
spans = []
from annotation in annotations:
span = doc.char_span(
annotation["start"],
annotation["end"],
annotation["label"],
)
spans.append(span)# Put all spans under a spans_key
doc.spans["sc"] = spans
```You can also achieve the same thing by using
[`set_ents`](https://spacy.io/api/doc#set_ents) or by creating a
[SpanGroup](https://spacy.io/api/spangroup).