Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/kennethenevoldsen/spacy-wrap

spaCy-wrap is a wrapper library for spaCy for including fine-tuned transformers from Huggingface in your spaCy pipeline allowing you to include existing fine-tuned models within your SpaCy workflow.
https://github.com/kennethenevoldsen/spacy-wrap

deep-learning huggingface huggingface-transformers language-model machine-learning natural-language-processing nlp pytorch spacy spacy-extension spacy-extensions spacy-models spacy-nlp spacy-pipeline spacy-transformers text-classification transformers

Last synced: 4 months ago
JSON representation

spaCy-wrap is a wrapper library for spaCy for including fine-tuned transformers from Huggingface in your spaCy pipeline allowing you to include existing fine-tuned models within your SpaCy workflow.

Host: GitHub
URL: https://github.com/kennethenevoldsen/spacy-wrap
Owner: KennethEnevoldsen
License: mit
Created: 2022-01-30T18:49:52.000Z (about 3 years ago)
Default Branch: main
Last Pushed: 2024-04-15T16:37:36.000Z (10 months ago)
Last Synced: 2024-09-29T13:01:17.538Z (4 months ago)
Topics: deep-learning, huggingface, huggingface-transformers, language-model, machine-learning, natural-language-processing, nlp, pytorch, spacy, spacy-extension, spacy-extensions, spacy-models, spacy-nlp, spacy-pipeline, spacy-transformers, text-classification, transformers
Language: Python
Homepage: https://KennethEnevoldsen.github.io/spacy-wrap/
Size: 2.2 MB
Stars: 46
Watchers: 3
Forks: 3
Open Issues: 0
Metadata Files:
- Readme: readme.md
- Changelog: CHANGELOG.md
- License: LICENSE
- Citation: citation.cff

Awesome Lists containing this project

README

        

# spaCy-wrap: For Wrapping fine-tuned transformers in spaCy pipelines

[![PyPI version](https://badge.fury.io/py/spacy-wrap.svg)](https://pypi.org/project/spacy-wrap/)

[![python version](https://img.shields.io/badge/Python-%3E=3.8-blue)](https://github.com/kennethenevoldsen/spacy-wrap)

[![Code style: black](https://img.shields.io/badge/Code%20Style-Black-black)](https://black.readthedocs.io/en/stable/the_black_code_style/current_style.html)

[![github actions pytest](https://github.com/kennethenevoldsen/spacy-wrap/actions/workflows/tests.yml/badge.svg)](https://github.com/kennethenevoldsen/spacy-wrap/actions)

[![github actions docs](https://github.com/kennethenevoldsen/spacy-wrap/actions/workflows/documentation.yml/badge.svg)](https://kennethenevoldsen.github.io/spacy-wrap/)

![github coverage](https://img.shields.io/endpoint?url=https://gist.githubusercontent.com/KennethEnevoldsen/33fb85a2c440013df494c1fce884633c/raw/3813a0369fdd61b39a806b7b91839ff405ef809a/badge-spacy-wrap-coverage.json)

spaCy-wrap is a minimal library intended for wrapping fine-tuned transformers from the [Huggingface model hub](https://huggingface.co/models?pipeline_tag=text-classification&sort=downloads) in your spaCy pipeline allowing the inclusion of existing models within [SpaCy](https://spacy.io) workflows. 

As far as possible it follows a similar API as [spacy-transformers](https://github.com/explosion/spacy-transformers).

**NOTE**: Since the release of spaCy-wrap, Explosion released the [spacy-huggingface-pipelines](https://github.com/explosion/spacy-huggingface-pipelines) it takes the approach of wrapping the Huggingface pipeline as opposed to the transformer. That means token aggregation and conversion into spans happens at

the Huggingface pipeline, while in spaCy-wrap it happens at the logits of the model which can sometimes lead to unfortunate differences in results.

I generally recommend using the spacy-huggingface-pipelines for most use cases, but if you need to use the transformer output more directly 

spaCy-wrap can have its uses.

## Installation

Installing spacy-wrap is simple using pip:

```

pip install spacy_wrap

```

## Examples

The following shows a simple example of how you can quickly add a fine-tuned transformer model from the Huggingface model hub for either [text classification](https://huggingface.co/models?pipeline_tag=text-classification&sort=downloads), [named entity](https://huggingface.co/models?pipeline_tag=token-classification&sort=downloads) or [token classification](https://huggingface.co/models?pipeline_tag=token-classification&sort=downloads). 

### Sequence Classification

In this example, we will use a [model](https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english) fine-tuned for sentiment classification on SST2. This model classifies whether a text is positive or negative. We will add this model to a blank English pipeline:

```python

import spacy

import spacy_wrap

nlp = spacy.blank("en")

config = {

    "doc_extension_trf_data": "clf_trf_data",  # document extention for the forward pass

    "doc_extension_prediction": "sentiment",  # document extention for the prediction

    "model": {

        # the model name or path of huggingface model

        "name": "distilbert-base-uncased-finetuned-sst-2-english",  

    },

}

transformer = nlp.add_pipe("sequence_classification_transformer", config=config)

doc = nlp("spaCy is a wonderful tool")

print(doc.cats)

# {'NEGATIVE': 0.001, 'POSITIVE': 0.999}

print(doc._.sentiment)

# 'POSITIVE'

print(doc._.clf_trf_data)

# TransformerData(wordpieces=...

```

These pipelines can also easily be applied to multiple documents using the `nlp.pipe` as one would expect from a spaCy component:

```python

docs = nlp.pipe(

    [

        "I hate wrapping my own models",

        "Isn't there a tool for this?!",

        "spacy-wrap is great for wrapping models",

    ]

)

for doc in docs:

    print(doc._.sentiment)

# 'NEGATIVE'

# 'NEGATIVE'

# 'POSITIVE'

```

 
 

   More Examples 

It is always nice to have more than one example. Here is another one where we add the Hate speech model for Danish to a blank Danish pipeline:

```python

import spacy

import spacy_wrap

nlp = spacy.blank("da")

config = {

    "doc_extension_trf_data": "clf_trf_data",  # document extention for the forward pass

    "doc_extension_prediction": "hate_speech",  # document extention for the prediction

    # choose custom labels

    "labels": ["Not hate Speech", "Hate speech"],

    "model": {

        "name": "DaNLP/da-bert-hatespeech-detection",  # the model name or path of huggingface model

    },

}

transformer = nlp.add_pipe("classification_transformer", config=config)

doc = nlp("Senile gamle idiot") # old senile idiot

doc._.clf_trf_data

# TransformerData(wordpieces=...

doc._.hate_speech

# "Hate speech"

doc._.hate_speech_prob

# {'prob': array([0.013, 0.987], dtype=float32), 'labels': ['Not hate Speech', 'Hate speech']}

```


 

### Token Classification

We can also use the model for token classification: 

```python

import spacy

import spacy_wrap

nlp = spacy.blank("en")

config = {"model": {"name": "vblagoje/bert-english-uncased-finetuned-pos"}, 

          # "predictions_to": ["pos"]  # optional, can be "pos", "tag" or "ents"

}

snlp.add_pipe("token_classification_transformer", config=config)

text = "My name is Wolfgang and I live in Berlin"

doc = nlp(text)

print(doc._.tok_clf_predictions)

# ['PRON', 'NOUN', 'AUX', 'PROPN', 'CCONJ', 'PRON', 'VERB', 'ADP', 'PROPN']

```

By default, spacy-wrap will automatically detect it the labels follow the universal POS tags as well. If so it will also assign it to the `token.pos`, similar regular spacy pipelines:

```python

print(doc[0].pos_)

# 'PRON'

```

### Named Entity Recognition

In this example, we use a model fine-tuned for named entity recognition. spacy-wrap will in this case infer from the IOB tags that the model is intended for named entity recognition and assign it to `doc.ents`.

```python

import spacy

import spacy_wrap

nlp = spacy.blank("en")

# specify model from the hub

config = {"model": {"name": "dslim/bert-base-NER"}, 

          "predictions_to": ["ents"]} # forced to be named entity recognition, if left out it will be estimated from the labels

# add it to the pipe

nlp.add_pipe("token_classification_transformer", config=config)

doc = nlp("My name is Wolfgang and I live in Berlin.")

print(doc.ents)

# (Wolfgang, Berlin)

```

# 📖 Documentation

| Documentation              |                                             |

| -------------------------- | ------------------------------------------- |

| 🔧 **[Installation]**       | Installation instructions for spacy-wrap.   |

| 📰 **[News and changelog]** | New additions, changes and version history. |

| 🎛 **[Documentation]**      | The reference for spacy-wrap's API.         |

[Documentation]: https://kennethenevoldsen.github.io/spacy-wrap/index.html

[Installation]: https://kennethenevoldsen.github.io/spacy-wrap/installation.html

[News and changelog]: https://kennethenevoldsen.github.io/spacy-wrap/news.html

# 💬 Where to ask questions

| Type                           |                        |

| ------------------------------ | ---------------------- |

| 🚨 **FAQ**                      | [FAQ]                  |

| 🚨 **Bug Reports**              | [GitHub Issue Tracker] |

| 🎁 **Feature Requests & Ideas** | [GitHub Issue Tracker] |

| 👩‍💻 **Usage Questions**          | [GitHub Discussions]   |

| 🗯 **General Discussion**       | [GitHub Discussions]   |

[FAQ]: https://kennethenevoldsen.github.io/spacy-wrap/faq.html

[github issue tracker]: https://github.com/kennethenevoldsen/spacy-wrap/issues

[github discussions]: https://github.com/kennethenevoldsen/spacy-wrap/discussions