Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/surajiyer/spacybert

BERT inference (with similar function to hanxiao/bert-as-service) for spaCy with custom extension attributes
https://github.com/surajiyer/spacybert

bert huggingface huggingface-transformers language-model machine-learning natural-language-processing nlp pytorch pytorch-model spacy spacy-extension spacy-pipeline

Last synced: about 2 months ago
JSON representation

BERT inference (with similar function to hanxiao/bert-as-service) for spaCy with custom extension attributes

Host: GitHub
URL: https://github.com/surajiyer/spacybert
Owner: surajiyer
License: mit
Created: 2020-07-26T17:01:42.000Z (over 4 years ago)
Default Branch: master
Last Pushed: 2020-08-05T18:03:06.000Z (over 4 years ago)
Last Synced: 2024-11-16T17:16:25.958Z (about 2 months ago)
Topics: bert, huggingface, huggingface-transformers, language-model, machine-learning, natural-language-processing, nlp, pytorch, pytorch-model, spacy, spacy-extension, spacy-pipeline
Language: Python
Homepage:
Size: 8.79 KB
Stars: 2
Watchers: 3
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # spacybert: Bert inference for spaCy

[spaCy v2.0](https://spacy.io/usage/v2) extension and pipeline component for loading BERT sentence / document embedding meta data to `Doc`, `Span` and `Token` objects. The Bert backend itself is supported by the [Hugging Face transformers](https://github.com/huggingface/transformers) library.

## Installation

`spacybert` requires `spacy` v2.0.0 or higher.

## Usage

### Getting BERT embeddings for single language dataset

```

import spacy

from spacybert import BertInference

nlp = spacy.load('en')

```

Then either use BertInference as part of a pipeline,

```

bert = BertInference(

    from_pretrained='path/to/pretrained_bert_weights_dir',

    set_extension=False)

nlp.add_pipe(bert, last=True)

```

Or not...

```

bert = BertInference(

    from_pretrained='path/to/pretrained_bert_weights_dir',

    set_extension=True)

```

The difference is that when `set_extension=True`, `bert_repr` is set as a property extension for the Doc, Span and Token spacy objects. If `set_extension=False`, the `bert_repr` is set as an attribute extension with a default value (`=None`). The attribute computes the correct value when `doc._.bert_repr` is called.

Get the Bert representation / embedding.

```

doc = nlp("This is a test")

print(doc._.bert_repr)  # <-- torch.Tensor

```

### Getting BERT embeddings for multiple languages dataset.

```

import spacy

from spacy_langdetect import LanguageDetector

from spacybert import MultiLangBertInference

nlp = spacy.load('en')

nlp.add_pipe(LanguageDetector(), name='language_detector', last=True)

bert = MultiLangBertInference(

    from_pretrained={

        'en': 'path/to/en_pretrained_bert_weights_dir',

        'nl': 'path/to/nl_pretrained_bert_weights_dir'

    },

    set_extension=False)

nlp.add_pipe(bert, after='language_detector')

texts = [

    "This is a test",  # English

    "Dit is een test"  # Dutch

]

for doc in nlp.pipe(texts):

    print(doc._.bert_repr)  # <-- torch.Tensor

```

When language_detector detects languages other than the ones for which pre-trained weights is specified, by default `doc._.bert_repr = None`.

## Available attributes

The extension sets attributes on the `Doc`, `Span` and `Token`. You can change the attribute name on initializing the extension.

| | | |

|-|-|-|

| `Doc._.bert_repr` | `torch.Tensor` | Document BERT embedding |

| `Span._.bert_repr` | `torch.Tensor` | Span BERT embedding |

| `Token._.bert_repr` | `torch.Tensor` | Token BERT embedding |

| | | |

## Settings

On initialization of `BertInference`, you can define the following:

| name | type | default | description |

|-|-|-|-|

| `from_pretrained` | `str` | `None` | Path to Bert model directory or name of HuggingFace transformers pre-trained Bert weights, e.g., `bert-base-uncased` |

| `attr_name` | `str` | `'bert_repr'` | Name of the BERT embedding attribute to set to the `._` property |

| `max_seq_len` | `int` | 512 | Max sequence length for input to Bert |

| `pooling_strategy` | `str` | `'REDUCE_MEAN'` | Strategy to generate single sentence embedding from multiple word embeddings. See below for the various pooling strategies available. |

| `set_extension` | `bool` | `True` | If `True`, then `'bert_repr'` is set as a property extension for the `Doc`, `Span` and `Token` spacy objects. If `False`, the `'bert_repr'` is set as an attribute extension with a default value (`None`) which gets filled correctly when called in a pipeline. Set it to `False` if you want to use this extension in a spacy pipeline. |

| `force_extension` | `bool` | `True` | A boolean value to create the same 'Extension Attribute' upon being executed again |

On initialization of `MultiLangBertInference`, you can define the following:

| name | type | default | description |

|-|-|-|-|

| `from_pretrained` | `Dict[LANG_ISO_639_1, str]` | `None` | Mapping between two-letter language codes to path to model directory or HuggingFace transformers pre-trained Bert weights |

| `attr_name` | `str` | `'bert_repr'` | Same as in BertInference |

| `max_seq_len` | `int` | 512 | Same as in BertInference |

| `pooling_strategy` | `str` | `'REDUCE_MEAN'` | Same as in BertInference |

| `set_extension` | `bool` | `True` | Same as in BertInference |

| `force_extension` | `bool` | `True` | Same as in BertInference |

## Pooling strategies

| strategy | description |

|-|-|

| `REDUCE_MEAN` | Element-wise average the word embeddings |

| `REDUCE_MAX` | Element-wise maximum of the word embeddings |

| `REDUCE_MEAN_MAX` | Apply both `'REDUCE_MEAN'` and `'REDUCE_MAX'` and concatenate. So if the original word embedding is of dimensions `(768,)`, then the output will have shape `(1536,)` |

| `CLS_TOKEN`, `FIRST_TOKEN` | Take the embedding of only the first `[CLS]` token |

| `SEP_TOKEN`, `LAST_TOKEN` | Take the embedding of only the last `[SEP]` token |

| `None` | No reduction is applied and a matrix of embeddings per word in the sentence is returned |

## Roadmap

This extension is still experimental. Possible future updates include:

* Getting document representation from other state-of-the-art NLP models other than Google's BERT.

* Method for computing similarity between `Doc`, `Span` and `Token` objects using the `bert_repr` tensor.

* Getting representation from multiple / other layers in the models.