https://github.com/explosion/spacy-experimental

🧪 Cutting-edge experimental spaCy components and features
https://github.com/explosion/spacy-experimental
lemmatizer machine-learning natural-language-processing nlp spacy spacy-extension spacy-pipeline tokenizer
Last synced: about 1 month ago
JSON representation
🧪 Cutting-edge experimental spaCy components and features
Host: GitHub
URL: https://github.com/explosion/spacy-experimental
Owner: explosion
License: mit
Created: 2021-11-18T11:51:52.000Z (over 3 years ago)
Default Branch: master
Last Pushed: 2024-04-23T19:54:23.000Z (about 1 year ago)
Last Synced: 2025-03-30T21:09:50.378Z (about 1 month ago)
Topics: lemmatizer, machine-learning, natural-language-processing, nlp, spacy, spacy-extension, spacy-pipeline, tokenizer
Language: Python
Homepage: https://spacy.io
Size: 1.33 MB
Stars: 98
Watchers: 10
Forks: 19
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

README

        

# spacy-experimental: Cutting-edge experimental spaCy components and features

This package includes experimental components and features for

[spaCy](https://spacy.io) v3.x, for example model architectures, pipeline

components and utilities.

[![tests](https://github.com/explosion/spacy-experimental/actions/workflows/tests.yml/badge.svg)](https://github.com/explosion/spacy-experimental/actions/workflows/tests.yml)

[![pypi Version](https://img.shields.io/pypi/v/spacy-experimental.svg?style=flat-square&logo=pypi&logoColor=white)](https://pypi.org/project/spacy-experimental/)

## Installation

Install with `pip`:

```bash

python -m pip install -U pip setuptools wheel

python -m pip install spacy-experimental

```

## Using spacy-experimental

Components and features may be modified or removed in any release, so always

specify the exact version as a package requirement if you're experimenting with

a particular component, e.g.:

```

spacy-experimental==0.147.0

```

Then you can add the experimental components to your config or import from

`spacy_experimental`:

```ini

[components.experimental_char_ner_tokenizer]

factory = "experimental_char_ner_tokenizer"

```

## Components

### Trainable character-based tokenizers

Two trainable tokenizers represent tokenization as a sequence tagging problem

over individual characters and use the existing spaCy tagger and NER

architectures to perform the tagging.

In the spaCy pipeline, a simple "pretokenizer" is applied as the pipeline

tokenizer to split each doc into individual characters and the trainable

tokenizer is a pipeline component that retokenizes the doc. The pretokenizer

needs to be configured manually in the config or with `spacy.blank()`:

```python

nlp = spacy.blank(

    "en",

    config={

        "nlp": {

            "tokenizer": {"@tokenizers": "spacy-experimental.char_pretokenizer.v1"}

        }

    },

)

```

The two tokenizers currently reset any existing tag or entity annotation

respectively in the process of retokenizing.

#### Character-based tagger tokenizer

In the tagger version `experimental_char_tagger_tokenizer`, the tagging problem

is represented internally with character-level tags for token start (`T`), token

internal (`I`), and outside a token (`O`). This representation comes from

[Elephant: Sequence Labeling for Word and Sentence Segmentation](https://aclanthology.org/D13-1146/)

(Evang et al., 2013).

```none

This is a sentence.

TIIIOTIOTOTIIIIIIIT

```

With the option `annotate_sents`, `S` replaces `T` for the first token in each

sentence and the component predicts both token and sentence boundaries.

```none

This is a sentence.

SIIIOTIOTOTIIIIIIIT

```

A config excerpt for `experimental_char_tagger_tokenizer`:

```ini

[nlp]

pipeline = ["experimental_char_tagger_tokenizer"]

tokenizer = {"@tokenizers":"spacy-experimental.char_pretokenizer.v1"}

[components]

[components.experimental_char_tagger_tokenizer]

factory = "experimental_char_tagger_tokenizer"

annotate_sents = true

scorer = {"@scorers":"spacy-experimental.tokenizer_senter_scorer.v1"}

[components.experimental_char_tagger_tokenizer.model]

@architectures = "spacy.Tagger.v1"

nO = null

[components.experimental_char_tagger_tokenizer.model.tok2vec]

@architectures = "spacy.Tok2Vec.v2"

[components.experimental_char_tagger_tokenizer.model.tok2vec.embed]

@architectures = "spacy.MultiHashEmbed.v2"

width = 128

attrs = ["ORTH","LOWER","IS_DIGIT","IS_ALPHA","IS_SPACE","IS_PUNCT"]

rows = [1000,500,50,50,50,50]

include_static_vectors = false

[components.experimental_char_tagger_tokenizer.model.tok2vec.encode]

@architectures = "spacy.MaxoutWindowEncoder.v2"

width = 128

depth = 4

window_size = 4

maxout_pieces = 2

```

#### Character-based NER tokenizer

In the NER version, each character in a token is part of an entity:

```none

T	B-TOKEN

h	I-TOKEN

i	I-TOKEN

s	I-TOKEN

 	O

i	B-TOKEN

s	I-TOKEN

	O

a	B-TOKEN

 	O

s	B-TOKEN

e	I-TOKEN

n	I-TOKEN

t	I-TOKEN

e	I-TOKEN

n	I-TOKEN

c	I-TOKEN

e	I-TOKEN

.	B-TOKEN

```

A config excerpt for `experimental_char_ner_tokenizer`:

```ini

[nlp]

pipeline = ["experimental_char_ner_tokenizer"]

tokenizer = {"@tokenizers":"spacy-experimental.char_pretokenizer.v1"}

[components]

[components.experimental_char_ner_tokenizer]

factory = "experimental_char_ner_tokenizer"

scorer = {"@scorers":"spacy-experimental.tokenizer_scorer.v1"}

[components.experimental_char_ner_tokenizer.model]

@architectures = "spacy.TransitionBasedParser.v2"

state_type = "ner"

extra_state_tokens = false

hidden_width = 64

maxout_pieces = 2

use_upper = true

nO = null

[components.experimental_char_ner_tokenizer.model.tok2vec]

@architectures = "spacy.Tok2Vec.v2"

[components.experimental_char_ner_tokenizer.model.tok2vec.embed]

@architectures = "spacy.MultiHashEmbed.v2"

width = 128

attrs = ["ORTH","LOWER","IS_DIGIT","IS_ALPHA","IS_SPACE","IS_PUNCT"]

rows = [1000,500,50,50,50,50]

include_static_vectors = false

[components.experimental_char_ner_tokenizer.model.tok2vec.encode]

@architectures = "spacy.MaxoutWindowEncoder.v2"

width = 128

depth = 4

window_size = 4

maxout_pieces = 2

```

The NER version does not currently support sentence boundaries, but it would be

easy to extend using a `B-SENT` entity type.

### Biaffine parser

A biaffine dependency parser, similar to that proposed in [Deep Biaffine

Attention for Neural Dependency Parsing](Deep Biaffine Attention for Neural

Dependency Parsing) (Dozat & Manning, 2016). The parser consists of two parts:

an edge predicter and an edge labeler. For example:

```ini

[components.experimental_arc_predicter]

factory = "experimental_arc_predicter"

[components.experimental_arc_labeler]

factory = "experimental_arc_labeler"

```

The arc predicter requires that a previous component (such as `senter`) sets

sentence boundaries during training. Therefore, such a component must be added

to `annotating_components`:

```ini

[training]

annotating_components = ["senter"]

```

The [biaffine parser sample project](projects/biaffine_parser) provides an

example biaffine parser pipeline.

### Span Finder

The SpanFinder is a new experimental component that identifies span boundaries

by tagging potential start and end tokens. It's an ML approach to suggest

candidate spans with higher precision.

`SpanFinder` uses the following parameters:

- `threshold`: Probability threshold for predicted spans.

- `predicted_key`: Name of the [SpanGroup](https://spacy.io/api/spangroup) the

  predicted spans are saved to.

- `training_key`: Name of the [SpanGroup](https://spacy.io/api/spangroup) the

  training spans are read from.

- `max_length`: Max length of the predicted spans. No limit when set to `0`.

  Defaults to `0`.

- `min_length`: Min length of the predicted spans. No limit when set to `0`.

  Defaults to `0`.

Here is a config excerpt for the `SpanFinder` together with a `SpanCategorizer`:

```ini

[nlp]

lang = "en"

pipeline = ["tok2vec","span_finder","spancat"]

batch_size = 128

disabled = []

before_creation = null

after_creation = null

after_pipeline_creation = null

tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}

[components]

[components.tok2vec]

factory = "tok2vec"

[components.tok2vec.model]

@architectures = "spacy.Tok2Vec.v1"

[components.tok2vec.model.embed]

@architectures = "spacy.MultiHashEmbed.v2"

width = ${components.tok2vec.model.encode.width}

attrs = ["ORTH", "SHAPE"]

rows = [5000, 2500]

include_static_vectors = false

[components.tok2vec.model.encode]

@architectures = "spacy.MaxoutWindowEncoder.v2"

width = 96

depth = 4

window_size = 1

maxout_pieces = 3

[components.span_finder]

factory = "experimental_span_finder"

threshold = 0.35

predicted_key = "span_candidates"

training_key = ${vars.spans_key}

min_length = 0

max_length = 0

[components.span_finder.scorer]

@scorers = "spacy-experimental.span_finder_scorer.v1"

predicted_key = ${components.span_finder.predicted_key}

training_key = ${vars.spans_key}

[components.span_finder.model]

@architectures = "spacy-experimental.SpanFinder.v1"

[components.span_finder.model.scorer]

@layers = "spacy.LinearLogistic.v1"

nO=2

[components.span_finder.model.tok2vec]

@architectures = "spacy.Tok2VecListener.v1"

width = ${components.tok2vec.model.encode.width}

[components.spancat]

factory = "spancat"

max_positive = null

spans_key = ${vars.spans_key}

threshold = 0.5

[components.spancat.model]

@architectures = "spacy.SpanCategorizer.v1"

[components.spancat.model.reducer]

@layers = "spacy.mean_max_reducer.v1"

hidden_size = 128

[components.spancat.model.scorer]

@layers = "spacy.LinearLogistic.v1"

nO = null

nI = null

[components.spancat.model.tok2vec]

@architectures = "spacy.Tok2VecListener.v1"

width = ${components.tok2vec.model.encode.width}

[components.spancat.suggester]

@misc = "spacy-experimental.span_finder_suggester.v1"

predicted_key = ${components.span_finder.predicted_key}

```

This package includes a [spaCy project](./projects/span_finder) which shows how

to train and use the `SpanFinder` together with `SpanCategorizer`.

### Coreference Components

The [CoreferenceResolver](https://spacy.io/api/coref) and

[SpanResolver](https://spacy.io/api/span-resolver) are designed to be used

together to build a corerefence pipeline, which allows you to identify which

spans in a document refer to the same thing. Each component also includes an

architecture and scorer. For more details, see their pages in the main spaCy

docs.

For an example of how to build a pipeline with the components, see the

[example coref project](https://github.com/explosion/projects/tree/v3/experimental/coref).

## Architectures

None currently.

## Other

### Tokenizers

- `spacy-experimental.char_pretokenizer.v1`: Tokenize a text into individual

  characters.

### Scorers

- `spacy-experimental.tokenizer_scorer.v1`: Score tokenization.

- `spacy-experimental.tokenizer_senter_scorer.v1`: Score tokenization and

  sentence segmentation.

### Misc

Suggester functions for spancat:

**Subtree suggester**: Uses dependency annotation to suggest tokens with their

syntactic descendants.

- `spacy-experimental.subtree_suggester.v1`

- `spacy-experimental.ngram_subtree_suggester.v1`

**Chunk suggester**: Suggests noun chunks using the noun chunk iterator, which

requires POS and dependency annotation.

- `spacy-experimental.chunk_suggester.v1`

- `spacy-experimental.ngram_chunk_suggester.v1`

**Sentence suggester**: Uses sentence boundaries to suggest sentence spans.

- `spacy-experimental.sentence_suggester.v1`

- `spacy-experimental.ngram_sentence_suggester.v1`

The package also contains a

[`merge_suggesters`](spacy_experimental/span_suggesters/merge_suggesters.py)

function which can be used to combine suggestions from multiple suggesters.

Here are two config excerpts for using the `subtree suggester` with and without

the ngram functionality:

```

[components.spancat.suggester]

@misc = "spacy-experimental.subtree_suggester.v1"

```

```

[components.spancat.suggester]

@misc = "spacy-experimental.ngram_subtree_suggester.v1"

sizes = [1, 2, 3]

```

Note that all the suggester functions are registered in `@misc`.

## Bug reports and issues

Please report bugs in the

[spaCy issue tracker](https://github.com/explosion/spaCy/issues) or open a new

thread on the [discussion board](https://github.com/explosion/spaCy/discussions)

for other issues.

## Older documentation

See the READMEs in earlier

[tagged versions](https://github.com/explosion/spacy-experimental/tags) for

details about components in earlier releases.
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/explosion/spacy-experimental

Awesome Lists containing this project

README