{"id":15014043,"url":"https://github.com/explosion/spacy-experimental","last_synced_at":"2025-04-07T05:12:23.917Z","repository":{"id":38097727,"uuid":"429410014","full_name":"explosion/spacy-experimental","owner":"explosion","description":"🧪 Cutting-edge experimental spaCy components and features","archived":false,"fork":false,"pushed_at":"2024-04-23T19:54:23.000Z","size":1396,"stargazers_count":98,"open_issues_count":0,"forks_count":19,"subscribers_count":10,"default_branch":"master","last_synced_at":"2025-03-30T21:09:50.378Z","etag":null,"topics":["lemmatizer","machine-learning","natural-language-processing","nlp","spacy","spacy-extension","spacy-pipeline","tokenizer"],"latest_commit_sha":null,"homepage":"https://spacy.io","language":"Python","has_issues":false,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/explosion.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-11-18T11:51:52.000Z","updated_at":"2025-03-27T11:31:07.000Z","dependencies_parsed_at":"2024-01-18T10:08:25.463Z","dependency_job_id":"154314cd-2ad8-49d8-9ea2-b6ad45235843","html_url":"https://github.com/explosion/spacy-experimental","commit_stats":{"total_commits":183,"total_committers":10,"mean_commits":18.3,"dds":0.644808743169399,"last_synced_commit":"66c4be536d4e69b8f6366ceb2c0ebb39079d2c89"},"previous_names":[],"tags_count":10,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/explosion%2Fspacy-experimental","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/explosion%2Fspacy-experimental/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/explosion%2Fspacy-experimental/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/explosion%2Fspacy-experimental/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/explosion","download_url":"https://codeload.github.com/explosion/spacy-experimental/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247595335,"owners_count":20963943,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["lemmatizer","machine-learning","natural-language-processing","nlp","spacy","spacy-extension","spacy-pipeline","tokenizer"],"created_at":"2024-09-24T19:45:06.528Z","updated_at":"2025-04-07T05:12:23.892Z","avatar_url":"https://github.com/explosion.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003ca href=\"https://explosion.ai\"\u003e\u003cimg src=\"https://explosion.ai/assets/img/logo.svg\" width=\"125\" height=\"125\" align=\"right\" /\u003e\u003c/a\u003e\n\n# spacy-experimental: Cutting-edge experimental spaCy components and features\n\nThis package includes experimental components and features for\n[spaCy](https://spacy.io) v3.x, for example model architectures, pipeline\ncomponents and utilities.\n\n[![tests](https://github.com/explosion/spacy-experimental/actions/workflows/tests.yml/badge.svg)](https://github.com/explosion/spacy-experimental/actions/workflows/tests.yml)\n[![pypi Version](https://img.shields.io/pypi/v/spacy-experimental.svg?style=flat-square\u0026logo=pypi\u0026logoColor=white)](https://pypi.org/project/spacy-experimental/)\n\n## Installation\n\nInstall with `pip`:\n\n```bash\npython -m pip install -U pip setuptools wheel\npython -m pip install spacy-experimental\n```\n\n## Using spacy-experimental\n\nComponents and features may be modified or removed in any release, so always\nspecify the exact version as a package requirement if you're experimenting with\na particular component, e.g.:\n\n```\nspacy-experimental==0.147.0\n```\n\nThen you can add the experimental components to your config or import from\n`spacy_experimental`:\n\n```ini\n[components.experimental_char_ner_tokenizer]\nfactory = \"experimental_char_ner_tokenizer\"\n```\n\n## Components\n\n### Trainable character-based tokenizers\n\nTwo trainable tokenizers represent tokenization as a sequence tagging problem\nover individual characters and use the existing spaCy tagger and NER\narchitectures to perform the tagging.\n\nIn the spaCy pipeline, a simple \"pretokenizer\" is applied as the pipeline\ntokenizer to split each doc into individual characters and the trainable\ntokenizer is a pipeline component that retokenizes the doc. The pretokenizer\nneeds to be configured manually in the config or with `spacy.blank()`:\n\n```python\nnlp = spacy.blank(\n    \"en\",\n    config={\n        \"nlp\": {\n            \"tokenizer\": {\"@tokenizers\": \"spacy-experimental.char_pretokenizer.v1\"}\n        }\n    },\n)\n```\n\nThe two tokenizers currently reset any existing tag or entity annotation\nrespectively in the process of retokenizing.\n\n#### Character-based tagger tokenizer\n\nIn the tagger version `experimental_char_tagger_tokenizer`, the tagging problem\nis represented internally with character-level tags for token start (`T`), token\ninternal (`I`), and outside a token (`O`). This representation comes from\n[Elephant: Sequence Labeling for Word and Sentence Segmentation](https://aclanthology.org/D13-1146/)\n(Evang et al., 2013).\n\n```none\nThis is a sentence.\nTIIIOTIOTOTIIIIIIIT\n```\n\nWith the option `annotate_sents`, `S` replaces `T` for the first token in each\nsentence and the component predicts both token and sentence boundaries.\n\n```none\nThis is a sentence.\nSIIIOTIOTOTIIIIIIIT\n```\n\nA config excerpt for `experimental_char_tagger_tokenizer`:\n\n```ini\n[nlp]\npipeline = [\"experimental_char_tagger_tokenizer\"]\ntokenizer = {\"@tokenizers\":\"spacy-experimental.char_pretokenizer.v1\"}\n\n[components]\n\n[components.experimental_char_tagger_tokenizer]\nfactory = \"experimental_char_tagger_tokenizer\"\nannotate_sents = true\nscorer = {\"@scorers\":\"spacy-experimental.tokenizer_senter_scorer.v1\"}\n\n[components.experimental_char_tagger_tokenizer.model]\n@architectures = \"spacy.Tagger.v1\"\nnO = null\n\n[components.experimental_char_tagger_tokenizer.model.tok2vec]\n@architectures = \"spacy.Tok2Vec.v2\"\n\n[components.experimental_char_tagger_tokenizer.model.tok2vec.embed]\n@architectures = \"spacy.MultiHashEmbed.v2\"\nwidth = 128\nattrs = [\"ORTH\",\"LOWER\",\"IS_DIGIT\",\"IS_ALPHA\",\"IS_SPACE\",\"IS_PUNCT\"]\nrows = [1000,500,50,50,50,50]\ninclude_static_vectors = false\n\n[components.experimental_char_tagger_tokenizer.model.tok2vec.encode]\n@architectures = \"spacy.MaxoutWindowEncoder.v2\"\nwidth = 128\ndepth = 4\nwindow_size = 4\nmaxout_pieces = 2\n```\n\n#### Character-based NER tokenizer\n\nIn the NER version, each character in a token is part of an entity:\n\n```none\nT\tB-TOKEN\nh\tI-TOKEN\ni\tI-TOKEN\ns\tI-TOKEN\n \tO\ni\tB-TOKEN\ns\tI-TOKEN\n\tO\na\tB-TOKEN\n \tO\ns\tB-TOKEN\ne\tI-TOKEN\nn\tI-TOKEN\nt\tI-TOKEN\ne\tI-TOKEN\nn\tI-TOKEN\nc\tI-TOKEN\ne\tI-TOKEN\n.\tB-TOKEN\n```\n\nA config excerpt for `experimental_char_ner_tokenizer`:\n\n```ini\n[nlp]\npipeline = [\"experimental_char_ner_tokenizer\"]\ntokenizer = {\"@tokenizers\":\"spacy-experimental.char_pretokenizer.v1\"}\n\n[components]\n\n[components.experimental_char_ner_tokenizer]\nfactory = \"experimental_char_ner_tokenizer\"\nscorer = {\"@scorers\":\"spacy-experimental.tokenizer_scorer.v1\"}\n\n[components.experimental_char_ner_tokenizer.model]\n@architectures = \"spacy.TransitionBasedParser.v2\"\nstate_type = \"ner\"\nextra_state_tokens = false\nhidden_width = 64\nmaxout_pieces = 2\nuse_upper = true\nnO = null\n\n[components.experimental_char_ner_tokenizer.model.tok2vec]\n@architectures = \"spacy.Tok2Vec.v2\"\n\n[components.experimental_char_ner_tokenizer.model.tok2vec.embed]\n@architectures = \"spacy.MultiHashEmbed.v2\"\nwidth = 128\nattrs = [\"ORTH\",\"LOWER\",\"IS_DIGIT\",\"IS_ALPHA\",\"IS_SPACE\",\"IS_PUNCT\"]\nrows = [1000,500,50,50,50,50]\ninclude_static_vectors = false\n\n[components.experimental_char_ner_tokenizer.model.tok2vec.encode]\n@architectures = \"spacy.MaxoutWindowEncoder.v2\"\nwidth = 128\ndepth = 4\nwindow_size = 4\nmaxout_pieces = 2\n```\n\nThe NER version does not currently support sentence boundaries, but it would be\neasy to extend using a `B-SENT` entity type.\n\n### Biaffine parser\n\nA biaffine dependency parser, similar to that proposed in [Deep Biaffine\nAttention for Neural Dependency Parsing](Deep Biaffine Attention for Neural\nDependency Parsing) (Dozat \u0026 Manning, 2016). The parser consists of two parts:\nan edge predicter and an edge labeler. For example:\n\n```ini\n[components.experimental_arc_predicter]\nfactory = \"experimental_arc_predicter\"\n\n[components.experimental_arc_labeler]\nfactory = \"experimental_arc_labeler\"\n```\n\nThe arc predicter requires that a previous component (such as `senter`) sets\nsentence boundaries during training. Therefore, such a component must be added\nto `annotating_components`:\n\n```ini\n[training]\nannotating_components = [\"senter\"]\n```\n\nThe [biaffine parser sample project](projects/biaffine_parser) provides an\nexample biaffine parser pipeline.\n\n### Span Finder\n\nThe SpanFinder is a new experimental component that identifies span boundaries\nby tagging potential start and end tokens. It's an ML approach to suggest\ncandidate spans with higher precision.\n\n`SpanFinder` uses the following parameters:\n\n- `threshold`: Probability threshold for predicted spans.\n- `predicted_key`: Name of the [SpanGroup](https://spacy.io/api/spangroup) the\n  predicted spans are saved to.\n- `training_key`: Name of the [SpanGroup](https://spacy.io/api/spangroup) the\n  training spans are read from.\n- `max_length`: Max length of the predicted spans. No limit when set to `0`.\n  Defaults to `0`.\n- `min_length`: Min length of the predicted spans. No limit when set to `0`.\n  Defaults to `0`.\n\nHere is a config excerpt for the `SpanFinder` together with a `SpanCategorizer`:\n\n```ini\n[nlp]\nlang = \"en\"\npipeline = [\"tok2vec\",\"span_finder\",\"spancat\"]\nbatch_size = 128\ndisabled = []\nbefore_creation = null\nafter_creation = null\nafter_pipeline_creation = null\ntokenizer = {\"@tokenizers\":\"spacy.Tokenizer.v1\"}\n\n[components]\n\n[components.tok2vec]\nfactory = \"tok2vec\"\n\n[components.tok2vec.model]\n@architectures = \"spacy.Tok2Vec.v1\"\n\n[components.tok2vec.model.embed]\n@architectures = \"spacy.MultiHashEmbed.v2\"\nwidth = ${components.tok2vec.model.encode.width}\nattrs = [\"ORTH\", \"SHAPE\"]\nrows = [5000, 2500]\ninclude_static_vectors = false\n\n[components.tok2vec.model.encode]\n@architectures = \"spacy.MaxoutWindowEncoder.v2\"\nwidth = 96\ndepth = 4\nwindow_size = 1\nmaxout_pieces = 3\n\n[components.span_finder]\nfactory = \"experimental_span_finder\"\nthreshold = 0.35\npredicted_key = \"span_candidates\"\ntraining_key = ${vars.spans_key}\nmin_length = 0\nmax_length = 0\n\n[components.span_finder.scorer]\n@scorers = \"spacy-experimental.span_finder_scorer.v1\"\npredicted_key = ${components.span_finder.predicted_key}\ntraining_key = ${vars.spans_key}\n\n[components.span_finder.model]\n@architectures = \"spacy-experimental.SpanFinder.v1\"\n\n[components.span_finder.model.scorer]\n@layers = \"spacy.LinearLogistic.v1\"\nnO=2\n\n[components.span_finder.model.tok2vec]\n@architectures = \"spacy.Tok2VecListener.v1\"\nwidth = ${components.tok2vec.model.encode.width}\n\n[components.spancat]\nfactory = \"spancat\"\nmax_positive = null\nspans_key = ${vars.spans_key}\nthreshold = 0.5\n\n[components.spancat.model]\n@architectures = \"spacy.SpanCategorizer.v1\"\n\n[components.spancat.model.reducer]\n@layers = \"spacy.mean_max_reducer.v1\"\nhidden_size = 128\n\n[components.spancat.model.scorer]\n@layers = \"spacy.LinearLogistic.v1\"\nnO = null\nnI = null\n\n[components.spancat.model.tok2vec]\n@architectures = \"spacy.Tok2VecListener.v1\"\nwidth = ${components.tok2vec.model.encode.width}\n\n[components.spancat.suggester]\n@misc = \"spacy-experimental.span_finder_suggester.v1\"\npredicted_key = ${components.span_finder.predicted_key}\n```\n\nThis package includes a [spaCy project](./projects/span_finder) which shows how\nto train and use the `SpanFinder` together with `SpanCategorizer`.\n\n### Coreference Components\n\nThe [CoreferenceResolver](https://spacy.io/api/coref) and\n[SpanResolver](https://spacy.io/api/span-resolver) are designed to be used\ntogether to build a corerefence pipeline, which allows you to identify which\nspans in a document refer to the same thing. Each component also includes an\narchitecture and scorer. For more details, see their pages in the main spaCy\ndocs.\n\nFor an example of how to build a pipeline with the components, see the\n[example coref project](https://github.com/explosion/projects/tree/v3/experimental/coref).\n\n## Architectures\n\nNone currently.\n\n## Other\n\n### Tokenizers\n\n- `spacy-experimental.char_pretokenizer.v1`: Tokenize a text into individual\n  characters.\n\n### Scorers\n\n- `spacy-experimental.tokenizer_scorer.v1`: Score tokenization.\n- `spacy-experimental.tokenizer_senter_scorer.v1`: Score tokenization and\n  sentence segmentation.\n\n### Misc\n\nSuggester functions for spancat:\n\n**Subtree suggester**: Uses dependency annotation to suggest tokens with their\nsyntactic descendants.\n\n- `spacy-experimental.subtree_suggester.v1`\n- `spacy-experimental.ngram_subtree_suggester.v1`\n\n**Chunk suggester**: Suggests noun chunks using the noun chunk iterator, which\nrequires POS and dependency annotation.\n\n- `spacy-experimental.chunk_suggester.v1`\n- `spacy-experimental.ngram_chunk_suggester.v1`\n\n**Sentence suggester**: Uses sentence boundaries to suggest sentence spans.\n\n- `spacy-experimental.sentence_suggester.v1`\n- `spacy-experimental.ngram_sentence_suggester.v1`\n\nThe package also contains a\n[`merge_suggesters`](spacy_experimental/span_suggesters/merge_suggesters.py)\nfunction which can be used to combine suggestions from multiple suggesters.\n\nHere are two config excerpts for using the `subtree suggester` with and without\nthe ngram functionality:\n\n```\n[components.spancat.suggester]\n@misc = \"spacy-experimental.subtree_suggester.v1\"\n```\n\n```\n[components.spancat.suggester]\n@misc = \"spacy-experimental.ngram_subtree_suggester.v1\"\nsizes = [1, 2, 3]\n```\n\nNote that all the suggester functions are registered in `@misc`.\n\n## Bug reports and issues\n\nPlease report bugs in the\n[spaCy issue tracker](https://github.com/explosion/spaCy/issues) or open a new\nthread on the [discussion board](https://github.com/explosion/spaCy/discussions)\nfor other issues.\n\n## Older documentation\n\nSee the READMEs in earlier\n[tagged versions](https://github.com/explosion/spacy-experimental/tags) for\ndetails about components in earlier releases.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fexplosion%2Fspacy-experimental","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fexplosion%2Fspacy-experimental","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fexplosion%2Fspacy-experimental/lists"}