{"id":15014050,"url":"https://github.com/explosion/spacy-huggingface-pipelines","last_synced_at":"2025-10-28T14:36:31.490Z","repository":{"id":154169322,"uuid":"617473621","full_name":"explosion/spacy-huggingface-pipelines","owner":"explosion","description":"💥 Use Hugging Face text and token classification pipelines directly in spaCy","archived":false,"fork":false,"pushed_at":"2024-03-18T16:32:06.000Z","size":48,"stargazers_count":63,"open_issues_count":1,"forks_count":5,"subscribers_count":10,"default_branch":"main","last_synced_at":"2025-01-29T18:38:17.358Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":false,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/explosion.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-03-22T13:21:21.000Z","updated_at":"2024-12-02T21:28:15.000Z","dependencies_parsed_at":null,"dependency_job_id":"2fb9d6bb-a12a-4819-828c-8467938905f0","html_url":"https://github.com/explosion/spacy-huggingface-pipelines","commit_stats":{"total_commits":25,"total_committers":2,"mean_commits":12.5,"dds":"0.040000000000000036","last_synced_commit":"c7bb7a74d505d14374bd5bd97301e5836ec5e258"},"previous_names":[],"tags_count":4,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/explosion%2Fspacy-huggingface-pipelines","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/explosion%2Fspacy-huggingface-pipelines/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/explosion%2Fspacy-huggingface-pipelines/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/explosion%2Fspacy-huggingface-pipelines/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/explosion","download_url":"https://codeload.github.com/explosion/spacy-huggingface-pipelines/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":237152768,"owners_count":19263780,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-09-24T19:45:07.279Z","updated_at":"2025-10-28T14:36:26.451Z","avatar_url":"https://github.com/explosion.png","language":"Python","readme":"\u003ca href=\"https://explosion.ai\"\u003e\u003cimg src=\"https://explosion.ai/assets/img/logo.svg\" width=\"125\" height=\"125\" align=\"right\" /\u003e\u003c/a\u003e\n\n# spacy-huggingface-pipelines: Use pretrained transformer models for text and token classification\n\nThis package provides [spaCy](https://github.com/explosion/spaCy) components to\nuse pretrained\n[Hugging Face Transformers pipelines](https://huggingface.co/docs/transformers/main_classes/pipelines)\nfor inference only.\n\n[![PyPi](https://img.shields.io/pypi/v/spacy-huggingface-pipelines.svg?style=flat-square\u0026logo=pypi\u0026logoColor=white)](https://pypi.python.org/pypi/spacy-huggingface-pipelines)\n[![GitHub](https://img.shields.io/github/release/explosion/spacy-huggingface-pipelines/all.svg?style=flat-square\u0026logo=github)](https://github.com/explosion/spacy-huggingface-pipelines/releases)\n\n## Features\n\n- Apply pretrained transformers models like\n  [`dslim/bert-base-NER`](https://huggingface.co/dslim/bert-base-NER) and\n  [`distilbert-base-uncased-finetuned-sst-2-english`](https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).\n\n## 🚀 Installation\n\nInstalling the package from pip will automatically install all dependencies,\nincluding PyTorch and spaCy.\n\n```bash\npip install -U pip setuptools wheel\npip install spacy-huggingface-pipelines\n```\n\nFor GPU installation, follow the\n[spaCy installation quickstart with GPU](https://spacy.io/usage/), e.g.\n\n```bash\npip install -U spacy[cuda12x]\n```\n\nIf you are having trouble installing PyTorch, follow the\n[instructions](https://pytorch.org/get-started/locally/) on the official website\nfor your specific operating system and requirements.\n\n## 📖 Documentation\n\nThis module provides spaCy wrappers for the inference-only transformers\n[`TokenClassificationPipeline`](https://huggingface.co/docs/transformers/main/en/main_classes/pipelines#transformers.TokenClassificationPipeline)\nand\n[`TextClassificationPipeline`](https://huggingface.co/docs/transformers/main/en/main_classes/pipelines#transformers.TextClassificationPipeline)\npipelines.\n\nThe models are downloaded on initialization from the\n[Hugging Face Hub](https://huggingface.co/models) if they're not already in your\nlocal cache, or alternatively they can be loaded from a local path.\n\nNote that the transformer model data **is not saved with the pipeline** when you\ncall `nlp.to_disk`, so if you are loading pipelines in an environment with\nlimited internet access, make sure the model is available in your\n[transformers cache directory](https://huggingface.co/docs/transformers/main/en/installation#cache-setup)\nand enable offline mode if needed.\n\n### Token classification\n\nConfig settings for `hf_token_pipe`:\n\n```ini\n[components.hf_token_pipe]\nfactory = \"hf_token_pipe\"\nmodel = \"dslim/bert-base-NER\"     # Model name or path\nrevision = \"main\"                 # Model revision\naggregation_strategy = \"average\"  # \"simple\", \"first\", \"average\", \"max\"\nstride = 16                       # If stride \u003e= 0, process long texts in\n                                  # overlapping windows of the model max\n                                  # length. The value is the length of the\n                                  # window overlap in transformer tokenizer\n                                  # tokens, NOT the length of the stride.\nkwargs = {}                       # Any additional arguments for\n                                  # TokenClassificationPipeline\nalignment_mode = \"strict\"         # \"strict\", \"contract\", \"expand\"\nannotate = \"ents\"                 # \"ents\", \"pos\", \"spans\", \"tag\"\nannotate_spans_key = null         # Doc.spans key for annotate = \"spans\"\nscorer = null                     # Optional scorer\n```\n\n#### `TokenClassificationPipeline` settings\n\n- `model`: The model name or path.\n- `revision`: The model revision. For production use, a specific git commit is\n  recommended instead of the default `main`.\n- `stride`: For `stride \u003e= 0`, the text is processed in overlapping windows\n  where the `stride` setting specifies the number of overlapping tokens between\n  windows (NOT the stride length). If `stride` is `None`, then the text may be\n  truncated. `stride` is only supported for fast tokenizers.\n- `aggregation_strategy`: The aggregation strategy determines the word-level\n  tags for cases where subwords within one word do not receive the same\n  predicted tag. See:\n  https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.TokenClassificationPipeline.aggregation_strategy\n- `kwargs`: Any additional arguments to\n  [`TokenClassificationPipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.TokenClassificationPipeline).\n\n#### spaCy settings\n\n- `alignment_mode` determines how transformer predictions are aligned to spaCy\n  token boundaries as described for\n  [`Doc.char_span`](https://spacy.io/api/doc#char_span).\n- `annotate` and `annotate_spans_key` configure how the annotation is saved to\n  the spaCy doc. You can save the output as `token.tag_`, `token.pos_` (only for\n  UPOS tags), `doc.ents` or `doc.spans`.\n\n#### Examples\n\n1. Save named entity annotation as `Doc.ents`:\n\n```python\nimport spacy\nnlp = spacy.blank(\"en\")\nnlp.add_pipe(\"hf_token_pipe\", config={\"model\": \"dslim/bert-base-NER\"})\ndoc = nlp(\"My name is Sarah and I live in London\")\nprint(doc.ents)\n# (Sarah, London)\n```\n\n2. Save named entity annotation as `Doc.spans[spans_key]` and scores as\n   `Doc.spans[spans_key].attrs[\"scores\"]`:\n\n```python\nimport spacy\nnlp = spacy.blank(\"en\")\nnlp.add_pipe(\n    \"hf_token_pipe\",\n    config={\n        \"model\": \"dslim/bert-base-NER\",\n        \"annotate\": \"spans\",\n        \"annotate_spans_key\": \"bert-base-ner\",\n    },\n)\ndoc = nlp(\"My name is Sarah and I live in London\")\nprint(doc.spans[\"bert-base-ner\"])\n# [Sarah, London]\nprint(doc.spans[\"bert-base-ner\"].attrs[\"scores\"])\n# [0.99854773, 0.9996215]\n```\n\n3. Save fine-grained tags as `Token.tag`:\n\n```python\nimport spacy\nnlp = spacy.blank(\"en\")\nnlp.add_pipe(\n    \"hf_token_pipe\",\n    config={\n        \"model\": \"QCRI/bert-base-multilingual-cased-pos-english\",\n        \"annotate\": \"tag\",\n    },\n)\ndoc = nlp(\"My name is Sarah and I live in London\")\nprint([t.tag_ for t in doc])\n# ['PRP$', 'NN', 'VBZ', 'NNP', 'CC', 'PRP', 'VBP', 'IN', 'NNP']\n```\n\n4. Save coarse-grained tags as `Token.pos`:\n\n```python\nimport spacy\nnlp = spacy.blank(\"en\")\nnlp.add_pipe(\n    \"hf_token_pipe\",\n    config={\"model\": \"vblagoje/bert-english-uncased-finetuned-pos\", \"annotate\": \"pos\"},\n)\ndoc = nlp(\"My name is Sarah and I live in London\")\nprint([t.pos_ for t in doc])\n# ['PRON', 'NOUN', 'AUX', 'PROPN', 'CCONJ', 'PRON', 'VERB', 'ADP', 'PROPN']\n```\n\n### Text classification\n\nConfig settings for `hf_text_pipe`:\n\n```ini\n[components.hf_text_pipe]\nfactory = \"hf_text_pipe\"\nmodel = \"distilbert-base-uncased-finetuned-sst-2-english\"  # Model name or path\nrevision = \"main\"                 # Model revision\nkwargs = {}                       # Any additional arguments for\n                                  # TextClassificationPipeline\nscorer = null                     # Optional scorer\n```\n\nThe input texts are truncated according to the transformers model max length.\n\n#### `TextClassificationPipeline` settings\n\n- `model`: The model name or path.\n- `revision`: The model revision. For production use, a specific git commit is\n  recommended instead of the default `main`.\n- `kwargs`: Any additional arguments to\n  [`TextClassificationPipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.TextClassificationPipeline).\n\n#### Example\n\n```python\nimport spacy\n\nnlp = spacy.blank(\"en\")\nnlp.add_pipe(\n    \"hf_text_pipe\",\n    config={\"model\": \"distilbert-base-uncased-finetuned-sst-2-english\"},\n)\ndoc = nlp(\"This is great!\")\nprint(doc.cats)\n# {'POSITIVE': 0.9998694658279419, 'NEGATIVE': 0.00013048505934420973}\n```\n\n### Batching and GPU\n\nBoth token and text classification support batching with `nlp.pipe`:\n\n```python\nfor doc in nlp.pipe(texts, batch_size=256):\n    do_something(doc)\n```\n\nIf the component runs into an error processing a batch (e.g. on an empty text),\n`nlp.pipe` will back off to processing each text individually. If it runs into\nan error on an individual text, a warning is shown and the doc is returned\nwithout additional annotation.\n\nSwitch to GPU:\n\n```python\nimport spacy\nspacy.require_gpu()\n\nfor doc in nlp.pipe(texts):\n    do_something(doc)\n```\n\n## Bug reports and issues\n\nPlease report bugs in the\n[spaCy issue tracker](https://github.com/explosion/spaCy/issues) or open a new\nthread on the [discussion board](https://github.com/explosion/spaCy/discussions)\nfor other issues.\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fexplosion%2Fspacy-huggingface-pipelines","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fexplosion%2Fspacy-huggingface-pipelines","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fexplosion%2Fspacy-huggingface-pipelines/lists"}