{"id":13857012,"url":"https://github.com/explosion/spacy-stanza","last_synced_at":"2025-05-15T15:05:27.841Z","repository":{"id":39706384,"uuid":"168454015","full_name":"explosion/spacy-stanza","owner":"explosion","description":"💥 Use the latest Stanza (StanfordNLP) research models directly in spaCy","archived":false,"fork":false,"pushed_at":"2024-08-15T16:33:50.000Z","size":69,"stargazers_count":733,"open_issues_count":14,"forks_count":60,"subscribers_count":23,"default_branch":"master","last_synced_at":"2025-04-03T09:07:44.481Z","etag":null,"topics":["corenlp","data-science","machine-learning","natural-language-processing","nlp","spacy","spacy-pipeline","stanford-corenlp","stanford-machine-learning","stanford-nlp","stanza"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/explosion.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-01-31T03:08:06.000Z","updated_at":"2025-03-20T19:46:25.000Z","dependencies_parsed_at":"2024-02-13T12:04:28.629Z","dependency_job_id":"973dbdd8-5756-48ad-8952-daa40d98a7b4","html_url":"https://github.com/explosion/spacy-stanza","commit_stats":{"total_commits":97,"total_committers":8,"mean_commits":12.125,"dds":"0.35051546391752575","last_synced_commit":"b57f3482582535f2f07afb0e4ed01df650925d9d"},"previous_names":["explosion/spacy-stanfordnlp"],"tags_count":17,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/explosion%2Fspacy-stanza","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/explosion%2Fspacy-stanza/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/explosion%2Fspacy-stanza/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/explosion%2Fspacy-stanza/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/explosion","download_url":"https://codeload.github.com/explosion/spacy-stanza/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248478049,"owners_count":21110627,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["corenlp","data-science","machine-learning","natural-language-processing","nlp","spacy","spacy-pipeline","stanford-corenlp","stanford-machine-learning","stanford-nlp","stanza"],"created_at":"2024-08-05T03:01:22.573Z","updated_at":"2025-04-11T20:38:16.989Z","avatar_url":"https://github.com/explosion.png","language":"Python","readme":"\u003ca href=\"https://explosion.ai\"\u003e\u003cimg src=\"https://explosion.ai/assets/img/logo.svg\" width=\"125\" height=\"125\" align=\"right\" /\u003e\u003c/a\u003e\n\n# spaCy + Stanza (formerly StanfordNLP)\n\nThis package wraps the [Stanza](https://github.com/stanfordnlp/stanza) (formerly\nStanfordNLP) library, so you can use Stanford's models in a\n[spaCy](https://spacy.io) pipeline. The Stanford models achieved top accuracy in\nthe CoNLL 2017 and 2018 shared task, which involves tokenization, part-of-speech\ntagging, morphological analysis, lemmatization and labeled dependency parsing in\n68 languages. As of v1.0, Stanza also supports named entity recognition for\nselected languages.\n\n\u003e ⚠️ Previous version of this package were available as\n\u003e [`spacy-stanfordnlp`](https://pypi.python.org/pypi/spacy-stanfordnlp).\n\n[![tests](https://github.com/explosion/spacy-stanza/actions/workflows/tests.yml/badge.svg)](https://github.com/explosion/spacy-stanza/actions/workflows/tests.yml)\n[![PyPi](https://img.shields.io/pypi/v/spacy-stanza.svg?style=flat-square)](https://pypi.python.org/pypi/spacy-stanza)\n[![GitHub](https://img.shields.io/github/release/explosion/spacy-stanza/all.svg?style=flat-square)](https://github.com/explosion/spacy-stanza)\n[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg?style=flat-square)](https://github.com/ambv/black)\n\nUsing this wrapper, you'll be able to use the following annotations, computed by\nyour pretrained `stanza` model:\n\n- Statistical tokenization (reflected in the `Doc` and its tokens)\n- Lemmatization (`token.lemma` and `token.lemma_`)\n- Part-of-speech tagging (`token.tag`, `token.tag_`, `token.pos`, `token.pos_`)\n- Morphological analysis (`token.morph`)\n- Dependency parsing (`token.dep`, `token.dep_`, `token.head`)\n- Named entity recognition (`doc.ents`, `token.ent_type`, `token.ent_type_`,\n  `token.ent_iob`, `token.ent_iob_`)\n- Sentence segmentation (`doc.sents`)\n\n## ️️️⌛️ Installation\n\nAs of v1.0.0 `spacy-stanza` is only compatible with **spaCy v3.x**. To install\nthe most recent version:\n\n```bash\npip install spacy-stanza\n```\n\nFor spaCy v2, install v0.2.x and refer to the\n[v0.2.x usage documentation](https://github.com/explosion/spacy-stanza/tree/v0.2.x#-usage--examples):\n\n```bash\npip install \"spacy-stanza\u003c0.3.0\"\n```\n\nMake sure to also\n[download](https://stanfordnlp.github.io/stanza/download_models.html) one of the\n[pre-trained Stanza models](https://stanfordnlp.github.io/stanza/models.html).\n\n## 📖 Usage \u0026 Examples\n\n\u003e ⚠️ **Important note:** This package has been refactored to take advantage of\n\u003e [spaCy v3.0](https://spacy.io). Previous versions that were built for\n\u003e [spaCy v2.x](https://v2.spacy.io) worked considerably differently. Please see\n\u003e previous tagged versions of this README for documentation on prior versions.\n\nUse `spacy_stanza.load_pipeline()` to create an `nlp` object that you can use to\nprocess a text with a Stanza pipeline and create a spaCy\n[`Doc` object](https://spacy.io/api/doc). By default, both the spaCy pipeline\nand the Stanza pipeline will be initialized with the same `lang`, e.g. \"en\":\n\n```python\nimport stanza\nimport spacy_stanza\n\n# Download the stanza model if necessary\nstanza.download(\"en\")\n\n# Initialize the pipeline\nnlp = spacy_stanza.load_pipeline(\"en\")\n\ndoc = nlp(\"Barack Obama was born in Hawaii. He was elected president in 2008.\")\nfor token in doc:\n    print(token.text, token.lemma_, token.pos_, token.dep_, token.ent_type_)\nprint(doc.ents)\n```\n\nIf language data for the given language is available in spaCy, the respective\nlanguage class can be used as the base for the `nlp` object – for example,\n`English()`. This lets you use spaCy's lexical attributes like `is_stop` or\n`like_num`. The `nlp` object follows the same API as any other spaCy `Language`\nclass – so you can visualize the `Doc` objects with displaCy, add custom\ncomponents to the pipeline, use the rule-based matcher and do pretty much\nanything else you'd normally do in spaCy.\n\n```python\n# Access spaCy's lexical attributes\nprint([token.is_stop for token in doc])\nprint([token.like_num for token in doc])\n\n# Visualize dependencies\nfrom spacy import displacy\ndisplacy.serve(doc)  # or displacy.render if you're in a Jupyter notebook\n\n# Process texts with nlp.pipe\nfor doc in nlp.pipe([\"Lots of texts\", \"Even more texts\", \"...\"]):\n    print(doc.text)\n\n# Combine with your own custom pipeline components\nfrom spacy import Language\n@Language.component(\"custom_component\")\ndef custom_component(doc):\n    # Do something to the doc here\n    print(f\"Custom component called: {doc.text}\")\n    return doc\n\nnlp.add_pipe(\"custom_component\")\ndoc = nlp(\"Some text\")\n\n# Serialize attributes to a numpy array\nnp_array = doc.to_array(['ORTH', 'LEMMA', 'POS'])\n```\n\n### Stanza Pipeline options\n\nAdditional options for the Stanza\n[`Pipeline`](https://stanfordnlp.github.io/stanza/pipeline.html#pipeline) can be\nprovided as keyword arguments following the `Pipeline` API:\n\n- Provide the Stanza language as `lang`. For Stanza languages without spaCy\n  support, use \"xx\" for the spaCy language setting:\n\n  ```python\n  # Initialize a pipeline for Coptic\n  nlp = spacy_stanza.load_pipeline(\"xx\", lang=\"cop\")\n  ```\n\n- Provide Stanza pipeline settings following the `Pipeline` API:\n\n  ```python\n  # Initialize a German pipeline with the `hdt` package\n  nlp = spacy_stanza.load_pipeline(\"de\", package=\"hdt\")\n  ```\n\n- Tokenize with spaCy rather than the statistical tokenizer (only for English):\n\n  ```python\n  nlp = spacy_stanza.load_pipeline(\"en\", processors= {\"tokenize\": \"spacy\"})\n  ```\n\n- Provide any additional processor settings as additional keyword arguments:\n\n  ```python\n  # Provide pretokenized texts (whitespace tokenization)\n  nlp = spacy_stanza.load_pipeline(\"de\", tokenize_pretokenized=True)\n  ```\n\nThe spaCy config specifies all `Pipeline` options in the `[nlp.tokenizer]`\nblock. For example, the config for the last example above, a German pipeline\nwith pretokenized texts:\n\n```ini\n[nlp.tokenizer]\n@tokenizers = \"spacy_stanza.PipelineAsTokenizer.v1\"\nlang = \"de\"\ndir = null\npackage = \"default\"\nlogging_level = null\nverbose = null\nuse_gpu = true\n\n[nlp.tokenizer.kwargs]\ntokenize_pretokenized = true\n\n[nlp.tokenizer.processors]\n```\n\n### Serialization\n\nThe full Stanza pipeline configuration is stored in the spaCy pipeline\n[config](https://spacy.io/usage/training#config), so you can save and load the\npipeline just like any other `nlp` pipeline:\n\n```python\n# Save to a local directory\nnlp.to_disk(\"./stanza-spacy-model\")\n\n# Reload the pipeline\nnlp = spacy.load(\"./stanza-spacy-model\")\n```\n\nNote that this **does not save any Stanza model data by default**. The Stanza\nmodels are very large, so for now, this package expects you to download the\nmodels separately with `stanza.download()` and have them available either in the\ndefault model directory or in the path specified under `[nlp.tokenizer.dir]` in\nthe config.\n\n### Adding additional spaCy pipeline components\n\nBy default, the spaCy pipeline in the `nlp` object returned by\n`spacy_stanza.load_pipeline()` will be empty, because all `stanza` attributes\nare computed and set within the custom tokenizer,\n[`StanzaTokenizer`](spacy_stanza/tokenizer.py). But since it's a regular `nlp`\nobject, you can add your own components to the pipeline. For example, you could\nadd\n[your own custom text classification component](https://spacy.io/usage/training)\nwith `nlp.add_pipe(\"textcat\", source=source_nlp)`, or augment the named entities\nwith your own rule-based patterns using the\n[`EntityRuler` component](https://spacy.io/usage/rule-based-matching#entityruler).\n","funding_links":[],"categories":["Python"],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fexplosion%2Fspacy-stanza","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fexplosion%2Fspacy-stanza","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fexplosion%2Fspacy-stanza/lists"}