{"id":15014165,"url":"https://github.com/riccorl/ipa","last_synced_at":"2025-04-12T06:05:11.537Z","repository":{"id":40566943,"uuid":"465649228","full_name":"Riccorl/ipa","owner":"Riccorl","description":"NLP Preprocessing Pipeline Wrappers","archived":false,"fork":false,"pushed_at":"2023-05-12T15:13:25.000Z","size":99,"stargazers_count":11,"open_issues_count":3,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-04-12T06:04:18.114Z","etag":null,"topics":["lemmatization","model","natural-language-processing","nlp","part-of-speech-tagger","pipeline","preprocessing","spacy","stanza","tagging","token","tokenizer","wrapper"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Riccorl.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2022-03-03T09:22:16.000Z","updated_at":"2024-12-30T22:26:45.000Z","dependencies_parsed_at":"2023-09-25T01:51:36.281Z","dependency_job_id":null,"html_url":"https://github.com/Riccorl/ipa","commit_stats":{"total_commits":33,"total_committers":4,"mean_commits":8.25,"dds":"0.18181818181818177","last_synced_commit":"54289c04b6408395ea8dc128a69c9458974cb8ee"},"previous_names":[],"tags_count":4,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Riccorl%2Fipa","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Riccorl%2Fipa/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Riccorl%2Fipa/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Riccorl%2Fipa/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Riccorl","download_url":"https://codeload.github.com/Riccorl/ipa/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248525144,"owners_count":21118618,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["lemmatization","model","natural-language-processing","nlp","part-of-speech-tagger","pipeline","preprocessing","spacy","stanza","tagging","token","tokenizer","wrapper"],"created_at":"2024-09-24T19:45:16.849Z","updated_at":"2025-04-12T06:05:11.491Z","avatar_url":"https://github.com/Riccorl.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n\n# 🍺IPA: import, preprocess, accelerate\n\n[//]: # ([![Open in Visual Studio Code]\u0026#40;https://open.vscode.dev/badges/open-in-vscode.svg\u0026#41;]\u0026#40;https://github.dev/Riccorl/ipa\u0026#41;)\n[![PyTorch](https://img.shields.io/badge/PyTorch-orange?logo=pytorch)](https://pytorch.org/)\n[![Stanza](https://img.shields.io/badge/1.4-Stanza-5f0a09?logo=stanza)](https://stanfordnlp.github.io/stanza/)\n[![SpaCy](https://img.shields.io/badge/3.4.3-SpaCy-1a6f93?logo=spacy)](https://spacy.io/)\n[![Code style: black](https://img.shields.io/badge/code%20style-black-000000)](https://github.com/psf/black)\n\n[![Upload to PyPi](https://github.com/Riccorl/ipa/actions/workflows/python-publish-pypi.yml/badge.svg)](https://github.com/Riccorl/ipa/actions/workflows/python-publish-pypi.yml)\n[![PyPi Version](https://img.shields.io/github/v/release/Riccorl/ipa)](https://github.com/Riccorl/ipa/releases)\n[![DeepSource](https://deepsource.io/gh/Riccorl/ipa.svg/?label=active+issues\u0026token=QC6Jty-YdgXjKh9mKZyeqa4I)](https://deepsource.io/gh/Riccorl/ipa/?ref=repository-badge)\n\n\u003c/div\u003e\n\n🍺IPA: import, preprocess, accelerate\n\n## How to use\n\n### Install\n\nInstall the library from [PyPI](https://pypi.org/project/ipa-core):\n\n```bash\npip install ipa-core\n```\n\n### Usage\n\nIPA is a Python library that provides a set of preprocessing wrappers for Stanza and\nspaCy, providing a unified API for both libraries, making them interchangeable.\n\nLet's start with a simple example. Here we are using the `SpacyTokenizer` wrapper to preprocess a text: \n\n```python\nfrom ipa import SpacyTokenizer\n\nspacy_tokenizer = SpacyTokenizer(language=\"en\", return_pos_tags=True, return_lemmas=True)\ntokenized = spacy_tokenizer(\"Mary sold the car to John.\")\nfor word in tokenized:\n    print(\"{:\u003c5} {:\u003c10} {:\u003c10} {:\u003c10}\".format(word.index, word.text, word.pos, word.lemma))\n\n\"\"\"\n0    Mary       PROPN      Mary\n1    sold       VERB       sell\n2    the        DET        the\n3    car        NOUN       car\n4    to         ADP        to\n5    John       PROPN      John\n6    .          PUNCT      .\n\"\"\"\n```\n\nYou can load any model from spaCy, with its canonical name, `en_core_web_sm`, or with a simple alias, as \nwe did here, like `en`. By default, the simpler alias loads the smaller version of each model. For a complete \nlist of available models, see [spaCy documentation](https://spacy.io/usage/models).\n\nIn the very same way, you can load any model from Stanza using the `StanzaTokenizer` wrapper:\n\n```python\nfrom ipa import StanzaTokenizer\n\nstanza_tokenizer = StanzaTokenizer(language=\"en\", return_pos_tags=True, return_lemmas=True)\ntokenized = stanza_tokenizer(\"Mary sold the car to John.\")\nfor word in tokenized:\n    print(\"{:\u003c5} {:\u003c10} {:\u003c10} {:\u003c10}\".format(word.index, word.text, word.pos, word.lemma))\n\n\"\"\"\n0    Mary       PROPN      Mary\n1    sold       VERB       sell\n2    the        DET        the\n3    car        NOUN       car\n4    to         ADP        to\n5    John       PROPN      John\n6    .          PUNCT      .\n\"\"\"\n```\n\nFor more simple scenarios, you can use the `WhiteSpaceTokenizer` wrapper, which will just split the text \nby whitespace:\n\n```python\nfrom ipa import WhitespaceTokenizer\n\nwhitespace_tokenizer = WhitespaceTokenizer()\ntokenized = whitespace_tokenizer(\"Mary sold the car to John .\")\nfor word in tokenized:\n    print(\"{:\u003c5} {:\u003c10}\".format(word.index, word.text))\n\n\"\"\"\n0    Mary\n1    sold\n2    the\n3    car\n4    to\n5    John\n6    .\n\"\"\"\n```\n\n### Features\n\n#### Complete preprocessing pipeline\n\n`SpacyTokenizer` and `StanzaTokenizer` provide a unified API for both libraries, exposing most of their\nfeatures, like tokenization, Part-of-Speech tagging, lemmatization and dependency parsing. You can activate \nand deactivate any of these using `return_pos_tags`, `return_lemmas` and `return_deps`. So, for example,\n\n```python\nStanzaTokenizer(language=\"en\", return_pos_tags=True, return_lemmas=True)\n```\n\nwill return a list of `Token` objects, with the `pos` and `lemma` fields filled.\n\nwhile\n\n```python\nStanzaTokenizer(language=\"en\")\n```\n\nwill return a list of `Token` objects, with only the `text` field filled.\n\n### GPU support\n\nWith `use_gpu=True`, the library will use the GPU if it is available. To set up the environment for the GPU, \nrefer to the [Stanza documentation](https://stanfordnlp.github.io/stanza/) and the \n[spaCy documentation](https://spacy.io/usage/gpu).\n\n## API\n\n### Tokenizers\n\n`SpacyTokenizer`\n\n```python\nclass SpacyTokenizer(BaseTokenizer):\n    def __init__(\n        self,\n        language: str = \"en\",\n        return_pos_tags: bool = False,\n        return_lemmas: bool = False,\n        return_deps: bool = False,\n        split_on_spaces: bool = False,\n        use_gpu: bool = False,\n    ):\n```\n\n`StanzaTokenizer`\n\n```python\nclass StanzaTokenizer(BaseTokenizer):\n    def __init__(\n        self,\n        language: str = \"en\",\n        return_pos_tags: bool = False,\n        return_lemmas: bool = False,\n        return_deps: bool = False,\n        split_on_spaces: bool = False,\n        use_gpu: bool = False,\n    ):\n```\n\n`WhitespaceTokenizer`\n\n```python\nclass WhitespaceTokenizer(BaseTokenizer):\n    def __init__(self):\n```\n\n### Sentence Splitter\n\n`SpacySentenceSplitter`\n\n```python\nclass SpacySentenceSplitter(BaseSentenceSplitter):\n    def __init__(self, language: str = \"en\", model_type: str = \"statistical\"):\n```","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Friccorl%2Fipa","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Friccorl%2Fipa","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Friccorl%2Fipa/lists"}