{"id":13436662,"url":"https://github.com/nlp-uoregon/trankit","last_synced_at":"2025-05-14T12:09:05.063Z","repository":{"id":37761340,"uuid":"328007342","full_name":"nlp-uoregon/trankit","owner":"nlp-uoregon","description":"Trankit is a Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing","archived":false,"fork":false,"pushed_at":"2024-10-13T15:45:15.000Z","size":1108,"stargazers_count":749,"open_issues_count":37,"forks_count":103,"subscribers_count":25,"default_branch":"master","last_synced_at":"2025-04-11T04:57:45.480Z","etag":null,"topics":["adapters","artificial-intelligence","deeplearning","dependency-parsing","language-model","lemmatization","machine-learning","morphological-tagging","multilingual","natural-language-processing","nlp","part-of-speech-tagging","pytorch","sentence-segmentation","tokenization","universal-dependencies","xlm-roberta"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/nlp-uoregon.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-01-08T20:39:14.000Z","updated_at":"2025-04-10T15:44:39.000Z","dependencies_parsed_at":"2022-07-14T21:46:54.357Z","dependency_job_id":"4bc192ee-928b-448d-9646-f7cc40027f44","html_url":"https://github.com/nlp-uoregon/trankit","commit_stats":{"total_commits":110,"total_committers":9,"mean_commits":"12.222222222222221","dds":"0.11818181818181817","last_synced_commit":"7b064827bb0185dca8210c77b989859754d12aa1"},"previous_names":[],"tags_count":3,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nlp-uoregon%2Ftrankit","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nlp-uoregon%2Ftrankit/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nlp-uoregon%2Ftrankit/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nlp-uoregon%2Ftrankit/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/nlp-uoregon","download_url":"https://codeload.github.com/nlp-uoregon/trankit/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248345273,"owners_count":21088244,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["adapters","artificial-intelligence","deeplearning","dependency-parsing","language-model","lemmatization","machine-learning","morphological-tagging","multilingual","natural-language-processing","nlp","part-of-speech-tagging","pytorch","sentence-segmentation","tokenization","universal-dependencies","xlm-roberta"],"created_at":"2024-07-31T03:00:51.070Z","updated_at":"2025-04-11T04:57:52.010Z","avatar_url":"https://github.com/nlp-uoregon.png","language":"Python","funding_links":[],"categories":["Uncategorized","其他_NLP自然语言处理","Python","**Tools, Libraries, Models**","Tools","Tasks and Methods"],"sub_categories":["Uncategorized","其他_文本生成、文本对话","Transformers, BERT","Pipelines with Hungarian NLP components","POS Tagging and Dependency Parsing"],"readme":"\u003ch2 align=\"center\"\u003eTrankit: A Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing\u003c/h2\u003e\r\n\r\n\u003cdiv align=\"center\"\u003e\r\n    \u003ca href=\"https://github.com/nlp-uoregon/trankit/blob/master/LICENSE\"\u003e\r\n        \u003cimg alt=\"GitHub\" src=\"https://img.shields.io/github/license/nlp-uoregon/trankit.svg?color=blue\"\u003e\r\n    \u003c/a\u003e\r\n    \u003ca href='https://trankit.readthedocs.io/en/latest/?badge=latest'\u003e\r\n    \u003cimg src='https://readthedocs.org/projects/trankit/badge/?version=latest' alt='Documentation Status' /\u003e\r\n    \u003c/a\u003e\r\n    \u003ca href=\"http://nlp.uoregon.edu/trankit\"\u003e\r\n        \u003cimg alt=\"Demo Website\" src=\"https://img.shields.io/website/http/trankit.readthedocs.io/en/latest/index.html.svg?down_color=red\u0026down_message=offline\u0026up_message=online\"\u003e\r\n    \u003c/a\u003e\r\n    \u003ca href=\"https://pypi.org/project/trankit/\"\u003e\r\n        \u003cimg alt=\"PyPI Version\" src=\"https://img.shields.io/pypi/v/trankit?color=blue\"\u003e\r\n    \u003c/a\u003e\r\n    \u003ca href=\"https://pypi.org/project/trankit/\"\u003e\r\n        \u003cimg alt=\"Python Versions\" src=\"https://img.shields.io/pypi/pyversions/trankit?colorB=blue\"\u003e\r\n    \u003c/a\u003e\r\n\u003c/div\u003e\r\n\r\n[Our technical paper](https://arxiv.org/pdf/2101.03289.pdf) for Trankit won the Outstanding Demo Paper Award at [EACL 2021](https://2021.eacl.org/). Please cite the paper if you use Trankit in your research.\r\n\r\n```bibtex\r\n@inproceedings{nguyen2021trankit,\r\n      title={Trankit: A Light-Weight Transformer-based Toolkit for Multilingual Natural Language Processing}, \r\n      author={Nguyen, Minh Van and Lai, Viet Dac and Veyseh, Amir Pouran Ben and Nguyen, Thien Huu},\r\n      booktitle=\"Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations\",\r\n      year={2021}\r\n}\r\n```\r\n\r\n### :boom: :boom: :boom: Trankit v1.0.0 is out:\r\n\r\n* **90 new pretrained transformer-based pipelines for 56 languages**. The new pipelines are trained with XLM-Roberta large, which further boosts the performance significantly over 90 treebanks of the Universal Dependencies v2.5 corpus. Check out the new performance [here](https://trankit.readthedocs.io/en/latest/performance.html). This [page](https://trankit.readthedocs.io/en/latest/news.html#trankit-large) shows you how to use the new pipelines.\r\n\r\n* **Auto Mode for multilingual pipelines**. In the Auto Mode, the language of the input will be automatically detected, enabling the multilingual pipelines to process the input without specifying its language. Check out how to turn on the Auto Mode [here](https://trankit.readthedocs.io/en/latest/news.html#auto-mode-for-multilingual-pipelines). Thank you [loretoparisi](https://github.com/loretoparisi) for your suggestion on this.\r\n\r\n* **Command-line interface** is now available to use. This helps users who are not familiar with Python programming language use Trankit easily. Check out the tutorials on this [page](https://trankit.readthedocs.io/en/latest/commandline.html).\r\n\r\nTrankit is a **light-weight Transformer-based Python** Toolkit for multilingual Natural Language Processing (NLP). It provides a trainable pipeline for fundamental NLP tasks over [100 languages](https://trankit.readthedocs.io/en/latest/pkgnames.html#trainable-languages), and 90 [downloadable](https://trankit.readthedocs.io/en/latest/pkgnames.html#pretrained-languages-their-code-names) pretrained pipelines for [56 languages](https://trankit.readthedocs.io/en/latest/pkgnames.html#pretrained-languages-their-code-names).\r\n\r\n\u003cdiv align=\"center\"\u003e\u003cimg src=\"https://raw.githubusercontent.com/nlp-uoregon/trankit/master/docs/source/architecture.jpg\" height=\"300px\"/\u003e\u003c/div\u003e\r\n\r\n**Trankit outperforms the current state-of-the-art multilingual toolkit Stanza (StanfordNLP)** in many tasks over [90 Universal Dependencies v2.5 treebanks of 56 different languages](https://trankit.readthedocs.io/en/latest/performance.html#universal-dependencies-v2-5) while still being efficient in memory usage and\r\nspeed, making it *usable for general users*.\r\n\r\nIn particular, for **English**, **Trankit is significantly better than Stanza** on sentence segmentation (**+9.36%**) and dependency parsing (**+5.07%** for UAS and **+5.81%** for LAS). For **Arabic**, our toolkit substantially improves sentence segmentation performance by **16.36%** while **Chinese** observes **14.50%** and **15.00%** improvement of UAS and LAS for dependency parsing. Detailed comparison between Trankit, Stanza, and other popular NLP toolkits (i.e., spaCy, UDPipe) in other languages can be found [here](https://trankit.readthedocs.io/en/latest/performance.html#universal-dependencies-v2-5) on [our documentation page](https://trankit.readthedocs.io/en/latest/index.html).\r\n\r\nWe also created a Demo Website for Trankit, which is hosted at: http://nlp.uoregon.edu/trankit\r\n\r\n### Installation\r\nTrankit can be easily installed via one of the following methods:\r\n#### Using pip\r\n```\r\npip install trankit\r\n```\r\nThe command would install Trankit and all dependent packages automatically. \r\n\r\n#### From source\r\n```\r\ngit clone https://github.com/nlp-uoregon/trankit.git\r\ncd trankit\r\npip install -e .\r\n```\r\nThis would first clone our github repo and install Trankit.\r\n\r\n#### Fixing the compatibility issue of Trankit with Transformers\r\nPrevious versions of Trankit have encountered the [compatibility issue](https://github.com/nlp-uoregon/trankit/issues/5) when using recent versions of [transformers](https://github.com/huggingface/transformers). To fix this issue, please install the new version of Trankit as follows:\r\n```\r\npip install trankit==1.1.0\r\n```\r\nIf you encounter any other problem with the installation, please raise an issue [here](https://github.com/nlp-uoregon/trankit/issues/new) to let us know. Thanks.\r\n\r\n### Usage\r\nTrankit can process inputs which are untokenized (raw) or pretokenized strings, at\r\nboth sentence and document level. Currently, Trankit supports the following tasks:\r\n- Sentence segmentation.\r\n- Tokenization.\r\n- Multi-word token expansion.\r\n- Part-of-speech tagging.\r\n- Morphological feature tagging.\r\n- Dependency parsing.\r\n- Named entity recognition.\r\n#### Initialize a pretrained pipeline\r\nThe following code shows how to initialize a pretrained pipeline for English; it is instructed to run on GPU, automatically download pretrained models, and store them to the specified cache directory. Trankit will not download pretrained models if they already exist.\r\n```python\r\nfrom trankit import Pipeline\r\n\r\n# initialize a multilingual pipeline\r\np = Pipeline(lang='english', gpu=True, cache_dir='./cache')\r\n```\r\n\r\n#### Perform all tasks on the input\r\nAfter initializing a pretrained pipeline, it can be used to process the input on all tasks as shown below. If the input is a sentence, the tag `is_sent` must be set to True. \r\n```python\r\nfrom trankit import Pipeline\r\n\r\np = Pipeline(lang='english', gpu=True, cache_dir='./cache')\r\n\r\n######## document-level processing ########\r\nuntokenized_doc = '''Hello! This is Trankit.'''\r\npretokenized_doc = [['Hello', '!'], ['This', 'is', 'Trankit', '.']]\r\n\r\n# perform all tasks on the input\r\nprocessed_doc1 = p(untokenized_doc)\r\nprocessed_doc2 = p(pretokenized_doc)\r\n\r\n######## sentence-level processing ####### \r\nuntokenized_sent = '''This is Trankit.'''\r\npretokenized_sent = ['This', 'is', 'Trankit', '.']\r\n\r\n# perform all tasks on the input\r\nprocessed_sent1 = p(untokenized_sent, is_sent=True)\r\nprocessed_sent2 = p(pretokenized_sent, is_sent=True)\r\n```\r\nNote that, although pretokenized inputs can always be processed, using pretokenized inputs for languages that require multi-word token expansion such as Arabic or French might not be the correct way. Please check out the column `Requires MWT expansion?` of [this table](https://trankit.readthedocs.io/en/latest/pkgnames.html#pretrained-languages-their-code-names) to see if a particular language requires multi-word token expansion or not.  \r\nFor more detailed examples, please check out our [documentation page](https://trankit.readthedocs.io/en/latest/overview.html).\r\n\r\n#### Multilingual usage\r\nStarting from version v1.0.0, Trankit supports a handy [Auto Mode](https://trankit.readthedocs.io/en/latest/news.html#auto-mode-for-multilingual-pipelines) in which users do not have to set a particular language active before processing the input. In the Auto Mode, Trankit will automatically detect the language of the input and use the corresponding language-specific models, thus avoiding switching back and forth between languages in a multilingual pipeline.\r\n\r\n```python\r\nfrom trankit import Pipeline\r\n\r\np = Pipeline('auto')\r\n\r\n# Tokenizing an English input\r\nen_output = p.tokenize('''I figured I would put it out there anyways.''') \r\n\r\n# POS, Morphological tagging and Dependency parsing a French input\r\nfr_output = p.posdep('''On pourra toujours parler à propos d'Averroès de \"décentrement du Sujet\".''')\r\n\r\n# NER tagging a Vietnamese input\r\nvi_output = p.ner('''Cuộc tiêm thử nghiệm tiến hành tại Học viện Quân y, Hà Nội''')\r\n```\r\nIn this example, the code name `'auto'` is used to initialize a multilingual pipeline in the Auto Mode. For more information, please visit [this page](https://trankit.readthedocs.io/en/latest/news.html#auto-mode-for-multilingual-pipelines). Note that, besides the new Auto Mode, the [manual mode](https://trankit.readthedocs.io/en/latest/overview.html#multilingual-usage) can still be used as before.\r\n\r\n#### Building a customized pipeline\r\nTraining customized pipelines is easy with Trankit via the class `TPipeline`. Below we show how we can train a token and sentence splitter on customized data.\r\n```python\r\nfrom trankit import TPipeline\r\n\r\ntp = TPipeline(training_config={\r\n    'task': 'tokenize',\r\n    'save_dir': './saved_model',\r\n    'train_txt_fpath': './train.txt',\r\n    'train_conllu_fpath': './train.conllu',\r\n    'dev_txt_fpath': './dev.txt',\r\n    'dev_conllu_fpath': './dev.conllu'\r\n    }\r\n)\r\n\r\ntrainer.train()\r\n```\r\nDetailed guidelines for training and loading a customized pipeline can be found [here](https://trankit.readthedocs.io/en/latest/training.html) \r\n\r\n#### Sharing your customized pipelines\r\n\r\nIn case you want to share your customized pipelines with other users. Please create an issue [here](https://github.com/nlp-uoregon/trankit/issues/new) and provide us the following information:\r\n\r\n- Training data that you used to train your models, e.g., data license, data source, and some data statistics (i.e., sizes of training, development, and test data).\r\n- Performance of your pipelines on your test data using the official [evaluation script](https://universaldependencies.org/conll18/evaluation.html).\r\n- A downloadable link to your trained model files (a Google drive link would be great).\r\nAfter we receive your request, we will check and test your pipelines. Once everything is done, we would make the pipelines accessible by other users via new language codes.\r\n\r\n### Acknowledgements\r\nThis project has been supported by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via IARPA Contract No. 2019-19051600006 under the [Better Extraction from Text Towards Enhanced Retrieval (BETTER) Program](https://www.iarpa.gov/index.php/research-programs/better).\r\n\r\nWe use [XLM-Roberta](https://arxiv.org/abs/1911.02116) and [Adapters](https://arxiv.org/abs/2005.00247) as our shared multilingual encoder for different tasks and languages. The [AdapterHub](https://github.com/Adapter-Hub/adapter-transformers) is used to implement our plug-and-play mechanism with Adapters. To speed up the development process, the implementations for the MWT expander and the lemmatizer are adapted from [Stanza](https://github.com/stanfordnlp/stanza). To implement the language detection module, we leverage the [langid](https://github.com/saffsd/langid.py) library.\r\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnlp-uoregon%2Ftrankit","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnlp-uoregon%2Ftrankit","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnlp-uoregon%2Ftrankit/lists"}