{"id":19723125,"url":"https://github.com/bnosac/udpipe","last_synced_at":"2025-04-04T20:13:26.656Z","repository":{"id":52283259,"uuid":"101394229","full_name":"bnosac/udpipe","owner":"bnosac","description":"R package for Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing Based on the UDPipe Natural Language Processing Toolkit","archived":false,"fork":false,"pushed_at":"2023-03-01T17:20:02.000Z","size":6017,"stargazers_count":214,"open_issues_count":32,"forks_count":33,"subscribers_count":14,"default_branch":"master","last_synced_at":"2025-03-28T19:08:42.187Z","etag":null,"topics":["conll","dependency-parser","lemmatization","natural-language-processing","nlp","pos-tagging","r","r-package","r-pkg","rcpp","text-mining","tokenizer","udpipe"],"latest_commit_sha":null,"homepage":"https://bnosac.github.io/udpipe/en","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mpl-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/bnosac.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2017-08-25T10:39:11.000Z","updated_at":"2025-03-22T11:00:49.000Z","dependencies_parsed_at":"2023-10-20T18:30:01.598Z","dependency_job_id":null,"html_url":"https://github.com/bnosac/udpipe","commit_stats":{"total_commits":388,"total_committers":3,"mean_commits":"129.33333333333334","dds":"0.0077319587628865705","last_synced_commit":"6a974c52fea0eb021f230c9a49f591edf739f3cf"},"previous_names":[],"tags_count":23,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bnosac%2Fudpipe","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bnosac%2Fudpipe/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bnosac%2Fudpipe/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bnosac%2Fudpipe/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/bnosac","download_url":"https://codeload.github.com/bnosac/udpipe/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247242680,"owners_count":20907134,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["conll","dependency-parser","lemmatization","natural-language-processing","nlp","pos-tagging","r","r-package","r-pkg","rcpp","text-mining","tokenizer","udpipe"],"created_at":"2024-11-11T23:19:37.095Z","updated_at":"2025-04-04T20:13:26.634Z","avatar_url":"https://github.com/bnosac.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"# udpipe - R package for Tokenization, Tagging, Lemmatization and Dependency Parsing Based on UDPipe \n\nThis repository contains an R package which is an Rcpp wrapper around the UDPipe C++ library (http://ufal.mff.cuni.cz/udpipe, https://github.com/ufal/udpipe).\n\n- UDPipe provides language-agnostic tokenization, tagging, lemmatization and dependency parsing of raw text, which is an essential part in natural language processing.\n- The techniques used are explained in detail in the paper: \"Tokenizing, POS Tagging, Lemmatizing and Parsing UD 2.0 with UDPipe\", available at \u003chttps://ufal.mff.cuni.cz/~straka/papers/2017-conll_udpipe.pdf\u003e. In that paper, you'll also find accuracies on different languages and process flow speed (measured in words per second).\n\n![](vignettes/udpipe-rlogo.png)\n\n## General\n\nThe udpipe R package was designed with the following things in mind when building the Rcpp wrapper around the UDPipe C++ library:\n\n- Give R users simple access in order to easily tokenize, tag, lemmatize or perform dependency parsing on text in any language\n- Provide easy access to pre-trained annotation models\n- Allow R users to easily construct your own annotation model based on data in CONLL-U format as provided in more than 100 treebanks available at http://universaldependencies.org\n- Don't rely on Python or Java so that R users can easily install this package without configuration hassle\n- No external R package dependencies except the strict necessary (Rcpp and data.table, no tidyverse)\n\n## Installation \u0026 License\n\nThe package is available under the Mozilla Public License Version 2.0.\nInstallation can be done as follows. Please visit the package documentation at https://bnosac.github.io/udpipe/en and look at the R package vignettes for further details.\n\n```\ninstall.packages(\"udpipe\")\nvignette(\"udpipe-tryitout\", package = \"udpipe\")\nvignette(\"udpipe-annotation\", package = \"udpipe\")\nvignette(\"udpipe-universe\", package = \"udpipe\")\nvignette(\"udpipe-usecase-postagging-lemmatisation\", package = \"udpipe\")\n# An overview of keyword extraction techniques: https://bnosac.github.io/udpipe/docs/doc7.html\nvignette(\"udpipe-usecase-topicmodelling\", package = \"udpipe\")\nvignette(\"udpipe-parallel\", package = \"udpipe\")\nvignette(\"udpipe-train\", package = \"udpipe\")\n```\n\nFor installing the development version of this package: `remotes::install_github(\"bnosac/udpipe\", build_vignettes = TRUE)`\n\n## Example\n\nCurrently the package allows you to do tokenisation, tagging, lemmatization and dependency parsing with one convenient function called `udpipe`\n\n```\nlibrary(udpipe)\nudmodel \u003c- udpipe_download_model(language = \"dutch\")\nudmodel\n\n    language                                                                             file_model\ndutch-alpino C:/Users/Jan/Dropbox/Work/RForgeBNOSAC/BNOSAC/udpipe/dutch-alpino-ud-2.5-191206.udpipe\n\nx \u003c- udpipe(x = \"Ik ging op reis en ik nam mee: mijn laptop, mijn zonnebril en goed humeur.\",\n            object = udmodel)\nx\n```\n\n```\n doc_id paragraph_id sentence_id start end term_id token_id     token     lemma  upos                                        xpos                               feats head_token_id      dep_rel            misc\n   doc1            1           1     1   2       1        1        Ik        ik  PRON                VNW|pers|pron|nomin|vol|1|ev      Case=Nom|Person=1|PronType=Prs             2        nsubj            \u003cNA\u003e\n   doc1            1           1     4   7       2        2      ging      gaan  VERB                               WW|pv|verl|ev Number=Sing|Tense=Past|VerbForm=Fin             0         root            \u003cNA\u003e\n   doc1            1           1     9  10       3        3        op        op   ADP                                     VZ|init                                \u003cNA\u003e             4         case            \u003cNA\u003e\n   doc1            1           1    12  15       4        4      reis      reis  NOUN                  N|soort|ev|basis|zijd|stan              Gender=Com|Number=Sing             2          obl            \u003cNA\u003e\n   doc1            1           1    17  18       5        5        en        en CCONJ                                    VG|neven                                \u003cNA\u003e             7           cc            \u003cNA\u003e\n   doc1            1           1    20  21       6        6        ik        ik  PRON                VNW|pers|pron|nomin|vol|1|ev      Case=Nom|Person=1|PronType=Prs             7        nsubj            \u003cNA\u003e\n   doc1            1           1    23  25       7        7       nam     nemen  VERB                               WW|pv|verl|ev Number=Sing|Tense=Past|VerbForm=Fin             2         conj            \u003cNA\u003e\n   doc1            1           1    27  29       8        8       mee       mee   ADP                                      VZ|fin                                \u003cNA\u003e             7 compound:prt   SpaceAfter=No\n   doc1            1           1    30  30       9        9         :         : PUNCT                                         LET                                \u003cNA\u003e             7        punct            \u003cNA\u003e\n...\n```\n\n\n## Pre-trained models\n\nPre-trained models build on Universal Dependencies treebanks are made available for more than 65 languages based on 101 treebanks, namely:\n\nafrikaans-afribooms, ancient_greek-perseus, ancient_greek-proiel, arabic-padt, armenian-armtdp, basque-bdt, belarusian-hse, bulgarian-btb, buryat-bdt, catalan-ancora, chinese-gsd, chinese-gsdsimp, classical_chinese-kyoto, coptic-scriptorium, croatian-set, czech-cac, czech-cltt, czech-fictree, czech-pdt, danish-ddt, dutch-alpino, dutch-lassysmall, english-ewt, english-gum, english-lines, english-partut, estonian-edt, estonian-ewt, finnish-ftb, finnish-tdt, french-gsd, french-partut, french-sequoia, french-spoken, galician-ctg, galician-treegal, german-gsd, german-hdt, gothic-proiel, greek-gdt, hebrew-htb, hindi-hdtb, hungarian-szeged, indonesian-gsd, irish-idt, italian-isdt, italian-partut, italian-postwita, italian-twittiro, italian-vit, japanese-gsd, kazakh-ktb, korean-gsd, korean-kaist, kurmanji-mg, latin-ittb, latin-perseus, latin-proiel, latvian-lvtb, lithuanian-alksnis, lithuanian-hse, maltese-mudt, marathi-ufal, north_sami-giella, norwegian-bokmaal, norwegian-nynorsk, norwegian-nynorsklia, old_church_slavonic-proiel, old_french-srcmf, old_russian-torot, persian-seraji, polish-lfg, polish-pdb, polish-sz, portuguese-bosque, portuguese-br, portuguese-gsd, romanian-nonstandard, romanian-rrt, russian-gsd, russian-syntagrus, russian-taiga, sanskrit-ufal, scottish_gaelic-arcosg, serbian-set, slovak-snk, slovenian-ssj, slovenian-sst, spanish-ancora, spanish-gsd, swedish-lines, swedish-talbanken, tamil-ttb, telugu-mtg, turkish-imst, ukrainian-iu, upper_sorbian-ufal, urdu-udtb, uyghur-udt, vietnamese-vtb, wolof-wtb. \n\nThese have been made available easily to users of the package by using `udpipe_download_model`\n\n### How good are these models? \n\n- Accuracy statistics of models provided by the UDPipe authors which you download with udpipe_download_model from the default repository are available at [this link](https://github.com/jwijffels/udpipe.models.ud.2.5/blob/master/inst/udpipe-ud-2.5-191206/README).\n- Accuracy statistics of models trained using this R package which you download with udpipe_download_model from the bnosac/udpipe.models.ud repository are available at https://github.com/bnosac/udpipe.models.ud.\n- For a comparison between UDPipe and spaCy visit https://github.com/jwijffels/udpipe-spacy-comparison\n\n## Train your own models based on CONLL-U data\n\nThe package also allows you to build your own annotation model. For this, you need to provide data in CONLL-U format.\nThese are provided for many languages at https://universaldependencies.org, mostly under the CC-BY-SA license.\nHow this is done is detailed in the package vignette.\n\n```\nvignette(\"udpipe-train\", package = \"udpipe\")\n```\n\n\n## Support in text mining\n\nNeed support in text mining?\nContact BNOSAC: http://www.bnosac.be\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbnosac%2Fudpipe","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbnosac%2Fudpipe","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbnosac%2Fudpipe/lists"}