{"id":13741345,"url":"https://github.com/universaldependencies/tools","last_synced_at":"2025-05-08T21:33:25.404Z","repository":{"id":17085075,"uuid":"19850189","full_name":"UniversalDependencies/tools","owner":"UniversalDependencies","description":"Various utilities for processing the data.","archived":false,"fork":false,"pushed_at":"2024-10-29T14:32:07.000Z","size":22370,"stargazers_count":204,"open_issues_count":2,"forks_count":44,"subscribers_count":157,"default_branch":"master","last_synced_at":"2024-10-29T17:33:54.686Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/UniversalDependencies.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2014-05-16T08:46:56.000Z","updated_at":"2024-10-29T14:33:25.000Z","dependencies_parsed_at":"2023-10-13T02:14:01.603Z","dependency_job_id":"34dbe93a-c70e-4988-a7d2-c8df4aa19423","html_url":"https://github.com/UniversalDependencies/tools","commit_stats":null,"previous_names":[],"tags_count":18,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/UniversalDependencies%2Ftools","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/UniversalDependencies%2Ftools/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/UniversalDependencies%2Ftools/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/UniversalDependencies%2Ftools/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/UniversalDependencies","download_url":"https://codeload.github.com/UniversalDependencies/tools/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":224774752,"owners_count":17367790,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-03T04:00:58.112Z","updated_at":"2024-11-15T11:31:10.540Z","avatar_url":"https://github.com/UniversalDependencies.png","language":"Python","funding_links":[],"categories":["Software"],"sub_categories":["Utilities"],"readme":"# UD Tools\n\n[![alt text](https://avatars0.githubusercontent.com/u/7457237?s=200\u0026v=4 \"Universal Dependencies\")](http://universaldependencies.org/)\n\nThis repository contains various scripts in Perl and Python that can be used as tools for Universal Dependencies.\n\n\n\n## [validate.py](https://github.com/UniversalDependencies/tools/blob/master/validate.py)\n\nReads a CoNLL-U file and verifies that it complies with the UD specification. It must be run with\nthe language code and there must exist corresponding lists of treebank-specific features and\ndependency relations in order to check that they are valid, too.\n\nThe script runs under Python 3 and needs the third-party module **regex**. If you do not have the\n**regex** module, install it using `pip install --user regex`.\n\nNOTE: Depending on the configuration of your system, it is possible that both Python 2 and 3 are\ninstalled; then you may have to run `python3` instead of `python`, and `pip3` instead of `pip`.\n\n```\ncat la_proiel-ud-train.conllu | python validate.py --lang la --max-err=0\n```\n\nYou can run `python validate.py --help` for a list of available options.\n\n\n\n## [eval.py](https://github.com/UniversalDependencies/tools/blob/master/eval.py)\n\nEvaluates the accuracy of a UD tokenizer / lemmatizer / tagger / parser against gold-standard data.\nThe script was originally developed for the [CoNLL 2017](http://universaldependencies.org/conll17/)\nand [2018 shared tasks](http://universaldependencies.org/conll18/) in UD parsing, and later extended\nto handle the enhanced dependency representation in the [IWPT 2020](https://universaldependencies.org/iwpt20/)\nand [2021 shared tasks](https://universaldependencies.org/iwpt21/).\n\n```\npython eval.py -v goldstandard.conllu systemoutput.conllu\n```\n\nFor more details on usage, see the comments in the script. For more details on the metrics reported,\nsee the overview papers of the shared tasks linked above.\n\n\n\n## [check_sentence_ids.pl](https://github.com/UniversalDependencies/tools/blob/master/check_sentence_ids.pl)\n\nReads CoNLL-U files from STDIN and verifies that every sentence has a unique id in the sent_id\ncomment. All files of one treebank (repository) must be supplied at once in order to test\ntreebank-wide id uniqueness.\n\n```\ncat *.conllu | perl check_sentence_ids.pl\n```\n\n\n\n## [normalize_unicode.pl](https://github.com/UniversalDependencies/tools/blob/master/normalize_unicode.pl)\n\nConverts Unicode to the NFC normalized form. Can be applied to any UTF-8-encoded text file, including\nCoNLL-U. As a result, if there are character combinations that by definition must look the same,\nthe same sequence of bytes will be used to represent the glyph, thus improving accuracy of models\n(as long as they are applied to normalized data too).\n\n**Beware**: The output may slightly differ depending on your version of Perl because the Unicode\nstandard evolves and newer Perl versions incorporate newer versions of Unicode data.\n\n```\nperl normalize_unicode.pl \u003c input.conllu \u003e normalized_output.conllu\n```\n\n\n\n## [conllu-stats.pl](https://github.com/UniversalDependencies/tools/blob/master/conllu-stats.pl)\n\nReads a CoNLL-U file, collects various statistics and prints them.\nThis Perl script (conllu-stats.pl) is used to generate the stats.xml files in each data repository.\n\nThe script depends on Perl libraries `YAML` and `JSON::Parse` that may not be installed\nautomatically with Perl. If they are not installed on your system, you should be able\nto install them with the `cpan` command: `cpan YAML` and `cpan JSON::Parse`.\n\n```\nperl conllu-stats.pl *.conllu \u003e stats.xml\n```\n\n\n\n## [mwtoken-stats.pl](https://github.com/UniversalDependencies/tools/blob/master/mwtoken-stats.pl)\n\nReads a CoNLL-U file, collects statistics of multi-word tokens and prints them.\n\n```\ncat *.conllu | perl mwtoken-stats.pl \u003e mwtoken-stats.txt\n```\n\n\n\n## [enhanced_graph_properties.pl](https://github.com/UniversalDependencies/tools/blob/master/enhanced_graph_properties.pl)\n\nReads a CoNLL-U file, collects statistics about the enhanced graphs in the DEPS column and prints\nthem. This script uses the modules Graph.pm and Node.pm that lie in the same folder. On UNIX-like\nsystems it should be able to tell Perl where to find the modules even if the script is invoked from\na remote folder. If that does not work, use `perl -I libfolder script` to invoke it. Also note that\nother third-party modules are needed that are not automatically included in the installation of\nPerl: Moose, MooseX::SemiAffordanceAccessor, List::MoreUtils. You may need to install these modules\nusing the `cpan` tool (simply go to commandline and type `sudo cpan Moose`).\n```\ncat *.conllu | perl enhanced_graph_properties.pl \u003e eud-stats.txt\n```\n\n\n\n## [enhanced_collapse_empty_nodes.pl](https://github.com/UniversalDependencies/tools/blob/master/enhanced_collapse_empty_nodes.pl)\n\nReads a CoNLL-U file, removes empty nodes and adjusts the enhanced graphs so that a path traversing\none or more empty nodes is contracted into a single edge: if there was a \"conj\" edge from node 27\nto node 33.1, and a **nsubj** edge from node 33.1 to node 33, the resulting graph will have an edge\nfrom 27 to 33, labeled **conj\u003ensubj**\n\nThis script uses the modules Graph.pm and Node.pm that lie in the same folder. On UNIX-like systems\nit should be able to tell Perl where to find the modules even if the script is invoked from a\nremote folder. If that does not work, use `perl -I libfolder script` to invoke it. Also note that\nother third-party modules are needed that are not automatically included in the installation of\nPerl: Moose, MooseX::SemiAffordanceAccessor, List::MoreUtils. You may need to install these modules\nusing the `cpan` tool (simply go to commandline and type `sudo cpan Moose`).\n```\nperl enhanced_collapse_empty_nodes.pl enhanced.conllu \u003e collapsed.conllu\n```\n\n\n\n## [overlap.py](https://github.com/UniversalDependencies/tools/blob/master/overlap.py)\n\nCompares two CoNLL-U files and searches for sentences that occur in both (verbose duplicates of\ntoken sequences). Some treebanks, especially those where the original text had been acquired from\nthe web, contained duplicate documents that were found at different addresses and downloaded twice.\nThis tool helps to find out whether one of the duplicates fell in the training data and the other\nin development or test. The output has to be verified manually, as some “duplicates” are\nrepetitions that occur naturally in the language (in particular short sentences such as “Thank you.”)\n\nThe script can also help to figure out whether training-dev-test data split has been changed\nbetween two releases so that a previously training sentence is now in test or vice versa. That is\nsomething we want to avoid.\n\n\n\n## [find_duplicate_sentences.pl](https://github.com/UniversalDependencies/tools/blob/master/find_duplicate_sentences.pl) \u0026 [remove_duplicate_sentences.pl](https://github.com/UniversalDependencies/tools/blob/master/remove_duplicate_sentences.pl)\n\nSimilar to overlap.py but it works with the sentence-level attribute **text**. It remembers all\nsentences from STDIN or from input files whose names are given as arguments. The find script prints\nthe duplicate sentences (ordered by length and number of occurrences) to STDOUT. The remove script\nworks as a filter: it prints the CoNLL-U data from the input, except for the second and any\nsubsequent occurrence of the duplicate sentences.\n\n\n\n## [conllu_to_conllx.pl](https://github.com/UniversalDependencies/tools/blob/master/conllu_to_conllx.pl)\n\nConverts a file in the CoNLL-U format to the old CoNLL-X format. Useful with old tools (e.g.\nparsers) that require CoNLL-X as their input. Usage:\n```\nperl conllu_to_conllx.pl \u003c file.conllu \u003e file.conll\n```\n\n\n\n## [restore_conllu_lines.pl](https://github.com/UniversalDependencies/tools/blob/master/restore_conllu_lines.pl)\n\nMerges a CoNLL-X and a CoNLL-U file, taking only the CoNLL-U-specific lines from CoNLL-U. Can be\nused to merge the output of an old parser that only works with CoNLL-X with the original annotation\nthat the parser could not read.\n```\nrestore_conllu_lines.pl file-parsed.conll file.conllu\n```\n\n\n\n## [conllu_to_text.pl](https://github.com/UniversalDependencies/tools/blob/master/conllu_to_text.pl)\n\nConverts a file in the CoNLL-U format to plain text, word-wrapped to lines of 80 characters (but\nthe output line will be longer if there is a word that is longer than the limit). The script can\nuse either the sentence-level text attribute, or the word forms plus the SpaceAfter=No MISC\nattribute to output detokenized text. It also observes the sentence-level newdoc and newpar\nattributes, and the NewPar=Yes MISC attribute, if they are present, and prints an empty line\nbetween paragraphs or documents.\n\nOptionally, the script takes the language code as a parameter. Codes 'zh' and 'ja' will trigger\na different word-wrapping algorithm that is more suitable for Chinese and Japanese.\n\n**Usage**:\n```\nperl conllu_to_text.pl --lang zh \u003c file.conllu \u003e file.txt\n```\n\n\n\n## [conll_convert_tags_to_uposf.pl](https://github.com/UniversalDependencies/tools/blob/master/conll_convert_tags_to_uposf.pl)\nThis script takes the CoNLL columns CPOS, POS and FEAT and converts their combined values to the universal POS tag and features.\n\nYou need Perl. On Linux, you probably already have it; on Windows, you may have to download and install Strawberry Perl. You also need the Interset libraries. Once you have Perl, it is easy to get them via the following (call `cpan` instead of `cpanm` if you do not have cpanm).\n```\ncpanm Lingua::Interset\n```\nThen use the script like this:\n```\nperl conll_convert_tags_to_uposf.pl -f source_tagset \u003c input.conll \u003e output.conll\n```\nThe source tagset is the identifier of the tagset used in your data and known to Interset. Typically it is the language code followed by two colons and **conll**, e.g. **sl::conll** for the Slovenian data of CoNLL 2006. See the [tagset conversion tables](http://universaldependencies.github.io/docs/tagset-conversion/index.html) for more tagset codes.\n\n**IMPORTANT**:\n\nThe script assumes the CoNLL-X (2006 and 2007) file format. If your data is in another format (most notably CoNLL-U, but also e.g. CoNLL 2008/2009, which is not identical to 2006/2007), you have to modify the data or the script. Furthermore,\nyou have to know something about the tagset driver (-f source_tagset above) you are going to use. Some drivers do not expect to receive three values joined by TAB characters. Some expect two values and many expect just a single tag, perhaps the one you have in your POS column. These factors may also require you to adapt the script to your needs. You may want to consult the [documentation](https://metacpan.org/pod/Lingua::Interset). Go to Browse / Interset / Tagset, look up your language code and tagset name, then locate the list() function in the source code. That will give you an idea of what the input tags should look like (usually the driver is able to decode even some tags that are not on the list but have the same structure and feature values).\n\n\n\n## [check_files.pl](https://github.com/UniversalDependencies/tools/blob/master/check_files.pl)\nThis script checks the contents of one data repositories for missing/extra files,\ninvalid metadata in README etc. Together with validate.py, which checks the contents\nof individual CoNLL-U files, this script assesses whether a treebank is valid and\nready to be released.\n\n\n\n## [check_release.pl](https://github.com/UniversalDependencies/tools/blob/master/check_release.pl)\nThis script must be run in a folder where all the data repositories (UD_*) are\nstored as subfolders. It checks the contents of the data repositories for various\nissues that we want to solve before a new release of UD is published.\n\n\n\n## [conllu_align_tokens.pl](https://github.com/UniversalDependencies/tools/blob/master/conllu_align_tokens.pl)\nCompares tokenization and word segmentation of two CoNLL-U files. Assumes that no normalization was performed, that is, the sequence of non-whitespace characters is identical on both sides. Use case: We want to merge a gold-standard file, which has no lemmas, with lemmatization predicted by an external tool. But the tool also performed tokenization and we have no guarantee that it matches the gold-standard tokenization. Despite its name, the script now does exactly that, i.e., copies the system lemma to the gold-standard annotation if the tokens match, and prints the merged file to STDOUT. If something else than lemma shall be copied, the source code must be adjusted.\n```\nperl conllu_align_tokens.pl UD_Turkish-PUD/tr_pud-ud-test.conllu media/conll17-ud-test-2017-05-09/UFAL-UDPipe-1-2/2017-05-15-02-00-38/output/tr_pud.conllu\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Funiversaldependencies%2Ftools","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Funiversaldependencies%2Ftools","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Funiversaldependencies%2Ftools/lists"}