https://github.com/rhdunn/conllu-en-validator
A tool for validating English CoNLL-U data files.
https://github.com/rhdunn/conllu-en-validator
conllu universal-dependencies
Last synced: about 2 months ago
JSON representation
A tool for validating English CoNLL-U data files.
- Host: GitHub
- URL: https://github.com/rhdunn/conllu-en-validator
- Owner: rhdunn
- License: apache-2.0
- Created: 2023-10-27T21:38:11.000Z (over 1 year ago)
- Default Branch: master
- Last Pushed: 2023-12-04T08:59:49.000Z (over 1 year ago)
- Last Synced: 2025-01-31T23:28:04.737Z (4 months ago)
- Topics: conllu, universal-dependencies
- Language: Python
- Homepage:
- Size: 334 KB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# English CoNLL-U Validator
> A tool for validating English CoNLL-U data files.Usage:
```
./validate input.conllu OPTIONS | tee output.log
```- `--language LANG` -- The default language to use if none is specified in the document metadata.
- `--validator VALIDATOR` -- The validation check to perform on the input file.## Validators
The validator can be one of the following:abbreviations
: Check that tokens such as `Mrs.` are single tokens.contractions
: Check that `'` in dialectal contractions are kept as a single token instead of
incorrectly split into a multi-word token.form
: Check that the token and word `FORM` field is consistent with the assigned `UPOS`,
for example if punctuation tokens contains a single punctuation character.lemma
: Check that the token and word `LEMMA` field is consistent with the assigned `XPOS`
and relevant `MISC` features. __Note:__ If the token has a `CorrectForm`, the corrected
lemma should be in a [`CorrectLemma`](https://universaldependencies.org/misc.html#correctfeature)
annotation per the [guideline for typos](https://universaldependencies.org/u/overview/typos.html).mwt-tokens
: Check that `SpaceAfter` is not used within multi-word tokens. This will flag the use
of `SpaceAfter` between other tokens that should be annotated as multi-word tokens.mwt-words
: Check that the words in the multi-word token are correct.pos-tags
: Check that the `UPOS` are valid Universal Dependencies values for all treebanks.
Check that the `XPOS` are valid Penn TreeBank values for English treebanks.sentence-text
: Check that the token stream matches the sentence text for all treebanks.
Check that the word stream matches the sentence text for English treebanks.split-sentences
: Check that the sentences are split correctly.## License
Copyright (C) 2023 Reece H. Dunn`SPDX-License-Identifier:` [Apache-2.0](LICENSE)