https://github.com/gatenlp/corpusconversion-universal-dependencies
Tool to convert the Universal Dependencies Treebanks to GATE format
https://github.com/gatenlp/corpusconversion-universal-dependencies
Last synced: 20 days ago
JSON representation
Tool to convert the Universal Dependencies Treebanks to GATE format
- Host: GitHub
- URL: https://github.com/gatenlp/corpusconversion-universal-dependencies
- Owner: GateNLP
- Created: 2017-02-03T15:04:03.000Z (over 9 years ago)
- Default Branch: master
- Last Pushed: 2017-08-22T19:50:33.000Z (almost 9 years ago)
- Last Synced: 2025-03-09T19:55:20.603Z (about 1 year ago)
- Language: Groovy
- Size: 11.7 KB
- Stars: 0
- Watchers: 17
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Tool to convert Universal Dependencies corpora to GATE
This is an attempt to create a script that will convert Universal dependencies corpora into GATE documents.
Most TreeBanks do not seem to have any information about document boundaries so the conversion is done by
choosing the number of sentences to put in each output GATE document, the default is one (one document per sentence).
If more than one sentence is put into a document, then each Sentence is starting after a new line character.
The CONLL format does not include any information about white-space so a few simple heuristics are used to make
the output look reasonable. However, some treebanks contain the actual text of a sentence including whitespace
in a comment line, this can be used instead of the heuristics to create whitespace.
## How to run
* make sure convert.sh is executable, groovy is installed and on the bin path and GATE_HOME is set
* create a directory to contain the GATE documents
* optionally: set JAVA_OPTS, if set will override the default in the script
* ./convert.sh [options] infile outdir
## Annotations and features created