https://github.com/gatenlp/corpusconversion-conll2003
Tool/scripts to help converting the CoNLL 2003 corpora to GATE format
https://github.com/gatenlp/corpusconversion-conll2003
Last synced: about 1 year ago
JSON representation
Tool/scripts to help converting the CoNLL 2003 corpora to GATE format
- Host: GitHub
- URL: https://github.com/gatenlp/corpusconversion-conll2003
- Owner: GateNLP
- Created: 2017-09-01T15:25:02.000Z (over 8 years ago)
- Default Branch: master
- Last Pushed: 2018-05-04T14:32:08.000Z (about 8 years ago)
- Last Synced: 2025-01-13T06:11:01.865Z (over 1 year ago)
- Language: Scala
- Size: 12.7 KB
- Stars: 1
- Watchers: 13
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Tools to convert the CoNLL2003 NER corpora to GATE format
This repository contains two sets of scripts for creating the Conll2003 NER corpora
in GATE FastInfoset format:
* prepare-deu.sh and prepare-eng.sh to create the conll-format files from
the text corpora (licensed, need to get obtained separately) and the annotation files
(available from https://www.clips.uantwerpen.be/conll2003/ner/)
* convert-deu.sh and convert-eng.sh to convert the conll-format files to GATE format.
NOTE: the prepare-deu.sh script does not work correctly at the moment and cannot be
used, so the german conll-format files must get created by you separately and then
be put into directory ./conll2003-deu.
## Preparing the CONLL-format files
Create the English conll2003-format files by running `./prepare-eng.sh `
* `` is the directory that contains the downloaded ner.tgz file
* `` is the directory that contains all the zip files of the Reuters corpus
This should place the three files eng.train, eng.testa and eng.tesb into the
./conll2003-eng directory.
NOTE: The preparation of the German conll2003-format files does not work properly right now.
If you find the problem, please let me know or provide a pull request!
## Converting the CONLL-format files to GATE format
Requirements:
* Needs Java and Scala installed
* Needs GATE 8.4.x installed and the environment variable `GATE_HOME` set to the installation directory
* Only works on Linux, Mac and under Windows probably only in some form of Linux-compatibiity mode
Make sure that the CONLL-format files are in conll2003-eng or conll2003-deu as needed!
Now run `./convert-eng.sh` to convert English files and/or ./convert-deu.sh` to convert German files.
Each of the result directories contains one GATE document in GATE XML format for each document identified in the corresponding input file.
## Conversion Strategy
The following annotations are placed into the annotation set "Original markups":
* LOC, MISC, ORG, PER: for the entity annotations from the input. These annotations have the single feature startLineNr which identifies the (1-based) number of the original CoNLL input file where this entity started
* Token: for each input token one annotation is created. It contains the following features:
* chunkBIO: the original BIO value of the column for chunks
* lemma: the original value of the lemma columns (German only)
* lineNr: the line number (1-based) of that token in the original CoNLL input file as generated by the script described above
* neBIO: the original BIO value of the NE column
* pos: the original value of the POS column
The conversion algorithm separates all tokens by spaces except there are no spaces before punctuation caracters !,-.:;? and
there are no spaces after opening parentheses ({[ and before closing parentheses )}]
There is no space inserted before token 's but there is space inserted before a single ' because we cannot know if it is
the genetive of a plural or just used for quoting or something other.
For quote characters " a space is inserted before but not after all odd occurrences in a document and after but not before
all even occurrences.
All other tokens, including quote-like characters like '`, or characters like $£#~% if separate tokens in the input file are separated by
spaces.
No space is added at the beginning or end of a sentence or beginning or end of a document.