An open API service indexing awesome lists of open source software.

https://github.com/gatenlp/corpusconversion-conll2003

Tool/scripts to help converting the CoNLL 2003 corpora to GATE format
https://github.com/gatenlp/corpusconversion-conll2003

Last synced: about 1 year ago
JSON representation

Tool/scripts to help converting the CoNLL 2003 corpora to GATE format

Awesome Lists containing this project

README

          

# Tools to convert the CoNLL2003 NER corpora to GATE format

This repository contains two sets of scripts for creating the Conll2003 NER corpora
in GATE FastInfoset format:
* prepare-deu.sh and prepare-eng.sh to create the conll-format files from
the text corpora (licensed, need to get obtained separately) and the annotation files
(available from https://www.clips.uantwerpen.be/conll2003/ner/)
* convert-deu.sh and convert-eng.sh to convert the conll-format files to GATE format.

NOTE: the prepare-deu.sh script does not work correctly at the moment and cannot be
used, so the german conll-format files must get created by you separately and then
be put into directory ./conll2003-deu.

## Preparing the CONLL-format files

Create the English conll2003-format files by running `./prepare-eng.sh `
* `` is the directory that contains the downloaded ner.tgz file
* `` is the directory that contains all the zip files of the Reuters corpus

This should place the three files eng.train, eng.testa and eng.tesb into the
./conll2003-eng directory.

NOTE: The preparation of the German conll2003-format files does not work properly right now.
If you find the problem, please let me know or provide a pull request!

## Converting the CONLL-format files to GATE format

Requirements:
* Needs Java and Scala installed
* Needs GATE 8.4.x installed and the environment variable `GATE_HOME` set to the installation directory
* Only works on Linux, Mac and under Windows probably only in some form of Linux-compatibiity mode

Make sure that the CONLL-format files are in conll2003-eng or conll2003-deu as needed!

Now run `./convert-eng.sh` to convert English files and/or ./convert-deu.sh` to convert German files.

Each of the result directories contains one GATE document in GATE XML format for each document identified in the corresponding input file.

## Conversion Strategy

The following annotations are placed into the annotation set "Original markups":
* LOC, MISC, ORG, PER: for the entity annotations from the input. These annotations have the single feature startLineNr which identifies the (1-based) number of the original CoNLL input file where this entity started
* Token: for each input token one annotation is created. It contains the following features:
* chunkBIO: the original BIO value of the column for chunks
* lemma: the original value of the lemma columns (German only)
* lineNr: the line number (1-based) of that token in the original CoNLL input file as generated by the script described above
* neBIO: the original BIO value of the NE column
* pos: the original value of the POS column

The conversion algorithm separates all tokens by spaces except there are no spaces before punctuation caracters !,-.:;? and
there are no spaces after opening parentheses ({[ and before closing parentheses )}]
There is no space inserted before token 's but there is space inserted before a single ' because we cannot know if it is
the genetive of a plural or just used for quoting or something other.
For quote characters " a space is inserted before but not after all odd occurrences in a document and after but not before
all even occurrences.

All other tokens, including quote-like characters like '`, or characters like $£#~% if separate tokens in the input file are separated by
spaces.
No space is added at the beginning or end of a sentence or beginning or end of a document.