https://github.com/gatenlp/corpusconversion-bnc

Tool to convert the British National Corpus to GATE format
https://github.com/gatenlp/corpusconversion-bnc

Last synced: about 1 year ago
JSON representation

Tool to convert the British National Corpus to GATE format

Host: GitHub
URL: https://github.com/gatenlp/corpusconversion-bnc
Owner: GateNLP
Created: 2017-02-03T14:24:30.000Z (over 9 years ago)
Default Branch: master
Last Pushed: 2017-02-03T14:27:06.000Z (over 9 years ago)
Last Synced: 2025-03-09T19:55:27.453Z (over 1 year ago)
Language: Java
Size: 6.84 KB
Stars: 2
Watchers: 17
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Conversion of the original British National Corpus documents, XML edition to GATE

The files in this repository can be used to convert the original XML fiels from the BNC corpus
(see http://ota.ox.ac.uk/desc/2554) to usable GATE documents.

NOTE: this depends on the following tools and software, not included here:
* the runPipeline.sh command from https://github.com/johann-petrak/gatetool-runpipeline and the bin directory of
that tool must be on the binary path
* the Java plugin is added as a submodule (https://github.com/johann-petrak/gateplugin-Java)
* GATE (version 8.x)
* JAVA SDK
* Ant

## Preparation

* Make sure the Java submodule is actually fetched and compiled
* `git submodule init`
* `git submodule update`
* `git submodule foreach ant`
* Make sure the British National Corpus is available in some directory in unzipped for, the directory is usually called "2554"
* The conversion script will copy the BNC corpus into a local temporary directory, so make sure there is enough disk space
on the disk which contains the current directory (about 4.4G needed)
* The GATE documents will require about 114G of disk space

## Run the conversion

Just run the convert.sh script and pass the location of the BNC corpus and the desired output directory as arguments:

`./convert.sh bnccorpusdir outputdir`

## Overview of how the conversion is done:
* Load original files into GATE in XML format, but set the option
"add space on markup unpack if needed" to false
* Now, in the "Original markups" set we get all the XML fields as annotations.
* Remeber the following fields for document-features (if nothiing else specified, the document text for the field):
* availability
* bibl
* bncDoc.xml:id gets converted to id
* catRef.targets
* change: doc text is documentation of change, features date amd who give additional info, should get
converted to single list? Should get converted to change.\.\ and change.\
* classCode: text and feature scheme
* creation.date
* date
* distributor
* edition
* extent
* imprint (text and feature n)
* keywords
* profileDesc
* pubPlace
* publicationStmt
* publisher
* respStmt
* sourceDesc
* tagUsage: empty span annots with features gi and occurs. Should get converted to features
tagUsage.\=\
* titleStmt

Annotations relevant for the actual text:
* wtext: covers the part we are interested in. Anything before should eventually get removed
* w: words, features: c5, hw (lemma), pos
* c: something about punctuation and quotes, needed in addition to "w". features: c5
* mw: overlaps w annotations for multi-word stuff like "up to" and has feature c5=??
* s: sentences, feature n

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/gatenlp/corpusconversion-bnc

Awesome Lists containing this project

README