An open API service indexing awesome lists of open source software.

https://github.com/hltcoe/concrete-agiga

Tools to map between concrete and agiga representations
https://github.com/hltcoe/concrete-agiga

Last synced: 4 months ago
JSON representation

Tools to map between concrete and agiga representations

Awesome Lists containing this project

README

          

concrete-agiga
==============

concrete-agiga is a Java library that maps Annotated Gigaword documents to Concrete.

Maven dependency
---
```xml

edu.jhu.hlt
concrete-agiga
4.4.0

```

## TLDR / Quick start ##
```sh
mvn clean compile assembly:single
java -cp target/concrete-agiga-4.4.0-jar-with-dependencies.jar \
edu.jhu.hlt.concrete.agiga.AgigaConverter \
path/to/output/dir \
drop-annotations \
path/to/xml/or/xml/gz/file
```

Arguments:
* `path/to/output/dir` - where annotated files will end up
* `drop-annotations` - `boolean` - whether or not to drop annotations that are in the .xml files
* for RAW files, set to `true`, for ANNOTATED files, set to `false`
* `path/to/xml/or/xml/gz/file` - path to one or more `.xml` or `.xml.gz` files to process

Requirements:
* `java >= 1.8`
* `mvn >= 3.0.4`

## Notes ##
One implementation detail to be aware of:
The [anno-pipeline](https://github.com/hltcoe/anno-pipeline) outputs tokens
that contain strings rather than character offsets. So we are not able to
perfectly recreate the original document. The rule this uses is to one space
between tokens and a newline after every sentence. This will only affect you
if you rely on character distances and you use Concrete's TextSpan
(e.g. "Mike's house" => Token("Mike") Token("'s") Token("house") => "Mike 's house"))