https://github.com/hltcoe/concrete-agiga
Tools to map between concrete and agiga representations
https://github.com/hltcoe/concrete-agiga
Last synced: 4 months ago
JSON representation
Tools to map between concrete and agiga representations
- Host: GitHub
- URL: https://github.com/hltcoe/concrete-agiga
- Owner: hltcoe
- Created: 2013-05-29T15:36:34.000Z (about 13 years ago)
- Default Branch: master
- Last Pushed: 2015-03-11T05:22:30.000Z (about 11 years ago)
- Last Synced: 2025-07-22T10:38:25.039Z (11 months ago)
- Language: Java
- Size: 791 KB
- Stars: 1
- Watchers: 15
- Forks: 2
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
concrete-agiga
==============
concrete-agiga is a Java library that maps Annotated Gigaword documents to Concrete.
Maven dependency
---
```xml
edu.jhu.hlt
concrete-agiga
4.4.0
```
## TLDR / Quick start ##
```sh
mvn clean compile assembly:single
java -cp target/concrete-agiga-4.4.0-jar-with-dependencies.jar \
edu.jhu.hlt.concrete.agiga.AgigaConverter \
path/to/output/dir \
drop-annotations \
path/to/xml/or/xml/gz/file
```
Arguments:
* `path/to/output/dir` - where annotated files will end up
* `drop-annotations` - `boolean` - whether or not to drop annotations that are in the .xml files
* for RAW files, set to `true`, for ANNOTATED files, set to `false`
* `path/to/xml/or/xml/gz/file` - path to one or more `.xml` or `.xml.gz` files to process
Requirements:
* `java >= 1.8`
* `mvn >= 3.0.4`
## Notes ##
One implementation detail to be aware of:
The [anno-pipeline](https://github.com/hltcoe/anno-pipeline) outputs tokens
that contain strings rather than character offsets. So we are not able to
perfectly recreate the original document. The rule this uses is to one space
between tokens and a newline after every sentence. This will only affect you
if you rely on character distances and you use Concrete's TextSpan
(e.g. "Mike's house" => Token("Mike") Token("'s") Token("house") => "Mike 's house"))