https://github.com/hltcoe/annotated-nyt
Java wrappers and utilities for reading the Annotated NYT corpus
https://github.com/hltcoe/annotated-nyt
Last synced: 4 months ago
JSON representation
Java wrappers and utilities for reading the Annotated NYT corpus
- Host: GitHub
- URL: https://github.com/hltcoe/annotated-nyt
- Owner: hltcoe
- License: apache-2.0
- Created: 2015-04-13T22:08:16.000Z (about 11 years ago)
- Default Branch: master
- Last Pushed: 2022-01-04T16:31:07.000Z (over 4 years ago)
- Last Synced: 2025-07-13T18:13:58.626Z (11 months ago)
- Language: Java
- Homepage:
- Size: 68.4 KB
- Stars: 1
- Watchers: 8
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# annotated-nyt
Utilities for reading the [Annotated NYT corpus](https://catalog.ldc.upenn.edu/LDC2008T19).

[](http://www.javadoc.io/doc/edu.jhu.hlt/annotated-nyt/)
Latest Maven dependency
---
```xml
edu.jhu.hlt
annotated-nyt
1.1.5
```
## Quick start
Create a `NYTCorpusDocumentParser` object:
```java
NYTCorpusDocumentParser parser = new NYTCorpusDocumentParser();
```
Read a single `.xml` document from the annotated NYT corpus:
```java
Path p = Paths.get("/your/path/.xml");
byte[] bytes = Files.readAllBytes(p);
NYTCorpusDocument ncd = parser.fromByteArray(bytes, false);
AnnotatedNYTDocument and = new AnnotatedNYTDocument(ncd);
```
## API
All fields in the `AnnotatedNYTDocument` objects are guaranteed to
be non-`null`.
Many of the fields in the corpus can be empty or `null` in the
documents themselves. These fields are represented in the wrapper
object, `AnnotatedNYTDocument`, as `Optional` fields.
Many convenience methods exist to convert naturally list-based items (e.g.,
the body as a `List` of paragraphs). Many of these sections, however,
can also be `null`. In these cases, the API will return an empty `List`
object. These lists will never be `null`.
## Running the integration test
The integration test can be executed with the following command:
```sh
mvn clean verify -Pitest -DanytDataPath=/path/to/your/LDC/corpus/data/dir
```
The `anyDataPath` property should point to your `data` directory
from the extracted ANYT corpus. This directory contains many folders
with numbers as names, representing years of annotated NYT data.