https://github.com/quadrama/dramanlp
UIMA NLP components for dramatic texts
https://github.com/quadrama/dramanlp
drama dramatic-texts german natural-language-processing nlp uima-components
Last synced: 3 months ago
JSON representation
UIMA NLP components for dramatic texts
- Host: GitHub
- URL: https://github.com/quadrama/dramanlp
- Owner: quadrama
- License: apache-2.0
- Created: 2016-05-03T16:30:51.000Z (about 10 years ago)
- Default Branch: master
- Last Pushed: 2023-06-06T15:17:02.000Z (about 3 years ago)
- Last Synced: 2025-07-26T04:44:41.153Z (11 months ago)
- Topics: drama, dramatic-texts, german, natural-language-processing, nlp, uima-components
- Language: Java
- Size: 114 MB
- Stars: 9
- Watchers: 3
- Forks: 3
- Open Issues: 10
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
[](https://github.com/quadrama/DramaNLP/releases/latest)
[](https://travis-ci.org/quadrama/DramaNLP)
[](https://www.zenodo.org/badge/latestdoi/57984264)
[](https://github.com/quadrama/DramaNLP/blob/master/LICENSE)
# DramaNLP
This repository contains a number of UIMA components to process dramatic texts, as well as an executable pipeline. We follow general design ideas implemented in [DKPro Core](https://dkpro.github.io/dkpro-core/). The full pipeline reads in files in several TEI/XML dialects (see below), and applies the most important NLP tools on them, while keeping the structural annotation of the plays intact (and, if necessary, processing different text layers separately).
## Compiling from source
1. Clone the repository: `git clone https://github.com/quadrama/DramaNLP.git`
1. Enter the directory: `cd DramaNLP`
- If necessary, switch to a branch `git checkout develop/1.0`
1. Download dependencies, compile everything and install it locally: `mvn compile install`
This produces a lot of output, but at the end, you should see something like `BUILD SUCCESS`
1. To compile a runnable binary, enter the directory: `cd de.unistuttgart.ims.drama.main` and run `mvn package`. This creates a file called `drama.Main.jar` in the directory `target/assembly/`. This file contains the code and all its dependencies.
## Running entire pipeline
As an example, we'll work on the data from the GerDraCor collection (which is based on TextGrid). Download the files from [GitHub](https://github.com/quadrama/gerdracor) and store the XML files in a directory. We will call the directory `$TEIDIR` in the following examples. The directory `$OUTDIR` is used to store the output of the pipeline. You'll need the file `drama.Main.jar`.
Enter the following command in the command line interface:
`java -cp target/assembly/drama.Main.jar de.unistuttgart.ims.drama.main.TEI2XMI --input $TEIDIR --output $OUTDIR/xmi --csvOutput $OUTDIR/csv --conllOutput $OUTDIR/conll --skipSpeakerIdentifier --corpus GERDRACOR --collectionId "gdc" --doCleanup`
After running, the directory `$OUTDIR` contains three sub directories, `xmi`, `csv` and `conll`, which are different file formats for the plays.
## TEI/XML dialects
This package supports the following drama corpora
- TextGrid (German)
- GerDraCor (German)
- theatre classique (French)