Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/bcdh/standoffconverter4dariah-campus
This is a work in progress. Once complete, this course will be published on DARIAH-Campus
https://github.com/bcdh/standoffconverter4dariah-campus
Last synced: 1 day ago
JSON representation
This is a work in progress. Once complete, this course will be published on DARIAH-Campus
- Host: GitHub
- URL: https://github.com/bcdh/standoffconverter4dariah-campus
- Owner: BCDH
- License: cc0-1.0
- Created: 2024-06-09T07:10:47.000Z (8 months ago)
- Default Branch: main
- Last Pushed: 2024-07-03T13:56:02.000Z (7 months ago)
- Last Synced: 2024-11-22T21:48:58.940Z (2 months ago)
- Language: Jupyter Notebook
- Size: 36.1 KB
- Stars: 1
- Watchers: 3
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Standoff Converter: The Missing Link between TEI and NLP
The goal of the course is to help humanities researchers apply NLP (natural language processing) tools and methods on TEI-encoded texts, even though such tools are usually not natively made to work with XML.
## Learning outcomes
Upon completion of this course, students will be able to:
- understand the usefulness of adding linguistic annotation to TEI-encoded texts
- conceptualize the differences, advantages and disadvantages of _inline_ and _standoff_ modes of annotation
- use Standoff Converter to convert TEI-econded texts from _inline_ to _standoff_ and vice versa
- use open-source NLP library spaCy to enrich TEI-encoded texts with information on lemmas, POS (part of speech) and NER (named entity recognition)## TEI and NLP: never the twain shall meet?
TEI and the philological tradition of manual annotation. Digital editions vs. corpora. Yet: digital editions can benefit from NLP annotation: better search and retrieval, indexing, pattern recognition.
But how to do it? Question of scale. We can't do linguistic annotation manually - it would take for ever. But applying NLP tools is not easy because they're usually not made to work natively with XML.
There is a way forward: TEI is flexible.
## Inline and standoff annotation
Explain the differences, advantages and disadvantages of storing annotation in the text or separately from it.
Concrete examples.
## What is Standoff Converter?
A tool which lets you convert TEI datasets from inline to standoff and vice versa.
## Applying NLP tools to TEI-ecoded texts
In this seciton, we'll take you step-by-step through the process of adding lingusitic annotations to TEI-ecnoded texts.
### Chose the dataset you want to work with
We provide a sample dataset.
A paragraph or so about the letter(s) we chose.
### Set clear annotation goals
What is the end goal? What do we want the final TEI to look like? What attributes and elements are we going to use to annotate lemmas, POS, NER...
### Convert the dataset to standoff TEI
### Apply NLP tools to standoff TEI
### Convert enriched standoff TEI back to inline
## Conclusions