https://github.com/slub/mets-mods2tei
Convert bibliographic meta data in MODS format to TEI headers
https://github.com/slub/mets-mods2tei
conversion mets mods tei
Last synced: 6 months ago
JSON representation
Convert bibliographic meta data in MODS format to TEI headers
- Host: GitHub
- URL: https://github.com/slub/mets-mods2tei
- Owner: slub
- License: apache-2.0
- Created: 2019-05-08T08:32:51.000Z (over 6 years ago)
- Default Branch: master
- Last Pushed: 2025-02-12T10:41:02.000Z (8 months ago)
- Last Synced: 2025-03-25T18:41:01.111Z (7 months ago)
- Topics: conversion, mets, mods, tei
- Language: Python
- Size: 1.03 MB
- Stars: 8
- Watchers: 5
- Forks: 6
- Open Issues: 18
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
Awesome Lists containing this project
README
# mets-mods2tei
[](https://circleci.com/gh/slub/mets-mods2tei)
[](https://codecov.io/gh/slub/mets-mods2tei)
[](https://badge.fury.io/py/mets-mods2tei)Convert bibliographic meta data in METS/MODS format to TEI headers and optionally serialize linked ALTO-encoded OCR to TEI text.
## Background
[MODS](http://www.loc.gov/standards/mods/) is the de-facto standard for encoding bibliographic
meta data in libraries. It is usually included as a separate section into
[METS](http://www.loc.gov/standards/mets/) XML files. Physical and logical structure of a document
are expressed in terms of structural mappings (`structMap` elements).[TEI](https://tei-c.org/) is the de-facto standard for representing digital text for research
purposes. It usually includes detailed bibliographic meta data in its
[header](https://tei-c.org/release/doc/tei-p5-doc/de/html/ref-teiHeader.html).Since these standards contain a considerable amount of degrees of freedom, the conversion uses
well-defined subsets. For MODS, this is the
[*MODS Anwendungsprofil für digitalisierte Medien*](https://dfg-viewer.de/fileadmin/groups/dfgviewer/MODS-Anwendungsprofil_2.3.1.pdf).
For METS, the [METS Anwendungsprofil für digitalisierte Medien 2.1](https://www.zvdd.de/fileadmin/AGSDD-Redaktion/METS_Anwendungsprofil_2.1.pdf) is consulted.
For the TEI Header, the conversion is roughly based on the [*DTA base format*](https://github.com/deutschestextarchiv/dtabf).`mets-mods2tei` is developed at the [Saxon State and University Library in Dresden](https://www.slub-dresden.de).
## Installation
`mets-mods2tei` is implemented in Python 3. In the following, we assume a working Python 3
(tested versions 3.5, 3.6 and 3.7) installation.### Setup Python
Using [virtual environments](https://packaging.python.org/tutorials/installing-packages/#creating-virtual-environments) is highly recommended,
although not strictly necessary for installing `mets-mods2tei`.To create a virtual environement in a subdirectory of your choice (e.g. `env`), run
python3 -m venv env
(once) and then activate it (each time you open the shell) via
. env/bin/activate
Depending on how old the packages are which your base system provides,
you might have to update pip first:pip install -U pip setuptools
### Get Python package
`mets-mods2tei` can be installed via `pip3` directly.
You can install from either the repository sources or the
prebuilt distribution on PyPI:#### From repository
If you have an active virtual environment, do
pip install mets-mods2tei
Otherwise, try
pip3 install --user mets-mods2tei
#### From source
Get the repository:
git clone https://github.com/slub/mets-mods2tei.git
cd mets-mods2teiIf you have an active virtual environment, do
pip install .
Otherwise, try
pip3 install --user .
## Testing
`mets-mods2tei` uses `pytest`-based testing.
To install the prerequisites for testing, (in your venv), do
pip install -r requirements-test.txt
(once) and then run the tests via:
pytest
## Code coverage
Determine code coverage by running
make coverage
## Usage
### mm2tei
Installing `mets-mods2tei` makes the command-line tool `mm2tei` available:
mm2tei --help
```
Usage: mm2tei [OPTIONS] METSMETS: File containing or URL pointing to the METS/MODS XML to be converted
Parse given METS and its meta-data, and convert it to TEI.
If `--ocr` is given, then also read the ALTO full-text files from the
fileGrp in `--text-group`, and convert page contents accordingly (in
physical order).Decorate page boundaries with image and page numbers. Moreover, if `--add-
refs` contains `page`, then reference the corresponding base image files (by
file name) from `--img-group`. Likewise, if `--add-refs` contains `line`,
then reference the corresponding textline segments (by XML ID) from `--text-
group`.Output XML to `--output (use '-' for stdout), log to stderr.`
Options:
-O, --output FILENAME File path to write TEI output to
-o, --ocr Serialize OCR into resulting TEI
-T, --text-group TEXT File group which contains the full-text
-I, --img-group TEXT File group which contains the images
-r, --add-refs [page|line]
-l, --log-level [DEBUG|INFO|WARN|ERROR|OFF]
-h, --help Show this message and exit.
```It reads METS XML via URL or file argument and prints the resulting TEI,
including the extracted information from the MODS part of the METS.Example:
mm2tei -O tei.xml "https://digital.slub-dresden.de/oai/?verb=GetRecord&metadataPrefix=mets&identifier=oai:de:slub-dresden:db:id-453779263"
### mm-update
Installing `mets-mods2tei` also provides the command-line multi-cmd tool `mm-update`:
mm-update --help
```
Usage: mm-update [OPTIONS] COMMAND [ARGS]...Entry-point of multi-purpose CLI for DFG Viewer compliant METS updates
Options:
--version Show the version and exit.
-l, --log-level [OFF|ERROR|WARN|INFO|DEBUG|TRACE]
Log level
-d, --directory WORKSPACE_DIR Changes the workspace folder location
[default: METS_URL directory or .]"
-m, --mets METS_URL The path/URL of the METS file [default:
WORKSPACE_DIR/mets.xml]
--backup Backup METS whenever it is saved.
--help Show this message and exit.Commands:
add-agent add agent headers, optionally from external METS
add-file add a file reference, optionally as URL
download download files into subdirectories, as path or URL
remove-file remove all file references for a specific location,...
remove-files remove all file references for a specific fileGrp / MIME...
validate custom OcrdWorkspaceValidator
```mm-update add-agent --help
```
Usage: mm-update add-agent [OPTIONS]add agent headers, optionally from external METS
Options:
-m, --mets TEXT copy metsHdr/agent from this file, too
--help Show this message and exit.
```mm-update add-file --help
```
Usage: mm-update add-file [OPTIONS] PATHadd a file reference, optionally as URL
Options:
-G, --file-grp FILE_GRP fileGrp to add to [required]
-m, --mimetype TYPE Media type of the file. Guessed from extension if
not provided
-g, --page-id PAGE_ID ID of the physical page (or empty if document-
global)
-u, --url-prefix TEXT URL prefix to add to path before storing references
(or else keep local file refs)
--help Show this message and exit.```
mm-update remove-file --help
```
Usage: mm-update remove-file [OPTIONS] PATHremove all file references for a specific location, optionally as URL
Options:
-u, --url-prefix TEXT URL prefix to add to path before removing references
(or else search verbatim file refs)
--help Show this message and exit.
```mm-update remove-files --help
```
Usage: mm-update remove-files [OPTIONS]remove all file references for a specific fileGrp / MIME type / page ID
combinationOptions:
-G, --file-grp FILE_GRP fileGrp to add to [required]
-m, --mimetype TYPE Media type of the file. Guessed from extension if
not provided
-g, --page-id PAGE_ID ID of the physical page (or empty if document-
global)
--help Show this message and exit.
```mm-update validate --help
```
Usage: mm-update validate [OPTIONS]custom OcrdWorkspaceValidator
Options:
-u, --url-prefix TEXT validate each file has this URL prefix
--help Show this message and exit.
```mm-update download --help
```
Usage: mm-update download [OPTIONS]download files into subdirectories, as path or URL
Options:
-G, --file-grp FILE_GRP fileGrp USE (or empty if all fileGrps)
-g, --page-id PAGE_ID ID of the physical page (or empty if all
pages)
-p, --path-names [URL|GRP/ID.SUF]
how to generate local path names (from URL
or from fileGrp, file ID and suffix)
[default: URL]
-u, --url-prefix TEXT URL prefix to remove from path before
storing downloaded files (to avoid creating
host directories)
-r, --reference [no-change|replace-by-local|insert-local|append-local]
whether and how to update the FLocat
reference in METS [default: no-change]
--help Show this message and exit.
```Example:
# dump files (without changing METS):
mm-update download -u https://digital.slub-dresden.de/data/kitodo/GottDie_453779263/
...
# add TEI
mm-update add-file -G TEI -m application/tei+xml -u https://digital.slub-dresden.de/data/kitodo/GottDie_453779263/ tei.xml
...
# remove old PDF:
mm-update remove-files -G DOWNLOAD
# add new PDF:
mm-update add-file -G DOWNLOAD -m application/pdf -u https://digital.slub-dresden.de/data/kitodo/GottDie_453779263/ -g PHYS_0001 pdf/file_0001.pdf
mm-update add-file -G DOWNLOAD -m application/pdf -u https://digital.slub-dresden.de/data/kitodo/GottDie_453779263/ -g PHYS_0002 pdf/file_0002.pdf
mm-update add-file -G DOWNLOAD -m application/pdf -u https://digital.slub-dresden.de/data/kitodo/GottDie_453779263/ -g PHYS_0003 pdf/file_0003.pdf
mm-update add-file -G DOWNLOAD -m application/pdf -u https://digital.slub-dresden.de/data/kitodo/GottDie_453779263/ pdf/all.pdf
...
# remove old ALTO:
mm-update remove-files -G FULLTEXT -g PHYS_0001
mm-update remove-files -G FULLTEXT -g PHYS_0002
mm-update remove-files -G FULLTEXT -g PHYS_0003
# add new ALTO:
mm-update add-file -G FULLTEXT -m text/xml -u https://digital.slub-dresden.de/data/kitodo/GottDie_453779263/ -g PHYS_0001 ocr/alto_0001.xml
mm-update add-file -G FULLTEXT -m text/xml -u https://digital.slub-dresden.de/data/kitodo/GottDie_453779263/ -g PHYS_0002 ocr/alto_0002.xml
mm-update add-file -G FULLTEXT -m text/xml -u https://digital.slub-dresden.de/data/kitodo/GottDie_453779263/ -g PHYS_0003 ocr/alto_0003.xml
...
# validate:
mm-update validate -u https://digital.slub-dresden.de/data/kitodo/GottDie_453779263/## Maintainers
If you have any questions or encounter any problems, please do not hesitate to contact us.
- [Kay-Michael Würzner](https://github.com/wrznr)
- [Robert Sachunsky](https://github.com/bertsky)