An open API service indexing awesome lists of open source software.

https://github.com/scdh/xtriples-micro

XTriples implementation in XSLT for local usage or deployment on a micro service
https://github.com/scdh/xtriples-micro

Last synced: about 1 month ago
JSON representation

XTriples implementation in XSLT for local usage or deployment on a micro service

Awesome Lists containing this project

README

          

# `xtriples-micro` – An XTriples Processor for Micro Services and Local Usage

[![Tests](https://github.com/SCDH/xtriples-micro/actions/workflows/test.yaml/badge.svg)](https://github.com/SCDH/xtriples-micro/actions/workflows/test.yaml)
[![Create release](https://github.com/SCDH/xtriples-micro/actions/workflows/deploy.yaml/badge.svg)](https://github.com/SCDH/xtriples-micro/actions/workflows/deploy.yaml)

`xtriples-micro` is an implementation of a
[XTriples](https://xtriples.lod.academy/) processor that works without
an eXist datebase.

XTriples? In XTriples, instead of writing specialized programs in
XSLT, XQuery, Python, etc. for extracting RDF triples from XML
documents, we write configuration files containing selectors. These
config files are evaluated by an XTriples processor, which returns RDF
triples. Here's an example of such a configuration file:

```










/@id
type
Person


/@id
label
/name/english


/@id
label
/name/greek


/@id
seeAlso
/concat("http://en.wikipedia.org/wiki/", $currentResource/name/english)





```

To get more impressions, have a look at the extraction
[recipes](recipes).

While the original XTriples processor requires an eXist database and
applies a configuration only on the fixed set of XML files contained
in it, the implementation at hand runs outside of a database, e.g., on
a local set of documents. It can also be deployed on the famous [SEED
XML Transformer](https://github.com/scdh/seed-xc). This deployment gives you a lightweight microservice,
where you can send a single XML document and a config file to and get
RDF triples in return.

## Getting started

### Microservice

TODO

### Oxygen

This project offers an Oxygen framework, that assists writing XTriples
configuration files and also provides transformation scenarios for
applying a configuration to a single or a collection of
documents. Installation is as simple as using the following
installation link in the installation dialog found in **Help** ->
**Install New Addons**:

```
https://scdh.github.io/xtriples-micro/descriptor.xml
```

See detailed description in the [Wiki](https://github.com/SCDH/xtriples-micro/wiki/Oxygen-Framework)!

### XSLT Package

For using the XTriples engine in CI/CD pipelines or in downstream
projects, installation of a released package is the way to go. The
[Wiki](https://github.com/SCDH/xtriples-micro/wiki/Installation-of-a-Release)
gives detailed instructions!

### Playing around and Testing

For playing around with XTriples and validating that it is suitable
technology, you can also clone this repository. It comes with a fully
reproducible [tooling](https://github.com/scdh/tooling) environment
that installs all tools needed for running and testing in a
sandbox. You only need a Java development kit (JDK) installed. On
debian-based systems, you can install it with `sudo apt install
openjdk`.

To set up the tooling environment, clone this repository, `cd` into
your working copy and run:

```
./mvnw package # Linux
```

or

```
mvnw.cmd package # Windows
```

This will download Saxon-HE etc. and generate wrapper files, that set
up the classpath for using them.

After running the command above, the wrapper scripts are in
`target/bin/`. E.g., there are a wrappers around
[Saxon-HE](https://www.saxonica.com/documentation12/index.html#!using-xsl/commandline)
and [Jena RIOT](https://jena.apache.org/documentation/io/):

```
target/bin/xslt.sh -?
```

```
target/bin/riot.sh -h
```

## Extracting RDF Triples

There are XSLT stylesheets, that do the work of evaluating an XTriples
configuration file and applying it to XML documents.

### `extract.xsl`

[`xsl/extract.xsl`](xsl/extract.xsl) extracts
from an XML document given as source by applying a configuration
passed in via the stylesheet parameter `config-uri`.

```shell
target/bin/xslt.sh -xsl:xsl/extract.xsl -s:test/gods/1.xml config-uri=$(realpath test/gods/configuration.xml)
```

The output should look like this:

```ntriples
.
"Aphrodite"@en .
"Ἀφροδίτη"@gr .
.
```

If your result is polluted with debug messages, you can append `2>
/dev/null` to silence them or use Saxon's `-o:` option to send the
output to a file. They are printed to stderr.

If you want an other format, pipe the result to Jena RIOT like so:

```
target/bin/xslt.sh -xsl:xsl/extract.xsl -s:test/gods/1.xml config-uri=$(realpath test/gods/configuration.xml) | target/bin/riot.sh --out rdf/xml
```

Here's the result:

```xml



Ἀφροδίτη
Aphrodite

```

This is the only transformation that makes sense deploying on a micro
service. See [seed](seed.md).

### `extract-collection.xsl`

[`xsl/extract-doc-param.xsl`](xsl/extract-doc-param.xsl) takes a
configuration as source document and applies it to the collecton of
XML documents given in `/xtriples/collection/@uri`, which is
interpreted as a Saxon collection URI. See section [Implementation of
the Specs](#implementation-of-the-specs) for details. This is
compatible to the reference implementation.

Example:

```
target/bin/xslt.sh -xsl:xsl/extract-collection.xsl -s:test/gods/configuration.xml
```

This will extract triples from all the God files in
[`test/gods`](test/gods) due to the collection URI ``. It is a relative URI (current directory
`.`), and the [`select` query
string](https://www.saxonica.com/documentation12/index.html#!sourcedocs/collections/collection-directories)
is interpreted by the Saxon processor.

### `extract-doc-param.xsl`

[`xsl/extract-doc-param.xsl`](xsl/extract-doc-param.xsl) takes a
configuration as source document and applies it to an XML document
referenced by the `source-uri` stylesheet parameter.

```shell
target/bin/xslt.sh -xsl:xsl/extract-param-doc.xsl -s:test/gods/configuration.xml source-uri=$(realpath test/gods/1.xml)
```

The `is-collection-uri` stylesheet parameter can be used, to indicate,
that the URI is a collection:

```shell
target/bin/xslt.sh -xsl:xsl/extract-param-doc.xsl -s:test/gods/configuration.xml is-collection-uri=true source-uri=/path/to/edition?select=*.tei.xml;recurse=true
```

This works with any [type of collection](#collections).

So stylesheet can be used for passing user-defined collections into
once written extraction recipes.

## Extraction Recipes

The [`recipes`](recipes) folder has XTriple configurations for
extraction tasks that occur in many projects.

## Writing configurations

1. The content of ``, ``, `` and
`` is evaluated as an **XPath** expression, if and only
if the content starts with a **slash** `/`. Before the expression
is evaluated, it is prepended with `$currentResource` (or
`$externalResource` respectively). E.g., `/@id` is evaluated as
`$currentResource/@id`. In `` the XPath is constructed
like this: `xs:boolean($currentResource CONDITION )`.
1. Keep the difference of **document** vs. **resource** in mind: Each
document may contain multiple resources if
`/xtriples/collection/resource/@uri` is used to unnest resources
from a document. The variable `$currentResource` and
`$resourceIndex` provide access to the resource and its index.
1. This resource context is transparent to the underlying
document. Thus, accessing parts of the document outside of the
context subtree is possible:
`$currentResource/ancestor::TEI/teiHeader`.
1. The XPath evaluation uses **namespaces** made up from the
prefix-to-URI mapping from the `` section of the
configuration file. Thus:
- If you want to extract RDF from non-namespace XML sources, do not
use the empty string prefix in the vocabularies, since that would
bind the default namespace for XPath evaluation to this
vocabulary URI.
- Be careful about using the default namespace, since it is not
compatible with the reference implementation. See
[below](#implementation-of-the-specs)!
1. Using **BNodes** may be a bit tricky. See [these hints](bnodes.md).

## Implementation of the Specs

This is a full implementation of the [XTriples
spec](https://xtriples.lod.academy/documentation.html).

### Additional Features

In addition to the specs, this implementation adds the following
features:

1. In addition to static ISO 639 language identifiers, `object/@lang`
can also be XPath expressions, that return such language
identifiers. This feature is handy for projects that set up
language in their XML documents.
1. By leaving away `@prefix` for a `` or setting it to the
empty string, the default namespace when evaluating XPath
expressions binds to this vocabulary URI. Thus, when setting
``, you can write
XPaths like this: `//(teiHeader/fileDesc/titleStmt/title)[1]`
without prefixing the element names. See
[`test/config-02.xml`](test/config-02.xml) for a self contained
test case. Evaluating it on the [reference
implementation](https://xtriples.lod.academy/index.html) fails,
while the implementation at hand processes it correctly.
1. It is possible to use your own functions in the XPath expressionss
in the `` section: You can load an additional XSLT
stylesheet by using the `libraries` (sequence of xs:anyURI) or
`libraries-csv` (a string of comma separated URIs) stylesheet
parameters. Please notice, that you have to declare your function's
visibility non-private and non-hidden, e.g., `@visibility=public`,
cf. [XSLT 3.0 TR](https://www.w3.org/TR/xslt-30/#evaluate-static-context).
```shell
target/bin/xslt.sh -xsl:xsl/extract-collection.xsl -s:... libraries-csv=$(realpath my-utils.xsl)
```

### Collections

Due to not running inside an eXist database, the evaluation of the
`` section of the configuration differs from the reference
implementation. However, you can get full compatibility mode (see end
of this section).

In contrast to the specs, **`/xtriples/collection/@uri`** is ignored,
when a single XML source document is passed to the processor, i.e.,
when using `xsl/extract.xsl` or `xsl/extract-param-dox.xsl`.

When using `xsl/extract-collection.xsl`, it is evaluated as a [Saxon
collection
URI](https://www.saxonica.com/documentation12/index.html#!sourcedocs/collections/collection-uris). It
can thus be a

- [directory
URI](https://www.saxonica.com/documentation12/index.html#!sourcedocs/collections/collection-directories)
with select pattern for finding files (relative URIs are resolved
against the evaluated configuration file), or
- [zip-collection](https://www.saxonica.com/documentation12/index.html#!sourcedocs/collections/ZIP-collections)
(zip, jar, docx) which will automatically be unpacked and
crawled, or a
- [collection
catalog](https://www.saxonica.com/documentation12/index.html#!sourcedocs/collections/collection-catalogs)
listing files to crawl or
- your own collection type provided you have written your own
[collection
finder](https://www.saxonica.com/documentation12/index.html#!sourcedocs/collections/user-collections).

*Link based resource crawling* and *literal resource crawling* are
supported exactly as in the reference implementation. In both modes,
there is no `@uri` attribute present for the collection.

You can get full compatibility by setting the `is-collection-uri`
stylesheet parameter to `false`. This way, all the `@uri` attribute of
each `` is not read as a Saxon collection URI, but as a
single document URI. Using this attribute, *XPath based resource
crawling with resources spread over multiple files* is also supported.

You can evaluate the examples in `test/gods` with
`is-collection-uri=false` and by using the XML catalog in
`test/catalog.xml`, which maps lod academy URIs to local files:

```
target/bin/xslt.sh -xsl:xsl/extract-collection.xsl -s:test/gods/conf-NN.xml -catalog:test/catalog.xml is-collection-uri=false
```

## Output: NTriples

There's only one output format: NTriples. In a microservice
architecture, converting to other formats is done in a converter
service. NTriples is the RDF serialization of choice, because the
response bodies of multiple request can simply be concatenated into
one graph.

## Development

Run tests with

```
target/bin/test.sh
```

or

```
source target/bin/classpath.sh # only once needed per shell session
ant -Dcatalog=test/catalog.xml test
```

## License

This is distributed under the MIT license.

The tests cases directly in `test/gods/` where taken from the
[original eXist-db
implementation](https://github.com/digicademy/xtriples/tree/master),
which is licensed under the terms of the MIT license.