Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/stain/2016-provweek-tavernaprov

Abstract submitted to http://provenanceweek.org/2016/p3yl/
https://github.com/stain/2016-provweek-tavernaprov

Last synced: about 2 months ago
JSON representation

Abstract submitted to http://provenanceweek.org/2016/p3yl/

Awesome Lists containing this project

README

        

# Tracking workflow execution with TavernaProv

[![DOI](https://zenodo.org/badge/doi/10.5281/zenodo.51314.svg)](https://doi.org/10.5281/zenodo.51314)

* Authors: [Stian Soiland-Reyes](http://orcid.org/0000-0001-9842-9718),
[Pinar Alper](http://orcid.org/0000-0002-2224-0780),
[Carole Goble](http://orcid.org/0000-0003-1219-2137); [eScience Lab](http://www.esciencelab.org.uk/), University of Manchester
* Document id:
* Mirror:
* DOI: https://doi.org/10.5281/zenodo.51314
* In reply to: [PROV: Three Years Later](http://provenanceweek.org/2016/p3yl/)
* Published: 2016-05-09
* Modified: 2016-05-11
* Accepted: 2016-05-16
* Presented: 2016-06-06 (Pinar Alper)
* Mirrored: 2020-08-04 (moved to s11.no)
* License: Creative Commons Attribution 4.0 International License
* Keywords:
+ scientific workflow
+ provenance
+ research object
+ reproducibility
* Also available in formats:
+ [xhtml](https://s11.no/2016/provweek-tavernaprov/2016-provweek-tavernaprov.xhtml)
+ [pdf](https://s11.no/2016/provweek-tavernaprov/2016-provweek-tavernaprov.pdf)
+ [turtle](https://s11.no/2016/provweek-tavernaprov/2016-provweek-tavernaprov.ttl)

Creative Commons License

[Apache Taverna](https://taverna.incubator.apache.org) [1] is a scientific workflow
system for combining web services and local tools. Taverna
[records provenance](https://taverna.incubator.apache.org/documentation/provenance/) [2] of
workflow runs, intermediate values and user interactions, both as an aid for
debugging while designing the workflow, but also as a record for later
reproducibility and comparison.

Taverna also records provenance of the evolution
of the workflow definition
(including a chain of `wasDerivedFrom` relations),
attributions and annotations; for brevity we here focus on
how Taverna's workflow run provenance extends PROV and
is embedded with Research Objects.

## Data bundle

A workflow run can be exported from Taverna as a
[workflow data bundle](https://github.com/apache/incubator-taverna-engine/tree/master/taverna-prov#structure-of-exported-provenance); a [Research Object bundle](https://w3id.org/bundle/) [3]
in the form of a ZIP archive that contains the
[workflow definition](https://taverna.incubator.apache.org/documentation/scufl2/)
(itself a Research Object), annotations, inputs, outputs and intermediate
values, and a [PROV-O](https://www.w3.org/TR/prov-o/) trace of the workflow
execution, showing every process execution within the workflow
run, linking to the produced and consumed values using relative paths.

A workflow run can thus be downloaded as a single file from a
[Taverna Server](https://taverna.incubator.apache.org/download/server/),
shared ([myExperiment](http://myexperiment.org/),
[SEEK](http://seek4science.org/)), published ([ROHub](http://www.rohub.org/),
[Zenodo](https://zenodo.org/)), imported in another
[Taverna Workbench](https://taverna.incubator.apache.org/download/workbench/), shown in the
[Databundle Viewer](https://github.com/apache/incubator-taverna-databundle-viewer)
or modified with the [DataBundle API](https://github.com/apache/incubator-taverna-language/tree/0.15.1-incubating/taverna-databundle).

## Abstraction levels

[PROV](https://www.w3.org/TR/prov-dm/)
is a generic model for describing provenance. While this means there are
generally many multiple ways to express the same history in PROV,
a scientific workflow run with processors and data values naturally match the
[`Activity`]((https://www.w3.org/TR/prov-dm/#term-entity)) and [`Entity`](https://www.w3.org/TR/prov-dm/#term-entity) relations
[`wasGeneratedBy`](https://www.w3.org/TR/prov-dm/#dfn-wasgeneratedby) and
[`used`](https://www.w3.org/TR/prov-dm/#dfn-used).

However, using PROV-O to describe the details of a Taverna execution
meant a significant [increase in verbosity](https://github.com/apache/incubator-taverna-engine/blob/master/taverna-prov/example/helloanyone.bundle/workflowrun.prov.ttl). To simplify query and interoperability with PROV tools, we declare
relations both with [starting point terms](https://www.w3.org/TR/prov-o/#description-starting-point-terms)
and as [qualified terms](https://www.w3.org/TR/prov-o/#description-qualified-terms),
e.g. to represent an `Activity` that
used different values as different input parameters, we provide
both a direct [used](http://www.w3.org/TR/prov-o/#used), but also a
[qualifiedUsage](http://www.w3.org/TR/prov-o/#qualifiedUsage) to a
[Usage](http://www.w3.org/TR/prov-o/#Usage) that specify [hadRole](http://www.w3.org/TR/prov-o/#hadRole) and [entity](http://www.w3.org/TR/prov-o/entity).

PROV deliberately does not mandate how to make the
design decisions on
what activities, entities
and agents participate in a particular scenario,
but for interoperability purposes this
flexibility means that PROV is a kind of
"XML for provenance" - a common
language with a defined [semantics](https://www.w3.org/TR/prov-sem/),
but which can be applied in many different ways.

One interoperability design question for representing computational
workflow runs is how much of the
workflow engine's internal logic and language should be explicit in the PROV
representation. As we primarily wanted to convey what happened in the workflow
at the same granularity as its definition, we tried to hide provenance that
would be intrinsic to the Taverna Engine, e.g. [implicit iteration](http://taverna.knowledgeblog.org/2010/12/13/iteration-in-taverna-workflows/) is
not shown as a separate `Activity`.

However keeping provenance only at the dataflow level
(input/outputs of workflow processes) meant that TavernaProv
could not easily represent "deeper" provenance such as the
intermediate values of [while-loops](http://dev.mygrid.org.uk/wiki/display/tav250/Loops)
or intermittent failures that were automatically recovered by Taverna's [retry mechanism](http://dev.mygrid.org.uk/wiki/display/tav250/Retries), as we wanted to
avoid [unrolled workflow provenance](http://sites.computer.org/debull/A07dec/susan.pdf).

Keeping the link between the workflow definition and execution is essential to
understanding Taverna provenance, yet PROV doesn't describe the structure of a
[Plan](http://www.w3.org/TR/prov-o/#Plan). Taverna's workflow definition is in Linked Data using the
[SCUFL2](http://www.essepuntato.it/lode/owlapi/http://taverna.incubator.apache.org/ns/2010/scufl2.ttl) vocabulary, which includes many
implementation details for the Taverna Engine,
and so forming a meaningful query like
_"What is the value made by calls to webservice X"_ means understanding the
whole conceptual model of Taverna workflow definitions.

Therefore Taverna's PROV export also includes an
annotation with the
[wfdesc](https://w3id.org/ro/2016-01-28/wfdesc/) [4] abstraction of the
workflow definition, embedding user annotations and higher-level
information like [web service location](https://w3id.org/ro/2016-01-28/wf4ever/#rootURI).
_wfdesc_ deliberately leaves out execution details like iteration and parallelism controls
as it primarily functions as a target for user-driven annotations about the
workflow steps.

Correspondingly the provenance bundle from TavernaPROV
includes a higher level [wfprov](https://w3id.org/ro/2016-01-28/wfprov/)
abstraction of the workflow execution, with direct shortcuts like
[describedByProcess](https://w3id.org/ro/2016-01-28/wfprov/#describedByProcess)
and
[describedByParameter](https://w3id.org/ro/2016-01-28/wfprov/#describedByParameter)
to bypass the indirection of PROV qualified terms; simplifying
[queries](https://github.com/apache/incubator-taverna-engine/tree/master/taverna-prov#querying-provenance) like
_"Which web service consumed value Y?"_.

The duality between wfdesc and wfprov is similar to the
"future provenance" model of [P-Plan](http://purl.org/net/p-plan) and [OPMW](ttp://www.opmw.org/model/OPMW/#WorkflowTemplateProcess)
and its [workflow templates](http://www.isi.edu/~gil/papers/garijo-etal-works14.pdf) [5],
and similarly the the split between "prospective provenance"
and "retrospective provenance" of the
[ProvONE Data Model for Scientific Workflow Provenance](http://vcvcomputing.com/provone/provone.html). [6]

## Identifiers and interoperability

A great advantage of using Linked Data was that we could use the same
identifiers in all three formats. One challenge was that Taverna workflows are often run within a
desktop user interface or on the command line, and with privacy concerns
we didn't have the luxury of a server to mint URIs; we already [learnt our lesson with LSIDs](http://dev.mygrid.org.uk/blog/2016/02/what-exactly-happened-to-lsid/) [7].

Taverna therefore generate UUID-based structured `http://` URIs within [our namespaces](http://ns.taverna.org.uk/),
e.g.:

* `http://ns.taverna.org.uk/2011/run/d5ee659e-e11e-43a5-bc0a-58d93674e5e2/process/1e027057-2aeb-47f7-97dc-03e19e9772be/`
* `http://ns.taverna.org.uk/2010/workflowBundle/2f0e94ef-b5c4-455d-aeab-1e9611f46b8b/workflow/HelloWorld/processor/hello/`

Resolving these URIs gives the [scufl2-info](https://github.com/stain/scufl2-info),
web-service, which provide a minimal [JSON-LD](http://json-ld.org/) wfprov/wfdesc representation
identifying the URI as a provenance or workflow item, but (by design) not having access to the data bundle it can't say anything more.

We found that our UUID-based URIs don't play too well with
[PROV Toolbox](http://lucmoreau.github.io/ProvToolbox/) and alternative
PROV formats like
[PROV-N](https://www.w3.org/TR/prov-n/) and
[PROV-XML](https://www.w3.org/TR/prov-xml/), as
every URI ending in ``/`` is registered as
a separate namespace in order to form valid QNames.
A suggested improvement for TavernaProv
is to generate [provly identifiers](https://github.com/lucmoreau/ProvToolbox/wiki/Mapping-PROV-Qualified-Names-to-xsd:QName#4-provly-identifiers), while remaining compliant with the [10 simple rules for identifiers](https://github.com/ResearchObject/identifier-rules). [8]

Similarly, [OWL reasoning](https://www.w3.org/TR/owl2-profiles/#Reasoning_in_OWL_2_RL_and_RDF_Graphs_using_Rules) is not generally applied by PROV-O consumers, so even though
_wfprov_ formally extends
PROV in its ontology definitions, we needed to add explicitly the
implied PROV-O statements in TavernaProv's Turtle output.

## Common Workflow Language

[Common Workflow Language](http://commonwl.org/) has created a
workflow language [specification](http://www.commonwl.org/draft-3/),
a reference implementation
[cwltool](https://github.com/common-workflow-language/cwltool),
and a large [community](http://www.commonwl.org/#Participating_Organizations)
of workflow system developers who are adding
CWL support across bioinformatics,
including Apache Taverna and Galaxy. Unlike wfdesc, OPMW and P-Plan, CWL workflows are
primarily intended to be executed,
with a strong emphasis on the dataflow between
command line tools packaged as [Docker](https://docker.com/) images.

CWL is specified using [Schema Salad](http://www.commonwl.org/draft-3/SchemaSalad.html),
which provides [JSON-LD](http://json-ld.org/) constructs in
[YAML](http://yaml.org/). The CWL dataflow
model is inspired by wfdesc and Apache Taverna and thus have similar
execution semantics and provenance requirements.
CWL is [planning its provenance format](https://github.com/common-workflow-language/common-workflow-language/issues/84)
based on PROV-O, wfprov and JSON-LD. As one of the CWL adopters, Apache Taverna
will naturally also aim to support the CWL provenance model.

## Future work

To face the verbosity issue, we are considering to split out wfprov statements
to a different file; as a ZIP archive the Taverna Data Bundle can contain
many provenance formats. Similarly splitting out the details using
PROV-O Qualified Terms to a separate file is worth considering, this could also
improve PROV visualization of workflow provenance.
Having such separate [PROV bundles](https://www.w3.org/TR/prov-dm/#component4)
would also make it easier for Taverna to support the ProvONE model
as an additional format.

[PROV Links](https://www.w3.org/TR/prov-links/) could be added to
Research Object Bundles to relate its data files to the
then multiple workflow provenance traces that describe their generation
and usage.

## References

1. Katherine Wolstencroft, Robert Haines, Donal Fellows, Alan Williams, David Withers, Stuart Owen, Stian Soiland-Reyes, Ian Dunlop, Aleksandra Nenadic, Paul Fisher, Jiten Bhagat, Khalid Belhajjame, Finn Bacall, Alex Hardisty, Abraham Nieva de la Hidalga, Maria P. Balcazar Vargas, Shoaib Sufi, Carole Goble (2013): **The Taverna workflow suite: designing and executing workflows of Web Services on the desktop, web or in the cloud.** In: _Nucleic Acids Research_, **41**(W1): W557–W561. [doi:10.1093/nar/gkt328](http://dx.doi.org/10.1093/nar/gkt328)
2. Paolo Missier, Satya Sahoo, Jun Zhao, Carole Goble, Amit Sheth. (2010): **Janus: from Workflows to Semantic Provenance and Linked Open Data** in _Provenance and Annotation of Data and Processes, Third International Provenance and Annotation Workshop, (IPAW'10)_, 15–16 Jun 2010. Springer, Berlin: 129–141. [doi:10.1007/978-3-642-17819-1_16](http://dx.doi.org/10.1007/978-3-642-17819-1_16) [[pdf]](http://tw.rpi.edu/media/2013/12/31/96a5/IPAW2010_FP_Missier.pdf)
3. Stian Soiland-Reyes, Matthew Gamble, Robert Haines (2014): **Research Object Bundle 1.0**. _researchobject.org Specification_. [https://w3id.org/bundle/](https://w3id.org/bundle/) 2014-11-05. [doi:10.5281/zenodo.12586](http://dx.doi.org/10.5281/zenodo.12586)
4. Khalid Belhajjame, Jun Zhao, Daniel Garijo, Matthew Gamble, Kristina Hettne, Raul Palma, Eleni Mina, Oscar Corcho, José Manuel Gómez-Pérez, Sean Bechhofer, Graham Klyne, Carole Goble (2015): **Using a suite of ontologies for preserving workflow-centric research objects**, _Web Semantics: Science, Services and Agents on the World Wide Web_. [doi:10.1016/j.websem.2015.01.003](http://dx.doi.org/doi:10.1016/j.websem.2015.01.003)
5. Daniel Garijo, Yolanda Gil, Oscar Corcho (2014): **Towards workflow ecosystems through semantic and standard representations**. _Proceeding
WORKS '14 Proceedings of the 9th Workshop on Workflows in Support of Large-Scale Science_. [doi:10.1109/WORKS.2014.13](http://dx.doi.org/10.1109/WORKS.2014.13) [[pdf]](http://conferences.computer.org/works/2014/papers/7067a094.pdf)
6. Víctor Cuevas-Vicenttín, Parisa Kianmajd, Bertram Ludäscher, Paolo Missier, Fernando Chirigati, Yaxing Wei, David Koop, Saumen Dey (2014): **The PBase Scientific Workflow Provenance Repository**. _International Journal of Digital Curation_ **9**(2). [doi:10.2218/ijdc.v9i2.332](http://dx.doi.org/10.2218/ijdc.v9i2.332)
7. Stian Soiland-Reyes, Alan R Williams (2016): **What exactly happened to LSID?** _myGrid developer blog_, 2016-02-26. [http://dev.mygrid.org.uk/blog/2016/02/what-exactly-happened-to-lsid/](http://dev.mygrid.org.uk/blog/2016/02/what-exactly-happened-to-lsid/) [doi:10.5281/zenodo.46804](http://dx.doi.org/10.5281/zenodo.46804)
8. Julie A McMurry, Niklas Blomberg, Tony Burdett, Nathalie Conte, Michel Dumontier, Donal Fellows, Alejandra Gonzalez-Beltran, Philipp Gormanns, Janna Hastings, Melissa A Haendel, Henning Hermjakob, Jean-Karim Hériché, Jon C Ison, Rafael C Jimenez, Simon Jupp, Nick Juty, Camille Laibe, Nicolas Le Novère, James Malone, Maria Jesus Martin, Johanna R McEntyre, Chris Morris, Juha Muilu, Wolfgang Müller, Christopher J Mungall, Philippe Rocca-Serra, Susanna-Assunta Sansone, Murat Sariyar, Jacky L Snoep, Natalie J Stanford, Neil Swainston, Nicole L Washington, Alan R Williams, Katherine Wolstencroft, Carole Goble, Helen Parkinson (2015): **10 Simple rules for design, provision, and reuse of identifiers for web-based life science data**. _Zenodo_. Submitted to PLoS Computational Biology.
[doi:10.5281/zenodo.31765](http://dx.doi.org/10.5281/zenodo.31765)
9. Davidson, Susan B., Sarah Cohen Boulakia, Anat Eyal, Bertram Ludäscher, Timothy M. McPhillips, Shawn Bowers, Manish Kumar Anand, and Juliana Freire (2007): **Provenance in Scientific Workflow Systems.** IEEE Data Eng. Bull. 30, no. 4 (2007): 44-50.