https://github.com/sd2e/pipelinejobs-agave-proxy

(Mirror) Provides a generalized interface to run Agave jobs in the PipelineJobs framework
https://github.com/sd2e/pipelinejobs-agave-proxy

agaveapi metadata reactor workflow

Last synced: about 1 year ago
JSON representation

(Mirror) Provides a generalized interface to run Agave jobs in the PipelineJobs framework

Host: GitHub
URL: https://github.com/sd2e/pipelinejobs-agave-proxy
Owner: SD2E
License: other
Created: 2018-12-09T13:27:25.000Z (over 7 years ago)
Default Branch: master
Last Pushed: 2019-01-31T23:16:22.000Z (over 7 years ago)
Last Synced: 2025-01-28T22:51:16.624Z (over 1 year ago)
Topics: agaveapi, metadata, reactor, workflow
Language: Python
Homepage:
Size: 31.3 KB
Stars: 0
Watchers: 7
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.rst
- Changelog: CHANGELOG.rst
- License: LICENSE

Awesome Lists containing this project

README

          ========================

PipelineJobs Agave Proxy

========================

This Reactor provides a generalized proxy for running Agave API jobs such that

their inputs, parameterization, and outputs are connected to (and thus

discoverable from within) the Data Catalog.

Register Agave App as a Pipeline

--------------------------------

Before an Agave App can be run by this proxy, three things must happen:

#. It must be architected to fit the PipelineJobs workflow

#. It must be public or shared with user **sd2eadm**

#. It must be registered as a Data Catalog ``Pipeline``

App Architecture

^^^^^^^^^^^^^^^^

The app must generate filenames that are distinguishable between runs. This is

enforced to prevent accidentallly over-writing of files when multiple jobs

share an archiving destination. Furthermore, the app definition and any

interior runtime logic must use fully-qualified Agave files URLs to

define inputs. Finally, the app's ``id`` must be unique not only in the

**Agave Apps Catalog** (this is automatically enforced) but also in the

**Data Catalog Pipelines** collection.

Share or Publish the App

^^^^^^^^^^^^^^^^^^^^^^^^

*Coming soon...*

Registering a Pipeline

^^^^^^^^^^^^^^^^^^^^^^

*Coming soon...*

Launching a Managed Agave Job

-----------------------------

Construct and send a message including the following components to the

**PipelineJobs Agave Proxy** Reactor.

#. An Agave job definition

#. A metdata linkage parameter

#. Optional control parameters

.. note:: The **agave_pipelinejob** format is documented in JSONSchemas_.

Agave Job Definition

^^^^^^^^^^^^^^^^^^^^

The Agave job definition must be included as as subdocument in the message. To

illustrate this, start with a basic Agave job definition: Here is an

example for an imaginary Agave app ``tacobot9000-0.1.0u1``.

.. code-block:: json

    {

      "appId": "tacobot9000-0.1.0u1",

      "name": "TACObot job",

      "inputs": {"file1": "agave://data.tacc.cloud/examples/tacobot/test1.txt"},

      "parameters": {"salsa": true, "avocado": false, "cheese": true},

      "maxRunTime": "01:00:00"

    }

To launch this via the Agave, this document would be sent directly to the

``/apps`` endpoint. To send it instead to the proxy, move it to key

``job_definition`` in a JSON document.

.. code-block:: json

    {

    	"job_definition": {

    		"appId": "tacobot9000-0.1.0u1",

    		"name": "TACObot job",

    		"inputs": {

    			"file1": "agave://data.tacc.cloud/examples/tacobot/test1.txt"

    		},

    		"parameters": {

    			"salsa": true,

    			"avocado": false,

    			"cheese": true

    		},

    		"maxRunTime": "01:00:00"

    	}

    }

Metadata Linkage Parameter

^^^^^^^^^^^^^^^^^^^^^^^^^^

An explicit linkage to objects in the Data Catalog must be established. This is

done via the ``parameters`` key, which must contain a valid value for one of

the following:

- ``experiment_id``

- ``sample_id``

- ``measurement_id``

Either single values or an array of values may be passed, and either the

readable text value may be provided or the corresponding UUID.

Which Parameter to Pass

#######################

A PipelineJob is always linked to a set of measurements by way of the

linkage parameter. The job's archive path is also determined by the linkage

parameter. To illustrate:.

If a job's ``measurement_id=['measurement.tacc.1234',

'measurement.tacc.2345']``, it will linked to these two measurements

and its archive path will end with a hash of the two ``measurement_id`` values.

Assuming those measurements are children of ``sample.tacc.abcde`` and the only

linkage parameter sent was ``sample_id='sample.tacc.abcdef'``, the job will

still be linked to all the child measurements of that sample. Its archive path

will end with a hash of ``sample.tacc.abcde``. Howver, if both measurement_id

and sample_id are passed, the linkages are made to the specified measurement(s)

while the archive path is a function of the sample_id value(s).

For experiment_id, the specific samples are linked to the job and the

archive path is a function of experiment_id value(s).

This design allows files generated by the job to be linked to only one level

of the metadata hiearchy, while allowing collection of outputs at higher

levels of organization in the file system.

Here is a worked example of the current example job request, as it stands:

.. code-block:: json

    {

    	"parameters": {

    		"sample_id": "sample.tacc.abcde"

    	},

    	"job_definition": {

    		"appId": "tacobot9000-0.1.0u1",

    		"name": "TACObot job",

    		"inputs": {

    			"file1": "agave://data.tacc.cloud/examples/tacobot/test1.txt"

    		},

    		"parameters": {

    			"salsa": true,

    			"avocado": false,

    			"cheese": true

    		},

    		"maxRunTime": "01:00:00"

    	}

    }

Additional Control Parameters

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Job behavior can be refined with additional control parameters.

instanced

#########

Each PipelineJob has a distinct archive path derived from its Pipeline UUID,

the ``data`` dictionary passed at job ``init()`` and/or ``setup()``, and a

function of its linkage parameters to experiments, samples, or measurements.

To avoid inadvertent over-writes, the archive path is extended with an

*instancing directory* named in the form ``adjective-animal-YYYYMMDDTHHmmssZ``.

To avoid use of the instancing directory, include ``instanced: false`` in the

job request message.

Example: ``"instanced": false``

index_patterns

##############

The default behavior of the PipelineJobs System is to index every file found

under a job's archive path to be linked to that specific job. To subselect only

specific files, it is possible to include one or more Python regular

expressions in ``index_patterns``. Only files matching these patterns will be

linked to the job.

Example: ``"index_patterns": []``

processing_level

################

The default behavior of the PipelineJobs System is to index files under a job's

archive path as processing level "1". To change this, an alternative

``processing_level`` may be passed in the job request message.

Example: ``"processing_level": "2"``

.. note:: Only one automatic indexing configuration can be active for a given

          job. Additional indexing actions with other configurations may be

          initiated by sending a message directly to **PipelineJobs Indexer**

Job Life Cycle

--------------

Here is complete record from the Pipelines system showing how the information

from job creation and subsequent events is stored and discoverable. A few key

highlights:

* The top-level ``data`` field holds the original parameterization of the job

* Three events are noted in the ``history``: create, run, finish

* The actor and execution for the managing instance of **PipelineJobs Agave Proxy** are available under ``agent`` and ``task``, respectively

.. literalinclude:: jobdocument.json

   :language: json

   :linenos:

.. _JSONSchemas:

JSON Schemas

------------

.. literalinclude:: schemas/agave2.jsonschema

   :language: json

   :linenos:

   :caption: agave_pipelinejob

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/sd2e/pipelinejobs-agave-proxy

Awesome Lists containing this project

README