Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.
Awesome Lists | Featured Topics | Projects
https://github.com/miku/siskin

Tasks around metadata.
https://github.com/miku/siskin
code4lib library metadata
Last synced: 2 months ago
JSON representation
Tasks around metadata.
Host: GitHub
URL: https://github.com/miku/siskin
Owner: miku
License: gpl-3.0
Created: 2014-06-23T17:59:57.000Z (over 10 years ago)
Default Branch: master
Last Pushed: 2024-06-03T07:39:50.000Z (8 months ago)
Last Synced: 2024-06-21T17:05:02.492Z (7 months ago)
Topics: code4lib, library, metadata
Language: Python
Size: 91.2 MB
Stars: 21
Watchers: 8
Forks: 4
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

README

        # siskin

Various tasks for heterogeneous metadata handling for project

[finc](https://finc.info) at [Leipzig University Library](https://www.ub.uni-leipzig.de). Based on

[luigi](https://github.com/spotify/luigi) from Spotify.

We use a couple of [scripts](bin) in the repository to harvest about twenty

[data sources](siskin/sources) of various flavors (FTPs, OAIs, HTTPs), mix and

match CSV, XML and JSON, run conversions and deduplication to create a single

file that is indexable and conforms to a customized VuFind SOLR schema, running

on an unified index host serving part of the data in the online catalogs of

[partners](https://finc.info/de/anwender).

[![DOI](https://zenodo.org/badge/21136499.svg)](https://zenodo.org/badge/latestdoi/21136499) [![Project Status: Active – The project has reached a stable, usable state and is being actively developed.](https://www.repostatus.org/badges/latest/active.svg)](https://www.repostatus.org/#active)

* Overview in a [few markdown slides](https://github.com/miku/siskin/blob/master/docs/ai-overview/slides.md)

Luigi (and other frameworks) allow to divide complex workflows into a set of

tasks, which form a

[DAG](https://en.wikipedia.org/wiki/Directed_acyclic_graph). The task logic is

implemented in Python, but it is easy to use external tools, e.g. via

[ExternalProgram](https://github.com/spotify/luigi/blob/master/luigi/contrib/external_program.py)

or [shellout](https://github.com/miku/gluish#easy-shell-calls). Luigi is

workflow glue and scales up (HDFS) and down (local scheduler).

More on Luigi:

* [Luigi docs](https://luigi.readthedocs.io/en/stable/)

* [Luigi presentation at LPUG 2015](https://github.com/miku/lpug-luigi)

* [Luigi workshop at PyCon Balkan 2018](https://github.com/miku/batchdata)

* [Data pipelines, Luigi, Airflow: everything you need to know](https://towardsdatascience.com/data-pipelines-luigi-airflow-everything-you-need-to-know-18dc741449b7)

More about the project:

* [Blog about index](https://finc.info/Archive/268) [de], 2015

* [Presentation at 4th VuFind Meetup](https://swop.bsz-bw.de/frontdoor/index/index/docId/1104) [de], 2015

* [Metadaten zwischen Autopsie und Automatisierung](http://web.archive.org/web/20220617102023id_/https://www.bibliotheksverband.de/sites/default/files/2021-11/Erwkomm_Fortbild_Ddorf2018_Wiesenmueller.pdf#page=26) [de], 2018

Contents.

* [Install](#install)

* [Update](#update)

* [Run](#run)

* [Create an aggregated file for finc](#create-an-aggregated-file-for-finc)

* [Configuration](#configuration)

* [Software versioning](#software-versioning)

* [Schema changes](#schema-changes)

* [Task dependencies](#task-dependencies)

* [Evolving workflows](#evolving-workflows)

* [Development](#development)

* [Naming conventions](#naming-conventions)

* [Deployment](#deployment)

* [TODO](#todo)

----

## Install

```

$ pip install -U siskin

```

The siskin project includes a [bunch of

scripts](https://github.com/miku/siskin/tree/master/bin), that allow to create,

inspect or remove tasks or task artifacts.

Starting 02/2020, only Python 3 is supported.

Run `taskchecksetup` to see, what additional tools might need to be installed

(this is a manually [curated](https://git.io/fhZvG) list, not everything is

required for every task).

```shell

$ taskchecksetup

ok      7z

ok      csvcut

ok      curl

ok      filterline

ok      flux.sh

ok      groupcover

ok      iconv

ok      iconv-chunks

ok      jq

ok      metha-sync

ok      pigz

ok      solrbulk

ok      span-import

ok      unzip

ok      wget

ok      xmllint

ok      yaz-marcdump

```

## Update

For siskin updates a

```

$ pip install -U siskin

```

should suffices. If newer versions of external program are required, than

please update those manually (e.g. via your OS' package manager).

## Run

List tasks:

    $ tasknames

A task is an encapsulation of a processing step and can be in theory, anything;

Typical tasks are: fetching data from FTP, OAI endpoint or an HTTP API, format

conversions, filters or reports. Many tasks are parameterized by date (with the

default often being *today*), which allows siskin to keep track, whether an artifact

is update-to-date or not.

Run simple task:

    $ taskdo DOAJHarvest

Documentation:

    $ taskdocs | less -R

Remove artefacts of a task:

    $ taskrm DOAJHarvest

Inspect the source code of a task:

```python

$ taskinspect AILocalData

class AILocalData(AITask):

    """

    Extract a CSV about source, id, doi and institutions for deduplication.

    """

    date = ClosestDateParameter(default=datetime.date.today())

    batchsize = luigi.IntParameter(default=25000, significant=False)

    def requires(self):

        return AILicensing(date=self.date)

    ...

```

## Create an aggregated file for finc

There are a couple of prerequisites:

* [ ] siskin is [installed](https://github.com/miku/siskin/#install)

* [ ] most additional tools are installed (or: output of the `taskchecksetup` is mostly green)

* [ ] credentials are [configured](https://github.com/miku/siskin/#configuration) in */etc/siskin/siskin.ini* or *~/.config/siskin/siskin.ini*

* [ ] some static data (that cannot be accessed over the net) is put into place (and configured in *siskin.ini*)

* [ ] sufficient disk space is available

The update process itself consists of various updates:

* all data sources (crossref, doaj, ...) are updated, as needed (e.g. FTP is synced, OAI is harvested, API, ...)

* the licensing data is fetched from [AMSL](https://amsl.technology)

This dependency graph of these operations can become complex:

![](docs/catalog/AIUpdate.png)

However, if everything is put into place, a single command will suffice:

```shell

$ taskdo AIUpdate --workers 4

```

This can be a long running (hours, days) command, depending on the state of the already cached data.

Note: Currently a jour fixe (the 15th of a month) is used as default for the

licensing information (another task, called *AMSLFilterConfigFreeze* should be

run daily for this to work). The jour fixe can be overriden with the *current* information, by passing a parameter to the *AILicensing* task:

```

$ taskdo AIUpdate --workers 4 --AILicensing-override

```

Once the task is completed, the output of the two tasks:

* AIExport (solr)

* AIRedact (blob, currently [microblob](https://github.com/miku/microblob))

can be put into their respective data stores (e.g. via [solrbulk](https://github.com/miku/solrbulk)).

## Configuration

The siskin package harvests all kinds of data sources, some of which might be

protected. All credentials and a few other configuration options go into a

`siskin.ini`, either in `/etc/siskin/` or `~/.config/siskin/`. If both files

are present, the local options take precedence.

Luigi uses a bit of configuration as well, put it under `/etc/luigi/`.

Completions on task names will save you typing and time, so put

`siskin_compeletion.sh` under `/etc/bash_completion.d` or somewhere else.

```shell

$ tree etc

etc

├── bash_completion.d

│   └── siskin_completion.sh

├── luigi

│   ├── luigi.cfg

│   └── logging.ini

└── siskin

    └── siskin.ini

```

All configuration values can be inspected quickly with:

```

$ taskconfig

[core]

home = /var/siskin

[imslp]

listings-url = https://example.org/abc

[jstor]

ftp-username = abc

ftp-password = d3f

...

```

## Software versioning

Since siskin works mostly *on data*, software versioning differs a bit, but we

try to adhere to the following rules:

* *major* changes: *You need to recreate all your data from scratch*.

* *minor* changes: We added, renamed or removed *at least one task*. You will

  have to recreate a subset of the tasks to see the changes. You might need to change

  pipelines depending on those tasks, because they might not exist any more or have been renamed.

* *revision* changes: A modification within existing tasks (e.g. bugfixes).

  You will have to recreate a subset of the tasks to see this changes, but no new

  task is introduced. *No pipeline is broken, that wasn't already*.

These rules apply for version 0.2.0 and later. To see the current version, use:

```shell

$ taskversion

0.43.3

```

## Schema changes

To remove all files of a certain format (due to schema changes or such) it helps, if naming is uniform:

```shell

$ tasknames | grep IntermediateSchema | xargs -I {} taskrm {}

...

```

Apart from that, all upstream tasks need to be removed manually (consult the

[map](https://git.io/v5sdS)) as this is not automatic yet.

## Task dependencies

Inspect task dependencies with:

```shell

$ taskdeps JstorIntermediateSchema

  └─ JstorIntermediateSchema(date=2018-05-25)

      └─ AMSLService(date=2018-05-25, name=outboundservices:discovery)

      └─ JstorCollectionMapping(date=2018-05-25)

      └─ JstorIntermediateSchemaGenericCollection(date=2018-05-25)

```

Or visually via [graphviz](https://www.graphviz.org/).

```shell

$ taskdeps-dot JstorIntermediateSchema | dot -Tpng > deps.png

```

## Evolving workflows

![](http://i.imgur.com/8bFvSvN.gif)

## Development

To converge the project on a common format run:

```shell

$ make imports style

```

This will fix import order and code style in-place. Requires isort and yapf

installed. Should be executed under Python 3 only (as Python 2 isort seems to

have differing opinions).

Other tools:

* use [pylint](https://github.com/PyCQA/pylint), currently 9.18/10 with many errors ignored, maybe with [git commit hook](https://github.com/sebdah/git-pylint-commit-hook)

* use [pytest](https://docs.pytest.org/), [pytest-cov](https://pypi.org/project/pytest-cov/), coverage at 9%

## Naming conventions

Some conventions are enforced by tools (e.g. imports, yapf), but the following

may be considered as well.

### Task names and filenames

* task class names that produce MARC21 should have suffix MARC, e.g. ArchiveMARC

* task class names that produce intermediate schema files should have suffix IntermediateSchema, e.g. ArchiveIntermediateSchema

* task for a single source should share a prefix, e.g. ArchiveMARC, ArchiveISSNList

* source prefix names should follow the source names (e.g. site of publisher), in German: *vorlagegetreu*, e.g. DOAJHarvest, GallicaMARC

* potentially long source names can be shortened, e.g. Umweltbibliothek can become UmBi... in umbi.py

* it is recommended that the source file name follows the source name, e.g. DOAJ tasks live in doaj.py

### Module docstrings for tasks (and scripts)

Rough examples:

```python

# coding: utf-8

# pylint: ...

#

# Copyright 2019 ... GPL-3.0+ snippet

# ...

# @license GPL-3.0+ 

"""

Source: Gallica

SID: 20

Ticket: #14793

Origin: OAI

Updates: monthly

Config:

[vkfilm]

input = /path/to/file

password = helloadmin

"""

```

### Quoting style

* use double quotes, if possible

### Executable

* if a module can be used as standalone script, then it should include the following line as first line:

```

#!/usr/bin/env python

```

### Python 2/3 considerations

* use [six](https://six.readthedocs.io/), if necessary

* use `__future__` imports if necessary

* prefer [io.open](https://docs.python.org/3/library/io.html#io.open) to raw open, e.g. Python 2 builtin has no keyword `encoding`

* string literals should be written with the `u` prefix (obsolete in Python 3, but required in Python 2)

### Debugging

* prefer logging over print statements

### Open for discussion

* one suffix for data acquisition tasks, e.g. Harvest, Get, Fetch, Download, ...

## Deployment

A distribution can be created via Makefile.

```shell

$ make dist

$ tree dist/

dist/

└── siskin-0.62.0.tar.gz

```

The tarball can be installed via [pip](https://pypi.org/project/pip/):

```

$ pip install siskin-0.62.0.tar.gz

```

If access to PyPI is possible, one can upload the tarball there with:

```

$ make upload

```

Which in turn allows to install siskin via:

```

$ pip install -U siskin

```

on the target machine.

## TODO

* [ ] The naming of the scripts is a bit unfortunate, `taskdo`, `taskcat`,

  .... Maybe better `siskin run`, `siskin cat`, `siskin rm` and so on.

* [ ] Investigate [pytest](https://docs.pytest.org/en/latest/) for testing tasks, given inputs.

# Misc

A short video using luigi's [on_success and

on_failure](https://luigi.readthedocs.io/en/stable/api/luigi.task.html#luigi.task.Task.on_failure)

handlers to make the processing audible.

[![](docs/screenie_14.png)](https://archive.org/details/the-sound-of-data-being-processed-2014)