Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/miku/siskin
Tasks around metadata.
https://github.com/miku/siskin
code4lib library metadata
Last synced: 2 months ago
JSON representation
Tasks around metadata.
- Host: GitHub
- URL: https://github.com/miku/siskin
- Owner: miku
- License: gpl-3.0
- Created: 2014-06-23T17:59:57.000Z (over 10 years ago)
- Default Branch: master
- Last Pushed: 2024-06-03T07:39:50.000Z (8 months ago)
- Last Synced: 2024-06-21T17:05:02.492Z (7 months ago)
- Topics: code4lib, library, metadata
- Language: Python
- Size: 91.2 MB
- Stars: 21
- Watchers: 8
- Forks: 4
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# siskin
Various tasks for heterogeneous metadata handling for project
[finc](https://finc.info) at [Leipzig University Library](https://www.ub.uni-leipzig.de). Based on
[luigi](https://github.com/spotify/luigi) from Spotify.We use a couple of [scripts](bin) in the repository to harvest about twenty
[data sources](siskin/sources) of various flavors (FTPs, OAIs, HTTPs), mix and
match CSV, XML and JSON, run conversions and deduplication to create a single
file that is indexable and conforms to a customized VuFind SOLR schema, running
on an unified index host serving part of the data in the online catalogs of
[partners](https://finc.info/de/anwender).[![DOI](https://zenodo.org/badge/21136499.svg)](https://zenodo.org/badge/latestdoi/21136499) [![Project Status: Active – The project has reached a stable, usable state and is being actively developed.](https://www.repostatus.org/badges/latest/active.svg)](https://www.repostatus.org/#active)
* Overview in a [few markdown slides](https://github.com/miku/siskin/blob/master/docs/ai-overview/slides.md)
Luigi (and other frameworks) allow to divide complex workflows into a set of
tasks, which form a
[DAG](https://en.wikipedia.org/wiki/Directed_acyclic_graph). The task logic is
implemented in Python, but it is easy to use external tools, e.g. via
[ExternalProgram](https://github.com/spotify/luigi/blob/master/luigi/contrib/external_program.py)
or [shellout](https://github.com/miku/gluish#easy-shell-calls). Luigi is
workflow glue and scales up (HDFS) and down (local scheduler).More on Luigi:
* [Luigi docs](https://luigi.readthedocs.io/en/stable/)
* [Luigi presentation at LPUG 2015](https://github.com/miku/lpug-luigi)
* [Luigi workshop at PyCon Balkan 2018](https://github.com/miku/batchdata)
* [Data pipelines, Luigi, Airflow: everything you need to know](https://towardsdatascience.com/data-pipelines-luigi-airflow-everything-you-need-to-know-18dc741449b7)More about the project:
* [Blog about index](https://finc.info/Archive/268) [de], 2015
* [Presentation at 4th VuFind Meetup](https://swop.bsz-bw.de/frontdoor/index/index/docId/1104) [de], 2015
* [Metadaten zwischen Autopsie und Automatisierung](http://web.archive.org/web/20220617102023id_/https://www.bibliotheksverband.de/sites/default/files/2021-11/Erwkomm_Fortbild_Ddorf2018_Wiesenmueller.pdf#page=26) [de], 2018Contents.
* [Install](#install)
* [Update](#update)
* [Run](#run)
* [Create an aggregated file for finc](#create-an-aggregated-file-for-finc)
* [Configuration](#configuration)
* [Software versioning](#software-versioning)
* [Schema changes](#schema-changes)
* [Task dependencies](#task-dependencies)
* [Evolving workflows](#evolving-workflows)
* [Development](#development)
* [Naming conventions](#naming-conventions)
* [Deployment](#deployment)
* [TODO](#todo)----
## Install
```
$ pip install -U siskin
```The siskin project includes a [bunch of
scripts](https://github.com/miku/siskin/tree/master/bin), that allow to create,
inspect or remove tasks or task artifacts.Starting 02/2020, only Python 3 is supported.
Run `taskchecksetup` to see, what additional tools might need to be installed
(this is a manually [curated](https://git.io/fhZvG) list, not everything is
required for every task).```shell
$ taskchecksetup
ok 7z
ok csvcut
ok curl
ok filterline
ok flux.sh
ok groupcover
ok iconv
ok iconv-chunks
ok jq
ok metha-sync
ok pigz
ok solrbulk
ok span-import
ok unzip
ok wget
ok xmllint
ok yaz-marcdump
```## Update
For siskin updates a
```
$ pip install -U siskin
```should suffices. If newer versions of external program are required, than
please update those manually (e.g. via your OS' package manager).## Run
List tasks:
$ tasknames
A task is an encapsulation of a processing step and can be in theory, anything;
Typical tasks are: fetching data from FTP, OAI endpoint or an HTTP API, format
conversions, filters or reports. Many tasks are parameterized by date (with the
default often being *today*), which allows siskin to keep track, whether an artifact
is update-to-date or not.Run simple task:
$ taskdo DOAJHarvest
Documentation:
$ taskdocs | less -R
Remove artefacts of a task:
$ taskrm DOAJHarvest
Inspect the source code of a task:
```python
$ taskinspect AILocalData
class AILocalData(AITask):
"""
Extract a CSV about source, id, doi and institutions for deduplication.
"""
date = ClosestDateParameter(default=datetime.date.today())
batchsize = luigi.IntParameter(default=25000, significant=False)def requires(self):
return AILicensing(date=self.date)
...
```## Create an aggregated file for finc
There are a couple of prerequisites:
* [ ] siskin is [installed](https://github.com/miku/siskin/#install)
* [ ] most additional tools are installed (or: output of the `taskchecksetup` is mostly green)
* [ ] credentials are [configured](https://github.com/miku/siskin/#configuration) in */etc/siskin/siskin.ini* or *~/.config/siskin/siskin.ini*
* [ ] some static data (that cannot be accessed over the net) is put into place (and configured in *siskin.ini*)
* [ ] sufficient disk space is availableThe update process itself consists of various updates:
* all data sources (crossref, doaj, ...) are updated, as needed (e.g. FTP is synced, OAI is harvested, API, ...)
* the licensing data is fetched from [AMSL](https://amsl.technology)This dependency graph of these operations can become complex:
![](docs/catalog/AIUpdate.png)
However, if everything is put into place, a single command will suffice:
```shell
$ taskdo AIUpdate --workers 4
```This can be a long running (hours, days) command, depending on the state of the already cached data.
Note: Currently a jour fixe (the 15th of a month) is used as default for the
licensing information (another task, called *AMSLFilterConfigFreeze* should be
run daily for this to work). The jour fixe can be overriden with the *current* information, by passing a parameter to the *AILicensing* task:```
$ taskdo AIUpdate --workers 4 --AILicensing-override
```Once the task is completed, the output of the two tasks:
* AIExport (solr)
* AIRedact (blob, currently [microblob](https://github.com/miku/microblob))can be put into their respective data stores (e.g. via [solrbulk](https://github.com/miku/solrbulk)).
## Configuration
The siskin package harvests all kinds of data sources, some of which might be
protected. All credentials and a few other configuration options go into a
`siskin.ini`, either in `/etc/siskin/` or `~/.config/siskin/`. If both files
are present, the local options take precedence.Luigi uses a bit of configuration as well, put it under `/etc/luigi/`.
Completions on task names will save you typing and time, so put
`siskin_compeletion.sh` under `/etc/bash_completion.d` or somewhere else.```shell
$ tree etc
etc
├── bash_completion.d
│ └── siskin_completion.sh
├── luigi
│ ├── luigi.cfg
│ └── logging.ini
└── siskin
└── siskin.ini
```All configuration values can be inspected quickly with:
```
$ taskconfig
[core]
home = /var/siskin[imslp]
listings-url = https://example.org/abc[jstor]
ftp-username = abc
ftp-password = d3f
...
```## Software versioning
Since siskin works mostly *on data*, software versioning differs a bit, but we
try to adhere to the following rules:* *major* changes: *You need to recreate all your data from scratch*.
* *minor* changes: We added, renamed or removed *at least one task*. You will
have to recreate a subset of the tasks to see the changes. You might need to change
pipelines depending on those tasks, because they might not exist any more or have been renamed.
* *revision* changes: A modification within existing tasks (e.g. bugfixes).
You will have to recreate a subset of the tasks to see this changes, but no new
task is introduced. *No pipeline is broken, that wasn't already*.These rules apply for version 0.2.0 and later. To see the current version, use:
```shell
$ taskversion
0.43.3
```## Schema changes
To remove all files of a certain format (due to schema changes or such) it helps, if naming is uniform:
```shell
$ tasknames | grep IntermediateSchema | xargs -I {} taskrm {}
...
```Apart from that, all upstream tasks need to be removed manually (consult the
[map](https://git.io/v5sdS)) as this is not automatic yet.## Task dependencies
Inspect task dependencies with:
```shell
$ taskdeps JstorIntermediateSchema
└─ JstorIntermediateSchema(date=2018-05-25)
└─ AMSLService(date=2018-05-25, name=outboundservices:discovery)
└─ JstorCollectionMapping(date=2018-05-25)
└─ JstorIntermediateSchemaGenericCollection(date=2018-05-25)
```Or visually via [graphviz](https://www.graphviz.org/).
```shell
$ taskdeps-dot JstorIntermediateSchema | dot -Tpng > deps.png
```## Evolving workflows
![](http://i.imgur.com/8bFvSvN.gif)
## Development
To converge the project on a common format run:
```shell
$ make imports style
```This will fix import order and code style in-place. Requires isort and yapf
installed. Should be executed under Python 3 only (as Python 2 isort seems to
have differing opinions).Other tools:
* use [pylint](https://github.com/PyCQA/pylint), currently 9.18/10 with many errors ignored, maybe with [git commit hook](https://github.com/sebdah/git-pylint-commit-hook)
* use [pytest](https://docs.pytest.org/), [pytest-cov](https://pypi.org/project/pytest-cov/), coverage at 9%## Naming conventions
Some conventions are enforced by tools (e.g. imports, yapf), but the following
may be considered as well.### Task names and filenames
* task class names that produce MARC21 should have suffix MARC, e.g. ArchiveMARC
* task class names that produce intermediate schema files should have suffix IntermediateSchema, e.g. ArchiveIntermediateSchema
* task for a single source should share a prefix, e.g. ArchiveMARC, ArchiveISSNList
* source prefix names should follow the source names (e.g. site of publisher), in German: *vorlagegetreu*, e.g. DOAJHarvest, GallicaMARC
* potentially long source names can be shortened, e.g. Umweltbibliothek can become UmBi... in umbi.py
* it is recommended that the source file name follows the source name, e.g. DOAJ tasks live in doaj.py### Module docstrings for tasks (and scripts)
Rough examples:
```python
# coding: utf-8
# pylint: ...
#
# Copyright 2019 ... GPL-3.0+ snippet
# ...
# @license GPL-3.0+"""
Source: Gallica
SID: 20
Ticket: #14793
Origin: OAI
Updates: monthlyConfig:
[vkfilm]
input = /path/to/file
password = helloadmin"""
```
### Quoting style
* use double quotes, if possible
### Executable
* if a module can be used as standalone script, then it should include the following line as first line:
```
#!/usr/bin/env python
```### Python 2/3 considerations
* use [six](https://six.readthedocs.io/), if necessary
* use `__future__` imports if necessary
* prefer [io.open](https://docs.python.org/3/library/io.html#io.open) to raw open, e.g. Python 2 builtin has no keyword `encoding`
* string literals should be written with the `u` prefix (obsolete in Python 3, but required in Python 2)### Debugging
* prefer logging over print statements
### Open for discussion
* one suffix for data acquisition tasks, e.g. Harvest, Get, Fetch, Download, ...
## Deployment
A distribution can be created via Makefile.
```shell
$ make dist
$ tree dist/
dist/
└── siskin-0.62.0.tar.gz
```The tarball can be installed via [pip](https://pypi.org/project/pip/):
```
$ pip install siskin-0.62.0.tar.gz
```If access to PyPI is possible, one can upload the tarball there with:
```
$ make upload
```Which in turn allows to install siskin via:
```
$ pip install -U siskin
```on the target machine.
## TODO
* [ ] The naming of the scripts is a bit unfortunate, `taskdo`, `taskcat`,
.... Maybe better `siskin run`, `siskin cat`, `siskin rm` and so on.
* [ ] Investigate [pytest](https://docs.pytest.org/en/latest/) for testing tasks, given inputs.# Misc
A short video using luigi's [on_success and
on_failure](https://luigi.readthedocs.io/en/stable/api/luigi.task.html#luigi.task.Task.on_failure)
handlers to make the processing audible.[![](docs/screenie_14.png)](https://archive.org/details/the-sound-of-data-being-processed-2014)