{"id":21620778,"url":"https://github.com/miku/siskin","last_synced_at":"2025-04-11T09:16:22.092Z","repository":{"id":18074362,"uuid":"21136499","full_name":"miku/siskin","owner":"miku","description":"Tasks around metadata.","archived":false,"fork":false,"pushed_at":"2025-03-19T15:21:00.000Z","size":95706,"stargazers_count":21,"open_issues_count":1,"forks_count":5,"subscribers_count":7,"default_branch":"master","last_synced_at":"2025-03-25T06:33:33.394Z","etag":null,"topics":["code4lib","library","luigi","luigi-pipeline","metadata"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/miku.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2014-06-23T17:59:57.000Z","updated_at":"2025-03-19T15:21:04.000Z","dependencies_parsed_at":"2023-01-11T20:27:58.380Z","dependency_job_id":"c6364fd5-2ab9-4623-95d8-b163eb0248da","html_url":"https://github.com/miku/siskin","commit_stats":null,"previous_names":[],"tags_count":66,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/miku%2Fsiskin","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/miku%2Fsiskin/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/miku%2Fsiskin/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/miku%2Fsiskin/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/miku","download_url":"https://codeload.github.com/miku/siskin/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248366378,"owners_count":21091955,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["code4lib","library","luigi","luigi-pipeline","metadata"],"created_at":"2024-11-24T23:12:47.595Z","updated_at":"2025-04-11T09:16:22.006Z","avatar_url":"https://github.com/miku.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# siskin\n\nVarious tasks for heterogeneous metadata handling for project\n[finc](https://finc.info) at [Leipzig University Library](https://www.ub.uni-leipzig.de). Based on\n[luigi](https://github.com/spotify/luigi) from Spotify.\n\nWe use a couple of [scripts](bin) in the repository to harvest about twenty\n[data sources](siskin/sources) of various flavors (FTPs, OAIs, HTTPs), mix and\nmatch CSV, XML and JSON, run conversions and deduplication to create a single\nfile that is indexable and conforms to a customized VuFind SOLR schema, running\non an unified index host serving part of the data in the online catalogs of\n[partners](https://finc.info/de/anwender).\n\n[![DOI](https://zenodo.org/badge/21136499.svg)](https://zenodo.org/badge/latestdoi/21136499) [![Project Status: Active – The project has reached a stable, usable state and is being actively developed.](https://www.repostatus.org/badges/latest/active.svg)](https://www.repostatus.org/#active)\n\n* Overview in a [few markdown slides](https://github.com/miku/siskin/blob/master/docs/ai-overview/slides.md)\n\nLuigi (and other frameworks) allow to divide complex workflows into a set of\ntasks, which form a\n[DAG](https://en.wikipedia.org/wiki/Directed_acyclic_graph). The task logic is\nimplemented in Python, but it is easy to use external tools, e.g. via\n[ExternalProgram](https://github.com/spotify/luigi/blob/master/luigi/contrib/external_program.py)\nor [shellout](https://github.com/miku/gluish#easy-shell-calls). Luigi is\nworkflow glue and scales up (HDFS) and down (local scheduler).\n\nMore on Luigi:\n\n* [Luigi docs](https://luigi.readthedocs.io/en/stable/)\n* [Luigi presentation at LPUG 2015](https://github.com/miku/lpug-luigi)\n* [Luigi workshop at PyCon Balkan 2018](https://github.com/miku/batchdata)\n* [Data pipelines, Luigi, Airflow: everything you need to know](https://towardsdatascience.com/data-pipelines-luigi-airflow-everything-you-need-to-know-18dc741449b7)\n\nMore about the project:\n\n* [Blog about index](https://finc.info/Archive/268) [de], 2015\n* [Presentation at 4th VuFind Meetup](https://swop.bsz-bw.de/frontdoor/index/index/docId/1104) [de], 2015\n* [Metadaten zwischen Autopsie und Automatisierung](http://web.archive.org/web/20220617102023id_/https://www.bibliotheksverband.de/sites/default/files/2021-11/Erwkomm_Fortbild_Ddorf2018_Wiesenmueller.pdf#page=26) [de], 2018\n\nContents.\n\n* [Install](#install)\n* [Update](#update)\n* [Run](#run)\n* [Create an aggregated file for finc](#create-an-aggregated-file-for-finc)\n* [Configuration](#configuration)\n* [Software versioning](#software-versioning)\n* [Schema changes](#schema-changes)\n* [Task dependencies](#task-dependencies)\n* [Evolving workflows](#evolving-workflows)\n* [Development](#development)\n* [Naming conventions](#naming-conventions)\n* [Deployment](#deployment)\n* [TODO](#todo)\n\n----\n\n## Install\n\n```\n$ pip install -U siskin\n```\n\nThe siskin project includes a [bunch of\nscripts](https://github.com/miku/siskin/tree/master/bin), that allow to create,\ninspect or remove tasks or task artifacts.\n\nStarting 02/2020, only Python 3 is supported.\n\nRun `taskchecksetup` to see, what additional tools might need to be installed\n(this is a manually [curated](https://git.io/fhZvG) list, not everything is\nrequired for every task).\n\n```shell\n$ taskchecksetup\nok      7z\nok      csvcut\nok      curl\nok      filterline\nok      flux.sh\nok      groupcover\nok      iconv\nok      iconv-chunks\nok      jq\nok      metha-sync\nok      pigz\nok      solrbulk\nok      span-import\nok      unzip\nok      wget\nok      xmllint\nok      yaz-marcdump\n```\n\n## Update\n\nFor siskin updates a\n\n```\n$ pip install -U siskin\n```\n\nshould suffices. If newer versions of external program are required, than\nplease update those manually (e.g. via your OS' package manager).\n\n## Run\n\nList tasks:\n\n    $ tasknames\n\nA task is an encapsulation of a processing step and can be in theory, anything;\nTypical tasks are: fetching data from FTP, OAI endpoint or an HTTP API, format\nconversions, filters or reports. Many tasks are parameterized by date (with the\ndefault often being *today*), which allows siskin to keep track, whether an artifact\nis update-to-date or not.\n\nRun simple task:\n\n    $ taskdo DOAJHarvest\n\nDocumentation:\n\n    $ taskdocs | less -R\n\nRemove artefacts of a task:\n\n    $ taskrm DOAJHarvest\n\nInspect the source code of a task:\n\n```python\n$ taskinspect AILocalData\nclass AILocalData(AITask):\n    \"\"\"\n    Extract a CSV about source, id, doi and institutions for deduplication.\n    \"\"\"\n    date = ClosestDateParameter(default=datetime.date.today())\n    batchsize = luigi.IntParameter(default=25000, significant=False)\n\n    def requires(self):\n        return AILicensing(date=self.date)\n    ...\n```\n\n## Create an aggregated file for finc\n\nThere are a couple of prerequisites:\n\n* [ ] siskin is [installed](https://github.com/miku/siskin/#install)\n* [ ] most additional tools are installed (or: output of the `taskchecksetup` is mostly green)\n* [ ] credentials are [configured](https://github.com/miku/siskin/#configuration) in */etc/siskin/siskin.ini* or *~/.config/siskin/siskin.ini*\n* [ ] some static data (that cannot be accessed over the net) is put into place (and configured in *siskin.ini*)\n* [ ] sufficient disk space is available\n\nThe update process itself consists of various updates:\n\n* all data sources (crossref, doaj, ...) are updated, as needed (e.g. FTP is synced, OAI is harvested, API, ...)\n* the licensing data is fetched from [AMSL](https://amsl.technology)\n\nThis dependency graph of these operations can become complex:\n\n![](docs/catalog/AIUpdate.png)\n\nHowever, if everything is put into place, a single command will suffice:\n\n```shell\n$ taskdo AIUpdate --workers 4\n```\n\nThis can be a long running (hours, days) command, depending on the state of the already cached data.\n\nNote: Currently a jour fixe (the 15th of a month) is used as default for the\nlicensing information (another task, called *AMSLFilterConfigFreeze* should be\nrun daily for this to work). The jour fixe can be overriden with the *current* information, by passing a parameter to the *AILicensing* task:\n\n```\n$ taskdo AIUpdate --workers 4 --AILicensing-override\n```\n\nOnce the task is completed, the output of the two tasks:\n\n* AIExport (solr)\n* AIRedact (blob, currently [microblob](https://github.com/miku/microblob))\n\ncan be put into their respective data stores (e.g. via [solrbulk](https://github.com/miku/solrbulk)).\n\n## Configuration\n\nThe siskin package harvests all kinds of data sources, some of which might be\nprotected. All credentials and a few other configuration options go into a\n`siskin.ini`, either in `/etc/siskin/` or `~/.config/siskin/`. If both files\nare present, the local options take precedence.\n\nLuigi uses a bit of configuration as well, put it under `/etc/luigi/`.\n\nCompletions on task names will save you typing and time, so put\n`siskin_compeletion.sh` under `/etc/bash_completion.d` or somewhere else.\n\n```shell\n$ tree etc\netc\n├── bash_completion.d\n│   └── siskin_completion.sh\n├── luigi\n│   ├── luigi.cfg\n│   └── logging.ini\n└── siskin\n    └── siskin.ini\n```\n\nAll configuration values can be inspected quickly with:\n\n```\n$ taskconfig\n[core]\nhome = /var/siskin\n\n[imslp]\nlistings-url = https://example.org/abc\n\n[jstor]\n\nftp-username = abc\nftp-password = d3f\n...\n```\n\n## Software versioning\n\nSince siskin works mostly *on data*, software versioning differs a bit, but we\ntry to adhere to the following rules:\n\n* *major* changes: *You need to recreate all your data from scratch*.\n* *minor* changes: We added, renamed or removed *at least one task*. You will\n  have to recreate a subset of the tasks to see the changes. You might need to change\n  pipelines depending on those tasks, because they might not exist any more or have been renamed.\n* *revision* changes: A modification within existing tasks (e.g. bugfixes).\n  You will have to recreate a subset of the tasks to see this changes, but no new\n  task is introduced. *No pipeline is broken, that wasn't already*.\n\nThese rules apply for version 0.2.0 and later. To see the current version, use:\n\n```shell\n$ taskversion\n0.43.3\n```\n\n## Schema changes\n\nTo remove all files of a certain format (due to schema changes or such) it helps, if naming is uniform:\n\n```shell\n$ tasknames | grep IntermediateSchema | xargs -I {} taskrm {}\n...\n```\n\nApart from that, all upstream tasks need to be removed manually (consult the\n[map](https://git.io/v5sdS)) as this is not automatic yet.\n\n## Task dependencies\n\nInspect task dependencies with:\n\n```shell\n$ taskdeps JstorIntermediateSchema\n  └─ JstorIntermediateSchema(date=2018-05-25)\n      └─ AMSLService(date=2018-05-25, name=outboundservices:discovery)\n      └─ JstorCollectionMapping(date=2018-05-25)\n      └─ JstorIntermediateSchemaGenericCollection(date=2018-05-25)\n```\n\nOr visually via [graphviz](https://www.graphviz.org/).\n\n```shell\n$ taskdeps-dot JstorIntermediateSchema | dot -Tpng \u003e deps.png\n```\n\n## Evolving workflows\n\n![](http://i.imgur.com/8bFvSvN.gif)\n\n## Development\n\nTo converge the project on a common format run:\n\n```shell\n$ make imports style\n```\n\nThis will fix import order and code style in-place. Requires isort and yapf\ninstalled. Should be executed under Python 3 only (as Python 2 isort seems to\nhave differing opinions).\n\nOther tools:\n\n* use [pylint](https://github.com/PyCQA/pylint), currently 9.18/10 with many errors ignored, maybe with [git commit hook](https://github.com/sebdah/git-pylint-commit-hook)\n* use [pytest](https://docs.pytest.org/), [pytest-cov](https://pypi.org/project/pytest-cov/), coverage at 9%\n\n## Naming conventions\n\nSome conventions are enforced by tools (e.g. imports, yapf), but the following\nmay be considered as well.\n\n### Task names and filenames\n\n* task class names that produce MARC21 should have suffix MARC, e.g. ArchiveMARC\n* task class names that produce intermediate schema files should have suffix IntermediateSchema, e.g. ArchiveIntermediateSchema\n* task for a single source should share a prefix, e.g. ArchiveMARC, ArchiveISSNList\n* source prefix names should follow the source names (e.g. site of publisher), in German: *vorlagegetreu*, e.g. DOAJHarvest, GallicaMARC\n* potentially long source names can be shortened, e.g. Umweltbibliothek can become UmBi... in umbi.py\n* it is recommended that the source file name follows the source name, e.g. DOAJ tasks live in doaj.py\n\n### Module docstrings for tasks (and scripts)\n\nRough examples:\n\n```python\n# coding: utf-8\n# pylint: ...\n#\n# Copyright 2019 ... GPL-3.0+ snippet\n# ...\n# @license GPL-3.0+ \u003chttp://spdx.org/licenses/GPL-3.0+\u003e\n\n\"\"\"\n\nSource: Gallica\nSID: 20\nTicket: #14793\nOrigin: OAI\nUpdates: monthly\n\nConfig:\n\n[vkfilm]\ninput = /path/to/file\npassword = helloadmin\n\n\"\"\"\n\n```\n\n### Quoting style\n\n* use double quotes, if possible\n\n### Executable\n\n* if a module can be used as standalone script, then it should include the following line as first line:\n\n```\n#!/usr/bin/env python\n```\n\n### Python 2/3 considerations\n\n* use [six](https://six.readthedocs.io/), if necessary\n* use `__future__` imports if necessary\n* prefer [io.open](https://docs.python.org/3/library/io.html#io.open) to raw open, e.g. Python 2 builtin has no keyword `encoding`\n* string literals should be written with the `u` prefix (obsolete in Python 3, but required in Python 2)\n\n### Debugging\n\n* prefer logging over print statements\n\n### Open for discussion\n\n* one suffix for data acquisition tasks, e.g. Harvest, Get, Fetch, Download, ...\n\n## Deployment\n\nA distribution can be created via Makefile.\n\n```shell\n$ make dist\n$ tree dist/\ndist/\n└── siskin-0.62.0.tar.gz\n```\n\nThe tarball can be installed via [pip](https://pypi.org/project/pip/):\n\n```\n$ pip install siskin-0.62.0.tar.gz\n```\n\nIf access to PyPI is possible, one can upload the tarball there with:\n\n```\n$ make upload\n```\n\nWhich in turn allows to install siskin via:\n\n```\n$ pip install -U siskin\n```\n\non the target machine.\n\n## TODO\n\n* [ ] The naming of the scripts is a bit unfortunate, `taskdo`, `taskcat`,\n  .... Maybe better `siskin run`, `siskin cat`, `siskin rm` and so on.\n* [ ] Investigate [pytest](https://docs.pytest.org/en/latest/) for testing tasks, given inputs.\n\n# Misc\n\nA short video using luigi's [on_success and\non_failure](https://luigi.readthedocs.io/en/stable/api/luigi.task.html#luigi.task.Task.on_failure)\nhandlers to make the processing audible.\n\n[![](docs/screenie_14.png)](https://archive.org/details/the-sound-of-data-being-processed-2014)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmiku%2Fsiskin","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmiku%2Fsiskin","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmiku%2Fsiskin/lists"}