{"id":43238792,"url":"https://github.com/libris/swepub-redux","last_synced_at":"2026-02-01T11:14:32.176Z","repository":{"id":53819871,"uuid":"446438675","full_name":"libris/swepub-redux","owner":"libris","description":null,"archived":false,"fork":false,"pushed_at":"2025-09-16T08:02:21.000Z","size":17953,"stargazers_count":3,"open_issues_count":10,"forks_count":0,"subscribers_count":8,"default_branch":"develop","last_synced_at":"2025-09-16T10:17:00.513Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":false,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/libris.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2022-01-10T13:40:34.000Z","updated_at":"2025-09-16T08:02:25.000Z","dependencies_parsed_at":"2023-12-15T14:24:27.550Z","dependency_job_id":"d24ae382-b540-4656-b2fe-8d30a699a4fd","html_url":"https://github.com/libris/swepub-redux","commit_stats":null,"previous_names":[],"tags_count":22,"template":false,"template_full_name":null,"purl":"pkg:github/libris/swepub-redux","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/libris%2Fswepub-redux","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/libris%2Fswepub-redux/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/libris%2Fswepub-redux/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/libris%2Fswepub-redux/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/libris","download_url":"https://codeload.github.com/libris/swepub-redux/tar.gz/refs/heads/develop","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/libris%2Fswepub-redux/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28977317,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-01T09:57:52.632Z","status":"ssl_error","status_checked_at":"2026-02-01T09:57:49.143Z","response_time":56,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-02-01T11:14:31.724Z","updated_at":"2026-02-01T11:14:32.170Z","avatar_url":"https://github.com/libris.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Swepub\n\nSwepub consists of two programs\n\n1. A pipeline, which fetches and cleans up publication metadata from a set of sources, using OAI-PMH. From this data an sqlite3 database is created which serves as the source for the service (the other program).\n1. A service, the \"Swepub website\", which makes the aggregate Swepub data available for analysis.\n\n## Setup\n\nAssuming https://github.com/astral-sh/uv is installed, just clone this repo.\n\n### Setup: Annif\n\nFor automated subject classification we use [Annif](https://annif.org/). This is optional in the default DEV environment;\nif Annif is not detected, autoclassification will be automatically disabled.\nTo set it up, follow the instructions in https://github.com/libris/swepub-annif.\n\n## Pipeline\n\nTo run the pipeline and harvest a few sources do:\n\n```bash\nuv run python -m pipeline.harvest --update --skip-unpaywall mdh miun mau\n```\n\n(`--skip-unpaywall` avoids hitting a non-public Unpaywall mirror; alternatively, you could set `SWEPUB_SKIP_REMOTE` which skips both Unpaywall and other remote services (e.g. shortdoi.org, issn.org).)\n\nExpect this to take a few minutes. If you don't specify source(s) you instead get the full production data which takes a lot longer (~8 hours). Sources must exist in `pipeline/sources.json`. If the database doesn't exist, it will be created; if it already exists, sources will be incrementally updated (harvesting records added/updated/deleted since the previous execution of `pipeline.harvest --update`).\n\nTo forcibly create a new database, run `uv run python -m pipeline.harvest --force` (or `-f`).\n\nRun `uv run python -m pipeline.harvest -h` to see all options.\n\nThere are no running \"services\", nor any global state. Each time the pipeline is executed, an sqlite3 database is either created or updated. You may even run more than one harvester (with a different database path) in parallel if you like.\n\nYou can `purge` (delete) one or more sources. In combination with a subsequent `update` command, this lets you completely remove a source and then harvest it fully, while keeping records from other sources in the database intact:\n\n```bash\nuv run python -m pipeline.harvest --purge uniarts\nuv run python -m pipeline.harvest --update uniarts\n```\n\n(If you omit the source name, all sources' records will be purged and fully harvested.)\n\nFor sources that keep track of deleted records, a much quicker way is:\n\n```bash\nuv run python -m pipeline.harvest --reset-harvest-time uniarts\nuv run python -m pipeline.harvest --reset-harvest-time uniarts\n```\n\n`--reset-harvest-time` removes the `last_harvest` entry for the specified source(s), meaning the next `--update` will trigger a full harvest.\n\nThe resulting sqlite3 database has roughly the following structure (see storage.py for details):\n\n| Table | Description |\n| --- | --- |\n|original| Original XML data for every harvested record. |\n|converted| Converted+validated+normalized versions of each record. There is foreign key for each row which references the 'original' table |\n|cluster| Clusters of converted records that are all considered to be duplicates of the same publication. There is a foreign key which references the 'converted' table. |\n|finalized| A union or \"master\" record for each cluster, containing a merger of all the records in the cluster. There is a foreign key which references the 'cluster' table. |\n\n\n## Service\n\nIf you want the frontend, make sure Nodejs/npm/yarn are installed, and then:\n\n```bash\nnpm ci --prefix service/vue-client/\nyarn --cwd service/vue-client/ build\n```\n\nFor development, you can run `yarn --cwd service/vue-client/ build --mode=development` instead; this lets you use e.g. Vue Devtools.\n\nTo start the Swepub service (which provides the API and serves the static frontend files, if they exist):\n\n```bash\nuvr run python -m service.swepub\n```\n\nThen visit http://localhost:5000. API docs are available on http://localhost:5000/api/v2/apidocs.\n\nTo use port other than 5000, set `SWEPUB_SERVICE_PORT`.\n\n\n## Training/updating Annif\nSee https://github.com/libris/swepub-annif\n\n\n## Tests\n\nUnit tests:\n\n```bash\nuv run pytest\n\n# Also test embedded doctests:\nuv run pytest --doctest-modules\n```\n\nTo harvest specific test records, first start the mock API server:\n\n```bash\nuv run python -m tests.mock_api\n```\n\nThen, in another terminal:\n\n```bash\nexport SWEPUB_SOURCE_FILE=tests/sources_test.json\nuv run python -m pipeline.harvest -f dedup\nuv run python -m service.swepub\n```\n\nTo add a new test record, for example `foobar.xml`:\n\nFirst, add `foobar.xml` to `tests/test_data`\n\nThen, edit `tests/sources_test.json` and add an entry for the file:\n\n```json\n  \"foobar\": {\n    \"code\": \"whatever\",\n    \"name\": \"TEST - whatever\",\n    \"sets\": [\n      {\n        \"url\": \"http://localhost:8549/foobar\",\n        \"subset\": \"SwePub-whatever\",\n        \"metadata_prefix\": \"swepub_mods\"\n      }\n    ]\n  }\n ```\n\nMake sure `foobar` in `url` corresponds to the filename (minus the `.xml` suffix).\n\nNow you should be able to harvest it:\n\n```bash\n# Again, make sure the mock API server is rnning and that you've set\n# export SWEPUB_SOURCE_FILE=tests/sources_test.json\nuv run python -m pipeline.harvest -f foobar\n```\n\n\n## Working with local XML files\n\nTo quickly test changes without having to hit real OAI-PMH servers over and over,\nyou can download the XML files locally and start a local OAI-PMH server from which\nyou can harvest.\n\n\n```bash\n# Harvest to disk\nuv run python -m misc.fetch_records uniarts ths # Saves to ./_xml by default; -h for options\n# _or_, to fetch all sources: uv run python -m misc.fetch_records\n\n# Start OAI-PMH server in the background or in another terminal\nuv run python -m misc.oai_pmh_server # See -h for options\n\n# Now harvest (--local-server defaults to http://localhost:8383/oai)\nuv run python -m pipeline.harvest -f uniarts ths --local-server\n```\n\nThis \"OAI-PMH server\" supports only the very bare minimum for `pipeline.harvest` to work\nin (non-incremental mode). Remember that if you run `misc.fetch_records` again, you need\nto restart `misc.oai_pmh_server` for it to pick up the changes.\n\nYou can also download only specific records (and when downloading specific records,\nthe XML will be pretty-printed):\n\n```bash\nuv run python -m misc.fetch_records oai:DiVA.org:uniarts-1146 oai:DiVA.org:lnu-108145\n```\n\n\n## Testing XML-\u003eJSON-LD conversion\n\nHaving downloaded the XML of a record (see \"Working with local XML files\" above):\n```bash\nuv run python -m misc.fetch_records oai:DiVA.org:uniarts-1146\n```\n\n...you can then test conversion from Swepub MODS XML to KBV/JSON-LD like so:\n\n\n```bash\nuv run python -m misc.mods_to_json resources/mods_to_xjsonld.xsl _xml/uniarts/oaiDiVA.orguniarts-1146.xml\n```\n\nThen you can edit `resources/mods_to_xjsonld.xsl` and/or `_xml/uniarts/oaiDiVA.orguniarts-1146.xml`,\nrun `misc.mods_to_json` again, and see what happens.\n\n(With `xsltproc` you can also see the intermediary post-XSLT, pre-JSON-conversion XML:\n`xsltproc resources/mods_to_xjsonld.xsl _xml/uniarts/oaiDiVA.orguniarts-1146.xml`)\n\n## Testing conversion, enrichment and \"legacy\" export pipeline\n\n```bash\nuv run python -m misc.testpipe _xml/sometestfile.xml /tmp/swepub-testpipe/\n```\nProduces 3 files corresponding to conversion, audit and legacy steps:\n```bash\n/tmp/swepub-testpipe/sometestfile-out-1.jsonld\n/tmp/swepub-testpipe/sometestfile-out-2-audited.jsonld\n/tmp/swepub-testpipe/sometestfile-out-3-legacy.jsonld\n```\n\n## Testing changes to Swepub \"legacy\" export\n\nInstall `mysql-server` or `mariadb-server`. Follow the instructions in `pipeline/legacy_sync.py`.\nThen, if you've set the relevant env vars (`SWEPUB_LEGACY_SEARCH_USER`, etc.), `pipeline.harvest`\nwill update the MySQL database. You can then inspect the JSON data of a specific record:\n\n```bash\nmysql -u\u003cuser\u003e -p\u003cpassword\u003e swepub_legacy -sN -e \"select data from enriched where identifier = 'oai:DiVA.org:uniarts-1146';\" | jq\n```\n\n## Updating Resources\n\nSome resources are generated from external data, and may eventually become stale and in need of a refresh. See `resources/Makefile`.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flibris%2Fswepub-redux","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flibris%2Fswepub-redux","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flibris%2Fswepub-redux/lists"}