{"id":43238731,"url":"https://github.com/libris/unpaywallmirror","last_synced_at":"2026-02-01T11:14:27.763Z","repository":{"id":145791488,"uuid":"426264824","full_name":"libris/unpaywallmirror","owner":"libris","description":null,"archived":false,"fork":false,"pushed_at":"2026-01-20T13:51:14.000Z","size":142,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":5,"default_branch":"master","last_synced_at":"2026-01-20T22:24:20.413Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/libris.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2021-11-09T14:38:30.000Z","updated_at":"2026-01-20T13:51:18.000Z","dependencies_parsed_at":"2023-07-28T22:03:00.553Z","dependency_job_id":null,"html_url":"https://github.com/libris/unpaywallmirror","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/libris/unpaywallmirror","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/libris%2Funpaywallmirror","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/libris%2Funpaywallmirror/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/libris%2Funpaywallmirror/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/libris%2Funpaywallmirror/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/libris","download_url":"https://codeload.github.com/libris/unpaywallmirror/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/libris%2Funpaywallmirror/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28977317,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-01T09:57:52.632Z","status":"ssl_error","status_checked_at":"2026-02-01T09:57:49.143Z","response_time":56,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-02-01T11:14:27.224Z","updated_at":"2026-02-01T11:14:27.755Z","avatar_url":"https://github.com/libris.png","language":"Java","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Unpaywall mirror\n\nThis creates a local mirror of the Unpaywall and Crossref dataset APIs, based on dump files.\n\nGetting entries by DOI is the only supported action. There is no searching. \n\n### Ingest a dump \nThis application assumes that source data is in the following form:\nA directory with 8-digit-numbered gzipped files named NNNNNNNN.gz. Each file must contain a number (for example 128) of json-lines.\nEach line should be a complete json-object, and must have the \"doi\" property (containing the DOI).\nThe object as a whole is what will be served when the doi is looked up.\n\nNeither Unpaywall nor Crossref provide their data in precisely this form, and so the data needs a bit\nof processing (using normal ubiquitous cli tools) before it is ready to be served.\n\nFirst decide where in your filesystem you want Unpaywall and/or Crossref data placed.\nYou will need to make sure the servlet is then started with the following environment flags passed along (typically in JAVA_OPTS), for example like so:\n```-Dunpaywall.datadir=\"/srv/unpaywalldata -Dcrossref.datadir=\"/srv/crossrefdata```\n\n#### Unpaywall\n\nTo ingest a dump from Unpaywall, first make sure you've permission to download one from Unpaywall.\nWhen you've obtained a download link do the following:\n\n1. Shut the service down\n1. Delete or move anything (like older dumps) already in place at ```$UNPAYWALL_DEST_DIR_WITH_TRAILING_SLASH```, and make sure the directory exists and is writable.\n1. ```curl -Ss $DOWNLOAD_URL | gunzip | split -l 128 --numeric-suffixes=1 --suffix-length=8 --filter='gzip \u003e $FILE.gz' - $UNPAYWALL_DEST_DIR_WITH_TRAILING_SLASH ``` When done from a dump on disc, expect this process to take ~2 hours. If doing it like suggested while downloading, it will obviously take longer.\n1. Start the service up again\n\n#### Crossref\n\n1. Download a dump from crossref (this is provided as a torrent, so be careful on org networks)\n1. cd to where the dump was downloaded, and make sure there is only data there (remove any robots.txt for example)\n1. Shut the service down\n1. Delete or move anything (like older dumps) already in place at ```$CROSSREF_DEST_DIR_WITH_TRAILING_SLASH```, and make sure the directory exists and is writable.\n1. ```parallel -j$(nproc) 'pigz -dc {}' ::: *.gz | sed 's/\"DOI\":/\"doi\":/g' | split -l 128 --numeric-suffixes=1 --suffix-length=8 --filter='pigz \u003e $FILE.gz' - $CROSSREF_DEST_DIR_WITH_TRAILING_SLASH ```\n1. Start the service up again\n\nThe first time the service starts (with a new datadump) it will spend some time building an index of the dump. This will only happen once. This process may take several hours.\n\n### Local development\n\nIf you need to test things out or make some changes, and have a dump or a part of one, use the head command to limit the amount of data you need to work with, like so:\n```\nzcat unpaywall_snapshot_2021-07-02T151134.jsonl.gz | head -10000 | split -l 128 --numeric-suffixes=1 --suffix-length=8 --filter='gzip \u003e $FILE.gz' - $DEST_DIR_WITH_TRAILING_SLASH\n```\nAnd run the dev-server with\n```\n./gradlew appRun -Dunpaywall.datadir=\"$UNPAYWALL_DIR_WITHOUT_TRAILING_SLASH\" -Dcrossref.datadir=\"$CROSSREF_DIR_WITHOUT_TRAILING_SLASH\" \n```\nNote that it is perfectly fine to run with only one of the dumps specified. So for example, simply do not set crossref.datadir if you wish to only mirror unpaywall data.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flibris%2Funpaywallmirror","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flibris%2Funpaywallmirror","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flibris%2Funpaywallmirror/lists"}