{"id":28764387,"url":"https://github.com/bonk-dev/saturday","last_synced_at":"2026-04-29T09:07:31.778Z","repository":{"id":298487655,"uuid":"961098825","full_name":"bonk-dev/saturday","owner":"bonk-dev","description":"A Python 3 app designed to scrape science publication metadata from various sources","archived":false,"fork":false,"pushed_at":"2025-06-11T10:21:07.000Z","size":2758,"stargazers_count":0,"open_issues_count":1,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-06-11T11:37:00.321Z","etag":null,"topics":["csv","csv-parser","elsevier","google-scholar","html-scraping","scopus","scraper","sqlite"],"latest_commit_sha":null,"homepage":"https://bonk-dev.github.io/saturday/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/bonk-dev.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-04-05T18:45:16.000Z","updated_at":"2025-06-11T10:24:08.000Z","dependencies_parsed_at":"2025-06-11T11:37:04.555Z","dependency_job_id":"18b30e31-4059-4ee0-9a8d-ae842b98a820","html_url":"https://github.com/bonk-dev/saturday","commit_stats":null,"previous_names":["bonk-dev/saturday"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/bonk-dev/saturday","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bonk-dev%2Fsaturday","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bonk-dev%2Fsaturday/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bonk-dev%2Fsaturday/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bonk-dev%2Fsaturday/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/bonk-dev","download_url":"https://codeload.github.com/bonk-dev/saturday/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bonk-dev%2Fsaturday/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32418255,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-29T06:29:02.080Z","status":"ssl_error","status_checked_at":"2026-04-29T06:29:00.631Z","response_time":110,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["csv","csv-parser","elsevier","google-scholar","html-scraping","scopus","scraper","sqlite"],"created_at":"2025-06-17T09:37:29.368Z","updated_at":"2026-04-29T09:07:31.773Z","avatar_url":"https://github.com/bonk-dev.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# saturday\nA Python 3 app designed to scrape science publication metadata from various sources like: \n- Scopus \n  - Elsevier API\n  - Scopus' website export gateway\n- Google Scholar\n\n## Build\n1. Install [just](https://github.com/casey/just?tab=readme-ov-file#installation)\n2. Install pip requirements (preferably in venv): `pip install -r requirements.txt`\n3. Install [Node.js](https://nodejs.org/en/download)\n4. Run `just all` - the artifacts will be in the `dist/` directory.\n\nAll available recipes:\n```shell\n$ just -l\nAvailable recipes:\n    all  # Build everything\n    cli  # Build the Python CLI into an exe\n    docs # Build Docusaurus docs\n    env  # Copy .env.sample to dist/.env\n    gui  # Build backend + frontend into an exe\n```\n\n## Usage\n### Command-line\n```shell\n$ python3 main.py\nusage: main.py [-h] [-a] [-p PROXY] [--debug-proxy DEBUG_PROXY] [-g] [-s]\n               [--scopus-api-output SCOPUS_API_OUTPUT] [-b] [--scopus-batch-file SCOPUS_BATCH_FILE]\n               [--scopus-batch-output SCOPUS_BATCH_OUTPUT] [--ssl-insecure]\n               search_query\n\nScience publication metadata scraper\n\npositional arguments:\n  search_query          Generic search query to use when scraping metadata\n\noptions:\n  -h, --help            show this help message and exit\n  -a, --all             Use all methods (google-scholar, scopus)\n  -p, --proxy PROXY     HTTP(S) proxy address, example: -p http://127.0.0.1:8080 -p\n                        http://127.0.0.2:1234. Not used when making requests to IP-authenticated\n                        services (Elsevier, Scopus, etc.)\n  --debug-proxy DEBUG_PROXY\n                        HTTP(S) proxy address, used for ALL requests, including ones made to\n                        services based on IP authentication (Elsevier, Scopus)\n  -g, --google-scholar  Use Google Scholar for scraping metadata\n  -s, --scopus-api      Use Scopus API for scraping metadata\n  --scopus-api-output SCOPUS_API_OUTPUT\n                        Path to a file where raw data fetched from Elsevier API will be saved. File\n                        type: JSON.\n  -b, --scopus-batch    Use Scopus batch export for scraping metadata\n  --scopus-batch-file SCOPUS_BATCH_FILE\n                        Use a local .CSV dump instead of exporting from Scopus\n  --scopus-batch-output SCOPUS_BATCH_OUTPUT\n                        Path to a file where raw data fetched from Scopus batch export will be\n                        saved. File type: CSV.\n  --ssl-insecure        Do not verify upstream server SSL/TLS certificates\n```\n\n#### All scrapers\n```shell\n$ python3 main.py --all \"python3 C++\" \n```\n#### Scopus (batch gateway)\n```shell\n$ python3 main.py --scopus-batch \"python3 C++\" \n```\n\n#### Scopus (batch gateway, save dump to file)\n```shell\n$ python3 main.py --scopus-batch --scopus-batch-output \"/tmp/sc-batch.csv\" \"python3 C++\" \n```\n\n#### Scopus (batch gateway) and Google Scholar\n```shell\n$ python3 main.py --scopus-batch --google-scholar \"python3 C++\" \n```\n\n## Tests\nCurrently, only the `fetchers` module contains automated unit tests:\n```shell\n$ python -m unittest discover fetcher/tests\n...........\n----------------------------------------------------------------------\nRan 11 tests in 0.200s\n\nOK\n```\n\n## Setup\n### Python requirements\nIn order to install required Python packages:\n- (optionally) set up a venv: ```python -m venv .venv \u0026\u0026 source .venv/bin/activate```\n- install packages: ```pip install -r requirements.txt```\n\n### Environment variables\nSome fetcher modules require additional setup (API keys, cookies etc.).\nHere are the required steps for all implemented fetchers.\n\n### Elsevier (Scopus) API\n\n#### SCOPUS_API_KEY\nIn order to use the Elsevier API a key is required. You can create \nsuch key (which is linked to your account) on [https://dev.elsevier.com/apikey/manage](https://dev.elsevier.com/apikey/manage).\n\nAfter acquiring the API key, it needs to be supplied with an environment\nvariable: `SCOPUS_API_KEY`.\nThe app supports .env files (see `.env.sample`).\n\n#### SCOPUS_API_BASE\nThe app also supports using a different API endpoint, which can be controlled\nwith the `SCOPUS_API_BASE` environment variable (or in .env).\n\n#### Example .env\n```\nSCOPUS_API_BASE=https://api.elsevier.com\nSCOPUS_API_KEY=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\n```\n\n### Scopus batch export\nScopus batch export is an alternative method to the official Elsevier API.\nIt uses the endpoints utilized by the Scopus website, allowing the user to export\nmore data at once, while also skipping the weekly limit enforced by the Elsevier API.\n\nThe downside is it is harder to set up. Instead of a single, long-lived API key, a cookie\ndump from the user's browser is needed. These cookies are short-lived and need to be refreshed\nonce in a while (implemented in the app).\n\nThe required coookies are:\n- `SCSessionID`\n- `scopusSessionUUID`\n- `AWSELB`\n- `SCOPUS_JWT`\n\n#### Example .env\n```\nSCOPUS_BATCH_BASE=https://www.scopus.com\nSCOPUS_BATCH_COOKIE_FILE=/tmp/scopus-batch-cookies\nSCOPUS_BATCH_COOKIE_JWT_DOMAIN=.scopus.com\nSCOPUS_BATCH_USER_AGENT=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/134.0.0.0 Safari/537.3\n```\n\nAfter creating the file, the user needs to supply the path to the app with\nan environment variable (or the .env file): `SCOPUS_BATCH_COOKIE_FILE`.\nFor example `SCOPUS_BATCH_COOKIE_FILE=/tmp/scopus-cookies`.\n\n#### SCOPUS_BATCH_BASE\nThis env variable allows the user to change the base URI of the endpoints used\nby the scraper module. Default value: `https://www.scopus.com`.\n\nFor example, if you change this to https://example.org, then the scraper\nwill send a request to `https://example.org/api/documents/search/eids` \nand **NOT** to `https://www.scopus.com/api/documents/search/eids` when searching\nfor documents to export.\n\n#### SCOPUS_BATCH_COOKIE_FILE\nThis env variable allows the user to set the path to a file containing the user's\nauthentication cookies.\n\nThe cookies are to be supplied as sent by browser in the `Cookie:` header. \nThis means, that a user can copy and paste the `Cookie:` header value \ninto a file, and use it as is (unexpected cookies will be ignored).\n\nExample file:\n```\nSCSessionID=cookie_val; scopusSessionUUID=cookie_val2; AWSELB=cookie_val3; SCOPUS_JWT=cookie_val4\n```\n\n#### SCOPUS_BATCH_COOKIE_JWT_DOMAIN\nThis env variable controls the `Domain` parameter of the `SCOPUS_JWT` cookie.\nDefault value: `.scopus.com`.\n`SCOPUS_JWT` needs to have the correct `Domain` value, because it's refreshed\nperiodically (by the `Set-Cookie` header) and the [HTTPX](https://github.com/encode/httpx)\nclient won't accept a cookie with a different domain, than previously set.\n\n#### SCOPUS_BATCH_USER_AGENT\nThis env variable is used by the app to set the correct `User-Agent` header\nwhen sending requests to the Scopus' endpoints.\n\nIt is important to use the user's web-browser `User-Agent` value, because\nusing anything different **will** trigger Cloudflare anti-bot mechanisms.\n\n### Google Scholar\nThis fetcher module doesn't require any authentication cookies or keys, but\na user can configure the base URI for requests and the User-Agent to be used\nby the HTTP(S) client.\n\nAlso, a user can supply a proxy server to use while scraping with \nthe `--proxy` option (see [Usage](#usage)).\n\n#### GOOGLE_SCHOLAR_BASE\nThis env variable allows the user to change the base URI of the endpoints used\nby the scraper module. Default value: `https://scholar.google.com`.\n\nFor example, if you change this to `https://example.org`, then the scraper\nwill send a request to `https://example.org/scholar?q=...` \nand **NOT** to `https://scholar.google.com/scholar?q=...` when searching\nfor publications.\n\n#### GOOGLE_SCHOLAR_USER_AGENT\nThis env variable is used by the app to set the `User-Agent` header\nwhen sending requests to the Google Scholar website.\n\n#### Example .env\n```\nGOOGLE_SCHOLAR_BASE=https://scholar.google.com\nGOOGLE_SCHOLAR_USER_AGENT=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/137.0.0.0 Safari/537.36\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbonk-dev%2Fsaturday","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbonk-dev%2Fsaturday","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbonk-dev%2Fsaturday/lists"}