{"id":17060755,"url":"https://github.com/jyasskin/pbot-crawler","last_synced_at":"2025-03-23T08:15:01.948Z","repository":{"id":60116765,"uuid":"540251989","full_name":"jyasskin/pbot-crawler","owner":"jyasskin","description":"Crawler for PBOT's website to show what has changed.","archived":false,"fork":false,"pushed_at":"2023-01-24T04:15:42.000Z","size":217,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-03-20T17:54:12.273Z","etag":null,"topics":["crawler"],"latest_commit_sha":null,"homepage":"https://webserver-26h4rfwp7a-uw.a.run.app/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jyasskin.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2022-09-23T02:39:35.000Z","updated_at":"2022-10-29T23:13:26.000Z","dependencies_parsed_at":"2023-02-13T16:46:38.027Z","dependency_job_id":null,"html_url":"https://github.com/jyasskin/pbot-crawler","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jyasskin%2Fpbot-crawler","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jyasskin%2Fpbot-crawler/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jyasskin%2Fpbot-crawler/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jyasskin%2Fpbot-crawler/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jyasskin","download_url":"https://codeload.github.com/jyasskin/pbot-crawler/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245072266,"owners_count":20556353,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crawler"],"created_at":"2024-10-14T10:45:07.551Z","updated_at":"2025-03-23T08:15:01.923Z","avatar_url":"https://github.com/jyasskin.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# PBOT Crawler\n\nWe'd like to crawl the [PBOT website](https://www.portland.gov/transportation)\napproximately weekly in order to identify new and changed pages. This shouldn't\nput too much load on the site, which I approximate by limiting requests to\n1/second.\n\n`local/` has a crawler that runs on a single machine and dumps the crawl to the\nlocal filesystem. It optimizes fetches by treating the previous crawl as a\ncache, but it doesn't identify changed pages.\n\n\n## Google Cloud design\n\nI've tried to limit this to the free tier services. For sizing estimates, PBOT's\nsite has about 3500 pages taking about 200M for bodies and 60M for headers.\n\n* [Cloud Scheduler](https://cloud.google.com/scheduler/docs) to kick off the\n  crawl weekly, by queuing the root of the PBOT site.\n* [Pub/Sub queue](https://cloud.google.com/pubsub/docs) to hold the list of\n  to-be-crawled URLs. This queue will wind up with lots of duplicates that the\n  next Function will need to deduplicate.\n* A Function to do the actual crawl. See [below](#the-crawl-function) for its\n  strategy.\n* [Firestore](https://cloud.google.com/firestore/docs) to record the set of URLs\n  crawled each week and the content of each URL, including its status, caching\n  headers, and outbound links. See [below](#firestore-schema) for its schema.\n\n### The crawl function\n\nUses an instance limit of 1 so we can use global variables to rate-limit\nexternal fetches and to deduplicate fetch URLs. It would be cleaner to use\nMemorystore or Firestore for both, but Memorystore isn't in the Cloud free tier,\nand doing this in Firestore would risk exceeding the 20k/day free writes.\n\n1. Receive a PubSub event with a URL to crawl.\n1. Check the global set of crawled URLs to deduplicate.\n1. Query the URL from the current crawl in Firestore to deduplicate more\n   reliably. If it's there, we'll assume that its outbound URLs have been queued\n   to PubSub.\n1. Query the URL from the previous crawl in Firestore to use its\n   [`ETag`](https://httpwg.org/specs/rfc9111.html) and discover whether the page\n   is new or updated.\n1. Fetch the URL from PBOT, with cache headers.\n1. If it's new or changed, record that somewhere (TODO), and ask the Web Archive\n   to archive it. Maybe this is another PubSub queue and Function?\n1. Gather its outbound links, either from the previous crawl or using\n   [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/).\n1. Queue its outbound links to PubSub, deduplicating each one against the local\n   set of crawled URLs.\n1. Write the page to the current crawl in Firestore.\n1. Report success!\n\n### Firestore schema\n\n* `/`\n  * `content`\n    * Document IDs are the SHA-256 of the resource body\n      * `outbound_links`: array of outbound absolute URLs.\n  * `crawl-YYYY-MM-DD` collection for each crawl.\n    * Document IDs are SHA-256(URL).\n      * `url`: The actual URL.\n      * `new`: True if this document was new in this crawl.\n      * `changed`: True if this document was changed in this crawl.\n      * `status`: 200, etc.\n      * `headers`: Lowercase map of a few HTTP response headers.\n        * etag\n        * last-modified\n        * location\n      * `content`: Reference into `content` collection.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjyasskin%2Fpbot-crawler","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjyasskin%2Fpbot-crawler","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjyasskin%2Fpbot-crawler/lists"}