{"id":15722129,"url":"https://github.com/cldellow/datasette-scraper","last_synced_at":"2025-08-17T14:08:07.332Z","repository":{"id":65367224,"uuid":"571818683","full_name":"cldellow/datasette-scraper","owner":"cldellow","description":"Add website scraping abilities to Datasette","archived":false,"fork":false,"pushed_at":"2023-03-04T22:08:22.000Z","size":377,"stargazers_count":64,"open_issues_count":8,"forks_count":1,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-08-13T17:09:58.948Z","etag":null,"topics":["datasette","datasette-plugin","scraping"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/cldellow.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"COPYING","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-11-29T00:19:56.000Z","updated_at":"2025-07-01T14:57:43.000Z","dependencies_parsed_at":"2024-10-24T16:49:50.492Z","dependency_job_id":"d0e2d454-1f37-4d5f-aa5f-b6c0950f4b2c","html_url":"https://github.com/cldellow/datasette-scraper","commit_stats":{"total_commits":123,"total_committers":1,"mean_commits":123.0,"dds":0.0,"last_synced_commit":"25af45b9fd204e1068d82f5c04b4a14b9f4cbd5a"},"previous_names":[],"tags_count":8,"template":false,"template_full_name":null,"purl":"pkg:github/cldellow/datasette-scraper","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cldellow%2Fdatasette-scraper","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cldellow%2Fdatasette-scraper/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cldellow%2Fdatasette-scraper/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cldellow%2Fdatasette-scraper/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/cldellow","download_url":"https://codeload.github.com/cldellow/datasette-scraper/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cldellow%2Fdatasette-scraper/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":270856775,"owners_count":24657700,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-17T02:00:09.016Z","response_time":129,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["datasette","datasette-plugin","scraping"],"created_at":"2024-10-03T22:04:11.463Z","updated_at":"2025-08-17T14:08:07.286Z","avatar_url":"https://github.com/cldellow.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"# datasette-scraper\n\n[![PyPI](https://img.shields.io/pypi/v/datasette-scraper.svg)](https://pypi.org/project/datasette-scraper/)\n[![Changelog](https://img.shields.io/github/v/release/cldellow/datasette-scraper?include_prereleases\u0026label=changelog)](https://github.com/cldellow/datasette-scraper/releases)\n[![Tests](https://github.com/cldellow/datasette-scraper/workflows/Test/badge.svg)](https://github.com/cldellow/datasette-scraper/actions?query=workflow%3ATest)\n[![License](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](https://github.com/cldellow/datasette-scraper/blob/main/LICENSE)\n\n`datasette-scraper` is a Datasette plugin to manage small-ish (~100K pages) crawl and extract jobs.\n\n- Opinionated yet extensible\n  - Some useful tasks are possible out-of-the-box, or write your own pluggy hooks to go further\n- Leans heavily into SQLite\n  - Introspect your crawls via ops tables exposed in Datasette\n- Built on robust libraries\n  - [Datasette](https://datasette.io/) as a host\n  - [selectolax](https://github.com/rushter/selectolax) for HTML parsing\n  - [httpx](https://www.python-httpx.org/) for HTTP requests\n  - [pluggy](https://pluggy.readthedocs.io/en/stable/) for extensibility\n  - [zstandard](https://github.com/indygreg/python-zstandard) for efficiently compressing HTTP responses\n\n**Not for adversarial crawling**. Want to crawl a site that blocks bots? You're on your own.\n\n## Installation\n\nInstall this plugin in the same environment as Datasette.\n\n    datasette install datasette-scraper\n\n## Usage\n\nConfigure `datasette-scraper` via `metadata.json`. You need to enable the plugin\non a per-database level.\n\nTo enable it in the `my-database` database, write something like this:\n\n```\n{\n  \"databases\": {\n    \"my-database\": {\n      \"plugins\": {\n        \"datasette-scraper\": {\n        }\n      }\n    }\n  }\n}\n```\n\nThe next time you start datasette, the plugin will create several tables in\nthe specified database. Go to the `dss_crawl` table to define a crawl.\n\nA 10-minute end-to-end walkthrough video is available:\n\n\u003cdiv align=\"left\"\u003e\n      \u003ca href=\"https://www.youtube.com/watch?v=zrSGnz7ErNI\"\u003e\n         \u003cimg src=\"https://img.youtube.com/vi/zrSGnz7ErNI/maxresdefault.jpg\" style=\"width:100%;\"\u003e\n      \u003c/a\u003e\n\u003c/div\u003e\n\n## Usage notes\n\n`datasette-scraper` requires a database in which to track its operational data,\nand a database in which to store scraped data. They can be the same database.\n\nBoth databases will be put into WAL mode.\n\nThe ops database's `user_version` pragma will be used to track schema versions.\n\n## Architecture\n\n`datasette-scraper` handles the core bookkeeping for scraping--keeping track of\nURLs to be scraped, rate-limiting requests to origins, persisting data into the DB.\nIt relies on plugins to do almost all the interesting work. For example, fetching\nthe actual pages, following redirects, navigating sitemaps, extracting data.\n\nThe tool comes with plugins for common use cases. Some users may want to author\ntheir own `after_fetch_url` or `extract_from_response` implementations to do custom\nprocessing.\n\n### Overview\n\n```mermaid\nflowchart LR\ndirection TB\n\nsubgraph init\n  A(user starts crawl) --\u003e B[get_seed_urls]\nend\n\nsubgraph crawl [for each URL to crawl]\n  before_fetch_url --\u003e fetch_cached_url --\u003e fetch_url --\u003e after_fetch_url\n  fetch_cached_url --\u003e after_fetch_url\nend\n\nsubgraph discover [for each URL crawled]\n  discover_urls --\u003e canonicalize_url --\u003e canonicalize_url\n  canonicalize_url --\u003e x[queue URL to crawl]\n  extract_from_response\nend\n\ninit --\u003e crawl --\u003e discover\n```\n\n### Plugin hooks\n\nMost plugins will only implement a few of these hooks.\n\n- `conn` is a read/write `sqlite3.Connection` to the database\n- `config` is the crawl's config\n\n#### `get_seed_urls(config)`\n\nReturns a list of strings representing seed URLs to be fetched.\n\nThey will be considered to have depth of 0, i.e. seeds.\n\n#### `before_fetch_url(conn, config, job_id, url, depth, request_headers)`\n\n`request_headers` is a dict, you can modify it to control what gets sent in the request.\n\nReturns:\n  - truthy to indicate this URL should not be crawled (for example, crawl max page limit)\n  - falsy to express no opinion\n\n\u003e **Note** `before_fetch_url` vs `canonicalize_url`\n\u003e\n\u003e You can also use the `canonicalize_url` hook to reject URLs prior to them entering\n\u003e the crawl queue.\n\u003e\n\u003e A URL rejected by `canonicalize_url` will not result in an entry in the\n\u003e `dss_crawl_queue` and `dss_crawl_queue_history` tables.\n\u003e\n\u003e Which one you use is a matter of taste, in general, if you _never_ want the URL,\n\u003e reject it at canonicalization time.\n\n#### `fetch_cached_url(conn, config, url, depth, request_headers)`\n\nFetch a previously-cached HTTP response. The system will not have checked that\nthere was rate limit available before calling this.\n\nReturns:\n  - `None`, to indicate not handled\n  - a response object, which is a dict with:\n    - `fetched_at` - an ISO 8601 time like `2022-12-26 01:23:45.00`\n    - `headers` - the response headers, eg `[['content-type', 'text/html']]`\n    - `status_code` - the respones code, eg `200`\n    - `text` - the response body\n\nOnce any plugin has returned a truthy value, no other plugin's `fetch_url`\nhook will be invoked.\n\n\n#### `fetch_url(conn, config, url, request_headers)`\n\nFetch an HTTP response from the live server. The system will have checked that there\nwas rate limit available before calling this.\n\nSame return type and behaviour as `fetch_cached_url`.\n\n#### `after_fetch_url(conn, config, url, request_headers, response, fresh, fetch_duration)`\n\nDo something with a fetched URL.\n\n#### `discover_urls(config, url, response)`\n\nReturns a list of URLs to crawl.\n\nThe URLs can be either strings, in which case they'll get enqueued as depth + 1, or tuple of URL and depth. This can be useful for paginated index pages, where you'd like to crawl to a max depth of, say, 2, but treat all the index pages as being at depth 1.\n\n#### `canonicalize_url(config, from_url, to_url, to_url_depth)`\n\nReturns:\n  - `False` to filter URL\n  - an URL to be crawled instead\n  - `None` or `True` to no-op\n\nThe URL to be crawled can be a string, or a tuple of string and depth.\n\nThis hook is useful for:\n  - blocking URLs that we never want\n  - canonicalizing URLs, for example, by omitting query parameters\n  - restricting crawls to same origin\n  - resetting depth for pagination\n\n#### `extract_from_response(config, url, response)`\n\nReturns an object of rows-to-be-inserted-or-upserted:\n\n```jsonc\n{\n  \"dbname\": {  // can be omitted, in which case, current DB will be used\n    \"users\": [\n      {\n        \"id!\": \"cldellow@gmail.com\",  // ! indicates pkey, compound OK\n        \"name\": \"Colin\",\n      },\n      {\n        \"id!\": \"santa@northpole.com\",\n        \"name\": \"Santa Claus\",\n      }\n    ],\n    \"places\": [\n      {\n        \"id@\": \"santa@northpole.com\",\n        \"__delete\": true\n      },\n      {\n        \"id@\": \"cldellow@gmail.com\",\n        \"city\": \"Kitchener\",\n      },\n      {\n        \"id@\": \"cldellow@gmail.com\",\n        \"city\": \"Dawson Creek\"\n      }\n    ]\n  }\n}\n```\n\nColumn names can have sigils at the end:\n- `!` says the column is part of the pkey; there can be at most 1 row with this value\n- `@` says the column should be indexed; there can be multiple rows with this value\n\nColumns with sigils must be known at table creation time. Although you can have\nmultiple columns with sigils, you cannot mix `!` and `@` sigils in the same table.\n\nAny missing tables or columns will be created. Columns will have `ANY` data type.\nColumns will be nullable unless they have the `!` sigil.\n\nYou can indicate that a row should be deleted by emitting `__delete` key in your object.\n\n`datasette-scraper` may commit your changes to the database in batches in order to\nreduce write transactions and improve throughput. It may also elide\nDELETE/INSERT statements entirely if it determines that the state of the database\nwould be unchanged.\n\nIf you'd like to control the schema more carefully, please create the table manually.\n\n#### Metadata hooks\n\nThese hooks don't affect operation of the scrapes. They provide metadata to\nhelp validate a user's configuration and show UI to configure a crawl.\n\n##### config_schema()\n\nReturns a `ConfigSchema` option that defines how this plugin is configured.\n\nConfiguration is done via [JSON schema](https://json-schema.org/understanding-json-schema/). UI is done via [JSON Forms](https://jsonforms.io/).\n\nLook at the existing plugins to learn how to use this hook.\n\nThe schema is optional; if omitted, you will need to configure the plug in\nout of band.\n\n##### config_default_value()\n\nReturns `None` to indicate that new crawls should not use this plugin by default.\n\nOtherwise, returns a reasonable default value that conforms to the schema in `config_schema()`\n\n## Development\n\nTo set up this plugin locally, first checkout the code. Then create a new virtual environment:\n\n    cd datasette-scraper\n    python3 -m venv venv\n    source venv/bin/activate\n\nNow install the dependencies and test dependencies:\n\n    pip install -e '.[test]'\n\nTo run the tests:\n\n    pytest\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcldellow%2Fdatasette-scraper","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcldellow%2Fdatasette-scraper","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcldellow%2Fdatasette-scraper/lists"}