{"id":17694962,"url":"https://github.com/asg017/sqlite-robotstxt","last_synced_at":"2025-05-13T03:43:06.540Z","repository":{"id":191038897,"uuid":"683813818","full_name":"asg017/sqlite-robotstxt","owner":"asg017","description":"A SQLite extension for parsing robots.txt files","archived":false,"fork":false,"pushed_at":"2024-01-05T01:04:33.000Z","size":85,"stargazers_count":3,"open_issues_count":0,"forks_count":0,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-05-05T02:05:02.712Z","etag":null,"topics":["sqlite","sqlite-extension"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/asg017.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE-APACHE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2023-08-27T19:25:29.000Z","updated_at":"2024-07-22T01:08:08.000Z","dependencies_parsed_at":"2023-08-27T20:28:35.604Z","dependency_job_id":"9766e078-b7d8-4d40-8347-a6c5e12efe15","html_url":"https://github.com/asg017/sqlite-robotstxt","commit_stats":null,"previous_names":["asg017/sqlite-robotstxt"],"tags_count":3,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/asg017%2Fsqlite-robotstxt","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/asg017%2Fsqlite-robotstxt/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/asg017%2Fsqlite-robotstxt/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/asg017%2Fsqlite-robotstxt/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/asg017","download_url":"https://codeload.github.com/asg017/sqlite-robotstxt/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253870821,"owners_count":21976610,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["sqlite","sqlite-extension"],"created_at":"2024-10-24T13:50:39.511Z","updated_at":"2025-05-13T03:43:06.517Z","avatar_url":"https://github.com/asg017.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# sqlite-robotstxt\n\nA SQLite extension for parsing [`robots.txt`](https://en.wikipedia.org/wiki/Robots.txt) files. Based on [`sqlite-loadable-rs`](https://github.com/asg017/sqlite-loadable-rs) and the [`robotstxt` crate](https://docs.rs/robotstxt/latest/robotstxt/).\n\n## Usage\n\nSee if a specified User-Agent can access a specific path, based on the rules of a `robots.txt`.\n\n```sql\nselect robotstxt_matches(\n  readfile('robots.txt'),\n  'My-Agent',\n  '/path'\n); -- 0 or 1\n```\n\nFind all indvidual rules specified in a `robots.txt` file.\n\n```sql\nselect *\nfrom robotstxt_rules(\n  readfile('tests/examples/en.wikipedia.org.robots.txt')\n)\nlimit 10;\n/*\n┌────────────────────────────┬────────┬───────────┬──────┐\n│         user_agent         │ source │ rule_type │ path │\n├────────────────────────────┼────────┼───────────┼──────┤\n│ MJ12bot                    │ 12     │ disallow  │ /    │\n│ Mediapartners-Google*      │ 16     │ disallow  │ /    │\n│ IsraBot                    │ 20     │ disallow  │      │\n│ Orthogaffe                 │ 23     │ disallow  │      │\n│ UbiCrawler                 │ 28     │ disallow  │ /    │\n│ DOC                        │ 31     │ disallow  │ /    │\n│ Zao                        │ 34     │ disallow  │ /    │\n│ sitecheck.internetseer.com │ 39     │ disallow  │ /    │\n│ Zealbot                    │ 42     │ disallow  │ /    │\n│ MSIECrawler                │ 45     │ disallow  │ /    │\n└────────────────────────────┴────────┴───────────┴──────┘\n*/\n```\n\nUse with `sqlite-http` to requests `robots.txt` files on the fly.\n\n```sql\nselect *\nfrom robotstxt_rules(\n  http_get_body('https://www.reddit.com/robots.txt')\n)\nlimit 10;\n\n\n/*\n┌────────────┬────────┬───────────┬─────────────────────┐\n│ user_agent │ source │ rule_type │        path         │\n├────────────┼────────┼───────────┼─────────────────────┤\n│ 008        │ 3      │ disallow  │ /                   │\n│ voltron    │ 7      │ disallow  │ /                   │\n│ bender     │ 10     │ disallow  │ /my_shiny_metal_ass │\n│ Gort       │ 13     │ disallow  │ /earth              │\n│ MJ12bot    │ 16     │ disallow  │ /                   │\n│ PiplBot    │ 19     │ disallow  │ /                   │\n│ *          │ 22     │ disallow  │ /*.json             │\n│ *          │ 23     │ disallow  │ /*.json-compact     │\n│ *          │ 24     │ disallow  │ /*.json-html        │\n│ *          │ 25     │ disallow  │ /*.xml              │\n└────────────┴────────┴───────────┴─────────────────────┘\n*/\n```\n\n## TODO\n\n- [ ] `robotstxt_allowed(rules, path)` overload on `robotstxt_user_agents`\n- [ ] sitemaps?\n- [ ] unknown directives?\n\n- [ ] pytest + syrupy\n- [ ] ensure no panics\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fasg017%2Fsqlite-robotstxt","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fasg017%2Fsqlite-robotstxt","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fasg017%2Fsqlite-robotstxt/lists"}