{"id":15308161,"url":"https://github.com/danielmorell/se_bot_checker","last_synced_at":"2025-04-15T00:54:47.622Z","repository":{"id":57465123,"uuid":"253949550","full_name":"danielmorell/se_bot_checker","owner":"danielmorell","description":"Validate search engine user agents and IP addresses.","archived":false,"fork":false,"pushed_at":"2022-02-03T11:41:28.000Z","size":43,"stargazers_count":4,"open_issues_count":0,"forks_count":1,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-04-15T00:54:42.349Z","etag":null,"topics":["crawler","googlebot","python","search-engine","spider"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/danielmorell.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-04-08T01:02:03.000Z","updated_at":"2022-06-04T13:37:12.000Z","dependencies_parsed_at":"2022-09-17T17:52:05.590Z","dependency_job_id":null,"html_url":"https://github.com/danielmorell/se_bot_checker","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/danielmorell%2Fse_bot_checker","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/danielmorell%2Fse_bot_checker/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/danielmorell%2Fse_bot_checker/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/danielmorell%2Fse_bot_checker/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/danielmorell","download_url":"https://codeload.github.com/danielmorell/se_bot_checker/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248986279,"owners_count":21194025,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crawler","googlebot","python","search-engine","spider"],"created_at":"2024-10-01T08:14:21.723Z","updated_at":"2025-04-15T00:54:47.606Z","avatar_url":"https://github.com/danielmorell.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Search Engine Bot Checker\n\n[![Version](https://flat.badgen.net/badge/PyPI/v1.0.3)](https://pypi.org/project/se-bot-checker/)\n\nThis is a simple python library that verifies the validity of a search engine crawler based on it's IP and user agent.\n\nIt is designed to assist SEO's and DevOps validate `googlebot` and other search engine bots.\n\n## Installation\n\n```commandline\npip install se-bot-checker\n```\n\n## Usage\n\nUsing SE Bot Checker to validate a search engine crawler is simple. There are two basic steps.\n\n1. Instantiate the bot class.\n2. Call the bot class with IP and user agent arguments.\n\n```python\nfrom se_bot_checker.bots import GoogleBot\ngooglebot = GoogleBot()\ntest_one = googlebot(\n    '66.249.66.1', \n    'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'\n)\ntest_two = googlebot(\n    '127.0.0.1', \n    'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'\n)\nprint(test_one)\nprint(test_two)\n```\n\n**Output:**\n\n```text\n(True, 'googlebot')\n(False, 'unknown')\n```\n\n## Prebuilt Bots\n\nThere are several bot definitions that are already created, have been tested and will be maintained. The prebuilt \ncrawlers are the most common search engine crawlers.\n\n### Crawler validation methods\n\n| Bot           | User Agent | IP | DNS |\n|---------------|------------|----|-----|\n| `BaiduSpider` | X          | X* | X** |\n| `BingBot`     | X          | X* | X   |\n| `DuckDuckBot` | X          | X  |     |\n| `GoogleBot`   | X          | X* | X   |\n| `YandexBot`   | X          | X* | X   |\n\n\\* IP validation is only used on consecutive checks run using the same bot checker instance. This means that in the \nfollowing example there will be only one DNS network request since the IP in `test_two` has already been validated when \n`test_one` was run.\n\n\\** BaiduSpider only supports reverse DNS validation not reverse and forward. Although it on first glance it appears\nBaiduSpider should support reverse/forward DNS validation I have never had forward success for BaiduSpider. \n\n```python\nfrom se_bot_checker.bots import GoogleBot\ngooglebot = GoogleBot()\ntest_one = googlebot(\n    '66.249.66.1', \n    'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'\n)\nprint(test_one)  # (True, 'googlebot')\ntest_two = googlebot(\n    '66.249.66.1', \n    'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'\n)\nprint(test_two)  # (True, 'googlebot')\n```\n\n### `BaiduSpider`\n\nBaiduSpider validation only uses reverse DNS lookup not reverse and forward.\n\n- **Name:** `baiduspider`\n- **Domains:** `.baidu.com`, `.baidu.jp`\n- **User Agents:** `baiduspider`\n- **Use RegEx:** `False`\n\n\n### `BingBot`\n\n- **Name:** `bingbot`\n- **Domains:** `.search.msn.com`\n- **User Agents:** `bingbot`, `msnbot`, `bingpreview`\n- **Use RegEx:** `True`\n\n### `DuckDuckBot`\n\nDuckDuckBot only uses IP validation from the list of valid IPs.\n\n- **Name:** `duckduckbot`\n- **IPs:** See list below\n- **User Agents:** `duckduckbot`, `duckduckgo`\n- **Use RegEx:** `True`\n\n```text\n20.191.45.212\n40.88.21.235\n40.76.173.151\n40.76.163.7\n20.185.79.47\n52.142.26.175\n20.185.79.15\n52.142.24.149\n40.76.162.208\n40.76.163.23\n40.76.162.191\n40.76.162.247\n54.208.102.37\n107.21.1.8\n\nUpdated: January 31, 2022\n```\n\n### `GoogleBot`\n\n- **Name:** `googlebot`\n- **Domains:** `.googlebot.com`, `.google.com`\n- **User Agents:** `googlebot`\n- **Use RegEx:** `False`\n\n### `YandexBot`\n\n- **Name:** `bingbot`\n- **Domains:** `.search.msn.com`\n- **User Agents:** `bingbot`, `msnbot`, `bingpreview`\n- **Use RegEx:** `True`\n\n## Creating Your Own Bot Definition\n\nSE Bot Checker was designed to be extensible. The core of SE Bot Checker is the `Bot` class. To create your own \nbot you can simply extend `Bot`.\n\nHere is custom bot that will only validate Googlebot mobile.\n\n```python\nfrom se_bot_checker.bots import Bot\n\nclass MobileGoogleBot(Bot):\n    \"\"\"\n    Mobile googlebot checker\n    \"\"\"\n    name = 'googlebot-mobile'\n    domains = ['.googlebot.com', '.google.com']\n    user_agent = 'android.*googlebot'\n```\n\nThat is all there is to it. However, we could simplify this a little by extending the `GoogleBot` class.\n\n```python\nfrom se_bot_checker.bots import GoogleBot\n\nclass MobileGoogleBot(GoogleBot):\n    \"\"\"\n    Mobile googlebot checker\n    \"\"\"\n    name = 'googlebot-mobile'\n    user_agent = 'android.*googlebot'\n```\n\nBoth the desktop and mobile versions of Googlebot use the same domains for the reverse/forward DNS validation. This \nmeans we can simply extend `GoogleBot`. This is the recommended approach when possible.\n\n### `Bot` API\n\nThis class is the core of SE Bot Checker. It handles the validation process. New bot definitions should subclass this \nclass.\n\nA single bot class can be instantiated once and called many times. The allows base settings to be configured and \nmultiple IP and user agent pairs to be validated simply.\n\n**`Bot.name`:** `str` This is the name the bot will return if it validates to `True`.\n\n**`Bot.ips`:** `iterable` A list of known valid IPs.\n\n**`Bot.domains`:** `iterable` A list of known valid domains. This is used to validate the results of the reverse\nDNS lookup. An exact match or a super domain of the DNS lookup results is considered a positive match.\n\n**`Bot.user_agent`:** `str` A substring or RegEx pattern to use to validate the request user agent. For the best\nperformance and compatibility request user agent string are changed to lowercase prior to matching. the `user_agent` \nstring should be lower case. If you need to validate upper or mixed case user agents you can override the \n`Bot.valid_user_agent()` method.\n\n**`Bot.use_regex`:** `bool` Whether the user agent validation should use substring or regex matching. If \n`user_agent` is just a string and not a RegEx pattern this should be `False`. It slightly faster. Defaults to `False`.\n\n## Contributors\n\n[@danielmorell](https://github.com/danielmorell)\n\nCopyright © 2020 [Daniel Morell](https://www.danielmorell.com/)","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdanielmorell%2Fse_bot_checker","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdanielmorell%2Fse_bot_checker","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdanielmorell%2Fse_bot_checker/lists"}