{"id":20594861,"url":"https://github.com/scrapy/protego","last_synced_at":"2026-01-29T12:18:08.980Z","repository":{"id":37466047,"uuid":"192757224","full_name":"scrapy/protego","owner":"scrapy","description":"A pure-Python robots.txt parser with support for modern conventions. ","archived":false,"fork":false,"pushed_at":"2025-03-24T18:45:37.000Z","size":3576,"stargazers_count":65,"open_issues_count":5,"forks_count":28,"subscribers_count":9,"default_branch":"master","last_synced_at":"2025-05-05T23:05:51.863Z","etag":null,"topics":["hacktoberfest","python","robots-parser","robots-txt"],"latest_commit_sha":null,"homepage":"","language":"DIGITAL Command Language","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/scrapy.png","metadata":{"files":{"readme":"README.rst","changelog":"CHANGELOG.rst","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-06-19T15:17:48.000Z","updated_at":"2025-04-03T21:18:05.000Z","dependencies_parsed_at":"2024-12-08T07:03:22.467Z","dependency_job_id":"326f3a98-1847-4719-9426-377f78e56d29","html_url":"https://github.com/scrapy/protego","commit_stats":{"total_commits":114,"total_committers":16,"mean_commits":7.125,"dds":0.5087719298245614,"last_synced_commit":"7bef537448ff9bfef343f8db8f6008f4a7942d5b"},"previous_names":[],"tags_count":22,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scrapy%2Fprotego","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scrapy%2Fprotego/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scrapy%2Fprotego/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scrapy%2Fprotego/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/scrapy","download_url":"https://codeload.github.com/scrapy/protego/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254453646,"owners_count":22073616,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["hacktoberfest","python","robots-parser","robots-txt"],"created_at":"2024-11-16T08:10:32.066Z","updated_at":"2026-01-29T12:18:08.973Z","avatar_url":"https://github.com/scrapy.png","language":"DIGITAL Command Language","readme":"=======\nProtego\n=======\n\n.. image:: https://img.shields.io/pypi/pyversions/protego.svg\n   :target: https://pypi.python.org/pypi/protego\n   :alt: Supported Python Versions\n\n.. image:: https://github.com/scrapy/protego/actions/workflows/tests-ubuntu.yml/badge.svg\n   :target: https://github.com/scrapy/protego/actions/workflows/tests-ubuntu.yml\n   :alt: CI\n\nProtego is a pure-Python ``robots.txt`` parser with support for modern\nconventions.\n\n\nInstall\n=======\n\nTo install Protego, simply use pip:\n\n.. code-block:: none\n\n    pip install protego\n\n\nUsage\n=====\n\n.. code-block:: pycon\n\n   \u003e\u003e\u003e from protego import Protego\n   \u003e\u003e\u003e robotstxt = \"\"\"\n   ... User-agent: *\n   ... Disallow: /\n   ... Allow: /about\n   ... Allow: /account\n   ... Disallow: /account/contact$\n   ... Disallow: /account/*/profile\n   ... Crawl-delay: 4\n   ... Request-rate: 10/1m                 # 10 requests every 1 minute\n   ...\n   ... Sitemap: http://example.com/sitemap-index.xml\n   ... Host: http://example.co.in\n   ... \"\"\"\n   \u003e\u003e\u003e rp = Protego.parse(robotstxt)\n   \u003e\u003e\u003e rp.can_fetch(\"http://example.com/profiles\", \"mybot\")\n   False\n   \u003e\u003e\u003e rp.can_fetch(\"http://example.com/about\", \"mybot\")\n   True\n   \u003e\u003e\u003e rp.can_fetch(\"http://example.com/account\", \"mybot\")\n   True\n   \u003e\u003e\u003e rp.can_fetch(\"http://example.com/account/myuser/profile\", \"mybot\")\n   False\n   \u003e\u003e\u003e rp.can_fetch(\"http://example.com/account/contact\", \"mybot\")\n   False\n   \u003e\u003e\u003e rp.crawl_delay(\"mybot\")\n   4.0\n   \u003e\u003e\u003e rp.request_rate(\"mybot\")\n   RequestRate(requests=10, seconds=60, start_time=None, end_time=None)\n   \u003e\u003e\u003e list(rp.sitemaps)\n   ['http://example.com/sitemap-index.xml']\n   \u003e\u003e\u003e rp.preferred_host\n   'http://example.co.in'\n\n\nUsing Protego with Requests_:\n\n.. code-block:: pycon\n\n   \u003e\u003e\u003e from protego import Protego\n   \u003e\u003e\u003e import requests\n   \u003e\u003e\u003e r = requests.get(\"https://google.com/robots.txt\")\n   \u003e\u003e\u003e rp = Protego.parse(r.text)\n   \u003e\u003e\u003e rp.can_fetch(\"https://google.com/search\", \"mybot\")\n   False\n   \u003e\u003e\u003e rp.can_fetch(\"https://google.com/search/about\", \"mybot\")\n   True\n   \u003e\u003e\u003e list(rp.sitemaps)\n   ['https://www.google.com/sitemap.xml']\n\n.. _Requests: https://3.python-requests.org/\n\n\nComparison\n==========\n\nThe following table compares Protego to the most popular ``robots.txt`` parsers\nimplemented in Python or featuring Python bindings:\n\n+----------------------------+---------+-----------------+--------+---------------------------+\n|                            | Protego | RobotFileParser | Reppy  | Robotexclusionrulesparser |\n+============================+=========+=================+========+===========================+\n| Implementation language    | Python  | Python          | C++    | Python                    |\n+----------------------------+---------+-----------------+--------+---------------------------+\n| Reference specification    | Google_ | `Martijn Koster’s 1996 draft`_                       |\n+----------------------------+---------+-----------------+--------+---------------------------+\n| `Wildcard support`_        | ✓       |                 | ✓      | ✓                         |\n+----------------------------+---------+-----------------+--------+---------------------------+\n| `Length-based precedence`_ | ✓       |                 | ✓      |                           |\n+----------------------------+---------+-----------------+--------+---------------------------+\n| Performance_               |         | +40%            | +1300% | -25%                      |\n+----------------------------+---------+-----------------+--------+---------------------------+\n\n.. _Google: https://developers.google.com/search/reference/robots_txt\n.. _Length-based precedence: https://developers.google.com/search/reference/robots_txt#order-of-precedence-for-group-member-lines\n.. _Martijn Koster’s 1996 draft: https://www.robotstxt.org/norobots-rfc.txt\n.. _Performance: https://anubhavp28.github.io/gsoc-weekly-checkin-12/\n.. _Wildcard support: https://developers.google.com/search/reference/robots_txt#url-matching-based-on-path-values\n\n\nAPI Reference\n=============\n\nClass ``protego.Protego``:\n\nProperties\n----------\n\n*   ``sitemaps`` {``list_iterator``} A list of sitemaps specified in\n    ``robots.txt``.\n\n*   ``preferred_host`` {string} Preferred host specified in ``robots.txt``.\n\n\nMethods\n-------\n\n*   ``parse(robotstxt_body)`` Parse ``robots.txt`` and return a new instance of\n    ``protego.Protego``.\n\n*   ``can_fetch(url, user_agent)`` Return True if the user agent can fetch the\n    URL, otherwise return ``False``.\n\n*   ``crawl_delay(user_agent)`` Return the crawl delay specified for the user\n    agent as a float. If nothing is specified, return ``None``.\n\n*   ``request_rate(user_agent)`` Return the request rate specified for the user\n    agent as a named tuple ``RequestRate(requests, seconds, start_time,\n    end_time)``. If nothing is specified, return ``None``.\n\n*   ``visit_time(user_agent)`` Return the visit time specified for the user\n    agent as a named tuple ``VisitTime(start_time, end_time)``.\n    If nothing is specified, return ``None``.\n","funding_links":[],"categories":["\u003ca name=\"utils--user-agents\"\u003e\u003c/a\u003e🧰 Utils \u0026 User Agents"],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fscrapy%2Fprotego","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fscrapy%2Fprotego","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fscrapy%2Fprotego/lists"}