{"id":18976081,"url":"https://github.com/jimsmart/grobotstxt","last_synced_at":"2025-04-19T17:10:35.653Z","repository":{"id":54998014,"uuid":"257495282","full_name":"jimsmart/grobotstxt","owner":"jimsmart","description":"grobotstxt is a native Go port of Google's robots.txt parser and matcher library.","archived":false,"fork":false,"pushed_at":"2022-03-16T16:41:25.000Z","size":244,"stargazers_count":110,"open_issues_count":1,"forks_count":7,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-04-16T12:18:41.959Z","etag":null,"topics":["go","robots-exclusion-protocol","robots-txt"],"latest_commit_sha":null,"homepage":"","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jimsmart.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-04-21T06:01:05.000Z","updated_at":"2025-03-29T22:24:04.000Z","dependencies_parsed_at":"2022-08-14T08:40:16.823Z","dependency_job_id":null,"html_url":"https://github.com/jimsmart/grobotstxt","commit_stats":null,"previous_names":[],"tags_count":7,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jimsmart%2Fgrobotstxt","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jimsmart%2Fgrobotstxt/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jimsmart%2Fgrobotstxt/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jimsmart%2Fgrobotstxt/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jimsmart","download_url":"https://codeload.github.com/jimsmart/grobotstxt/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":249746041,"owners_count":21319581,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["go","robots-exclusion-protocol","robots-txt"],"created_at":"2024-11-08T15:22:17.206Z","updated_at":"2025-04-19T17:10:35.630Z","avatar_url":"https://github.com/jimsmart.png","language":"Go","funding_links":[],"categories":[],"sub_categories":[],"readme":"# grobotstxt\n\n[![Apache 2.0](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](LICENSE)\n[![Build Status](https://github.com/jimsmart/grobotstxt/actions/workflows/main.yml/badge.svg?branch=main)](https://github.com/jimsmart/grobotstxt/actions/workflows/main.yml)\n[![codecov](https://codecov.io/gh/jimsmart/grobotstxt/branch/master/graph/badge.svg)](https://codecov.io/gh/jimsmart/grobotstxt)\n[![Go Report Card](https://goreportcard.com/badge/github.com/jimsmart/grobotstxt?cache-buster)](https://goreportcard.com/report/github.com/jimsmart/grobotstxt)\n[![Used By](https://img.shields.io/sourcegraph/rrc/github.com/jimsmart/grobotstxt.svg)](https://sourcegraph.com/github.com/jimsmart/grobotstxt)\n[![Godoc](https://img.shields.io/badge/godoc-reference-blue.svg)](https://godoc.org/github.com/jimsmart/grobotstxt)\n\ngrobotstxt is a native Go port of [Google's robots.txt parser and matcher C++\nlibrary](https://github.com/google/robotstxt).\n\n- Direct function-for-function conversion/port\n- Preserves all behaviour of original library\n- All 100% of original test suite functionality\n- Minor language-specific cleanups\n- Added a helper to extract Sitemap URIs\n- Super simple API\n\nAs per Google's original library, we include a small standalone binary executable,\nfor webmasters, that allows testing a single URL and user-agent against\na robots.txt. Ours is called `icanhasrobot`, and its inputs and outputs\nare compatible with the original tool.\n\n## About\n\nQuoting the README from Google's robots.txt parser and matcher repo:\n\n\u003e The Robots Exclusion Protocol (REP) is a standard that enables website owners to control which URLs may be accessed by automated clients (i.e. crawlers) through a simple text file with a specific syntax. It's one of the basic building blocks of the internet as we know it and what allows search engines to operate.\n\u003e\n\u003e Because the REP was only a de-facto standard for the past 25 years, different implementers implement parsing of robots.txt slightly differently, leading to confusion. This project aims to fix that by releasing the parser that Google uses.\n\u003e\n\u003e The library is slightly modified (i.e. some internal headers and equivalent symbols) production code used by Googlebot, Google's crawler, to determine which URLs it may access based on rules provided by webmasters in robots.txt files. The library is released open-source to help developers build tools that better reflect Google's robots.txt parsing and matching.\n\nPackage grobotstxt aims to be a faithful conversion, from C++ to Go, of Google's robots.txt parser and matcher.\n\n## Quickstart\n\n### Installation\n\n#### For developers\n\nGet the package (only needed if not using modules):\n\n```bash\ngo get github.com/jimsmart/grobotstxt\n```\n\nUse the package within your code (see examples below):\n\n```go\nimport \"github.com/jimsmart/grobotstxt\"\n```\n\n#### For webmasters\n\nAssumes Go is installed, and its environment is already set up.\n\nFetch the package:\n\n```bash\ngo get github.com/jimsmart/grobotstxt\n```\n\nBuild and install the standalone binary executable:\n\n```bash\ngo install github.com/jimsmart/grobotstxt/...\n```\n\nBy default, the resulting binary executable will be `~/go/bin/icanhasrobot` (assuming no customisation has been made to `$GOPATH` or `$GOBIN`).\n\nUse the tool:\n\n```bash\n$ icanhasrobot ~/local/path/to/robots.txt YourBot https://example.com/url\nuser-agent 'YourBot' with URI 'https://example.com/url': ALLOWED\n```\n\nAdditionally, one can pass multiple user-agent names to the tool, using comma-separated values, e.g.\n\n```bash\n$ icanhasrobot ~/local/path/to/robots.txt Googlebot,Googlebot-image https://example.com/url\nuser-agent 'Googlebot,Googlebot-image' with URI 'https://example.com/url': ALLOWED\n```\n\nIf `$GOBIN` is not included in your environment's `$PATH`, use the full path `~/go/bin/icanhasrobot` when invoking the executable.\n\n### Example Code\n\n#### `AgentAllowed`\n\n```go\nimport \"github.com/jimsmart/grobotstxt\"\n\n// Contents of robots.txt file.\nrobotsTxt := `\n    # robots.txt with restricted area\n\n    User-agent: *\n    Disallow: /members/*\n\n    Sitemap: http://example.net/sitemap.xml\n`\n\n// Target URI.\nuri := \"http://example.net/members/index.html\"\n\n\n// Is bot allowed to visit this page?\nok := grobotstxt.AgentAllowed(robotsTxt, \"FooBot/1.0\", uri)\n\n```\n\nSee also `AgentsAllowed`.\n\n#### `Sitemaps`\n\nAdditionally, one can also extract all Sitemap URIs from a given robots.txt file:\n\n```go\nsitemaps := grobotstxt.Sitemaps(robotsTxt)\n```\n\n## Documentation\n\nGoDocs [https://godoc.org/github.com/jimsmart/grobotstxt](https://godoc.org/github.com/jimsmart/grobotstxt)\n\n## Testing\n\nTo run the tests execute `go test` inside the project folder.\n\nFor a full coverage report, try:\n\n```bash\ngo test -coverprofile=coverage.out \u0026\u0026 go tool cover -html=coverage.out\n```\n\n## Notes\n\nThe original library required that the URI passed to the\n`AgentAllowed` and `AgentsAllowed` functions, or to the URI parameter\nof the standalone binary tool, should follow the encoding/escaping format specified by RFC3986, because the library itself does not perform URI normalisation.\n\nIn Go, with its native UTF-8 strings, this requirement is not in line with other commonly used APIs, and is therefore somewhat of a surprising/unexpected behaviour to Go developers.\n\nBecause of this, the Go API presented here has been ammended to automatically handle UTF-8 URIs, and performs any necessary normalisation internally.\n\nThis is the only behavioural change between grobotstxt and the original C++ library.\n\n## License\n\nLike the original library, package grobotstxt is licensed under the terms of the\nApache License, Version 2.0.\n\nSee [LICENSE](LICENSE) for more information.\n\n## Links\n\n- Original project:\n    [Google robots.txt parser and matcher library](https://github.com/google/robotstxt)\n\n## History\n\n- v1.0.3 (2022-03-16) Updates from upstream: Allow additional miss-spelling of 'disallow'. Additional tests. Make icanhasrobot tool return better exit codes. Make icanhasrobot work with multiple UAs.\n- v1.0.2 (2022-03-16) Bugfix: Allow wider range of characters for user-agent.\n- v1.0.1 (2021-04-19) Updated modules. Switch from Travis CI to GitHub Actions.\n- v1.0.0 (2021-04-18) Tagged as stable.\n- v0.2.1 (2021-01-16) Expose more methods of RobotsMatcher as public. Thanks to [anatolym](https://github.com/anatolym)\n- v0.2.0 (2020-04-24) Removed requirement for pre-encoded RFC3986 URIs on front-facing API.\n- v0.1.0 (2020-04-23) Initial release.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjimsmart%2Fgrobotstxt","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjimsmart%2Fgrobotstxt","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjimsmart%2Fgrobotstxt/lists"}