Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/temoto/robotstxt
The robots.txt exclusion protocol implementation for Go language
https://github.com/temoto/robotstxt
go go-library golang golang-library production-ready robots-txt status-active web
Last synced: 1 day ago
JSON representation
The robots.txt exclusion protocol implementation for Go language
- Host: GitHub
- URL: https://github.com/temoto/robotstxt
- Owner: temoto
- License: mit
- Created: 2010-07-12T10:54:05.000Z (over 14 years ago)
- Default Branch: master
- Last Pushed: 2022-11-09T09:51:34.000Z (about 2 years ago)
- Last Synced: 2025-01-18T16:04:45.522Z (8 days ago)
- Topics: go, go-library, golang, golang-library, production-ready, robots-txt, status-active, web
- Language: Go
- Homepage:
- Size: 94.7 KB
- Stars: 271
- Watchers: 10
- Forks: 55
- Open Issues: 4
-
Metadata Files:
- Readme: README.rst
- License: LICENSE
Awesome Lists containing this project
README
What
====This is a robots.txt exclusion protocol implementation for Go language (golang).
Build
=====To build and run tests run `go test` in source directory.
Contribute
==========Warm welcome.
* If desired, add your name in README.rst, section Who.
* Run `script/test && script/clean && echo ok`
* You can ignore linter warnings, but everything else must pass.
* Send your change as pull request or just a regular patch to current maintainer (see section Who).Thank you.
Usage
=====As usual, no special installation is required, just
import "github.com/temoto/robotstxt"
run `go get` and you're ready.
1. Parse
^^^^^^^^First of all, you need to parse robots.txt data. You can do it with
functions `FromBytes(body []byte) (*RobotsData, error)` or same for `string`::robots, err := robotstxt.FromBytes([]byte("User-agent: *\nDisallow:"))
robots, err := robotstxt.FromString("User-agent: *\nDisallow:")As of 2012-10-03, `FromBytes` is the most efficient method, everything else
is a wrapper for this core function.There are few convenient constructors for various purposes:
* `FromResponse(*http.Response) (*RobotsData, error)` to init robots data
from HTTP response. It *does not* call `response.Body.Close()`::robots, err := robotstxt.FromResponse(resp)
resp.Body.Close()
if err != nil {
log.Println("Error parsing robots.txt:", err.Error())
}* `FromStatusAndBytes(statusCode int, body []byte) (*RobotsData, error)` or
`FromStatusAndString` if you prefer to read bytes (string) yourself.
Passing status code applies following logic in line with Google's interpretation
of robots.txt files:* status 2xx -> parse body with `FromBytes` and apply rules listed there.
* status 4xx -> allow all (even 401/403, as recommended by Google).
* other (5xx) -> disallow all, consider this a temporary unavailability.2. Query
^^^^^^^^Parsing robots.txt content builds a kind of logic database, which you can
query with `(r *RobotsData) TestAgent(url, agent string) (bool)`.Explicit passing of agent is useful if you want to query for different agents. For
single agent users there is an efficient option: `RobotsData.FindGroup(userAgent string)`
returns a structure with `.Test(path string)` method and `.CrawlDelay time.Duration`.Simple query with explicit user agent. Each call will scan all rules.
::
allow := robots.TestAgent("/", "FooBot")
Or query several paths against same user agent for performance.
::
group := robots.FindGroup("BarBot")
group.Test("/")
group.Test("/download.mp3")
group.Test("/news/article-2012-1")Who
===Honorable contributors (in undefined order):
* Ilya Grigorik (igrigorik)
* Martin Angers (PuerkitoBio)
* Micha Gorelick (mynameisfiber)Initial commit and other: Sergey Shepelev [email protected]
Flair
=====.. image:: https://travis-ci.org/temoto/robotstxt.svg?branch=master
:target: https://travis-ci.org/temoto/robotstxt.. image:: https://codecov.io/gh/temoto/robotstxt/branch/master/graph/badge.svg
:target: https://codecov.io/gh/temoto/robotstxt.. image:: https://goreportcard.com/badge/github.com/temoto/robotstxt
:target: https://goreportcard.com/report/github.com/temoto/robotstxt