{"id":13464606,"url":"https://github.com/PuerkitoBio/fetchbot","last_synced_at":"2025-03-25T11:31:51.213Z","repository":{"id":12994725,"uuid":"15673762","full_name":"PuerkitoBio/fetchbot","owner":"PuerkitoBio","description":"A simple and flexible web crawler that follows the robots.txt policies and crawl delays.","archived":false,"fork":false,"pushed_at":"2021-05-19T15:11:17.000Z","size":2115,"stargazers_count":786,"open_issues_count":2,"forks_count":95,"subscribers_count":34,"default_branch":"master","last_synced_at":"2024-11-17T12:40:40.231Z","etag":null,"topics":["crawler","robots-txt"],"latest_commit_sha":null,"homepage":null,"language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/PuerkitoBio.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":".github/FUNDING.yml","license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null},"funding":{"github":["mna"],"custom":["https://www.buymeacoffee.com/mna"]}},"created_at":"2014-01-06T12:45:18.000Z","updated_at":"2024-10-24T16:55:41.000Z","dependencies_parsed_at":"2022-09-16T13:46:19.323Z","dependency_job_id":null,"html_url":"https://github.com/PuerkitoBio/fetchbot","commit_stats":null,"previous_names":[],"tags_count":7,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PuerkitoBio%2Ffetchbot","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PuerkitoBio%2Ffetchbot/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PuerkitoBio%2Ffetchbot/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PuerkitoBio%2Ffetchbot/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/PuerkitoBio","download_url":"https://codeload.github.com/PuerkitoBio/fetchbot/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245454056,"owners_count":20617968,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crawler","robots-txt"],"created_at":"2024-07-31T14:00:47.196Z","updated_at":"2025-03-25T11:31:48.592Z","avatar_url":"https://github.com/PuerkitoBio.png","language":"Go","funding_links":["https://github.com/sponsors/mna","https://www.buymeacoffee.com/mna"],"categories":["All","Core Libraries","Go"],"sub_categories":["Go"],"readme":"# fetchbot [![build status](https://secure.travis-ci.org/PuerkitoBio/fetchbot.svg)](http://travis-ci.org/PuerkitoBio/fetchbot) [![Go Reference](https://pkg.go.dev/badge/github.com/PuerkitoBio/fetchbot.svg)](https://pkg.go.dev/github.com/PuerkitoBio/fetchbot)\n\nPackage fetchbot provides a simple and flexible web crawler that follows the robots.txt\npolicies and crawl delays.\n\nIt is very much a rewrite of [gocrawl](https://github.com/PuerkitoBio/gocrawl) with a\nsimpler API, less features built-in, but at the same time more flexibility. As for Go\nitself, sometimes less is more!\n\n## Installation\n\nTo install, simply run in a terminal:\n\n    go get github.com/PuerkitoBio/fetchbot\n\nThe package has a single external dependency, [robotstxt](https://github.com/temoto/robotstxt). It also integrates code from the [iq package](https://github.com/kylelemons/iq).\n\nThe [API documentation is available on godoc.org](http://godoc.org/github.com/PuerkitoBio/fetchbot).\n\n## Changes\n\n* 2019-09-11 (v1.2.0): update robotstxt dependency (import path/repo URL has changed, issue #31, thanks to [@michael-stevens][michael-stevens] for raising the issue).\n* 2017-09-04 (v1.1.1): fix a goroutine leak when cancelling a Queue (issue #26, thanks to [@ryu-koui][ryu] for raising the issue).\n* 2017-07-06 (v1.1.0): add `Queue.Done` to get the done channel on the queue, allowing to wait in a `select` statement (thanks to [@DennisDenuto][denuto]).\n* 2015-07-25 (v1.0.0) : add `Cancel` method on the `Queue`, to close and drain without requesting any pending commands, unlike `Close` that waits for all pending commands to be processed (thanks to [@buro9][buro9] for the feature request).\n* 2015-07-24 : add `HandlerCmd` and call the Command's `Handler` function if it implements the `Handler` interface, bypassing the `Fetcher`'s handler. Support a `Custom` matcher on the `Mux`, using a predicate. (thanks to [@mmcdole][mmcdole] for the feature requests).\n* 2015-06-18 : add `Scheme` criteria on the muxer (thanks to [@buro9][buro9]).\n* 2015-06-10 : add `DisablePoliteness` field on the `Fetcher` to optionally bypass robots.txt checks (thanks to [@oli-g][oli]).\n* 2014-07-04 : change the type of Fetcher.HttpClient from `*http.Client` to the `Doer` interface. Low chance of breaking existing code, but it's a possibility if someone used the fetcher's client to run other requests (e.g. `f.HttpClient.Get(...)`).\n\n## Usage\n\nThe following example (taken from /example/short/main.go) shows how to create and\nstart a Fetcher, one way to send commands, and how to stop the fetcher once all\ncommands have been handled.\n\n```go\npackage main\n\nimport (\n\t\"fmt\"\n\t\"net/http\"\n\n\t\"github.com/PuerkitoBio/fetchbot\"\n)\n\nfunc main() {\n\tf := fetchbot.New(fetchbot.HandlerFunc(handler))\n\tqueue := f.Start()\n\tqueue.SendStringHead(\"http://google.com\", \"http://golang.org\", \"http://golang.org/doc\")\n\tqueue.Close()\n}\n\nfunc handler(ctx *fetchbot.Context, res *http.Response, err error) {\n\tif err != nil {\n\t\tfmt.Printf(\"error: %s\\n\", err)\n\t\treturn\n\t}\n\tfmt.Printf(\"[%d] %s %s\\n\", res.StatusCode, ctx.Cmd.Method(), ctx.Cmd.URL())\n}\n```\n\nA more complex and complete example can be found in the repository, at /example/full/.\n\n### Fetcher\n\nBasically, a **Fetcher** is an instance of a web crawler, independent of other Fetchers.\nIt receives Commands via the **Queue**, executes the requests, and calls a **Handler** to\nprocess the responses. A **Command** is an interface that tells the Fetcher which URL to\nfetch, and which HTTP method to use (i.e. \"GET\", \"HEAD\", ...).\n\nA call to Fetcher.Start() returns the Queue associated with this Fetcher. This is the\nthread-safe object that can be used to send commands, or to stop the crawler.\n\nBoth the Command and the Handler are interfaces, and may be implemented in various ways.\nThey are defined like so:\n\n```go\ntype Command interface {\n\tURL() *url.URL\n\tMethod() string\n}\ntype Handler interface {\n\tHandle(*Context, *http.Response, error)\n}\n```\n\nA **Context** is a struct that holds the Command and the Queue, so that the Handler always\nknows which Command initiated this call, and has a handle to the Queue.\n\nA Handler is similar to the net/http Handler, and middleware-style combinations can\nbe built on top of it. A HandlerFunc type is provided so that simple functions\nwith the right signature can be used as Handlers (like net/http.HandlerFunc), and there\nis also a multiplexer Mux that can be used to dispatch calls to different Handlers\nbased on some criteria.\n\n### Command-related Interfaces\n\nThe Fetcher recognizes a number of interfaces that the Command may implement, for\nmore advanced needs.\n\n* `BasicAuthProvider`: Implement this interface to specify the basic authentication\ncredentials to set on the request.\n\n* `CookiesProvider`: If the Command implements this interface, the provided Cookies\nwill be set on the request.\n\n* `HeaderProvider`: Implement this interface to specify the headers to set on the\nrequest.\n\n* `ReaderProvider`: Implement this interface to set the body of the request, via\nan `io.Reader`.\n\n* `ValuesProvider`: Implement this interface to set the body of the request, as\nform-encoded values. If the Content-Type is not specifically set via a `HeaderProvider`,\nit is set to \"application/x-www-form-urlencoded\". `ReaderProvider` and `ValuesProvider`\nshould be mutually exclusive as they both set the body of the request. If both are\nimplemented, the `ReaderProvider` interface is used.\n\n* `Handler`: Implement this interface if the Command's response should be handled\nby a specific callback function. By default, the response is handled by the Fetcher's\nHandler, but if the Command implements this, this handler function takes precedence\nand the Fetcher's Handler is ignored.\n\nSince the Command is an interface, it can be a custom struct that holds additional\ninformation, such as an ID for the URL (e.g. from a database), or a depth counter\nso that the crawling stops at a certain depth, etc. For basic commands that don't\nrequire additional information, the package provides the Cmd struct that implements\nthe Command interface. This is the Command implementation used when using the\nvarious Queue.SendString\\* methods.\n\nThere is also a convenience `HandlerCmd` struct for the commands that should be handled\nby a specific callback function. It is a Command with a Handler interface implementation.\n\n### Fetcher Options\n\nThe Fetcher has a number of fields that provide further customization:\n\n* HttpClient : By default, the Fetcher uses the net/http default Client to make requests. A\ndifferent client can be set on the Fetcher.HttpClient field.\n\n* CrawlDelay : That value is used only if there is no delay specified\nby the robots.txt of a given host.\n\n* UserAgent : Sets the user agent string to use for the requests and to validate\nagainst the robots.txt entries.\n\n* WorkerIdleTTL : Sets the duration that a worker goroutine can wait without receiving\nnew commands to fetch. If the idle time-to-live is reached, the worker goroutine\nis stopped and its resources are released. This can be especially useful for\nlong-running crawlers.\n\n* AutoClose : If true, closes the queue automatically once the number of active hosts\nreach 0.\n\n* DisablePoliteness : If true, ignores the robots.txt policies of the hosts.\n\nWhat fetchbot doesn't do - especially compared to gocrawl - is that it doesn't\nkeep track of already visited URLs, and it doesn't normalize the URLs. This is outside\nthe scope of this package - all commands sent on the Queue will be fetched.\nNormalization can easily be done (e.g. using [purell](https://github.com/PuerkitoBio/purell)) before sending the Command to the Fetcher.\nHow to keep track of visited URLs depends on the use-case of the specific crawler,\nbut for an example, see /example/full/main.go.\n\n## License\n\nThe [BSD 3-Clause license](http://opensource.org/licenses/BSD-3-Clause), the same as\nthe Go language. The iq package source code is under the CDDL-1.0 license (details in\nthe source file).\n\n[oli]: https://github.com/oli-g\n[buro9]: https://github.com/buro9\n[mmcdole]: https://github.com/mmcdole\n[denuto]: https://github.com/DennisDenuto\n[ryu]: https://github.com/ryu-koui\n[michael-stevens]: https://github.com/michael-stevens\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FPuerkitoBio%2Ffetchbot","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FPuerkitoBio%2Ffetchbot","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FPuerkitoBio%2Ffetchbot/lists"}