{"id":13581945,"url":"https://github.com/yields/ant","last_synced_at":"2025-05-16T15:02:15.115Z","repository":{"id":38376287,"uuid":"299020776","full_name":"yields/ant","owner":"yields","description":"A web crawler for Go","archived":false,"fork":false,"pushed_at":"2025-03-10T03:06:03.000Z","size":172,"stargazers_count":278,"open_issues_count":7,"forks_count":17,"subscribers_count":5,"default_branch":"master","last_synced_at":"2025-04-12T12:54:34.330Z","etag":null,"topics":["go","golang","scraper","spider","web-crawler"],"latest_commit_sha":null,"homepage":"","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/yields.png","metadata":{"files":{"readme":"Readme.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-09-27T11:50:37.000Z","updated_at":"2025-03-16T01:49:04.000Z","dependencies_parsed_at":"2025-03-05T02:11:20.391Z","dependency_job_id":"97f02494-eaed-4b2f-94d1-7d6781fde32d","html_url":"https://github.com/yields/ant","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yields%2Fant","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yields%2Fant/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yields%2Fant/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yields%2Fant/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/yields","download_url":"https://codeload.github.com/yields/ant/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254553936,"owners_count":22090415,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["go","golang","scraper","spider","web-crawler"],"created_at":"2024-08-01T15:02:20.530Z","updated_at":"2025-05-16T15:02:15.046Z","avatar_url":"https://github.com/yields.png","language":"Go","funding_links":[],"categories":["Go"],"sub_categories":[],"readme":"\n\u003cbr\u003e\n\u003cbr\u003e\n\u003cbr\u003e\n\n\u003cp align=center\u003e\n  ant (\u003cem\u003ealpha\u003c/em\u003e) is a web crawler for Go.\n\u003c/p\u003e\n\n\u003cbr\u003e\n\u003cbr\u003e\n\u003cbr\u003e\n\n\u003cp align=center\u003e\n  \u003ca href=\"https://github.com/yields/ant/workflows/test\"\u003e\n    \u003cimg src=\"https://github.com/yields/ant/workflows/test/badge.svg?event=push\" /\u003e\n  \u003c/a\u003e\n  \u003ca href=\"https://pkg.go.dev/github.com/yields/ant\"\u003e\n    \u003cimg src=\"https://pkg.go.dev/badge/github.com/yields/ant\" /\u003e\n  \u003c/a\u003e\n  \u003ca href=\"https://goreportcard.com/report/github.com/yields/ant\"\u003e\n    \u003cimg src=\"https://goreportcard.com/badge/github.com/yields/ant\" /\u003e\n  \u003c/a\u003e\n\u003c/p\u003e\n\n\u003cbr\u003e\n\u003cbr\u003e\n\u003cbr\u003e\n\n\n\u003cbr\u003e\n\n#### Declarative\n\n  The package includes functions that can scan data from the page into your structs\n  or slice of structs, this allows you to reduce the noise and complexity in your source-code.\n\n  You can also use a jQuery-like API that allows you to scrape complex HTML pages if needed.\n\n  ```go\n\n  var data struct { Title string `css:\"title\"` }\n  page, _ := ant.Fetch(ctx, \"https://apple.com\")\n  page.Scan(\u0026data)\n  data.Title // =\u003e Apple\n  ```\n\n\u003cbr\u003e\n\n#### Headless\n\n  By default the crawler uses `http.Client`, however if you're crawling SPAs\n  youc an use the `antcdp.Client` implementation which allows you to use chrome\n  headless browser to crawl pages.\n\n  ```go\n  eng, err := ant.Engine(ant.EngineConfig{\n    Fetcher: \u0026ant.Fetcher{\n      Client: antcdp.Client{},\n    },\n  })\n  ```\n\n\u003cbr\u003e\n\n#### Polite\n\n  The crawler automatically fetches and caches `robots.txt`, making sure that\n  it never causes issues to small website owners. Of-course you can disable\n  this behavior.\n\n  ```go\n  eng, err := ant.NewEngine(ant.EngineConfig{\n    Impolite: true,\n  })\n  eng.Run(ctx)\n  ```\n\n\u003cbr\u003e\n\n#### Concurrent\n\n  The crawler maintains a configurable amount of \"worker\" goroutines that read\n  URLs off the queue, and spawn a goroutine for each URL.\n\n  Depending on your configuration, you may want to increase the number of workers\n  to speed up URL reads, of-course if you don't have enough resources you can reduce\n  the number of workers too.\n\n  ```go\n  eng, err := ant.NewEngine(ant.EngineConfig{\n    // Spawn 5 worker goroutines that dequeue\n    // URLs and spawn a new goroutine for each URL.\n    Workers: 5,\n  })\n  eng.Run(ctx)\n  ```\n\n\u003cbr\u003e\n\n#### Rate limits\n\n  The package includes a powerful `ant.Limiter` interface that allows you to\n  define rate limits per URL. There are some built-in limiters as well.\n\n  ```go\n  ant.Limit(1) // 1 rps on all URLs.\n  ant.LimitHostname(5, \"amazon.com\") // 5 rps on amazon.com hostname.\n  ant.LimitPattern(5, \"amazon.com.*\") // 5 rps on URLs starting with `amazon.co.`.\n  ant.LimitRegexp(5, \"^apple.com\\/iphone\\/*\") // 5 rps on URLs that match the regex.\n  ```\n  \n  Note that `LimitPattern` and `LimitRegexp` only match on the host and path of the URL.\n\n\u003cbr\u003e\n\n#### Matchers\n\n  Another powerful interface is `ant.Matcher` which allows you to define URL\n  matchers, the matchers are called before URLs are queued.\n\n  ```go\n  ant.MatchHostname(\"amazon.com\") // scrape amazon.com URLs only.\n  ant.MatchPattern(\"amazon.com/help/*\")\n  ant.MatchRegexp(\"amazon\\.com\\/help/.+\")\n  ```\n\n\u003cbr\u003e\n\n#### Robust\n\n  The crawl engine automatically retries any errors that implement `Temporary()`\n  error that returns true.\n\n  Becuase the standard library returns errors that implement that interface\n  the engine will retry most temporary network and HTTP errors.\n\n  ```go\n  eng, err := ant.NewEngine(ant.EngineConfig{\n    Scraper: myscraper{},\n    MaxAttempts: 5,\n  })\n\n  // Blocks until one of the following is true:\n  //\n  // 1. No more URLs to crawl (the scraper stops returning URLs)\n  // 2. A non-temporary error occured.\n  // 3. MaxAttempts was reached.\n  //\n  err = eng.Run(ctx)\n  ```\n\n\u003cbr\u003e\n\n#### Built-in Scrapers\n\n  The whole point of scraping is to extract data from websites into a machine readable\n  format such as CSV or JSON, ant comes with built-in scrapers to make this ridiculously\n  easy, here's a full cralwer that extracts quotes into stdout.\n\n\n[embedmd]:# (_examples/jsonquotes/main.go /func main/ $)\n```go\nfunc main() {\n\tvar url = \"http://quotes.toscrape.com\"\n\tvar ctx = context.Background()\n\tvar start = time.Now()\n\n\ttype quote struct {\n\t\tText string   `css:\".text\"   json:\"text\"`\n\t\tBy   string   `css:\".author\" json:\"by\"`\n\t\tTags []string `css:\".tag\"    json:\"tags\"`\n\t}\n\n\ttype page struct {\n\t\tQuotes []quote `css:\".quote\" json:\"quotes\"`\n\t}\n\n\teng, err := ant.NewEngine(ant.EngineConfig{\n\t\tScraper: ant.JSON(os.Stdout, page{}, `li.next \u003e a`),\n\t\tMatcher: ant.MatchHostname(\"quotes.toscrape.com\"),\n\t})\n\tif err != nil {\n\t\tlog.Fatalf(\"new engine: %s\", err)\n\t}\n\n\tif err := eng.Run(ctx, url); err != nil {\n\t\tlog.Fatal(err)\n\t}\n\n\tlog.Printf(\"scraped in %s :)\", time.Since(start))\n}\n```\n\u003cbr\u003e\n\n#### Testing\n\n  `anttest` package makes it easy to test your scraper implementation\n  it fetches a page by a URL, caches it in the OS's temporary directory and re-uses it.\n\n  The func depends on the file's modtime, the file expires daily, you can adjust\n  the TTL by setting `antttest.FetchTTL`.\n\n  ```Go\n  // Fetch calls `t.Fatal` on errors.\n  page := anttest.Fetch(t, \"https://apple.com\")\n  _, err := myscraper.Scrape(ctx, page)\n  assert.NoError(err)\n  ```\n\n\u003cbr\u003e\n\u003cbr\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fyields%2Fant","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fyields%2Fant","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fyields%2Fant/lists"}