{"id":13424706,"url":"https://github.com/fredwu/crawler","last_synced_at":"2025-05-15T04:06:39.454Z","repository":{"id":37405903,"uuid":"65206632","full_name":"fredwu/crawler","owner":"fredwu","description":"A high performance web crawler / scraper in Elixir.","archived":false,"fork":false,"pushed_at":"2024-06-19T03:05:25.000Z","size":394,"stargazers_count":951,"open_issues_count":0,"forks_count":90,"subscribers_count":31,"default_branch":"master","last_synced_at":"2025-05-15T04:06:27.529Z","etag":null,"topics":["crawler","elixir","files","offline","scraper","scraper-engine","spider"],"latest_commit_sha":null,"homepage":"","language":"Elixir","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/fredwu.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2016-08-08T13:32:20.000Z","updated_at":"2025-05-07T04:13:58.000Z","dependencies_parsed_at":"2024-06-19T10:39:06.733Z","dependency_job_id":null,"html_url":"https://github.com/fredwu/crawler","commit_stats":{"total_commits":236,"total_committers":8,"mean_commits":29.5,"dds":0.0423728813559322,"last_synced_commit":"aadf912600f3e840e3330084bac6b7b3cdb3544e"},"previous_names":[],"tags_count":13,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fredwu%2Fcrawler","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fredwu%2Fcrawler/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fredwu%2Fcrawler/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fredwu%2Fcrawler/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/fredwu","download_url":"https://codeload.github.com/fredwu/crawler/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254270646,"owners_count":22042859,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crawler","elixir","files","offline","scraper","scraper-engine","spider"],"created_at":"2024-07-31T00:00:58.132Z","updated_at":"2025-05-15T04:06:34.436Z","avatar_url":"https://github.com/fredwu.png","language":"Elixir","funding_links":[],"categories":["Elixir","HTTP"],"sub_categories":[],"readme":"# Crawler\n\n[![Build Status](https://github.com/fredwu/crawler/actions/workflows/ci.yml/badge.svg)](https://github.com/fredwu/crawler/actions)\n[![CodeBeat](https://codebeat.co/badges/76916047-5b66-466d-91d3-7131a269899a)](https://codebeat.co/projects/github-com-fredwu-crawler-master)\n[![Coverage](https://img.shields.io/coveralls/fredwu/crawler.svg)](https://coveralls.io/github/fredwu/crawler?branch=master)\n[![Module Version](https://img.shields.io/hexpm/v/crawler.svg)](https://hex.pm/packages/crawler)\n[![Hex Docs](https://img.shields.io/badge/hex-docs-lightgreen.svg)](https://hexdocs.pm/crawler/)\n[![Total Download](https://img.shields.io/hexpm/dt/crawler.svg)](https://hex.pm/packages/crawler)\n[![License](https://img.shields.io/hexpm/l/crawler.svg)](https://github.com/fredwu/crawler/blob/master/LICENSE.md)\n[![Last Updated](https://img.shields.io/github/last-commit/fredwu/crawler.svg)](https://github.com/fredwu/crawler/commits/master)\n\nA high performance web crawler / scraper in Elixir, with worker pooling and rate limiting via [OPQ](https://github.com/fredwu/opq).\n\n## Features\n\n- Crawl assets (javascript, css and images).\n- Save to disk.\n- Hook for scraping content.\n- Restrict crawlable domains, paths or content types.\n- Limit concurrent crawlers.\n- Limit rate of crawling.\n- Set the maximum crawl depth.\n- Set timeouts.\n- Set retries strategy.\n- Set crawler's user agent.\n- Manually pause/resume/stop the crawler.\n\nSee [Hex documentation](https://hexdocs.pm/crawler/).\n\n## Architecture\n\nBelow is a very high level architecture diagram demonstrating how Crawler works.\n\n![](architecture.svg)\n\n## Usage\n\n```elixir\nCrawler.crawl(\"http://elixir-lang.org\", max_depths: 2)\n```\n\nThere are several ways to access the crawled page data:\n\n1. Use [`Crawler.Store`](https://hexdocs.pm/crawler/Crawler.Store.html)\n2. Tap into the registry([?](https://hexdocs.pm/elixir/Registry.html)) [`Crawler.Store.DB`](lib/crawler/store.ex)\n3. Use your own [scraper](#custom-modules)\n4. If the `:save_to` option is set, pages will be saved to disk in addition to the above mentioned places\n5. Provide your own [custom parser](#custom-modules) and manage how data is stored and accessed yourself\n\n## Configurations\n\n| Option        | Type    | Default Value               | Description                                                                                                                                                                               |\n| ------------- | ------- | --------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |\n| `:assets`     | list    | `[]`                        | Whether to fetch any asset files, available options: `\"css\"`, `\"js\"`, `\"images\"`.                                                                                                         |\n| `:save_to`    | string  | `nil`                       | When provided, the path for saving crawled pages.                                                                                                                                         |\n| `:workers`    | integer | `10`                        | Maximum number of concurrent workers for crawling.                                                                                                                                        |\n| `:interval`   | integer | `0`                         | Rate limit control - number of milliseconds before crawling more pages, defaults to `0` which is effectively no rate limit.                                                               |\n| `:max_depths` | integer | `3`                         | Maximum nested depth of pages to crawl.                                                                                                                                                   |\n| `:max_pages`  | integer | `:infinity`                 | Maximum amount of pages to crawl.                                                                                                                                                         |\n| `:timeout`    | integer | `5000`                      | Timeout value for fetching a page, in ms. Can also be set to `:infinity`, useful when combined with `Crawler.pause/1`.                                                                    |\n| `:retries`    | integer | `2`                         | Number of times to retry a fetch.                                                                                                                                                         |\n| `:store`      | module  | `nil`                       | Module for storing the crawled page data and crawling metadata. You can set it to `Crawler.Store` or use your own module, see `Crawler.Store.add_page_data/3` for implementation details. |\n| `:force`      | boolean | `false`                     | Force crawling URLs even if they have already been crawled, useful if you want to refresh the crawled data.                                                                               |\n| `:scope`      | term    | `nil`                       | Similar to `:force`, but you can pass a custom `:scope` to determine how Crawler should perform on links already seen.                                                                    |\n| `:user_agent` | string  | `Crawler/x.x.x (...)`       | User-Agent value sent by the fetch requests.                                                                                                                                              |\n| `:url_filter` | module  | `Crawler.Fetcher.UrlFilter` | Custom URL filter, useful for restricting crawlable domains, paths or content types.                                                                                                      |\n| `:retrier`    | module  | `Crawler.Fetcher.Retrier`   | Custom fetch retrier, useful for retrying failed crawls, nullifies the `:retries` option.                                                                                                 |\n| `:modifier`   | module  | `Crawler.Fetcher.Modifier`  | Custom modifier, useful for adding custom request headers or options.                                                                                                                     |\n| `:scraper`    | module  | `Crawler.Scraper`           | Custom scraper, useful for scraping content as soon as the parser parses it.                                                                                                              |\n| `:parser`     | module  | `Crawler.Parser`            | Custom parser, useful for handling parsing differently or to add extra functionalities.                                                                                                   |\n| `:encode_uri` | boolean | `false`                     | When set to `true` apply the `URI.encode` to the URL to be crawled.                                                                                                                       |\n| `:queue`      | pid     | `nil`                       | You can pass in an `OPQ` pid so that multiple crawlers can share the same queue.                                                                                                          |\n\n## Custom Modules\n\nIt is possible to swap in your custom logic as shown in the configurations section. Your custom modules need to conform to their respective behaviours:\n\n### Retrier\n\nSee [`Crawler.Fetcher.Retrier`](lib/crawler/fetcher/retrier.ex).\n\nCrawler uses [ElixirRetry](https://github.com/safwank/ElixirRetry)'s exponential backoff strategy by default.\n\n```elixir\ndefmodule CustomRetrier do\n  @behaviour Crawler.Fetcher.Retrier.Spec\nend\n```\n\n### URL Filter\n\nSee [`Crawler.Fetcher.UrlFilter`](lib/crawler/fetcher/url_filter.ex).\n\n```elixir\ndefmodule CustomUrlFilter do\n  @behaviour Crawler.Fetcher.UrlFilter.Spec\nend\n```\n\n### Scraper\n\nSee [`Crawler.Scraper`](lib/crawler/scraper.ex).\n\n```elixir\ndefmodule CustomScraper do\n  @behaviour Crawler.Scraper.Spec\nend\n```\n\n### Parser\n\nSee [`Crawler.Parser`](lib/crawler/parser.ex).\n\n```elixir\ndefmodule CustomParser do\n  @behaviour Crawler.Parser.Spec\nend\n```\n\n### Modifier\n\nSee [`Crawler.Fetcher.Modifier`](lib/crawler/fetcher/modifier.ex).\n\n```elixir\ndefmodule CustomModifier do\n  @behaviour Crawler.Fetcher.Modifier.Spec\nend\n```\n\n## Pause / Resume / Stop Crawler\n\nCrawler provides `pause/1`, `resume/1` and `stop/1`, see below.\n\n```elixir\n{:ok, opts} = Crawler.crawl(\"https://elixir-lang.org\")\n\nCrawler.running?(opts) # =\u003e true\n\nCrawler.pause(opts)\n\nCrawler.running?(opts) # =\u003e false\n\nCrawler.resume(opts)\n\nCrawler.running?(opts) # =\u003e true\n\nCrawler.stop(opts)\n\nCrawler.running?(opts) # =\u003e false\n```\n\nPlease note that when pausing Crawler, you would need to set a large enough `:timeout` (or even set it to `:infinity`) otherwise parser would timeout due to unprocessed links.\n\n## Multiple Crawlers\n\nIt is possible to start multiple crawlers sharing the same queue.\n\n```elixir\n{:ok, queue} = OPQ.init(worker: Crawler.Dispatcher.Worker, workers: 2)\n\nCrawler.crawl(\"https://elixir-lang.org\", queue: queue)\nCrawler.crawl(\"https://github.com\", queue: queue)\n```\n\n## Find All Scraped URLs\n\n```elixir\nCrawler.Store.all_urls() # =\u003e [\"https://elixir-lang.org\", \"https://google.com\", ...]\n```\n\n## Examples\n\n### Google Search + Github\n\nThis example performs a Google search, then scrapes the results to find Github projects and output their name and description.\n\nSee the [source code](examples/google_search.ex).\n\nYou can run the example by cloning the repo and run the command:\n\n```shell\nmix run -e \"Crawler.Example.GoogleSearch.run()\"\n```\n\n## API Reference\n\nPlease see https://hexdocs.pm/crawler.\n\n## Changelog\n\nPlease see [CHANGELOG.md](CHANGELOG.md).\n\n## Copyright and License\n\nCopyright (c) 2016 Fred Wu\n\nThis work is free. You can redistribute it and/or modify it under the\nterms of the [MIT License](http://fredwu.mit-license.org/).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffredwu%2Fcrawler","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ffredwu%2Fcrawler","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffredwu%2Fcrawler/lists"}