{"id":13502900,"url":"https://github.com/a11ywatch/crawler","last_synced_at":"2025-10-16T19:31:05.490Z","repository":{"id":39654064,"uuid":"351596803","full_name":"a11ywatch/crawler","owner":"a11ywatch","description":"gRPC web crawler turbo charged for performance","archived":false,"fork":false,"pushed_at":"2024-08-26T17:38:04.000Z","size":686,"stargazers_count":52,"open_issues_count":0,"forks_count":2,"subscribers_count":5,"default_branch":"main","last_synced_at":"2025-01-26T01:08:33.978Z","etag":null,"topics":["a11ywatch","crawler","grpc","scraper"],"latest_commit_sha":null,"homepage":"https://docs.rs/crate/website_crawler/latest","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/a11ywatch.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null},"funding":{"github":["a11ywatch"]}},"created_at":"2021-03-25T22:48:54.000Z","updated_at":"2025-01-08T23:27:18.000Z","dependencies_parsed_at":"2024-06-19T03:02:24.452Z","dependency_job_id":"600f2a99-6016-4577-a92a-6f6703c9aed2","html_url":"https://github.com/a11ywatch/crawler","commit_stats":{"total_commits":242,"total_committers":2,"mean_commits":121.0,"dds":0.09090909090909094,"last_synced_commit":"f76d407fb83f380c819be7ed9cbb9d44923f44c6"},"previous_names":[],"tags_count":5,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/a11ywatch%2Fcrawler","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/a11ywatch%2Fcrawler/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/a11ywatch%2Fcrawler/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/a11ywatch%2Fcrawler/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/a11ywatch","download_url":"https://codeload.github.com/a11ywatch/crawler/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":236742436,"owners_count":19197508,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["a11ywatch","crawler","grpc","scraper"],"created_at":"2024-07-31T22:02:29.139Z","updated_at":"2025-10-16T19:31:00.178Z","avatar_url":"https://github.com/a11ywatch.png","language":"Rust","funding_links":["https://github.com/sponsors/a11ywatch"],"categories":["Rust","Web Crawler"],"sub_categories":[],"readme":"# crawler\n\nA [gRPC](https://grpc.io/) web indexer turbo charged for performance. \n\nThis project is capable of handling millions of pages per second efficiently.\n\n## Getting Started\n\nMake sure to have [Rust](https://doc.rust-lang.org/book/ch01-01-installation.html) installed or Docker.\n\nThis project requires that you start up another gRPC server on port `50051` following the [proto spec](https://github.com/a11ywatch/protobuf/blob/main/website.proto).\n\nThe user agent is spoofed on each crawl to a random agent and the indexer extends [spider](https://github.com/spider-rs/spider) as the base.\n\n1. `cargo run` or `docker compose up`\n\n## Installation\n\nYou can install easily with the following:\n\n### Cargo\n\nThe [crate](https://crates.io/crates/website_crawler) is available to setup a gRPC server within rust projects.\n\n```sh\ncargo install website_crawler\n```\n\n### Docker\n\nYou can use also use the docker image at [a11ywatch/crawler](https://hub.docker.com/repository/docker/a11ywatch/crawler).\n\nSet the `CRAWLER_IMAGE` env var to `darwin-arm64` to get the native m1 mac image.\n\n```yml\ncrawler:\n  container_name: crawler\n  image: \"a11ywatch/crawler:${CRAWLER_IMAGE:-latest}\"\n  ports:\n    - 50055\n```\n\n### Node / Bun\n\nWe also release the package to npm [@a11ywatch/crawler](https://www.npmjs.com/package/@a11ywatch/crawler).\n\n```sh\nnpm i @a11ywatch/crawler\n```\n\nAfter import at the top of your project to start the gRPC server or run node directly against the module.\n\n```ts\nimport \"@a11ywatch/crawler\";\n```\n\n## Example\n\nThis is a basic example crawling a web page, add spider to your `Cargo.toml`:\n\n```toml\n[dependencies]\nwebsite_crawler = \"0.9.4\"\n```\n\nA basic [example](./examples/example.rs) can also be done with:\n\nOne terminal run the server\n\n```sh\ncargo run --example server --release\n```\n\nAnother terminal run the client/server\n\n```sh\ncargo run --example client --release\n```\n\nhttps://user-images.githubusercontent.com/8095978/221221122-cfed83aa-6ca1-4122-a1db-0d9948e9f9d9.mov\n\n### Dependencies\n\nIn order to build `crawler` locally \u003e= 0.5.0, you need the `protoc` Protocol Buffers compiler, along with Protocol Buffers resource files.\n\n#### Ubuntu\n\nproto compiler needs to be at v3 in order to compile. Ubuntu 18+ auto installs.\n\n```bash\nsudo apt update \u0026\u0026 sudo apt upgrade -y\nsudo apt install -y protobuf-compiler libprotobuf-dev\n```\n\n#### Alpine Linux\n\n```sh\nsudo apk add protoc protobuf-dev\n```\n\n#### macOS\n\nAssuming [Homebrew](https://brew.sh/) is already installed. (If not, see instructions for installing Homebrew on [the Homebrew website](https://brew.sh/).)\n\n```zsh\nbrew install protobuf\n```\n\n### Features\n\n1. `jemalloc` - use jemalloc memory allocator (default disabled).\n1. `regex` - use the regex crate for blacklist urls validation.\n1. `ua_generator` - use the ua_generator crate to spoof random user agent.\n1. `smart` - use smart mode to run HTTP request first and chrome when JS is needed.\n1. `chrome`: Enables chrome headless rendering, use the env var `CHROME_URL` to connect remotely.\n\n## About\n\nThis crawler is optimized for reduced latency and uses isolated based concurrency as it can handle over 10,000 pages within several milliseconds. In order to receive the links found for the crawler you need to add the [`website.proto`](./proto/website.proto) to your server. This is required since every request spawns a thread. Isolating the context drastically improves performance (preventing shared resources / communication ).\n\n## Help\n\nIf you need help implementing the gRPC server to receive the pages or links when found check out the [gRPC node example](https://github.com/A11yWatch/a11ywatch-core/blob/main/src/proto/website-server.ts) for a starting point .\n\n## LICENSE\n\nCheck the license file in the root of the project.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fa11ywatch%2Fcrawler","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fa11ywatch%2Fcrawler","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fa11ywatch%2Fcrawler/lists"}