{"id":15460523,"url":"https://github.com/schollz/crawdad","last_synced_at":"2025-04-22T10:35:14.899Z","repository":{"id":57492989,"uuid":"94834691","full_name":"schollz/crawdad","owner":"schollz","description":"Cross-platform persistent and distributed web crawler :crab:","archived":false,"fork":false,"pushed_at":"2019-05-10T02:22:28.000Z","size":8202,"stargazers_count":62,"open_issues_count":1,"forks_count":9,"subscribers_count":7,"default_branch":"master","last_synced_at":"2025-03-29T14:11:23.896Z","etag":null,"topics":["crawler","golang","redis","web"],"latest_commit_sha":null,"homepage":"https://schollz.github.io/crawdad/","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/schollz.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2017-06-20T01:06:15.000Z","updated_at":"2024-11-13T08:33:05.000Z","dependencies_parsed_at":"2022-09-26T17:50:52.143Z","dependency_job_id":null,"html_url":"https://github.com/schollz/crawdad","commit_stats":null,"previous_names":["schollz/goredis-crawler"],"tags_count":9,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/schollz%2Fcrawdad","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/schollz%2Fcrawdad/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/schollz%2Fcrawdad/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/schollz%2Fcrawdad/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/schollz","download_url":"https://codeload.github.com/schollz/crawdad/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":249259146,"owners_count":21239422,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crawler","golang","redis","web"],"created_at":"2024-10-01T23:22:22.065Z","updated_at":"2025-04-16T16:31:32.087Z","avatar_url":"https://github.com/schollz.png","language":"Go","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cimg src=\"https://user-images.githubusercontent.com/6550035/31456157-58663efe-ae76-11e7-8e53-6a2a5b7a196c.png\" width=\"450\" border=\"0\" alt=\"crawdad\" style=\"float:right;\" align=\"right\"\u003e\r\n\u003ch1\u003ecrawdad\u003c/h1\u003e\r\n\u003cp\u003e\r\n\u003ca href=\"https://github.com/schollz/crawdad/releases/latest\"\u003e\u003cimg src=\"https://img.shields.io/badge/version-3.0.0-brightgreen.svg?style=flat-square\" alt=\"Version\"\u003e\u003c/a\u003e\u0026nbsp;\u003cimg src=\"https://img.shields.io/badge/coverage-59%25-yellow.svg?style=flat-square\" alt=\"Code Coverage\"\u003e\r\n\u003cbr\u003e\r\n\u003cem\u003ecrawdad\u003c/em\u003e is cross-platform web-crawler that can also pinch data. \u003cem\u003ecrawdad\u003c/em\u003e is persistent, distributed, and fast. It uses a queue stored in a remote Redis database to persist after interruptions and also synchronize distributed instances. Data extraction can be specified by the simple and powerful \u003ca href=\"https://github.com/schollz/pluck\"\u003e\u003cem\u003epluck\u003c/em\u003e\u003c/a\u003e syntax. \r\n\u003c/p\u003e\r\n\r\nCrawl responsibly.\r\n\r\nFor a tutorial on how to use *crawdad* see [my blog post](https://schollz.github.io/crawdad/).\r\n\r\n# Features\r\n\r\n- Written in Go\r\n- [Cross-platform releases](https://github.com/schollz/crawdad/releases/latest)\r\n- Persistent (interruptions can be re-initialized)\r\n- Distributed (multiple crawdads can be run on diferent machines)\r\n- Scraping using [*pluck*](https://github.com/schollz/pluck)\r\n- Uses connection pools for lower latency\r\n- Uses threads for maximum parallelism\r\n\r\n# Install\r\n\r\nFirst [get Docker CE](https://www.docker.com/community-edition). This will make installing Redis a snap.\r\n\r\nThen, if you have Go installed, just do\r\n\r\n```\r\n$ go get github.com/schollz/crawdad\r\n```\r\n\r\nOtherwise, use the releases and [download crawdad](https://github.com/schollz/crawdad/releases/latest).\r\n\r\n# Run\r\n\r\nFirst run Redis:\r\n\r\n```sh\r\n$ docker run -d -v `pwd`:/data -p 6379:6379 redis \r\n```\r\n\r\nwhich will store the database in the current directory.\r\n\r\n## Crawling \r\n\r\nBy \"crawling* the *crawdad* will follow every link that corresponds to the base URL. This is useful for generating sitemaps.\r\n\r\nStartup *crawdad* with the base URL:\r\n\r\n```sh\r\n$ crawdad -set -url https://rpiai.com\r\n```\r\n\r\nThis command will set the base URL to crawl as `https://rpiai.com`. You can run *crawdad* on a different machine without setting these parameters again. E.g., on computer 2 you can run:\r\n\r\n```sh\r\n$ crawdad -server X.X.X.X\r\n```\r\n\r\nwhere `X.X.X.X` is the IP address of computer 2. This crawdad will now run with whatever parameters set from the first one. If you need to re-set parameters, just use `-set` to specify them again.\r\n\r\nEach machine running *crawdad* will help to crawl the respective website and add collected links to a universal queue in the server. The current state of the crawler is saved. If the crawler is interrupted, you can simply run the command again and it will restart from the last state.\r\n\r\nWhen done you can dump all the links:\r\n\r\n```sh\r\n$ crawdad -dump dump.txt\r\n```\r\n\r\nwhich will connect to Redis and dump all the links to-do, doing, done, and trashed.\r\n\r\n## Pinching\r\n\r\nBy \"pinching\" the *crawdad* will follow the specified links and extract data from each URL that can be dumped later.\r\n\r\nYou will need to make a [*pluck* TOML configuration file](https://github.com/schollz/pluck). For instance, I would like to scrape from my site, rpiai.com, the meta description and the title. My configuration, `pluck.toml`, looks like:\r\n\r\n```toml\r\n[[pluck]]\r\nname = \"description\"\r\nactivators = [\"meta\",\"name\",\"description\",'content=\"']\r\ndeactivator = '\"'\r\nlimit = 1\r\n\r\n[[pluck]]\r\nname = \"title\"\r\nactivators = [\"\u003ctitle\u003e\"]\r\ndeactivator = \"\u003c/title\u003e\"\r\nlimit = 1\r\n```\r\n\r\n\r\nNow I can crawl the site the same way as before, but load in this *pluck* configuration with `--pluck` so it captures the content:\r\n\r\n```sh\r\n$ crawdad -set -url \"https://rpiai.com\" -pluck pluck.toml\r\n```\r\n\r\nTo retrieve the data, then you can use the `-done` flag to collect a JSON map of all the plucked data.\r\n\r\n```sh\r\n$ crawdad -done data.json\r\n```\r\n\r\nThis data JSON file will contain each URL as a key and a JSON string of the finished data that contain keys for the description and the title.\r\n\r\n```sh\r\n$ cat data.json | grep why\r\n\"https://rpiai.com/why-i-made-a-book-recommendation-service/index.html\": \"{\\\"description\\\":\\\"Why I made a book recommendation service from scratch: basically I found that all other book suggestions lacked so I made something that actually worked.\\\",\\\"title\\\":\\\"What book is similar to Weaveworld by Clive Barker?\\\"}\"\r\n```\r\n\r\n# Advanced usage\r\n\r\nThere are lots of other options:\r\n\r\n```\r\n   --server value, -s value       address for Redis server (default: \"localhost\")\r\n   --port value, -p value         port for Redis server (default: \"6379\")\r\n   --url value, -u value          set base URL to crawl\r\n   --exclude value, -e value      set comma-delimted phrases that must NOT be in URL\r\n   --include value, -i value      set comma-delimted phrases that must be in URL\r\n   --seed file                    file with URLs to add to queue\r\n   --pluck value                  set config file for a plucker (see github.com/schollz/pluck)\r\n   --stats X                      Print stats every X seconds (default: 1)\r\n   --connections value, -c value  number of connections to use (default: 25)\r\n   --workers value, -w value      number of connections to use (default: 8)\r\n   --verbose                      turn on logging\r\n   --proxy                        use tor proxy\r\n   --set                          set options across crawdads\r\n   --dump file                    dump all the keys to file\r\n   --done file                    dump the map of the done things file\r\n   --useragent useragent          set the specified useragent\r\n   --redo                         move items from 'doing' to 'todo'\r\n   --query                        allow query parameters in URL\r\n   --hash                         allow hashes in URL\r\n   --no-follow                    do not follow links (useful with -seed)\r\n   --errors value                 maximum number of errors before exiting (default: 10)\r\n   --help, -h                     show help\r\n   --version, -v                  print the version\r\n\r\n```\r\n\r\n# Dev\r\n\r\nTo run tests\r\n\r\n```\r\n$ docker run -d -v `pwd`:/data -p 6377:6379 redis\r\n$ cd src \u0026\u0026 go test -v -cover\r\n```\r\n\r\n# License\r\n\r\nMIT\r\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fschollz%2Fcrawdad","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fschollz%2Fcrawdad","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fschollz%2Fcrawdad/lists"}