{"id":15460467,"url":"https://github.com/schollz/linkcrawler","last_synced_at":"2025-04-22T10:37:17.543Z","repository":{"id":144202309,"uuid":"84350589","full_name":"schollz/linkcrawler","owner":"schollz","description":"Cross-platform persistent and distributed web crawler :link:","archived":false,"fork":false,"pushed_at":"2017-09-09T14:56:30.000Z","size":61,"stargazers_count":111,"open_issues_count":0,"forks_count":9,"subscribers_count":8,"default_branch":"master","last_synced_at":"2025-04-11T19:06:42.382Z","etag":null,"topics":["crawler","hyperlinks","web"],"latest_commit_sha":null,"homepage":"","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/schollz.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2017-03-08T17:57:34.000Z","updated_at":"2024-01-04T16:12:08.000Z","dependencies_parsed_at":"2023-06-18T12:03:39.828Z","dependency_job_id":null,"html_url":"https://github.com/schollz/linkcrawler","commit_stats":{"total_commits":69,"total_committers":1,"mean_commits":69.0,"dds":0.0,"last_synced_commit":"f20aff4b8896c5b5c973de61609a7f877347da6a"},"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/schollz%2Flinkcrawler","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/schollz%2Flinkcrawler/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/schollz%2Flinkcrawler/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/schollz%2Flinkcrawler/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/schollz","download_url":"https://codeload.github.com/schollz/linkcrawler/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":249259146,"owners_count":21239422,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crawler","hyperlinks","web"],"created_at":"2024-10-01T23:22:00.251Z","updated_at":"2025-04-16T16:31:30.244Z","avatar_url":"https://github.com/schollz.png","language":"Go","funding_links":[],"categories":[],"sub_categories":[],"readme":"\r\n\u003cp align=\"center\"\u003e\r\n\u003cimg\r\n    src=\"logo.png\"\r\n    width=\"260\" height=\"80\" border=\"0\" alt=\"linkcrawler\"\u003e\r\n\u003cbr\u003e\r\n\u003ca href=\"https://travis-ci.org/schollz/linkcrawler\"\u003e\u003cimg src=\"https://img.shields.io/travis/schollz/linkcrawler.svg?style=flat-square\" alt=\"Build Status\"\u003e\u003c/a\u003e\r\n\u003ca href=\"http://gocover.io/github.com/schollz/linkcrawler/lib\"\u003e\u003cimg src=\"https://img.shields.io/badge/coverage-76%25-yellow.svg?style=flat-square\" alt=\"Code Coverage\"\u003e\u003c/a\u003e\r\n\u003ca href=\"https://godoc.org/github.com/schollz/linkcrawler/lib\"\u003e\u003cimg src=\"https://img.shields.io/badge/api-reference-blue.svg?style=flat-square\" alt=\"GoDoc\"\u003e\u003c/a\u003e\r\n\u003c/p\u003e\r\n\r\n\u003cp align=\"center\"\u003eCross-platform persistent and distributed web crawler\u003c/a\u003e\u003c/p\u003e\r\n\r\n*linkcrawler* is persistent because the queue is stored in a remote database that is automatically re-initialized if interrupted. *linkcrawler* is distributed because multiple instances of *linkcrawler* will work on the remotely stored queue, so you can start as many crawlers as you want on separate machines to speed along the process. *linkcrawler* is also fast because it is threaded and uses connection pools.\r\n\r\nCrawl responsibly.\r\n\r\n# This repo has been superseded by [schollz/goredis-crawler](https://github.com/schollz/goredis-crawler)\r\n\r\nGetting Started\r\n===============\r\n\r\n## Install\r\n\r\nIf you have Go installed, just do\r\n```\r\ngo get github.com/schollz/linkcrawler/...\r\ngo get github.com/schollz/boltdb-server/...\r\n```\r\n\r\nOtherwise, use the releases and [download linkcrawler](https://github.com/schollz/linkcrawler/releases/latest) and then [download the boltdb-server](https://github.com/schollz/boltdb-server/releases/latest).\r\n\r\n\r\n## Run\r\n\r\n### Crawl a site\r\n\r\nFirst run the database server which will create a LAN hub:\r\n\r\n```sh\r\n$ ./boltdb-server\r\nboltdb-server running on http://X.Y.Z.W:8050\r\n```\r\n\r\nThen, to capture all the links on a website:\r\n\r\n```sh\r\n$ linkcrawler --server http://X.Y.Z.W:8050 crawl http://rpiai.com\r\n```\r\n\r\n\r\nMake sure to replace `http://X.Y.Z.W:8050` with the IP information outputted from the boltdb-server.\r\n\r\nYou can run this last command on as many different machines as you want, which will help to crawl the respective website and add collected links to a universal queue in the server.\r\n\r\nThe current state of the crawler is saved. If the crawler is interrupted, you can simply run the command again and it will restart from the last state.\r\n\r\nSee the help (`-help`) if you'd like to see more options, such as exclusions/inclusions and modifying the worker pool and connection pools.\r\n\r\n\r\n### Download a site\r\n\r\nYou can also use *linkcrawler* to download webpages from a newline-delimited list of websites. As before, first startup a boltdb-server.  Then you can run:\r\n\r\n```bash\r\n$ linkcrawler --server http://X.Y.Z.W:8050 download links.txt\r\n```\r\n\r\nDownloads are saved into a folder `downloaded` with URL of link encoded in Base32 and compressed using gzip.\r\n\r\n### Dump the current list of links\r\n\r\nTo dump the current database, just use\r\n\r\n```bash\r\n$ linkcrawler --server http://X.Y.Z.W:8050 dump http://rpiai.com\r\nWrote 32 links to NB2HI4B2F4XXE4DJMFUS4Y3PNU======.txt\r\n```\r\n\r\n## License\r\n\r\nMIT\r\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fschollz%2Flinkcrawler","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fschollz%2Flinkcrawler","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fschollz%2Flinkcrawler/lists"}