{"id":13581165,"url":"https://github.com/karust/gogetcrawl","last_synced_at":"2026-01-15T01:43:42.116Z","repository":{"id":111646667,"uuid":"191992923","full_name":"karust/gogetcrawl","owner":"karust","description":"Extract web archive data using Wayback Machine and Common Crawl","archived":false,"fork":false,"pushed_at":"2024-11-04T08:25:29.000Z","size":60,"stargazers_count":167,"open_issues_count":0,"forks_count":18,"subscribers_count":2,"default_branch":"master","last_synced_at":"2026-01-12T01:35:26.048Z","etag":null,"topics":["commoncrawl","concurrency","crawler","golang","wayback-machine","webarchive"],"latest_commit_sha":null,"homepage":"","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/karust.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-06-14T19:02:05.000Z","updated_at":"2026-01-09T05:56:53.000Z","dependencies_parsed_at":"2025-04-06T06:32:52.242Z","dependency_job_id":"34a18276-4640-40c6-af96-604ccd5ec77d","html_url":"https://github.com/karust/gogetcrawl","commit_stats":null,"previous_names":["karust/gocommoncrawl"],"tags_count":7,"template":false,"template_full_name":null,"purl":"pkg:github/karust/gogetcrawl","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/karust%2Fgogetcrawl","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/karust%2Fgogetcrawl/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/karust%2Fgogetcrawl/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/karust%2Fgogetcrawl/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/karust","download_url":"https://codeload.github.com/karust/gogetcrawl/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/karust%2Fgogetcrawl/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28441031,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-15T00:55:22.719Z","status":"ssl_error","status_checked_at":"2026-01-15T00:55:20.945Z","response_time":107,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["commoncrawl","concurrency","crawler","golang","wayback-machine","webarchive"],"created_at":"2024-08-01T15:01:58.745Z","updated_at":"2026-01-15T01:43:42.093Z","avatar_url":"https://github.com/karust.png","language":"Go","readme":"# Go Get Crawl\n[![Go Report Card](https://goreportcard.com/badge/github.com/karust/goGetCrawl)](https://goreportcard.com/report/github.com/karust/gogetcrawl)\n[![Go Reference](https://pkg.go.dev/badge/github.com/karust/gogetcrawl.svg)](https://pkg.go.dev/github.com/karust/gogetcrawl)\n\n**gogetcrawl** is a tool and package that helps you download URLs and Files from popular Web Archives like [Common Crawl](http://commoncrawl.org) and [Wayback Machine](https://web.archive.org/). You can use it as a command line tool or import the solution into your Go project. \n\n## Installation\n### Source\n```\ngo install github.com/karust/gogetcrawl@latest\n```\n\n### Docker\n```\ndocker build -t gogetcrawl .\ndocker run gogetcrawl --help\n```\n\n### Binary\nCheck out the latest release [here](https://github.com/karust/gogetcrawl/releases).\n\n## Usage\n### Docker\n```\ndocker run uranusq/gogetcrawl url *.tutorialspoint.com/* --ext pdf --limit 5\n```\n### Docker compose\n```\ndocker-compose up --build\n```\n### CLI usage\n* See commands and flags:\n```\ngogetcrawl -h\n```\n\n#### Get URLs\n\n* You can get multiple-domain archive data, flags will be applied to each. By default, you will get all results displayed in your terminal (use `--collapse` to get **unique** results):\n```\ngogetcrawl url *.example.com *.tutorialspoint.com/* --collapse\n```\n\n* To **limit** the number of results, enable output to a file and select only Wayback as a **source** you can:\n```\ngogetcrawl url *.tutorialspoint.com/* --limit 10 --sources wb -o ./urls.txt\n```\n\n* Set **date range**:\n```\ngogetcrawl url *.tutorialspoint.com/* --limit 10 --from 20140131 --to 20231231\n```\n#### Download files\n* Download 5 `PDF` files to `./test` directory with 3 **workers**:\n```\ngogetcrawl download *.cia.gov/* --limit 5 -w 3 -d ./test -f \"mimetype:application/pdf\"\n```\n\n### Package usage\n```\ngo get github.com/karust/gogetcrawl\n```\nFor both Wayback and Common crawl you can use `concurrent` and `non-concurrent` ways to interract with archives: \n#### Wayback\n* **Get urls**\n```go\npackage main\n\nimport (\n\t\"fmt\"\n\n\t\"github.com/karust/gogetcrawl/common\"\n\t\"github.com/karust/gogetcrawl/wayback\"\n)\n\nfunc main() {\n\t// Get only 10 status:200 pages\n\tconfig := common.RequestConfig{\n\t\tURL:     \"*.example.com/*\",\n\t\tFilters: []string{\"statuscode:200\"},\n\t\tLimit:   10,\n\t}\n\n\t// Set request timout and retries\n\twb, _ := wayback.New(15, 2)\n\n\t// Use config to obtain all CDX server responses\n\tresults, _ := wb.GetPages(config)\n\n\tfor _, r := range results {\n\t\tfmt.Println(r.Urlkey, r.Original, r.MimeType)\n\t}\n}\n```\n\n* **Get files:**\n```go\n// Get all status:200 HTML files \nconfig := common.RequestConfig{\n\tURL:     \"*.tutorialspoint.com/*\",\n\tFilters: []string{\"statuscode:200\", \"mimetype:text/html\"},\n}\n\nwb, _ := wayback.New(15, 2)\nresults, _ := wb.GetPages(config)\n\n// Get first file from CDX response\nfile, err := wb.GetFile(results[0])\n\nfmt.Println(string(file))\n```\n\n#### CommonCrawl\n*To use CommonCrawl you just need to replace `wayback` module with `commoncrawl`. Let's use Common Crawl concurretly*\n\n* **Get urls**\n```go\ncc, _ := commoncrawl.New(30, 3)\n\nconfig1 := common.RequestConfig{\n\tURL:        \"*.tutorialspoint.com/*\",\n\tFilters:    []string{\"statuscode:200\", \"mimetype:text/html\"},\n\tLimit:      6,\n}\n\nconfig2 := common.RequestConfig{\n\tURL:        \"example.com/*\",\n\tFilters:    []string{\"statuscode:200\", \"mimetype:text/html\"},\n\tLimit:      6,\n}\n\nresultsChan := make(chan []*common.CdxResponse)\nerrorsChan := make(chan error)\n\ngo func() {\n\tcc.FetchPages(config1, resultsChan, errorsChan)\n}()\n\ngo func() {\n\tcc.FetchPages(config2, resultsChan, errorsChan)\n}()\n\nfor {\n\tselect {\n\tcase err := \u003c-errorsChan:\n\t\tfmt.Printf(\"FetchPages goroutine failed: %v\", err)\n\tcase res, ok := \u003c-resultsChan:\n\t\tif ok {\n\t\t\tfmt.Println(res)\n\t\t}\n\t}\n}\n```\n\n* **Get files:**\n```go\nconfig := common.RequestConfig{\n\tURL:     \"kamaloff.ru/*\",\n\tFilters: []string{\"statuscode:200\", \"mimetype:text/html\"},\n}\n\ncc, _ := commoncrawl.New(15, 2)\nresults, _ := wb.GetPages(config)\nfile, err := cc.GetFile(results[0])\n```\n\n## Bugs + Features\nIf you have some issues/bugs or feature request, feel free to open an issue.","funding_links":[],"categories":["Tools \u0026 Software","Go","[](#table-of-contents) Table of contents"],"sub_categories":["Utilities","[](#searchers-scrapers-extractors-parsers)Searchers, scrapers, extractors, parsers"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkarust%2Fgogetcrawl","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkarust%2Fgogetcrawl","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkarust%2Fgogetcrawl/lists"}