Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/karust/gogetcrawl
Extract web archive data using Wayback Machine and Common Crawl
https://github.com/karust/gogetcrawl
commoncrawl concurrency crawler golang wayback-machine webarchive
Last synced: about 2 months ago
JSON representation
Extract web archive data using Wayback Machine and Common Crawl
- Host: GitHub
- URL: https://github.com/karust/gogetcrawl
- Owner: karust
- License: mit
- Created: 2019-06-14T19:02:05.000Z (over 5 years ago)
- Default Branch: master
- Last Pushed: 2023-06-04T22:46:04.000Z (over 1 year ago)
- Last Synced: 2024-08-02T15:47:05.943Z (5 months ago)
- Topics: commoncrawl, concurrency, crawler, golang, wayback-machine, webarchive
- Language: Go
- Homepage:
- Size: 53.7 KB
- Stars: 140
- Watchers: 5
- Forks: 17
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Go Get Crawl
[![Go Report Card](https://goreportcard.com/badge/github.com/karust/goGetCrawl)](https://goreportcard.com/report/github.com/karust/gogetcrawl)
[![Go Reference](https://pkg.go.dev/badge/github.com/karust/gogetcrawl.svg)](https://pkg.go.dev/github.com/karust/gogetcrawl)**gogetcrawl** is a tool and package that helps you download URLs and Files from popular Web Archives like [Common Crawl](http://commoncrawl.org) and [Wayback Machine](https://web.archive.org/). You can use it as a command line tool or import the solution into your Go project.
## Installation
### Source
```
go install github.com/karust/gogetcrawl@latest
```### Docker
```
docker build -t gogetcrawl .
docker run gogetcrawl --help
```### Binary
Check out the latest release [here](https://github.com/karust/gogetcrawl/releases).## Usage
### Docker
```
docker run uranusq/gogetcrawl url *.tutorialspoint.com/* --ext pdf --limit 5
```
### Docker compose
```
docker-compose up --build
```
### CLI usage
* See commands and flags:
```
gogetcrawl -h
```#### Get URLs
* You can get multiple-domain archive data, flags will be applied to each. By default, you will get all results displayed in your terminal (use `--collapse` to get **unique** results):
```
gogetcrawl url *.example.com *.tutorialspoint.com/* --collapse
```* To **limit** the number of results, enable output to a file and select only Wayback as a **source** you can:
```
gogetcrawl url *.tutorialspoint.com/* --limit 10 --sources wb -o ./urls.txt
```* Set **date range**:
```
gogetcrawl url *.tutorialspoint.com/* --limit 10 --from 20140131 --to 20231231
```
#### Download files
* Download 5 `PDF` files to `./test` directory with 3 **workers**:
```
gogetcrawl download *.cia.gov/* --limit 5 -w 3 -d ./test -f "mimetype:application/pdf"
```### Package usage
```
go get github.com/karust/gogetcrawl
```
For both Wayback and Common crawl you can use `concurrent` and `non-concurrent` ways to interract with archives:
#### Wayback
* **Get urls**
```go
package mainimport (
"fmt""github.com/karust/gogetcrawl/common"
"github.com/karust/gogetcrawl/wayback"
)func main() {
// Get only 10 status:200 pages
config := common.RequestConfig{
URL: "*.example.com/*",
Filters: []string{"statuscode:200"},
Limit: 10,
}// Set request timout and retries
wb, _ := wayback.New(15, 2)// Use config to obtain all CDX server responses
results, _ := wb.GetPages(config)for _, r := range results {
fmt.Println(r.Urlkey, r.Original, r.MimeType)
}
}
```* **Get files:**
```go
// Get all status:200 HTML files
config := common.RequestConfig{
URL: "*.tutorialspoint.com/*",
Filters: []string{"statuscode:200", "mimetype:text/html"},
}wb, _ := wayback.New(15, 2)
results, _ := wb.GetPages(config)// Get first file from CDX response
file, err := wb.GetFile(results[0])fmt.Println(string(file))
```#### CommonCrawl
*To use CommonCrawl you just need to replace `wayback` module with `commoncrawl`. Let's use Common Crawl concurretly** **Get urls**
```go
cc, _ := commoncrawl.New(30, 3)config1 := common.RequestConfig{
URL: "*.tutorialspoint.com/*",
Filters: []string{"statuscode:200", "mimetype:text/html"},
Limit: 6,
}config2 := common.RequestConfig{
URL: "example.com/*",
Filters: []string{"statuscode:200", "mimetype:text/html"},
Limit: 6,
}resultsChan := make(chan []*common.CdxResponse)
errorsChan := make(chan error)go func() {
cc.FetchPages(config1, resultsChan, errorsChan)
}()go func() {
cc.FetchPages(config2, resultsChan, errorsChan)
}()for {
select {
case err := <-errorsChan:
fmt.Printf("FetchPages goroutine failed: %v", err)
case res, ok := <-resultsChan:
if ok {
fmt.Println(res)
}
}
}
```* **Get files:**
```go
config := common.RequestConfig{
URL: "kamaloff.ru/*",
Filters: []string{"statuscode:200", "mimetype:text/html"},
}cc, _ := commoncrawl.New(15, 2)
results, _ := wb.GetPages(config)
file, err := cc.GetFile(results[0])
```## Bugs + Features
If you have some issues/bugs or feature request, feel free to open an issue.