https://github.com/karust/gogetcrawl
Extract web archive data using Wayback Machine and Common Crawl
https://github.com/karust/gogetcrawl
commoncrawl concurrency crawler golang wayback-machine webarchive
Last synced: 8 months ago
JSON representation
Extract web archive data using Wayback Machine and Common Crawl
- Host: GitHub
- URL: https://github.com/karust/gogetcrawl
- Owner: karust
- License: mit
- Created: 2019-06-14T19:02:05.000Z (over 6 years ago)
- Default Branch: master
- Last Pushed: 2024-11-04T08:25:29.000Z (about 1 year ago)
- Last Synced: 2024-11-05T20:47:00.654Z (about 1 year ago)
- Topics: commoncrawl, concurrency, crawler, golang, wayback-machine, webarchive
- Language: Go
- Homepage:
- Size: 58.6 KB
- Stars: 145
- Watchers: 5
- Forks: 16
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- webarchiving-awesome-graph - Go Get Crawl - Extract web archive data using 💽 ⭐ 161 👀 3 (Tools & Software / Utilities)
- awesome-web-archiving - Go Get Crawl - Extract web archive data using <!--lint ignore double-link-->[Wayback Machine](https://web.archive.org/) and [Common Crawl](https://commoncrawl.org/). *(Stable)* (Tools & Software / Utilities)
- osint_stuff_tool_collection - GoGetCrawl
README
# Go Get Crawl
[](https://goreportcard.com/report/github.com/karust/gogetcrawl)
[](https://pkg.go.dev/github.com/karust/gogetcrawl)
**gogetcrawl** is a tool and package that helps you download URLs and Files from popular Web Archives like [Common Crawl](http://commoncrawl.org) and [Wayback Machine](https://web.archive.org/). You can use it as a command line tool or import the solution into your Go project.
## Installation
### Source
```
go install github.com/karust/gogetcrawl@latest
```
### Docker
```
docker build -t gogetcrawl .
docker run gogetcrawl --help
```
### Binary
Check out the latest release [here](https://github.com/karust/gogetcrawl/releases).
## Usage
### Docker
```
docker run uranusq/gogetcrawl url *.tutorialspoint.com/* --ext pdf --limit 5
```
### Docker compose
```
docker-compose up --build
```
### CLI usage
* See commands and flags:
```
gogetcrawl -h
```
#### Get URLs
* You can get multiple-domain archive data, flags will be applied to each. By default, you will get all results displayed in your terminal (use `--collapse` to get **unique** results):
```
gogetcrawl url *.example.com *.tutorialspoint.com/* --collapse
```
* To **limit** the number of results, enable output to a file and select only Wayback as a **source** you can:
```
gogetcrawl url *.tutorialspoint.com/* --limit 10 --sources wb -o ./urls.txt
```
* Set **date range**:
```
gogetcrawl url *.tutorialspoint.com/* --limit 10 --from 20140131 --to 20231231
```
#### Download files
* Download 5 `PDF` files to `./test` directory with 3 **workers**:
```
gogetcrawl download *.cia.gov/* --limit 5 -w 3 -d ./test -f "mimetype:application/pdf"
```
### Package usage
```
go get github.com/karust/gogetcrawl
```
For both Wayback and Common crawl you can use `concurrent` and `non-concurrent` ways to interract with archives:
#### Wayback
* **Get urls**
```go
package main
import (
"fmt"
"github.com/karust/gogetcrawl/common"
"github.com/karust/gogetcrawl/wayback"
)
func main() {
// Get only 10 status:200 pages
config := common.RequestConfig{
URL: "*.example.com/*",
Filters: []string{"statuscode:200"},
Limit: 10,
}
// Set request timout and retries
wb, _ := wayback.New(15, 2)
// Use config to obtain all CDX server responses
results, _ := wb.GetPages(config)
for _, r := range results {
fmt.Println(r.Urlkey, r.Original, r.MimeType)
}
}
```
* **Get files:**
```go
// Get all status:200 HTML files
config := common.RequestConfig{
URL: "*.tutorialspoint.com/*",
Filters: []string{"statuscode:200", "mimetype:text/html"},
}
wb, _ := wayback.New(15, 2)
results, _ := wb.GetPages(config)
// Get first file from CDX response
file, err := wb.GetFile(results[0])
fmt.Println(string(file))
```
#### CommonCrawl
*To use CommonCrawl you just need to replace `wayback` module with `commoncrawl`. Let's use Common Crawl concurretly*
* **Get urls**
```go
cc, _ := commoncrawl.New(30, 3)
config1 := common.RequestConfig{
URL: "*.tutorialspoint.com/*",
Filters: []string{"statuscode:200", "mimetype:text/html"},
Limit: 6,
}
config2 := common.RequestConfig{
URL: "example.com/*",
Filters: []string{"statuscode:200", "mimetype:text/html"},
Limit: 6,
}
resultsChan := make(chan []*common.CdxResponse)
errorsChan := make(chan error)
go func() {
cc.FetchPages(config1, resultsChan, errorsChan)
}()
go func() {
cc.FetchPages(config2, resultsChan, errorsChan)
}()
for {
select {
case err := <-errorsChan:
fmt.Printf("FetchPages goroutine failed: %v", err)
case res, ok := <-resultsChan:
if ok {
fmt.Println(res)
}
}
}
```
* **Get files:**
```go
config := common.RequestConfig{
URL: "kamaloff.ru/*",
Filters: []string{"statuscode:200", "mimetype:text/html"},
}
cc, _ := commoncrawl.New(15, 2)
results, _ := wb.GetPages(config)
file, err := cc.GetFile(results[0])
```
## Bugs + Features
If you have some issues/bugs or feature request, feel free to open an issue.