Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/joaooliveirapro/trawlergo
Basic HTTP Crawler in Golang
https://github.com/joaooliveirapro/trawlergo
crawler go golang http
Last synced: 3 days ago
JSON representation
Basic HTTP Crawler in Golang
- Host: GitHub
- URL: https://github.com/joaooliveirapro/trawlergo
- Owner: joaooliveirapro
- License: mit
- Created: 2024-12-27T15:39:50.000Z (20 days ago)
- Default Branch: main
- Last Pushed: 2024-12-27T16:16:26.000Z (20 days ago)
- Last Synced: 2024-12-27T16:23:09.215Z (20 days ago)
- Topics: crawler, go, golang, http
- Language: Go
- Homepage:
- Size: 0 Bytes
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE.md
Awesome Lists containing this project
README
# Trawlergo 🐛
Basic HTTP crawler in Golang. Use this to find out all the URLs for given domain along with related information from the HTTP request.## Features
- Regex match to include/exclude paths
- Concurrency safe
- HTTP Request information includes:
- Response status code
- Added count (how many new links fonund on page)## Install
```sh
$ go get github.com/joaooliveirapro/trawlergo # install
$ go mod tidy # clean up dependencies
```## How to use
```go
tg := trawlergo.App{
Workers:2, // Number of Go routines
MaxDepth: 1000, // Max HTTP requests (safe stop)
Domain: "www.mysite.com", // To standardize relative URLs. Don't include the protocol
StartingURLs, []string{"https://www.mysite.com/"} // Starting URLs
ExcludeRegex []string{"/no-go", "[\d]"} // Don't include these paths
IncludeRegex []string{"/some-path-001"} // Include these paths
}
tg.Run()
tg.SaveToJSON("data.json")
```
App must have as many StartingURLs as Workers set to avoid premature exit of Workers.###
```json
// data.json
[
{
"addedCount": 3,
"statusCode": 200,
"url": "https://crawler-test.com/mobile/separate_desktop_with_different_h1"
},
{
"addedCount": 0,
"statusCode": 200,
"url": "https://crawler-test.com/mobile/separate_desktop_with_different_links_in"
},
...
]```
### License
The MIT License (MIT)