Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/mindfiredigital/deepscanbot
It allows you to crawl websites with various configurations, including crawl depth, timeout settings, proxy support, and output options.
https://github.com/mindfiredigital/deepscanbot
bot crawl crawler go golang google webcrawler
Last synced: about 3 hours ago
JSON representation
It allows you to crawl websites with various configurations, including crawl depth, timeout settings, proxy support, and output options.
- Host: GitHub
- URL: https://github.com/mindfiredigital/deepscanbot
- Owner: mindfiredigital
- Created: 2024-06-19T09:16:57.000Z (5 months ago)
- Default Branch: main
- Last Pushed: 2024-08-30T06:22:00.000Z (2 months ago)
- Last Synced: 2024-11-07T08:20:17.493Z (about 11 hours ago)
- Topics: bot, crawl, crawler, go, golang, google, webcrawler
- Language: Go
- Homepage:
- Size: 18.6 KB
- Stars: 0
- Watchers: 3
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# DeepScanBot: Web Crawler
DeepScanBot is a customizable web crawler written in Go.
## Overview
DeepScanBot allows you to crawl websites with various configurations, including crawl depth, timeout settings, proxy support, and output options.
## Features
- **Customizable Crawl Depth**: Set the maximum depth to crawl web pages.
- **Timeout Management**: Set a timeout for each HTTP request.
- **Proxy Support**: Specify a proxy server for the HTTP requests.
- **Output Options**: Choose between plain text or JSON output.
- **Page Size Limit**: Skip pages exceeding a certain size.
- **Disable Redirects**: Option to disable HTTP redirects.
- **TLS Verification**: Option to disable TLS verification for HTTPS requests.
- **Unique URL Tracking**: Ensures URLs are crawled only once if enabled.
- **Show URL Source**: Display where each URL was found (e.g., in `` tags, `` tags).## Usage
To run the web crawler, use the following commands:
### Install Dependencies
```bash
go mod download# Run the Crawler
go run main.go -url <starting_url> [options]
# Build the Crawler
go build
# Flags
-url <string>: Required. The starting URL for the crawler.
-depth <int>: Maximum depth to crawl. Default: 2.
-timeout <int>: Timeout for each HTTP request in seconds. Default: 2.
-proxy <string>: Proxy URL for HTTP requests. Example: http://127.0.0.1:8080.
-json: Output results in JSON format. Default: false.
-size <int>: Limit page size in KB. Default: -1 (no limit).
-dr: Disable following HTTP redirects. Default: false.
-s: Show the source of the URL based on where it was found. Default: false.
-insecure: Disable TLS verification. Default: false.
-u: Ensure unique URLs are crawled. Default: false.
-h: Show help message.# Example
To start crawling from https://example.com with a maximum depth of 3, run:go run main.go -url https://example.com -depth 3
```