https://github.com/twiny/spidy

Domain names collector - Crawl websites and collect domain names along with their availability status.
https://github.com/twiny/spidy

backlinks crawler domain expired-domain golang scraper seotools spider

Last synced: about 2 months ago
JSON representation

Domain names collector - Crawl websites and collect domain names along with their availability status.

Host: GitHub
URL: https://github.com/twiny/spidy
Owner: twiny
License: mit
Created: 2020-07-21T21:42:21.000Z (about 5 years ago)
Default Branch: main
Last Pushed: 2023-07-23T23:48:50.000Z (about 2 years ago)
Last Synced: 2024-05-22T18:32:29.423Z (over 1 year ago)
Topics: backlinks, crawler, domain, expired-domain, golang, scraper, seotools, spider
Language: Go
Homepage: https://github.com/twiny/spidy/wiki
Size: 56.6 KB
Stars: 142
Watchers: 6
Forks: 26
Open Issues: 6
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

## Spidy
A tool that crawl websites to find domain names and checks thier availiabity.

### Install

```sh
git clone https://github.com/twiny/spidy.git
cd ./spidy

# build
go build -o bin/spidy -v cmd/spidy/main.go

# run
./bin/spidy -c config/config.yaml -u https://github.com
```

## Usage

```sh
NAME:
Spidy - Domain name scraper

USAGE:
spidy [global options] command [command options] [arguments...]

VERSION:
2.0.0

COMMANDS:
help, h Shows a list of commands or help for one command

GLOBAL OPTIONS:
--config path, -c path path to config file
--help, -h show help (default: false)
--urls urls, -u urls urls of page to scrape (accepts multiple inputs)
--version, -v print the version (default: false)
```

## Configuration

```yaml
# main crawler config
crawler:
max_depth: 10 # max depth of pages to visit per website.
# filter: [] # regexp filter
rate_limit: "1/5s" # 1 request per 5 sec
max_body_size: "20MB" # max page body size
user_agents: # array of user-agents
- "Spidy/2.1; +https://github.com/ twiny/spidy"
# proxies: [] # array of proxy. http(s), SOCKS5
# Logs
log:
rotate: 7 # log rotation
path: "./log" # log directory
# Store
store:
ttl: "24h" # keep cache for 24h
path: "./store" # store directory
# Results
result:
path: ./result # result directory
parralle: 3 # number of concurrent workers
timeout: "5m" # request timeout
tlds: ["biz", "cc", "com", "edu", "info", "net", "org", "tv"] # array of domain extension to check.
```

## TODO

- [ ] Add support to more `writers`.
- [ ] Add terminal logging.
- [ ] Add test cases.

## Issues

NOTE: This package is provided "as is" with no guarantee. Use it at your own risk and always test it yourself before using it in a production environment. If you find any issues, please create a new issue.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/twiny/spidy

Awesome Lists containing this project

README