Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/bradsec/gomine

A Go CLI tool to quickly crawl and mine (download) specific file types from websites.
https://github.com/bradsec/gomine

cli crawler golang terminal-based

Last synced: about 2 months ago
JSON representation

A Go CLI tool to quickly crawl and mine (download) specific file types from websites.

Host: GitHub
URL: https://github.com/bradsec/gomine
Owner: bradsec
License: mit
Created: 2024-12-21T05:46:58.000Z (about 2 months ago)
Default Branch: main
Last Pushed: 2024-12-21T05:50:11.000Z (about 2 months ago)
Last Synced: 2024-12-21T06:27:16.837Z (about 2 months ago)
Topics: cli, crawler, golang, terminal-based
Language: Go
Homepage:
Size: 0 Bytes
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# GoMine

A Go CLI tool to quickly crawl and mine (download) specific file types from websites.

## Limitations

- Will not work for some dynamic content sites.
- Will not work for sites with reCAPTCHA/CAPTCHA type protection.

## Installation

### Build from source (Go required)
To install / build redgrab binary from source, you need to have Go installed on your system (https://go.dev/doc/install). Once you have Go installed, you can either clone and run from source or download and install with the following command:

```terminal
go install github.com/bradsec/gomine@latest
```

## Basic Usage

```terminal
# With URL only will default to looking for document files types
gomine --url https://thisurlexamplesite.com

# Specify individual file types
gomine --url https://thisurlexamplesite.com --filetypes ".pdf,.jpg"
```

### Predefined File Type Groups

Use with flag --filetypes
Example using more than one group `--filetypes "images,documents"`

## Full Usage options

```terminal
-depth int
The maximum depth to follow links (default 10)
-external
Enable or disable downloading files from external domains (default true)
-filetext string
The text to be present in the filename (optional)
-filetypes string
Comma-separated list of file extensions to download (default "documents")
-timeout int
The maximum time the crawling will run. (default 10)
-url string
The target URL to search including http:// or https://
-useragent string
The User-Agent string to use (default "random")
```

## Other Notes

### Logs
A list of the crawled/visited URLs will be stored a text file `crawled.txt` in the `logs` sub-directory of the target URL directory.

### External File Links

If files are from an external domain/url there will be sub-director of the external domain/url within the main target URL directory containing the files from that site. You can disable downloading from external links using the `--external false` flag.