Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

https://github.com/Skallwar/suckit

Suck the InTernet
https://github.com/Skallwar/suckit

hacktoberfest rust webscraping

Last synced: about 1 month ago
JSON representation

Suck the InTernet

Lists

README

        

![Build and test](https://github.com/Skallwar/suckit/workflows/Build%20and%20test/badge.svg)
[![codecov](https://codecov.io/gh/Skallwar/suckit/branch/master/graph/badge.svg?token=ZLD369AY2G)](https://codecov.io/gh/Skallwar/suckit)
[![Crates.io](https://img.shields.io/crates/v/suckit.svg)](https://crates.io/crates/suckit)
[![Docs](https://docs.rs/suckit/badge.svg)](https://docs.rs/suckit)
[![Deps](https://deps.rs/repo/github/Skallwar/suckit/status.svg)](https://deps.rs/repo/github/Skallwar/suckit)
[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
[![License](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
![MSRV](https://img.shields.io/badge/MSRV-1.70.0-blue)

# SuckIT

`SuckIT` allows you to recursively visit and download a website's content to
your disk.

![SuckIT Logo](media/suckit_logo.png)

# Features

* [x] Vacuums the entirety of a website recursively
* [x] Uses multithreading
* [x] Writes the website's content to your disk
* [x] Enables offline navigation
* [x] Offers random delays to avoid IP banning
* [ ] Saves application state on CTRL-C for later pickup

# Options
```console
USAGE:
suckit [FLAGS] [OPTIONS]

FLAGS:
-c, --continue-on-error Flag to enable or disable exit on error
--disable-certs-checks Dissable SSL certificates verification
--dry-run Do everything without saving the files to the disk
-h, --help Prints help information
-V, --version Prints version information
-v, --verbose Enable more information regarding the scraping process
--visit-filter-is-download-filter Use the dowload filter in/exclude regexes for visiting as well

OPTIONS:
-a, --auth ...
HTTP basic authentication credentials space-separated as "username password host". Can be repeated for
multiple credentials as "u1 p1 h1 u2 p2 h2"
--cookie
Cookie to send with each request, format: key1=value1;key2=value2 [default: ]

--delay
Add a delay in seconds between downloads to reduce the likelihood of getting banned [default: 0]

-d, --depth
Maximum recursion depth to reach when visiting. Default is -1 (infinity) [default: -1]

-e, --exclude-download
Regex filter to exclude saving pages that match this expression [default: $^]

--exclude-visit
Regex filter to exclude visiting pages that match this expression [default: $^]

--ext-depth
Maximum recursion depth to reach when visiting external domains. Default is 0. -1 means infinity [default:
0]
-i, --include-download
Regex filter to limit to only saving pages that match this expression [default: .*]

--include-visit
Regex filter to limit to only visiting pages that match this expression [default: .*]

-j, --jobs Maximum number of threads to use concurrently [default: 1]
-o, --output Output directory
--random-range
Generate an extra random delay between downloads, from 0 to this number. This is added to the base delay
seconds [default: 0]
-t, --tries Maximum amount of retries on download failure [default: 20]
-u, --user-agent User agent to be used for sending requests [default: suckit]

ARGS:
Entry point of the scraping
```

# Example

A common use case could be the following:

`suckit http://books.toscrape.com -j 8 -o /path/to/downloaded/pages/`

![asciicast](media/suckit-adjusted-120cols-40rows-100ms.svg)

# Installation

As of right now, `SuckIT` does not work on Windows.

To install it, you need to have Rust installed.

* Check out [this link](https://www.rust-lang.org/learn/get-started) for
instructions on how to install Rust.

* If you just want to install the suckit executable, you can simply run
`cargo install --git https://github.com/skallwar/suckit`

* Now, run it from anywhere with the `suckit` command.

### Arch Linux

`suckit` can be installed from available [AUR packages](https://aur.archlinux.org/packages/?O=0&SeB=b&K=suckit&outdated=&SB=n&SO=a&PP=50&do_Search=Go) using an [AUR helper](https://wiki.archlinux.org/index.php/AUR_helpers). For example,

```
yay -S suckit
```

__Want to contribute ? Feel free to
[open an issue](https://github.com/Skallwar/suckit/issues/new) or
[submit a PR](https://github.com/Skallwar/suckit/compare) !__

# License

SuckIT is primarily distributed under the terms of both the MIT license
and the Apache License (Version 2.0)

See [LICENSE-APACHE](LICENSE-APACHE) and [LICENSE-MIT](LICENSE-MIT) for details.