An open API service indexing awesome lists of open source software.

https://github.com/everdrone/grab

Configurable Scraper & Downloader, Powered by RegExp and Go
https://github.com/everdrone/grab

cli cli-app downloader golang hcl scraper terminal tool utility

Last synced: 4 months ago
JSON representation

Configurable Scraper & Downloader, Powered by RegExp and Go

Awesome Lists containing this project

README

          



GRAB
GRAB


Greedy, Regex-Aware Binary Downloader




Stargazers


Latest Release


Codecov


GitHub issues

# Table of contents

- [Motivation](#why)
- [Installation](#installation)
- [Usage](#usage)
- [Quickstart](#quickstart)
- [Options](#command-options)
- [Next steps](#next-steps)

# Why

This project helps you automate scraping data and downloading assets from the internet. Based on Go's Regular Expression engine and HCL, for ease of use, performance and flexibility.

# Installation

Download and install the [latest release](https://github.com/everdrone/grab/releases/latest).

# Usage

Run the following command to generate a new configuration file in the current directory.

```
grab config generate
```

> **Note**
> Grab's configuration file uses [Hashicorp's HCL](https://github.com/hashicorp/hcl).
> You can always refer to their specification for topics not covered by the documentation in this repo.

Once you're happy with your configuration, you can check if everything is ok by running:

```
grab config check
```

To scrape and download assets, pass one or more URLs to the `get` subcommand:

```ini
# single URL
grab get https://url.to/scrape/files?from

# list of URLs
grab get urls.ini

# at least one of each
grab get https://my.url/and urls.ini list.ini
```

> **Note**
> The list of URLs can contain comments, like the `ini` format: all lines starting with `#` and `;` will be ignored.

# Quickstart

The default configuration, generated with `grab config generate` already works out of the box.

```hcl
global {
location = "/home/yourusername/Downloads/grab"
}

site "unsplash" {
test = "unsplash"

asset "image" {
pattern = "contentUrl\":\"([^\"]+)\""
capture = 1

transform filename {
pattern = "(?:.+)photos\\/(.*)"
replace = "$${1}.jpg"
}
}

info "title" {
pattern = "meta[^>]+property=\"og:title\"[^>]+content=\"(?P[^\"]+)\""
capture = "title"
}

subdirectory {
pattern = "\\(@(?P\\w+)\\)"
capture = "username"
from = body
}
}
```

For demonstration purposes, we can already download pictures from [unsplash](https://unsplash.com) by using the following command:

```
grab get https://unsplash.com/photos/uOi3lg8fGl4
```

> **Warning**
> Please use this tool responsibly. Don't use this tool for Denial of Service attacks! Don't violate Copyright or intellectual property!

Internally, the program checks checks each URL passed to `get`, if it matches a `test` pattern inside of any `site` block, it will parse find all matches for assets or data defined in `asset` and `info` blocks.
Once all the asset URLs are gathered, the download starts.

After running the above command, you should have a new `grab` directory in your `~/Downloads` folder, containing subdirectories for each site defined in the configuration. Inside each site directories you will find all the assets extracted from the provided URLs.

The configuration syntax is based on a few fundamental blocks:

- `global` block defines the main download directory and global network options.
- `site ` blocks group other blocks based on the site URL.
- `asset ` blocks define what to look for from each site and how to download it.
- `info ` blocks define what strings to extract from the page body.

Additional configuration settings can be specified:

- `network` blocks to pass headers and other network options when making requests.
- `transform url` blocks to replace the asset URL before downloading.
- `transform filename` blocks to replace the asset's destination path.
- `subdirectory` blocks to organize downloads into subdirectories named by strings present in the page body or URL.

For a more in-depth look into Grab's confguration options, check out [the guide](/docs/guide.md).

# Command Options

To get help about any command, use the `help` subcommand or the `--help` flag:

```ini
# to list all available commands:
grab help

# to show instructions for a specific subcommand:
grab help
```

### `get`

#### Arguments

Accepts both URLs or path to lists of URLs. Both can be provided at the same time.

```sh
# grab get [url|file...] [options]

grab get https://example.com/gallery/1 \
https://example.com/gallery/2 \
path/to/list.ini \
other/file.ini -n
```

#### Options

| Long | Short | Default | Description |
| ---------- | ----- | ------- | ------------------------------------------------------------------------------------------------------------------------------ |
| `force` | `f` | `false` | To overwrite already existing files |
| `config` | `c` | `nil` | To specify the path to a configuration file |
| `strict` | `s` | `false` | To stop the program at the first encountered error |
| `dry-run` | `n` | `false` | To send requests without writing to the disk |
| `progress` | `p` | `false` | To show a progress bar |
| `quiet` | `q` | `false` | To suppress all output to `stdout` (errors will still be printed to `stderr`).
This option takes precedence over `verbose` |
| `verbose` | `v` | `1` | To set the verbosity level:
`-v` is 1, `-vv` is 2 and so on...
`quiet` overrides this option. |

## Next steps

- [x] Retries & Timeout
- [x] Network options with inheritance
- [x] URL manipulation
- [x] Destination manipulation
- [x] Improve logging
- [x] Check for updates
- [ ] Display a progress bar
- [ ] Add HCL eval context functions
- [ ] Distribute via various package managers:
- [ ] Homebrew
- [ ] Apt
- [ ] Chocolatey
- [ ] Scoop
- [ ] Scripting language integration
- [ ] Plugin system
- [ ] Sequential jobs (like GitHub workflows)

## Credits

- [Catppuccin](https://github.com/catppuccin/) for the color palette
- [Shields.io](https://github.com/badges/shields) for the badges

## License

Distributed under the [MIT License](/LICENSE).