An open API service indexing awesome lists of open source software.

https://github.com/jakewarren/scrape

A command line scraping utility supporting CSS selectors or XPath
https://github.com/jakewarren/scrape

css-selector css-selectors scraping-utility web-scraping xpath

Last synced: 5 months ago
JSON representation

A command line scraping utility supporting CSS selectors or XPath

Awesome Lists containing this project

README

          

# scrape

[![Build Status](https://github.com/jakewarren/scrape/workflows/lint/badge.svg)](https://github.com/jakewarren/scrape/actions)
[![GitHub release](http://img.shields.io/github/release/jakewarren/scrape.svg?style=flat-square)](https://github.com/jakewarren/scrape/releases])
[![MIT License](http://img.shields.io/badge/license-MIT-blue.svg?style=flat-square)](https://github.com/jakewarren/scrape/blob/master/LICENSE)
[![Go Report Card](https://goreportcard.com/badge/github.com/jakewarren/scrape)](https://goreportcard.com/report/github.com/jakewarren/scrape)
[![PRs Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg?style=shields)](http://makeapullrequest.com)

A command line scraping utility inspired by [scrape]( https://github.com/jeroenjanssens/data-science-at-the-command-line/blob/master/tools/scrape).

## Features

* Scrape using XPath or CSS selectors
* Process HTML from a URL, STDIN, or a local file
* Extract a particular attribute

## Install
### Option 1: Binary

Download the latest release from [https://github.com/jakewarren/scrape/releases/latest](https://github.com/jakewarren/scrape/releases/latest)

### Option 2: From source

```
go get github.com/jakewarren/scrape
```

## Usage

```
Usage of scrape:
-A, --agent string user agent string (default "Mozilla/4.0 (Mozilla/4.0; MSIE 7.0; Windows NT 5.1; SV1; .NET CLR 3.0.04506.30)")
-a, --attr string attribute to scrape (default "html")
-c, --css string css selector
-h, --help usage information
-k, --insecure skip SSL verification
-x, --xpath string xpath query
```

### Examples:

#### Read from URL:
```
❯ scrape -c "h4 a" -a href "https://www.webscraper.io/test-sites/e-commerce/allinone"
/test-sites/e-commerce/allinone/product/244
/test-sites/e-commerce/allinone/product/269
/test-sites/e-commerce/allinone/product/192
```

#### Read from STDIN:
```
❯ curl -A 'Mozilla/4.0 (Mozilla/4.0; MSIE 7.0; Windows NT 5.1; SV1; .NET CLR 3.0.04506.30)' -s "https://www.webscraper.io/test-sites/e-commerce/allinone" | scrape -x "//h4/a" -a href
/test-sites/e-commerce/allinone/product/223
/test-sites/e-commerce/allinone/product/280
/test-sites/e-commerce/allinone/product/278
```

#### Read from file:
```
❯ scrape -x "//h4/a" /tmp/webscrapetest.html
Aspire E1-510
Lenovo V510 Blac...
Lenovo V510 Blac...
```