https://github.com/jakewarren/scrape
A command line scraping utility supporting CSS selectors or XPath
https://github.com/jakewarren/scrape
css-selector css-selectors scraping-utility web-scraping xpath
Last synced: 5 months ago
JSON representation
A command line scraping utility supporting CSS selectors or XPath
- Host: GitHub
- URL: https://github.com/jakewarren/scrape
- Owner: jakewarren
- License: mit
- Created: 2018-11-06T22:34:30.000Z (over 7 years ago)
- Default Branch: master
- Last Pushed: 2023-12-25T14:17:26.000Z (over 2 years ago)
- Last Synced: 2025-10-19T01:22:41.672Z (8 months ago)
- Topics: css-selector, css-selectors, scraping-utility, web-scraping, xpath
- Language: Go
- Size: 1.62 MB
- Stars: 4
- Watchers: 1
- Forks: 1
- Open Issues: 9
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
Awesome Lists containing this project
README
# scrape
[](https://github.com/jakewarren/scrape/actions)
[](https://github.com/jakewarren/scrape/releases])
[](https://github.com/jakewarren/scrape/blob/master/LICENSE)
[](https://goreportcard.com/report/github.com/jakewarren/scrape)
[](http://makeapullrequest.com)
A command line scraping utility inspired by [scrape]( https://github.com/jeroenjanssens/data-science-at-the-command-line/blob/master/tools/scrape).
## Features
* Scrape using XPath or CSS selectors
* Process HTML from a URL, STDIN, or a local file
* Extract a particular attribute
## Install
### Option 1: Binary
Download the latest release from [https://github.com/jakewarren/scrape/releases/latest](https://github.com/jakewarren/scrape/releases/latest)
### Option 2: From source
```
go get github.com/jakewarren/scrape
```
## Usage
```
Usage of scrape:
-A, --agent string user agent string (default "Mozilla/4.0 (Mozilla/4.0; MSIE 7.0; Windows NT 5.1; SV1; .NET CLR 3.0.04506.30)")
-a, --attr string attribute to scrape (default "html")
-c, --css string css selector
-h, --help usage information
-k, --insecure skip SSL verification
-x, --xpath string xpath query
```
### Examples:
#### Read from URL:
```
❯ scrape -c "h4 a" -a href "https://www.webscraper.io/test-sites/e-commerce/allinone"
/test-sites/e-commerce/allinone/product/244
/test-sites/e-commerce/allinone/product/269
/test-sites/e-commerce/allinone/product/192
```
#### Read from STDIN:
```
❯ curl -A 'Mozilla/4.0 (Mozilla/4.0; MSIE 7.0; Windows NT 5.1; SV1; .NET CLR 3.0.04506.30)' -s "https://www.webscraper.io/test-sites/e-commerce/allinone" | scrape -x "//h4/a" -a href
/test-sites/e-commerce/allinone/product/223
/test-sites/e-commerce/allinone/product/280
/test-sites/e-commerce/allinone/product/278
```
#### Read from file:
```
❯ scrape -x "//h4/a" /tmp/webscrapetest.html
Aspire E1-510
Lenovo V510 Blac...
Lenovo V510 Blac...
```