Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/dave/scrapy
Web scraper test project
https://github.com/dave/scrapy
Last synced: about 1 month ago
JSON representation
Web scraper test project
- Host: GitHub
- URL: https://github.com/dave/scrapy
- Owner: dave
- License: mit
- Created: 2018-08-28T14:27:50.000Z (over 6 years ago)
- Default Branch: master
- Last Pushed: 2018-08-30T16:01:48.000Z (over 6 years ago)
- Last Synced: 2024-10-13T14:35:33.745Z (3 months ago)
- Language: Go
- Size: 89.8 KB
- Stars: 1
- Watchers: 4
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
[![Build Status](https://travis-ci.org/dave/scrapy.svg?branch=master)](https://travis-ci.org/dave/scrapy)
[![Go Report Card](https://goreportcard.com/badge/github.com/dave/scrapy)](https://goreportcard.com/report/github.com/dave/scrapy)
[![codecov](https://codecov.io/gh/dave/scrapy/branch/master/graph/badge.svg)](https://codecov.io/gh/dave/scrapy)# A simple web scraper
### Install
```
go get -u github.com/dave/scrapy
```### Usage
```
scrapy [url]
```The `scrapy` command will get get the page at `url`, parse it for links and get all pages that are
on the same domain.Some stats will be outputted during the processing, and a list of URLs will be printed when it's
finished. You can end the job early with Ctrl+C.### Flags
Several command line flags are available:
```
-length int
Length of the queue (default 1000)
-timeout int
Request timeout in ms (default 10000)
-url string
The start page (default "https://monzo.com")
-workers int
Number of concurrent workers (default 5)
```### Library
This scraper can also be used as a library. See the [scraper](https://godoc.org/github.com/dave/scrapy/scraper) package.
### Notes
See [here](https://github.com/dave/scrapy/blob/master/NOTES.md) for design notes and brainstorming.
### Example output
```
Summary
-------
Queued 46
In progress 5 https://monzo.com/blog/2018/08/30/manage-your-bills
Success 22
Errors 0Latency
-------
0 - 100 ***
100 - 200
200 - 300
300 - 400 **************************
400 - 500 ******************************
500 - 600 ***************
600 - 700 ***
700 - 800 ***
800 - 900
900 - 1000
1000 - 1100
1100 - 1200
1200 - 1300
1300 - 1400
1400 - 1500
1500 - 1600
1600 - 1700
1700 - 1800
1800 - 1900
1900 - 2000
2000+URLs
----
https://monzo.com
https://monzo.com/-play-store-redirect
https://monzo.com/about
https://monzo.com/blog
https://monzo.com/blog/2018/07/02/publishing-our-2018-annual-report
https://monzo.com/blog/2018/07/10/making-quarterly-goals-public
https://monzo.com/blog/2018/07/25/monzo-reliability-report
https://monzo.com/blog/how-money-works
https://monzo.com/blog/latest...
```