An open API service indexing awesome lists of open source software.

https://github.com/icio/crul

Python website-scraper
https://github.com/icio/crul

Last synced: 11 months ago
JSON representation

Python website-scraper

Awesome Lists containing this project

README

          

# Crul :see_no_evil: :hear_no_evil: :speak_no_evil:

A silly little web crawler, not dissimilar to [`gergle`](https://github.com/icio/gergle).

## Installation

```bash
# Prepare a virtual environment (optional)
virtualenv env
. env/bin/activate

# Install crul
pip install git+git://github.com/icio/crul\#egg=crul
```

## Usage

```text
$ crul
Crul website scraper.

Examples:

A human-readable summary of the (rather outdated) pages on my website:

$ crul http://www.paul-scott.com/

Record the results of scraping into a file:

$ crul http://www.paul-scott.com/ --json > me.scrape

Render a sitemap of what we just scraped:

$ crul --replay me.scrape --sitemap > sitemap.xml

Scrape only 3 pages deep with a single worker, leaving 2 seconds between
each subsequent request to the site, w/o anything under /developer/flash
or /forum/:

$ crul -d 3 -w 1 -t 2 https://www.kirupa.com/ \
-i developer/flash -i forum/

Ignore robots.txt and grab everything we can, as fast as we can, with 5
workers, from :

$ crul --yolo -w 5

Usage:
crul ( [options] | --replay=)
[--disallow=]...
[--dot | --sitemap | --text | --json]
[-v|-q] [--log-file=]
crul [--help | --version]

Options:
--dot Output in dot format.
--sitemap Output in XML sitemap format.
--json Output in JSON format. [default]
--text Output in human-readable text format.
-A --user-agent The user-agent sent from the client.
[default: Crul/1.0 (+https://github.com/icio/crul)]
-d --depth= Traverse n pages deep from the starting point.
[default: 100]
-h --help Print this help.
-i --disallow= Ignore/disallow file paths from being scraped.
-l --log-file= Log to the given file.
-q --quiet Quiet logging.
-r --replay= Load responses from a JSON file, instead of scraping.
-t --delay= Wait n seconds between requests to the site.
-v --verbose Verbose logging.
--version Print the version number.
-w --workers= Use n worker threads to make requests in parallel.
[default: 4]
--yolo Don't bother checking robots.txt.
```

## Implementation

`crul` is implemented as a threaded set of workers processing a queue of URLs (`crul.scrape.site_crawl`). When new pages have been collected the linked pages (`crul.parse.PageParser`) are appended to the queue (`crul.traverse.PageTraverser`) for the workers to continue processing.

```text
v site_crawl(init_url)
|
+----------------------------------------------------+ +--- worker_sentinel [thread x 1] ------------+
| | | >-------> Wait for all tasks to be completed. |
| +---< <---+ pending (Task queue) <---+ | | |
| | | | <-------* Append kill signals to all queues. |
+----------------------------------------------------+ +---------------------------------------------+
| | |
+----+ worker [thread x num_workers] +---------------+ |
| | | | |
| | + worker_request +---------------------+ | | |
| | | | | | |
| +-> | page = page_parser(session.get(url)) | | | |
| | page_traverser.follow(page) | >-+ | |
| +-< | return page | | |
| | | | | |
| | +--------------------------------------+ | |
| | | |
+----------------------------------------------------+ |
| |
+----------------------------------------------------+ |
| | | |
| +---> >---+ completed (Page queue) | <-------+
| | |
+----------------------------------------------------+
|
v iter([Page, ...])
```

## Limitations

* Query-strings being used to uniquely identify a page means that links to pages with junk query strings risk causing a lot of duplication.
* None of the output formats conflate pages based on their canonical-url.
* Python's GIL.