https://github.com/icio/crul

Python website-scraper
https://github.com/icio/crul

Last synced: 11 months ago
JSON representation

Python website-scraper

Host: GitHub
URL: https://github.com/icio/crul
Owner: icio
Created: 2016-02-06T18:08:17.000Z (over 10 years ago)
Default Branch: master
Last Pushed: 2016-02-15T13:49:03.000Z (over 10 years ago)
Last Synced: 2023-03-16T12:21:55.094Z (over 3 years ago)
Language: Python
Size: 16.6 KB
Stars: 0
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          # Crul :see_no_evil: :hear_no_evil: :speak_no_evil: 

A silly little web crawler, not dissimilar to [`gergle`](https://github.com/icio/gergle).

## Installation

```bash

# Prepare a virtual environment (optional)

virtualenv env

. env/bin/activate

# Install crul

pip install git+git://github.com/icio/crul\#egg=crul

```

## Usage

```text

$ crul

Crul website scraper.

Examples:

    A human-readable summary of the (rather outdated) pages on my website:

        $ crul http://www.paul-scott.com/

    Record the results of scraping into a file:

        $ crul http://www.paul-scott.com/ --json > me.scrape

    Render a sitemap of what we just scraped:

        $ crul --replay me.scrape --sitemap > sitemap.xml

    Scrape only 3 pages deep with a single worker, leaving 2 seconds between

    each subsequent request to the site, w/o anything under /developer/flash

    or /forum/:

        $ crul -d 3 -w 1 -t 2 https://www.kirupa.com/ \

            -i developer/flash -i forum/

    Ignore robots.txt and grab everything we can, as fast as we can, with 5

    workers, from :

        $ crul --yolo -w 5 

Usage:

    crul ( [options] | --replay=)

            [--disallow=]...

            [--dot | --sitemap | --text | --json]

            [-v|-q] [--log-file=]

    crul [--help | --version]

Options:

       --dot              Output in dot format.

       --sitemap          Output in XML sitemap format.

       --json             Output in JSON format. [default]

       --text             Output in human-readable text format.

    -A --user-agent       The user-agent sent from the client.

                          [default: Crul/1.0 (+https://github.com/icio/crul)]

    -d --depth=        Traverse n pages deep from the starting point.

                          [default: 100]

    -h --help             Print this help.

    -i --disallow=  Ignore/disallow file paths from being scraped.

    -l --log-file=  Log to the given file.

    -q --quiet            Quiet logging.

    -r --replay=    Load responses from a JSON file, instead of scraping.

    -t --delay=        Wait n seconds between requests to the site.

    -v --verbose          Verbose logging.

       --version          Print the version number.

    -w --workers=      Use n worker threads to make requests in parallel.

                          [default: 4]

       --yolo             Don't bother checking robots.txt.

```

## Implementation

`crul` is implemented as a threaded set of workers processing a queue of URLs (`crul.scrape.site_crawl`). When new pages have been collected the linked pages (`crul.parse.PageParser`) are appended to the queue (`crul.traverse.PageTraverser`) for the workers to continue processing.

```text

                  v site_crawl(init_url)

                  |

    +----------------------------------------------------+      +--- worker_sentinel [thread x 1] ------------+

    |             |                                      | >-------> Wait for all tasks to be completed.      |

    |  +---<  <---+   pending (Task queue)        <---+  |      |                                             |

    |  |                                              |  | <-------* Append kill signals to all queues.       |

    +----------------------------------------------------+      +---------------------------------------------+

       |                                              |            |

    +----+ worker [thread x num_workers] +---------------+         |

    |  |                                              |  |         |

    |  |   + worker_request +---------------------+   |  |         |

    |  |   |                                      |   |  |         |

    |  +-> | page = page_parser(session.get(url)) |   |  |         |

    |      | page_traverser.follow(page)          | >-+  |         |

    |  +-< | return page                          |      |         |

    |  |   |                                      |      |         |

    |  |   +--------------------------------------+      |         |

    |  |                                                 |         |

    +----------------------------------------------------+         |

       |                                                           |

    +----------------------------------------------------+         |

    |  |                                                 |         |

    |  +--->  >---+   completed (Page queue)             | <-------+

    |             |                                      |

    +----------------------------------------------------+

                  |

                  v iter([Page, ...])

```

## Limitations

* Query-strings being used to uniquely identify a page means that links to pages with junk query strings risk causing a lot of duplication.

* None of the output formats conflate pages based on their canonical-url.

* Python's GIL.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/icio/crul

Awesome Lists containing this project

README