https://github.com/daringer/lollygag

Simple base for web scrapers
https://github.com/daringer/lollygag

Last synced: about 1 year ago
JSON representation

Simple base for web scrapers

Host: GitHub
URL: https://github.com/daringer/lollygag
Owner: daringer
License: mit
Created: 2017-10-01T20:19:20.000Z (over 8 years ago)
Default Branch: master
Last Pushed: 2017-11-12T04:00:36.000Z (over 8 years ago)
Last Synced: 2025-01-29T16:11:32.710Z (over 1 year ago)
Language: Python
Homepage:
Size: 168 KB
Stars: 0
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- Contributing: .github/CONTRIBUTING.md
- License: LICENSE
- Code of conduct: .github/CODE_OF_CONDUCT.md

Awesome Lists containing this project

README

          # Lollygag

## About

* Travis CI: [![Build Status](https://travis-ci.org/snorrwe/lollygag.svg?branch=master)](https://travis-ci.org/snorrwe/lollygag)

* Supported Python versions: 

    * Python 2.7

    * Python 3.6

    * Pypy

    * Pypy 3

## Installation

`pip install lollygag`

## Usage

1. Create a custom _Parser_ to define behaviour

1. Configure the _Crawler_ via either the _run_ method or command line arguments

1. Run your script `python my_crawler.py`

### Sample code

Find the source code [here](https://github.com/snorrwe/lollygag/blob/master/examples/daringer_example.py)

```python

from lollygag import run, Services, LinkParser, DomainCrawler, Crawler

class MyParser(LinkParser):

    def __init__(self, *args, **kwargs):

        super(MyParser, self).__init__(*args, **kwargs)

        self.use_next_data = False

    # on new page is found and shall be processed, 'data' contains full html-source

    def feed(self, data):

        return super(MyParser, self).feed(data)

    # on each start-tag found inside the full html-source

    def handle_starttag(self, tag, attrs):

        # super() will handle links () parsing, you can add() arbitrary links to self._links

        super(MyParser, self).handle_starttag(tag, attrs)


        # for  the contents are needed, set flag to remember it

        if tag == "script":

            self.use_next_data = True

        if tag == "img":

            self.log_service.info("found img: {}".format(dict(attrs).get("src", "<no src attr>")))

    # on each data (between two tags)

    def handle_data(self, data):

        if self.use_next_data:

            self.log_service.info("script contents: {}".format(data))

    # on each end-tag found

    def handle_endtag(self, tag):

        if tag == "script":

            self.use_next_data = False

# `Services.site_parser_factory` defines how a single page is parsed

Services.site_parser_factory = MyParser

# `Services.crawler_factory` defines the Crawler, thus how links are handled (where to crawl?)

# - By default `DomainCrawler` is used, which restricts crawling to _one_ domain

# - You "might" put `Crawler` here, this will lead to endless crawling...

# Services.crawler_factory = Crawler

run()

```

### Command line arguments

<table>

    <thead>

        <tr>

            <th>Name</th>

            <th>Short</th>

            <th>Description</th>

            <th>Default</th>

        </tr>

    </thead>

    <tbody>

        <div>

            <tr>

                <td>--help</td>

                <td>-h</td>

                <td rowspan="2">Show the help and exit</td>

                <td> - </td>

            </tr>

            <tr>

            </tr>

        </div>

        <div>

            <tr>

                <td>--url</td>

                <td>-u</td>

                <td rowspan="2">Base url you wish to crawl.<br>

                <i>

                    Note that if you pass the url argument to run() or crawl_domain() this option will be ignored.

                </i>

                </td>

                <td> None </td>

            </tr>

            <tr>

            </tr>

        </div>

        <div>

            <tr>

                <td>--threads</td>

                <td>-t</td>

                <td rowspan="2">Maximum number of concurrent threads</td>

                <td> 5 </td>

            </tr>

            <tr>

            </tr>

        </div>

        <div>

            <tr>

                <td>--loglevel</td>

                <td>-l</td>

                <td rowspan="2">Level of logging, possible values = [all, debug, info, warn, error, none]</td>

                <td> all </td>

            </tr>

            <tr>

            </tr>

        </div>

    </tbody>

</table>

## Contributing

Please refer to the [contribution guidelines](https://github.com/snorrwe/lollygag/blob/master/.github/CONTRIBUTING.md)

## License

[MIT](https://github.com/snorrwe/Crawler/blob/master/LICENSE)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/daringer/lollygag

Awesome Lists containing this project

README