https://github.com/CompileInc/hodor

🕷Configuration based html scraper
https://github.com/CompileInc/hodor

cssselect hodor html-scraper lxml pagination python scraping

Last synced: 3 months ago
JSON representation

🕷Configuration based html scraper

Host: GitHub
URL: https://github.com/CompileInc/hodor
Owner: CompileInc
License: mit
Created: 2016-09-06T13:03:11.000Z (about 9 years ago)
Default Branch: master
Last Pushed: 2024-05-21T22:55:23.000Z (over 1 year ago)
Last Synced: 2024-05-21T23:54:57.858Z (over 1 year ago)
Topics: cssselect, hodor, html-scraper, lxml, pagination, python, scraping
Language: Python
Homepage: http://hodor.live
Size: 53.7 KB
Stars: 23
Watchers: 5
Forks: 3
Open Issues: 7
Metadata Files:
- Readme: README.md
- License: LICENSE.md

Awesome Lists containing this project

README

          

# Hodor [![PyPI](https://img.shields.io/pypi/v/hodorlive.svg?maxAge=2592000?style=plastic)](https://pypi.python.org/pypi/hodorlive/)

A simple html scraper with xpath or css.

## Install

```pip install hodorlive```

## Usage

### As python package

***WARNING: This package by default doesn't verify ssl connections. Please check the [arguments](#arguments) to enable them.***

#### Sample code

```python

from hodor import Hodor

from dateutil.parser import parse

def date_convert(data):

    return parse(data)

url = 'http://www.nasdaq.com/markets/stocks/symbol-change-history.aspx'

CONFIG = {

    'old_symbol': {

        'css': '#SymbolChangeList_table tr td:nth-child(1)',

        'many': True

    },

    'new_symbol': {

        'css': '#SymbolChangeList_table tr td:nth-child(2)',

        'many': True

    },

    'effective_date': {

        'css': '#SymbolChangeList_table tr td:nth-child(3)',

        'many': True,

        'transform': date_convert

    },

    '_groups': {

        'data': '__all__',

        'ticker_changes': ['old_symbol', 'new_symbol']

    },

    '_paginate_by': {

        'xpath': '//*[@id="two_column_main_content_lb_NextPage"]/@href',

        'many': False

    }

}

h = Hodor(url=url, config=CONFIG, pagination_max_limit=5)

h.data

```

#### Sample output

```python

{'data': [{'effective_date': datetime.datetime(2016, 11, 1, 0, 0),

           'new_symbol': 'ARNC',

           'old_symbol': 'AA'},

          {'effective_date': datetime.datetime(2016, 11, 1, 0, 0),

           'new_symbol': 'ARNC$',

           'old_symbol': 'AA$'},

          {'effective_date': datetime.datetime(2016, 8, 16, 0, 0),

           'new_symbol': 'MALN8',

           'old_symbol': 'AHUSDN2018'},

          {'effective_date': datetime.datetime(2016, 8, 16, 0, 0),

           'new_symbol': 'MALN9',

           'old_symbol': 'AHUSDN2019'},

          {'effective_date': datetime.datetime(2016, 8, 16, 0, 0),

           'new_symbol': 'MALQ6',

           'old_symbol': 'AHUSDQ2016'},

          {'effective_date': datetime.datetime(2016, 8, 16, 0, 0),

           'new_symbol': 'MALQ7',

           'old_symbol': 'AHUSDQ2017'},

          {'effective_date': datetime.datetime(2016, 8, 16, 0, 0),

           'new_symbol': 'MALQ8',

           'old_symbol': 'AHUSDQ2018'}]}

```

#### Arguments

- ```ua``` (User-Agent)

- ```proxies``` (check requesocks)

- ```auth```

- ```crawl_delay``` (crawl delay in seconds across pagination - default: 3 seconds)

- ```pagination_max_limit``` (max number of pages to crawl - default: 100)

- ```ssl_verify``` (default: False)

- ```robots``` (if set respects robots.txt - default: True)

- ```reppy_capacity``` (robots cache LRU capacity - default: 100)

- ```trim_values``` (if set trims output for leading and trailing whitespace - default: True)

#### Config parameters:

- By default any key in the config is a rule to parse.

    - Each rule can be either a ```xpath``` or a ```css```

    - Each rule can extract ```many``` values by default unless explicity set to ```False```

    - Each rule can allow to ```transform``` the result with a function if provided

- Extra parameters include grouping (```_groups```) and pagination (```_paginate_by```) which is also of the rule format.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/CompileInc/hodor

Awesome Lists containing this project

README