https://github.com/CompileInc/hodor
🕷Configuration based html scraper
https://github.com/CompileInc/hodor
cssselect hodor html-scraper lxml pagination python scraping
Last synced: 3 months ago
JSON representation
🕷Configuration based html scraper
- Host: GitHub
- URL: https://github.com/CompileInc/hodor
- Owner: CompileInc
- License: mit
- Created: 2016-09-06T13:03:11.000Z (about 9 years ago)
- Default Branch: master
- Last Pushed: 2024-05-21T22:55:23.000Z (over 1 year ago)
- Last Synced: 2024-05-21T23:54:57.858Z (over 1 year ago)
- Topics: cssselect, hodor, html-scraper, lxml, pagination, python, scraping
- Language: Python
- Homepage: http://hodor.live
- Size: 53.7 KB
- Stars: 23
- Watchers: 5
- Forks: 3
- Open Issues: 7
-
Metadata Files:
- Readme: README.md
- License: LICENSE.md
Awesome Lists containing this project
README
# Hodor [](https://pypi.python.org/pypi/hodorlive/)
A simple html scraper with xpath or css.
## Install
```pip install hodorlive```
## Usage
### As python package
***WARNING: This package by default doesn't verify ssl connections. Please check the [arguments](#arguments) to enable them.***
#### Sample code
```python
from hodor import Hodor
from dateutil.parser import parsedef date_convert(data):
return parse(data)url = 'http://www.nasdaq.com/markets/stocks/symbol-change-history.aspx'
CONFIG = {
'old_symbol': {
'css': '#SymbolChangeList_table tr td:nth-child(1)',
'many': True
},
'new_symbol': {
'css': '#SymbolChangeList_table tr td:nth-child(2)',
'many': True
},
'effective_date': {
'css': '#SymbolChangeList_table tr td:nth-child(3)',
'many': True,
'transform': date_convert
},
'_groups': {
'data': '__all__',
'ticker_changes': ['old_symbol', 'new_symbol']
},
'_paginate_by': {
'xpath': '//*[@id="two_column_main_content_lb_NextPage"]/@href',
'many': False
}
}h = Hodor(url=url, config=CONFIG, pagination_max_limit=5)
h.data
```
#### Sample output
```python
{'data': [{'effective_date': datetime.datetime(2016, 11, 1, 0, 0),
'new_symbol': 'ARNC',
'old_symbol': 'AA'},
{'effective_date': datetime.datetime(2016, 11, 1, 0, 0),
'new_symbol': 'ARNC$',
'old_symbol': 'AA$'},
{'effective_date': datetime.datetime(2016, 8, 16, 0, 0),
'new_symbol': 'MALN8',
'old_symbol': 'AHUSDN2018'},
{'effective_date': datetime.datetime(2016, 8, 16, 0, 0),
'new_symbol': 'MALN9',
'old_symbol': 'AHUSDN2019'},
{'effective_date': datetime.datetime(2016, 8, 16, 0, 0),
'new_symbol': 'MALQ6',
'old_symbol': 'AHUSDQ2016'},
{'effective_date': datetime.datetime(2016, 8, 16, 0, 0),
'new_symbol': 'MALQ7',
'old_symbol': 'AHUSDQ2017'},
{'effective_date': datetime.datetime(2016, 8, 16, 0, 0),
'new_symbol': 'MALQ8',
'old_symbol': 'AHUSDQ2018'}]}
```#### Arguments
- ```ua``` (User-Agent)
- ```proxies``` (check requesocks)
- ```auth```
- ```crawl_delay``` (crawl delay in seconds across pagination - default: 3 seconds)
- ```pagination_max_limit``` (max number of pages to crawl - default: 100)
- ```ssl_verify``` (default: False)
- ```robots``` (if set respects robots.txt - default: True)
- ```reppy_capacity``` (robots cache LRU capacity - default: 100)
- ```trim_values``` (if set trims output for leading and trailing whitespace - default: True)#### Config parameters:
- By default any key in the config is a rule to parse.
- Each rule can be either a ```xpath``` or a ```css```
- Each rule can extract ```many``` values by default unless explicity set to ```False```
- Each rule can allow to ```transform``` the result with a function if provided
- Extra parameters include grouping (```_groups```) and pagination (```_paginate_by```) which is also of the rule format.