Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/jamesturk/scrapelib
⛏ a library for scraping unreliable pages
https://github.com/jamesturk/scrapelib
http python scraper
Last synced: 14 days ago
JSON representation
⛏ a library for scraping unreliable pages
- Host: GitHub
- URL: https://github.com/jamesturk/scrapelib
- Owner: jamesturk
- License: bsd-2-clause
- Created: 2010-07-06T19:34:01.000Z (over 14 years ago)
- Default Branch: main
- Last Pushed: 2024-08-20T15:46:48.000Z (3 months ago)
- Last Synced: 2024-10-15T03:30:55.786Z (29 days ago)
- Topics: http, python, scraper
- Language: Python
- Homepage: https://jamesturk.github.io/scrapelib/
- Size: 958 KB
- Stars: 207
- Watchers: 18
- Forks: 43
- Open Issues: 7
-
Metadata Files:
- Readme: README.md
- Funding: .github/FUNDING.yml
- License: LICENSE
Awesome Lists containing this project
- jimsghstars - jamesturk/scrapelib - ⛏ a library for scraping unreliable pages (Python)
README
**scrapelib** is a library for making requests to less-than-reliable websites.
Source: [https://github.com/jamesturk/scrapelib](https://github.com/jamesturk/scrapelib)
Documentation: [https://jamesturk.github.io/scrapelib/](https://jamesturk.github.io/scrapelib/)
Issues: [https://github.com/jamesturk/scrapelib/issues](https://github.com/jamesturk/scrapelib/issues)
[![PyPI badge](https://badge.fury.io/py/scrapelib.svg)](https://badge.fury.io/py/scrapelib)
[![Test badge](https://github.com/jamesturk/scrapelib/workflows/Test/badge.svg)](https://github.com/jamesturk/scrapelib/actions?query=workflow%3ATest)## Features
**scrapelib** originated as part of the [Open States](http://openstates.org/)
project to scrape the websites of all 50 state legislatures and as a result
was therefore designed with features desirable when dealing with sites that
have intermittent errors or require rate-limiting.Advantages of using scrapelib over using requests as-is:
- HTTP(S) and FTP requests via an identical API
- support for simple caching with pluggable cache backends
- highly-configurable request throtting
- configurable retries for non-permanent site failures
- All of the power of the suberb [requests](http://python-requests.org) library.## Installation
*scrapelib* is on [PyPI](https://pypi.org/project/scrapelib/), and can be installed via any standard package management tool:
poetry add scrapelib
or:
pip install scrapelib
## Example Usage
``` python
import scrapelib
s = scrapelib.Scraper(requests_per_minute=10)# Grab Google front page
s.get('http://google.com')# Will be throttled to 10 HTTP requests per minute
while True:
s.get('http://example.com')
```