https://github.com/jamesturk/scrapelib

⛏ a library for scraping unreliable pages
https://github.com/jamesturk/scrapelib

http python scraper

Last synced: 6 months ago
JSON representation

⛏ a library for scraping unreliable pages

Host: GitHub
URL: https://github.com/jamesturk/scrapelib
Owner: jamesturk
License: bsd-2-clause
Created: 2010-07-06T19:34:01.000Z (over 15 years ago)
Default Branch: main
Last Pushed: 2024-08-20T15:46:48.000Z (about 1 year ago)
Last Synced: 2025-03-28T05:08:36.899Z (7 months ago)
Topics: http, python, scraper
Language: Python
Homepage: https://jamesturk.github.io/scrapelib/
Size: 958 KB
Stars: 210
Watchers: 17
Forks: 40
Open Issues: 7
Metadata Files:
- Readme: README.md
- Funding: .github/FUNDING.yml
- License: LICENSE

Awesome Lists containing this project

jimsghstars - jamesturk/scrapelib - ⛏ a library for scraping unreliable pages (Python)

README

          **scrapelib** is a library for making requests to less-than-reliable websites.

Source: [https://github.com/jamesturk/scrapelib](https://github.com/jamesturk/scrapelib)

Documentation: [https://jamesturk.github.io/scrapelib/](https://jamesturk.github.io/scrapelib/)

Issues: [https://github.com/jamesturk/scrapelib/issues](https://github.com/jamesturk/scrapelib/issues)

[![PyPI badge](https://badge.fury.io/py/scrapelib.svg)](https://badge.fury.io/py/scrapelib)

[![Test badge](https://github.com/jamesturk/scrapelib/workflows/Test/badge.svg)](https://github.com/jamesturk/scrapelib/actions?query=workflow%3ATest)

## Features

**scrapelib** originated as part of the [Open States](http://openstates.org/)

project to scrape the websites of all 50 state legislatures and as a result

was therefore designed with features desirable when dealing with sites that

have intermittent errors or require rate-limiting.

Advantages of using scrapelib over using requests as-is:

- HTTP(S) and FTP requests via an identical API

- support for simple caching with pluggable cache backends

- highly-configurable request throtting

- configurable retries for non-permanent site failures

- All of the power of the suberb [requests](http://python-requests.org) library.

## Installation

*scrapelib* is on [PyPI](https://pypi.org/project/scrapelib/), and can be installed via any standard package management tool:

    poetry add scrapelib

or:

    pip install scrapelib

## Example Usage

``` python

  import scrapelib

  s = scrapelib.Scraper(requests_per_minute=10)

  # Grab Google front page

  s.get('http://google.com')

  # Will be throttled to 10 HTTP requests per minute

  while True:

      s.get('http://example.com')

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/jamesturk/scrapelib

Awesome Lists containing this project

README