https://github.com/simonsolnes/webcache

Cache webpages when you are testing your web scrapers.
https://github.com/simonsolnes/webcache

scraping web-scraping

Last synced: about 1 month ago
JSON representation

Cache webpages when you are testing your web scrapers.

Host: GitHub
URL: https://github.com/simonsolnes/webcache
Owner: simonsolnes
Created: 2018-04-29T20:43:05.000Z (about 8 years ago)
Default Branch: master
Last Pushed: 2018-04-29T22:58:20.000Z (about 8 years ago)
Last Synced: 2026-03-29T14:35:46.293Z (3 months ago)
Topics: scraping, web-scraping
Language: Python
Homepage:
Size: 10.7 KB
Stars: 1
Watchers: 0
Forks: 0
Open Issues: 3
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          # webcache

Cache webpages when you are testing your web scrapers.

## Putting it in your project

Add this to your project with:

`$ git submodule add https://github.com/simonsolnes/webcache webcache`

and

```python3

from webcache import WebCache

```

## Quick Intro

To download a webpage:

```python

with WebCache() as c:

	website = c.get('https://www.python.org')

```

## Methods

`get(url) ->` webpage-data `(str)`  

Gets the webpage data from the web or local cache.

`insert(*urls (string))`  

Puts one or several urls in the directory, but the cache doesn't download it. Meant for cuncurrent downloads.

`fetch()`  

Will download all webpages that are not local.

`update_url(*urls (string))`  

Will update the urls that is passed.

`update_all()`  

Will redownload all webpages that the cache knows about.

`update_old(age (int, seconds))` 

Will update the urls that has an age older than the one specified.

`reset()`  

Will delete all local data.

## Downloading concurrently; `insert` and `fetch`

```python

urls = [

	'https://www.python.org'

	'https://duckduckgo.com'

	'https://www.wikipedia.org'

]

with WebCache() as c:

	c.insert(*urls)

	c.fetch()

	website = c.get('https://www.python.org')

```

## Updating webpages

Update an url:

```python

with WebCache() as c:

	# one

	c.update_url('https://www.python.org')

	# or several

	c.update(*urls)

```

Update old urls:

```python

with WebCache() as c:

	c.update_old(60 * 60)

```

Update all urls:

```python

with WebCache() as c:

	c.update_all()

```

## Not get a DoS

To not overload a server, you can set an amount of time that you are willing to wait when the cache is downloading several webpages. The longer the wait, the longer the time between each request.

```python

with WebCache(60) as c:

	...

```

## Without context

It is possible to do:

```python

c = WebCache():

website = c.get('https://www.python.org')

```

But is not recommended when several webpages is needed, since the cache needs to load its directory for each time you create an instance.

The class is a singleton, so there is no need to worry about if there is something else is using the cache at the moment.

## Reset

Reset the whole cache:

```python

WebCache().reset()

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/simonsolnes/webcache

Awesome Lists containing this project

README