https://github.com/simonsolnes/webcache
Cache webpages when you are testing your web scrapers.
https://github.com/simonsolnes/webcache
scraping web-scraping
Last synced: about 1 month ago
JSON representation
Cache webpages when you are testing your web scrapers.
- Host: GitHub
- URL: https://github.com/simonsolnes/webcache
- Owner: simonsolnes
- Created: 2018-04-29T20:43:05.000Z (about 8 years ago)
- Default Branch: master
- Last Pushed: 2018-04-29T22:58:20.000Z (about 8 years ago)
- Last Synced: 2026-03-29T14:35:46.293Z (3 months ago)
- Topics: scraping, web-scraping
- Language: Python
- Homepage:
- Size: 10.7 KB
- Stars: 1
- Watchers: 0
- Forks: 0
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# webcache
Cache webpages when you are testing your web scrapers.
## Putting it in your project
Add this to your project with:
`$ git submodule add https://github.com/simonsolnes/webcache webcache`
and
```python3
from webcache import WebCache
```
## Quick Intro
To download a webpage:
```python
with WebCache() as c:
website = c.get('https://www.python.org')
```
## Methods
`get(url) ->` webpage-data `(str)`
Gets the webpage data from the web or local cache.
`insert(*urls (string))`
Puts one or several urls in the directory, but the cache doesn't download it. Meant for cuncurrent downloads.
`fetch()`
Will download all webpages that are not local.
`update_url(*urls (string))`
Will update the urls that is passed.
`update_all()`
Will redownload all webpages that the cache knows about.
`update_old(age (int, seconds))`
Will update the urls that has an age older than the one specified.
`reset()`
Will delete all local data.
## Downloading concurrently; `insert` and `fetch`
```python
urls = [
'https://www.python.org'
'https://duckduckgo.com'
'https://www.wikipedia.org'
]
with WebCache() as c:
c.insert(*urls)
c.fetch()
website = c.get('https://www.python.org')
```
## Updating webpages
Update an url:
```python
with WebCache() as c:
# one
c.update_url('https://www.python.org')
# or several
c.update(*urls)
```
Update old urls:
```python
with WebCache() as c:
c.update_old(60 * 60)
```
Update all urls:
```python
with WebCache() as c:
c.update_all()
```
## Not get a DoS
To not overload a server, you can set an amount of time that you are willing to wait when the cache is downloading several webpages. The longer the wait, the longer the time between each request.
```python
with WebCache(60) as c:
...
```
## Without context
It is possible to do:
```python
c = WebCache():
website = c.get('https://www.python.org')
```
But is not recommended when several webpages is needed, since the cache needs to load its directory for each time you create an instance.
The class is a singleton, so there is no need to worry about if there is something else is using the cache at the moment.
## Reset
Reset the whole cache:
```python
WebCache().reset()
```