https://github.com/anjackson/wauldock

OUT OF DATE - see ukwa/wren for recent changes.
https://github.com/anjackson/wauldock

Last synced: 5 months ago
JSON representation

OUT OF DATE - see ukwa/wren for recent changes.

Host: GitHub
URL: https://github.com/anjackson/wauldock
Owner: anjackson
Created: 2015-10-13T11:11:05.000Z (over 9 years ago)
Default Branch: master
Last Pushed: 2016-01-12T20:58:16.000Z (over 9 years ago)
Last Synced: 2025-01-05T02:25:25.351Z (6 months ago)
Language: Shell
Homepage: https://github.com/ukwa/wren
Size: 13.7 KB
Stars: 1
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

NOTICE
======

Development of these components has moved to https://github.com/ukwa/wren and related repositories.

Wauldock
========

An experiment aimed at building scaleable, modular web archive components based on Docker containers.

Scale-out Archiving Web Proxy
-----------------------------

- The [warcprox](https://github.com/internetarchive/warcprox) Dockerfile sets up warcprox on Ubuntu 14.04/Python 3.4 with the necessary dependencies.
- The [Squid](http://www.squid-cache.org/) caching forward proxy is used to set up a [Cache Hierarchy](http://wiki.squid-cache.org/Features/CacheHierarchy), but instead of caching the results, the 'parent' proxies can be instances of warcprox.
- This should allow proxy-based web archiving to be used on large scale crawls.
- Note that it may be possible to use the caching feature of Squid to avoid hitting the original site too often when extracting transcluded URLs.
- HAProxy in HTTP mode can redirect based on ```hdr(host)```, ```uri```, etc. (but not in TCP mode).

### Scaling out with Docker ###

To experiment with scaling out, first clean out any existing machines:

$ docker-compose rm

Then define how many warcprox instances you want and ask for them to be configured:

$ docker-compose scale warcprox=3

Then when you run

$ docker-compose up

The system will start up and configure a HAProxy instance that is configured to balance the load across all the warcprox instances. The provided configuration divides the load up using ```hdr(host)```, which send all requests relating to a particular host to the same warcprox instance. This ensures that URL-based de-duplication can work effectively. Further experimentation with the load balancing parameters is recommended.

### TO DO ###

- Use a shared [data volume container](https://docs.docker.com/userguide/dockervolumes/#creating-and-mounting-a-data-volume-container) to hold the WARCs.

CDX/Remote Resource Index Servers
---------------------------------

- Various web archiving components may benefit from having the [CDX index](https://archive.org/web/researcher/cdx_file_format.php) as an independent, scaleable service rather than the usual files.
- If the CDX server also present an API for updating its index, as well as reading it, it can act as a core, standalone component in a modular architecture.
- Potential uses include: playback, de-duplication, 'last seen' state during crawls.
- The [tinycdxserver](https://github.com/nla/tinycdxserver) Dockerfile sets up NLA's read/writable Remote Resource Index server (based on RocksDB) for experimentation.
- The read-only CDX servers ([pywb](https://github.com/ikreymer/pywb/wiki/CDX-Server-API),[OpenWayback](https://github.com/iipc/openwayback/tree/master/wayback-cdx-server-webapp)), could be unified and extended in this direction.
- Note that [warcbase](http://warcbase.org/) and [OpenWayback](https://github.com/iipc/openwayback) can be [used together](https://github.com/lintool/warcbase#waybackwarcbase-integration) for very large indexes that are best stored in HBase.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/anjackson/wauldock

Awesome Lists containing this project

README