https://github.com/internetarchive/draintasker
a tool for continuously ingesting w/arc files into the archive
https://github.com/internetarchive/draintasker
Last synced: 8 months ago
JSON representation
a tool for continuously ingesting w/arc files into the archive
- Host: GitHub
- URL: https://github.com/internetarchive/draintasker
- Owner: internetarchive
- Created: 2012-05-09T00:44:47.000Z (over 13 years ago)
- Default Branch: master
- Last Pushed: 2024-11-13T15:17:25.000Z (about 1 year ago)
- Last Synced: 2025-05-05T06:17:07.558Z (8 months ago)
- Language: Python
- Size: 886 KB
- Stars: 10
- Watchers: 5
- Forks: 7
- Open Issues: 7
-
Metadata Files:
- Readme: README
- Changelog: Changelog.txt
Awesome Lists containing this project
README
DRAINTASKER - clears filling disks
================================================================
* Draintasker wiki:
https://webarchive.jira.com/wiki/display/WEBOPS/Draintasker
* Draintasker bugs:
https://launchpad.net/archivewidecrawl/+bugs
supports "draining" a running crawler along two paths:
1) dtmon: IAS3-to-petabox (paired storage)
2) th-dtmon: catalog-direct-to-thumpers (Santa Clara MD)
run like this:
$ ssh home
$ screen
$ ssh -A crawler
$ cd /path/draintasker
$ svn up
$ emacs dtmon.yml, save-as: /path/drain.yml
$ dtmon.py /path/drain.yml | tee -a /path/drain.log
PROCESSING
monitor job and drain (with dtmon.py) - while DRAINME file exists,
pack warcs (PACKED), make manifests (MANIFEST), launch transfers
(TASK), verify transfers (TOMBSTONE), and finally, delete verified
warcs, then sleep before trying again.
under-the-hood:
dtmon.py config
|
'-> s3-drain-job job_dir xfer_job_dir max_size warc_naming
|
'-> pack-warcs.sh => PACKED
'-> make-manifests.sh => MANIFEST
'-> s3-launch-transfers.sh => LAUNCH TASK [RETRY] SUCCESS TOMBSTONE
'-> delete-verified-warcs.sh => poof, no w/arcs!
th-dtmon.sh config
|
'-> drain-job job_dir xfer_job_dir thumper max_size warc_naming
|
'-> pack-warcs.sh => PACKED
'-> make-manifests.sh => MANIFEST
'-> launch-transfers.sh => LAUNCH, TASK
'-> verify-transfers.sh => SUCCESS, TOMBSTONE
'-> delete-verified-warcs.sh => poof, no w/arcs!
get status of prerequisites and disk capacity like this:
$ get-status.sh crawldata_dir xfer_dir
some advice:
1) if there are old draintasker procs, kill them.
2) if files in the way, investigate and move aside,
eg mv LAUNCH.open LAUNCH.1, mv ERROR ERROR.1
good to number each failure/error file
3) check the status of your disks
./get-status.sh
4) (optional) test petabox-to-thumper path on single series
./launch-transfers.sh
5) log into home and open a screen session
6) in screen, ssh crawler, cd /path/draintasker/, svn up
7) run dtmon.py to continuously drain each job+disk
[screen]
cd /path/draintasker
./dtmon.py /path/disk1.yml
[screen]
cd /path/draintasker
./dtmon.py /path/disk3.yml
CONFIGURATION
directory structure
crawldata /{1,3}/crawling
rsync_path /{1,3}/incoming
job_dir /{crawldata}/{job_name}
xfer_job_dir /{rsync_path}/{job_name}
warc_series {xfer_job_dir}/{warc_series}
depending on config, your warcs might be written in e.g.
/1/crawling/{crawljob}/warcs
/3/crawling/{crawljob}/warcs
and be "packed" into
/1/incoming/{crawljob}/{warc_series}/MANIFEST
/3/incoming/{crawljob}/{warc_series}/MANIFEST
DEPENDENCIES
dtmon.py (IAS3-to-petabox)
+ HOME/.ias3cfg (when using dtmon.py)
+ add [incoming_x] stanzas to /etc/rsyncd.conf (see wiki)
th-dtmon.sh (catalog-direct-to-thumper)
+ ~/.wgetrc with your archive.org user cookies (see wiki)
+ ensure user petabox user exists: /home-local/petabox
+ PETABOX_HOME=/home/user/petabox (codebase from svn)
+ get petabox authorized_keys from "draintasking" crawler
@crawling08:~$ scp /home-local/petabox/.ssh/authorized_keys\
root@ia400131:/home-local/petabox/.ssh/authorized_keys
+ add [incoming_x] stanzas to /etc/rsyncd.conf (see wiki)
PREREQUISITES
DRAINME {job_dir}/DRAINME
FINISH_DRAIN {job_dir}/FINISH_DRAIN
PACKED {warc_series}/PACKED
MANIFEST {warc_series}/MANIFEST
LAUNCH {warc_series}/LAUNCH
TASK {warc_series}/TASK
TOMBSTONE {warc_series}/TOMBSTONE
if you see a RETRY file, eg RETRY.1284217446 the suffix is the epoch
time when a non-blocking retry was scheduled. if this file exists,
then the retry was attempted at some time after that. you can get the
human readable form of that time with the date cmd, like so:
date -d @1284217446
Sat Sep 11 15:04:06 UTC 2010
DRAIN DAEMON
dtmon.py run s3-drain-job periodically
th-dtmon.sh run drain-job periodically
drain-job.sh run draintasker processes in single mode
DRAIN PROCESSING
delete-verified-warcs.sh delete original (verified) w/arcs from each series
get-remote-warc-urls.sh report remote md5 and url for all filesxml in series
item-submit-task.sh submit catalog task for series
item-verify-download.sh wget remote w/arc and verify checksum for series
item-verify-size.sh verify remote size of w/arc series
launch-transfers.sh submit transfer tasks for series
make-manifests.sh compute md5s into series MANIFEST
pack-warcs.sh create warc series when available
s3-launch-transfers.sh invoke curl for series
task-check-success.sh check and report task success by task_id
verify-transfers.sh run task-check-success and item-verify for series
UTILS
get-status.sh report dtmons, prerequisites and disk usage
addup-warcs.sh report count and total size of warcs
bundle-crawl-artifacts.sh make tarball of crawldata for permastorage
check-crawldata-staged.sh report staged crawldata file count+size
check-crawldata.sh report source crawldata file count+size
copy-crawldata.sh copy all crawldata preserving dir structure
make-and-store-bundle.sh make bundles and scp to staging
----
siznax 2010