https://github.com/internetarchive/draintasker

a tool for continuously ingesting w/arc files into the archive
https://github.com/internetarchive/draintasker

Last synced: 8 months ago
JSON representation

a tool for continuously ingesting w/arc files into the archive

Host: GitHub
URL: https://github.com/internetarchive/draintasker
Owner: internetarchive
Created: 2012-05-09T00:44:47.000Z (over 13 years ago)
Default Branch: master
Last Pushed: 2024-11-13T15:17:25.000Z (about 1 year ago)
Last Synced: 2025-05-05T06:17:07.558Z (8 months ago)
Language: Python
Size: 886 KB
Stars: 10
Watchers: 5
Forks: 7
Open Issues: 7
Metadata Files:
- Readme: README
- Changelog: Changelog.txt

Awesome Lists containing this project

README

          DRAINTASKER - clears filling disks

================================================================

* Draintasker wiki:

  https://webarchive.jira.com/wiki/display/WEBOPS/Draintasker

* Draintasker bugs:

  https://launchpad.net/archivewidecrawl/+bugs

supports "draining" a running crawler along two paths:

  1) dtmon: IAS3-to-petabox (paired storage)

  2) th-dtmon: catalog-direct-to-thumpers (Santa Clara MD)

run like this:

  $ ssh home

  $ screen

  $ ssh -A crawler

  $ cd /path/draintasker

  $ svn up

  $ emacs dtmon.yml, save-as: /path/drain.yml

  $ dtmon.py /path/drain.yml | tee -a /path/drain.log

PROCESSING 

  monitor job and drain (with dtmon.py) - while DRAINME file exists,

  pack warcs (PACKED), make manifests (MANIFEST), launch transfers

  (TASK), verify transfers (TOMBSTONE), and finally, delete verified

  warcs, then sleep before trying again.

under-the-hood:

  dtmon.py config

  |

  '-> s3-drain-job job_dir xfer_job_dir max_size warc_naming

     |

     '-> pack-warcs.sh             => PACKED

     '-> make-manifests.sh         => MANIFEST

     '-> s3-launch-transfers.sh    => LAUNCH TASK [RETRY] SUCCESS TOMBSTONE

     '-> delete-verified-warcs.sh  => poof, no w/arcs!

  

  th-dtmon.sh config

  |

  '-> drain-job job_dir xfer_job_dir thumper max_size warc_naming

      |

      '-> pack-warcs.sh            => PACKED

      '-> make-manifests.sh        => MANIFEST

      '-> launch-transfers.sh      => LAUNCH, TASK

      '-> verify-transfers.sh      => SUCCESS, TOMBSTONE

      '-> delete-verified-warcs.sh => poof, no w/arcs!

get status of prerequisites and disk capacity like this:

  $ get-status.sh crawldata_dir xfer_dir

some advice:

  1) if there are old draintasker procs, kill them.

  2) if files in the way, investigate and move aside, 

     eg mv LAUNCH.open LAUNCH.1, mv ERROR ERROR.1

        good to number each failure/error file

  3) check the status of your disks

     ./get-status.sh

  4) (optional) test petabox-to-thumper path on single series

     ./launch-transfers.sh 

  5) log into home and open a screen session

  6) in screen, ssh crawler, cd /path/draintasker/, svn up

  7) run dtmon.py to continuously drain each job+disk

     [screen]

       cd /path/draintasker

       ./dtmon.py /path/disk1.yml

     [screen]

       cd /path/draintasker

       ./dtmon.py /path/disk3.yml

CONFIGURATION

directory structure

  crawldata     /{1,3}/crawling

  rsync_path    /{1,3}/incoming

  job_dir       /{crawldata}/{job_name}

  xfer_job_dir  /{rsync_path}/{job_name}

  warc_series   {xfer_job_dir}/{warc_series}

depending on config, your warcs might be written in e.g.

  /1/crawling/{crawljob}/warcs

  /3/crawling/{crawljob}/warcs

and be "packed" into 

  /1/incoming/{crawljob}/{warc_series}/MANIFEST

  /3/incoming/{crawljob}/{warc_series}/MANIFEST

    

DEPENDENCIES

  dtmon.py (IAS3-to-petabox)

    + HOME/.ias3cfg (when using dtmon.py)

    + add [incoming_x] stanzas to /etc/rsyncd.conf (see wiki)

  th-dtmon.sh (catalog-direct-to-thumper)

    + ~/.wgetrc with your archive.org user cookies (see wiki)

    + ensure user petabox user exists: /home-local/petabox

    + PETABOX_HOME=/home/user/petabox (codebase from svn)

    + get petabox authorized_keys from "draintasking" crawler

      @crawling08:~$ scp /home-local/petabox/.ssh/authorized_keys\

      root@ia400131:/home-local/petabox/.ssh/authorized_keys

    + add [incoming_x] stanzas to /etc/rsyncd.conf (see wiki)

PREREQUISITES

  DRAINME       {job_dir}/DRAINME

  FINISH_DRAIN  {job_dir}/FINISH_DRAIN

  PACKED        {warc_series}/PACKED

  MANIFEST      {warc_series}/MANIFEST

  LAUNCH        {warc_series}/LAUNCH

  TASK          {warc_series}/TASK

  TOMBSTONE     {warc_series}/TOMBSTONE

if you see a RETRY file, eg RETRY.1284217446 the suffix is the epoch

time when a non-blocking retry was scheduled. if this file exists,

then the retry was attempted at some time after that. you can get the

human readable form of that time with the date cmd, like so:

  date -d @1284217446

  Sat Sep 11 15:04:06 UTC 2010

DRAIN DAEMON

  dtmon.py      run s3-drain-job periodically

  th-dtmon.sh   run drain-job periodically

  drain-job.sh  run draintasker processes in single mode

DRAIN PROCESSING

  delete-verified-warcs.sh  delete original (verified) w/arcs from each series 

  get-remote-warc-urls.sh   report remote md5 and url for all filesxml in series 

  item-submit-task.sh       submit catalog task for series

  item-verify-download.sh   wget remote w/arc and verify checksum for series 

  item-verify-size.sh       verify remote size of w/arc series

  launch-transfers.sh       submit transfer tasks for series

  make-manifests.sh         compute md5s into series MANIFEST

  pack-warcs.sh             create warc series when available

  s3-launch-transfers.sh    invoke curl for series

  task-check-success.sh     check and report task success by task_id

  verify-transfers.sh       run task-check-success and item-verify for series 

UTILS

  get-status.sh              report dtmons, prerequisites and disk usage 

  addup-warcs.sh             report count and total size of warcs

  bundle-crawl-artifacts.sh  make tarball of crawldata for permastorage 

  check-crawldata-staged.sh  report staged crawldata file count+size

  check-crawldata.sh         report source crawldata file count+size

  copy-crawldata.sh          copy all crawldata preserving dir structure 

  make-and-store-bundle.sh   make bundles and scp to staging

----

siznax 2010

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/internetarchive/draintasker

Awesome Lists containing this project

README