Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/replicon/fast-archiver

Alternative archiving tool with fast performance for huge numbers of small files
https://github.com/replicon/fast-archiver

Last synced: 2 days ago
JSON representation

Alternative archiving tool with fast performance for huge numbers of small files

Awesome Lists containing this project

README

        

fast-archiver
~~~~~~~~~~~~~

fast-archiver is a command-line tool for archiving directories, and restoring
those archives written in [Go](http://golang.org).

fast-archiver uses a few techniques to try to be more efficient than
traditional tools:

1. It reads a number of files concurrently and then serializes the output.
Most other tools use sequential file processing, where operations like
``open()``, ``lstat()``, and ``close()`` can cause a lot of overhead when
reading huge numbers of small files. Making these operations concurrent
means that the tool is more often reading and writing data than you would
be otherwise.

2. It begins archiving files before it has completed reading the directory
entries that it is archiving, allowing for a fast startup time
compared to tools that first create an inventory of files to
transfer.

How Fast?
---------

On a test workload of 2,089,214 files representing a total of 90 GiB of data,
fast-archiver was compared with tar and rsync for reading data files and
transfering them over a network. The test scenario was a PostgreSQL
database, with many of the files being small, 8-24kiB in size.

Compared with tar, fast-archiver took 33% of the execution time (27m 38s vs.
1h 23m 23s) to read the test workload and output the archive to /dev/null.
The tar output had to be redirected through cat to create a comparable
scenario, because tar recognized /dev/null and shortcuts the actual data file
reading and writing. Here's the raw timing output for some hard data::

$ time fast-archiver -c -o /dev/null /db/data
skipping symbolic link /db/data/pg_xlog
1008.92user 663.00system 27:38.27elapsed 100%CPU (0avgtext+0avgdata 24352maxresident)k
0inputs+0outputs (0major+1732minor)pagefaults 0swaps

$ time tar -cf - /db/data | cat > /dev/null
tar: Removing leading `/' from member names
tar: /db/data/base/16408/12445.2: file changed as we read it
tar: /db/data/base/16408/12464: file changed as we read it
32.68user 375.19system 1:23:23elapsed 8%CPU (0avgtext+0avgdata 81744maxresident)k
0inputs+0outputs (0major+5163minor)pagefaults 0swaps

Compared with rsync, fast-archiver piped over ssh can transfer the database
from one machine to another in 1h 30m, vs. rsync in 3h.

These huge reductions in time may not be typical, but they happen to be the
workload that fast-archiver was designed for.

Examples
--------

Creates an archive (-c) reading the directory target1, and redirects the
archive to the file named target1.fast-archive::

fast-archiver -c target1 > target1.fast-archive
fast-archiver -c -o target1.fast-archive target1

Extracts the archive target1.fast-archive into the current directory::

fast-archiver -x < target1.fast-archive
fast-archiver -x -i target1.fast-archive

Creates a fast-archive remotely, and restores it locally, piping the data
through ssh::

ssh [email protected] "cd /db; fast-archive -c data --exclude=data/\*.pid" | fast-archiver -x

Installation
------------

The fast-archiver repository contains both a command-line tool (at the root)
and a package called ``falib`` which contains the archive reading and writing
code. To make the build work correctly with both the library and the
command-line tool, it's necessary to setup the correct GOPATH and directory
references.

Here's a quick set of steps to setup the build:

* Install [Go](http://golang.org).

* Setup ``$GOPATH``, for example: ``export GOPATH=$HOME/go-projects``. Probably
better to set it up in your ``.bash_aliases``.

* ``go get -u github.com/replicon/fast-archiver``

_or_

* ``go get -d github.com/replicon/fast-archiver && $GOPATH/src/github.com/replicon/fast-archiver/build.sh``

Command-line arguments
----------------------

-x
Extract archive mode.

-c
Create archive mode.

--multicpu
Allows concurrent activities to run on the specified number of CPUs. Since
the archiving is dominated by I/O, additional CPUs tend to just add
overhead in communicating between concurrent processes, but it could
increase throughput in some scenarios. Defaults to 1.

Create-mode only
================

-o
Output path for the archive. Defaults to stdout.

--exclude
A colon-separated list of paths to exclude from the archive. Can include
wildcards and other shell matching constructs.

--block-size
Specifies the size of blocks being read from disk, in bytes. The larger
the block size, the more memory fast-archiver will use, but it could result
in higher I/O rates. Defaults to 4096, maximum value is 65535.

--dir-readers
The maximum number of directories that will be read concurrently. Defaults
to 16.

--file-readers
The maximum number of files that will be read concurrently. Defaults to
16.

--queue-dir
The maximum size of the queue for sub-directory paths to be processed.
Defaults to 128.

--queue-read
The maximum size of the queue for file paths to be processed. Defaults to
128.

--queue-write
The maximum size of the block queue for archive output. Increasing this
will increase the potential memory usage, as (queue-write * block-size)
memory could be allocated for file reads. Defaults to 128.

Extract-mode only
=================

-i
Input path for the archive. Defaults to stdin.

--ignore-perms
Do not restore permissions on files and directories.

--ignore-owners
Do not restore uid and gid on files and directories.