Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/karlicoss/arctee

Atomic tee
https://github.com/karlicoss/arctee

backup data-liberation export

Last synced: about 2 months ago
JSON representation

Atomic tee

Awesome Lists containing this project

README

        

#+EXPORT_EXCLUDE_TAGS: noexport

#+begin_src python :exports output :results replace raw
import arctee
return arctee.__doc__
#+end_src

#+RESULTS:

Helper script to run your data exports.
It works kind of like [[https://en.wikipedia.org/wiki/Tee_(command)][*tee* command]], but:

- *a*: writes output atomically
- *r*: supports retrying command
- *c*: supports compressing output

You can read more on how it's used [[https://beepb00p.xyz/exports.html#arctee][here]].

* Motivation
Many things are very common to all data exports, regardless of the source.
In the vast majority of cases, you want to fetch some data, save it in a file (e.g. JSON) along with a timestamp and potentially compress.

This script aims to minimize the common boilerplate:

- =path= argument allows easy ISO8601 timestamping and guarantees atomic writing, so you'd never end up with corrupted exports.
- =--compression= allows to compress simply by passing the extension. No more =tar -zcvf=!
- =--retries= allows easy exponential backoff in case service you're querying is flaky.

Example:

: arctee '/exports/rtm/{utcnow}.ical.zstd' --compression zstd --retries 3 -- /soft/export/rememberthemilk.py

1. runs =/soft/export/rememberthemilk.py=, retrying it up to three times if it fails

The script is expected to dump its result in stdout; stderr is simply passed through.
2. once the data is fetched it's compressed as =zstd=
3. timestamp is computed and compressed data is written to =/exports/rtm/20200102T170015Z.ical.zstd=

* Do you really need a special script for that?

- why not use =date= command for timestamps?

passing =$(date -Iseconds --utc).json= as =path= works, however I need it for *most* of my exports; so it ends up polluting my crontabs.

Next, I want to do several things one after another here.
That sounds like a perfect candidate for *pipes*, right?
Sadly, there are serious caveats:

- *pipe errors don't propagate*. If one parts of your pipe fail, it doesn't fail everything

That's a major problem that often leads to unexpected behaviours.

In bash you can fix this by setting =set -o pipefail=. However:

- default cron shell is =/bin/sh=. Ok, you can change it to ~SHELL=/bin/bash~, but
- you can't set it to =/bin/bash -o pipefail=

You'd have to prepend all of your pipes with =set -o pipefail=, which is quite boilerplaty

- you can't use pipes for *retrying*; you need some wrapper script anyway

E.g. similar to how you need a wrapper script when you want to stop your program on timeout.

- it's possible to use pipes for atomically writing output to a file, however I haven't found any existing tools to do that

E.g. I want something like =curl https://some.api/get-data | tee --atomic /path/to/data.sjon=.

If you know any existing tool please let me know!

- it's possible to pipe compression

However due to the above concerns (timestamping/retrying/atomic writing), it has to be part of the script as well.

It feels that cron isn't a suitable tool for my needs due to pipe handling and the need for retries, however I haven't found a better alternative.
If you think any of these things can be simplified, I'd be happy to know and remove them in favor of more standard solutions!

* Installation

This can be installed with pip by running: =pip3 install --user git+https://github.com/karlicoss/arctee=

You can also manually install this by installing =atomicwrites= (=pip3 install atomicwrites=) and downloading and running =arctee.py= directly

** Optional Dependencies
- =pip3 install --user backoff=

[[https://github.com/litl/backoff][backoff]] is a library to simplify backoff and retrying. Only necessary if you want to use --retries--.
- =apt install atool=

[[https://www.nongnu.org/atool][atool]] is a tool to create archives in any format. Only necessary if you want to use compression.

# end of autogenerated stuff

* Usage

#+begin_src sh :results output :exports output
arctee --help
#+end_src

# TODO ugh. seems that github chokes over #+RESULT: here
#+begin_example
usage: arctee [-h] [-r RETRIES] [-c COMPRESSION] path

Wrapper for automating boilerplate for reliable and regular data exports.

Example: arctee '/exports/rtm/{utcnow}.ical.zstd' --compression zstd --retries 3 -- /soft/export/rememberthemilk.py --user "[email protected]"

Arguments past '--' are the actuall command to run.

positional arguments:
path Path with borg-style placeholders. Supported: {utcnow}, {hostname}, {platform}.

Example: '/exports/pocket/pocket_{utcnow}.json'

(see https://manpages.debian.org/testing/borgbackup/borg-placeholders.1.en.html)

optional arguments:
-h, --help show this help message and exit
-r RETRIES, --retries RETRIES
Total number of tries, 1 (default) means only try once. Uses exponential backoff.
-c COMPRESSION, --compression COMPRESSION
Set compression format.

See 'man apack' for list of supported formats. In addition, 'zstd' is also supported.
#+end_example

* TODOs :noexport: