Backups and Monitoring

(tldr: Beware of pipes with set -e. And write more checks.)

At the Chaosdorf, we have an automated weekly backup of all servers and other hosts. The script uses set -e right at the start and reports its success with send_nsca just before quitting. A freshness threshold is used to produce an alert if a backup run does not report in time.

This sounds like nothing can go wrong without being noticed. However, there is a problem: backup_external uses pipes. And in a pipe, only the return value of the last command is actually evaluated:

descent ~ > ( set -e; false | true; echo foo )
foo

So, if something along the way (e.g. tar or gpg) has a problem, the script will happily run along and report its success at the end. Which will result in something like this:

flux ~ > sudo ls -l /chaosdorf/backups/09 | fgrep feedback
-rw-r--r-- 1 chaosdorf chaosdorf    0 Mar  4 00:03 feedback.chaosdorf.dn42_etc.tar.xz.gpg
-rw-r--r-- 1 chaosdorf chaosdorf  24K Mar  4 00:03 feedback.chaosdorf.dn42_packages
-rw-r--r-- 1 chaosdorf chaosdorf    0 Mar  4 00:03 feedback.chaosdorf.dn42_root.tar.xz.gpg
-rw-r--r-- 1 chaosdorf chaosdorf    0 Mar  4 00:03 feedback.chaosdorf.dn42_usr_local.tar.xz.gpg
-rw-r--r-- 1 chaosdorf chaosdorf    0 Mar  4 00:03 feedback.chaosdorf.dn42_var_local.tar.xz.gpg
-rw-r--r-- 1 chaosdorf chaosdorf    0 Mar  4 00:03 feedback.chaosdorf.dn42_var_log.tar.xz.gpg

In this case, it was likely GPG refusing to work on a readonly filesystem (it's an embedded host running on an SD card, so making it readonly makes sense).

The good thing about this is: The failed backups are all empty files, and finding empty files is as easy as running find -size 0. So now we have a second check on the receiving host to alert me whenever an obviously failed backup is transferred.

So:

Never, ever trust a single check
If you have the disk space, keep more than just the most recent three backups (I actually did this right)