As I've explained
one of the first things I did upon installation of Arch was to set up a
backup solution. This was done simply using a little bash script, which itself
rsync to do the actual backup.
But after using this system for a little while, I decided that it needed a rewrite. This was mainly due to two reasons:
- A "bug" in rsync
- A flaw in the script/system
A "bug" in rsync
I put bug in between quotes, because while some could classify this as a bug, it isn't really one, since the current behavior of rsync actually matches what the manual says. It's just that one could easily see how a change in the current behavior would make a lot of sense.
It all comes down to the famous
--link-dest option that I'm using, as a way to
get multiple backups of my system, where all identicital files only need to be
copied/stored once on the backup disk, and then hard-linked as many times as
This works great, however here's what the manual says:
This option works best when copying into an empty destination hierarchy, as rsync treats existing files as definitive (so it never looks in the link-dest dirs when a destination file already exists)
So if a file already exists in the destination, rsync will not look into the
link-dest dirs. Which means that, in my case, when updating the backup
would happen that a file did exists (from the previous week), had changed since
then, and as a result rsync would copy the new file.
Ideally, it would have looked into the link-dest dir before, thus realizing that
there's already an up-to-date version of the file, and therefore simply creating
a hard-link is enough. It would make the whole process go faster, and save space
of course. Alas, that's not how rsync works, and this would be repeated for the
month backup as well.
This issue has been reported a few times, including here where someone even proposed a patch. Unfortunately, while the patch seemed to work as expected on a small test, when I tried to run it for a "real" run rsync just crashed.
Two solutions at that point:
- I try to fix it myself; but I don't think I would be able to do that.
- Change the way I use rsync to bypass this "bug"
A flaw in the script/system
And I went with the later, because of something else that, while I knew it from
the start and originally was okay with, I changed my mind. My script would run
each day (cron) to make a new backup
In addition, every week it would make a backup
week and every month a backup
month, the idea being that at any given moment I have three backups: beginning
of the day, the week, and the month.
But the way it was done would result in two backups being pretty much the same, and every once in a while all three of them would actually be identical. So, I figured I might as well improve this, and also work it so that I use rsync in a way that will bypass the "bug" described earlier.
The new & improved backup script
So I made a new version of the script, which now works like the following:
- launch rsync to make a backup (in a new folder named after the current date,
2011-09-23). It still uses the
--link-destoption (pointing to the latest backup), but since the destination always is a new folder, no more problem.
- then it creates/updates a symlink
latestso that it points to the newly created backup. This symlink is what's used in the
- the backup from the day before is removed, unless:
- we are the 2nd day of the month, then last month's backup is removed instead
- we are Tuesday, then:
- if we are also the 2nd day of the month, nothing else is removed
- if we are also the 9th day of the month, the backup from 2 weeks ago is also removed
- else, last week's backup is also removed
The result is pretty much the same as before, except that now I never have 2 (or more) backups identical. There's even handling of the case where a new day, week and month all begins at the same time. In which case on the 2nd I'll have my daily backup, the backup from the day before (as new backup of the month), and kept the backup from the previous week - which will be removed on the 9th.
I also used this occasion to put some things out in a configuration file (and/or
as command-line options), because I realized the first version has a few too
many things hard-coded (for instance, I hadn't even realized it couldn't be used
to backup anything else than
Now the script relies on configuration file, so you can define as many backup
schemes as you want, then simply specify which config file to use form the
command line (using
--config). Of course you can also simply define
all options from command-line, should you want to.
In case you define the same option both in a config file and on the command-line, the later takes precedence.
The configuration file is a simple text file, where you can use comments (start
the line with
#). Values should not be put between quotes but directly
specified after the equal sign.
Backup folders are created in a destination root, set by option
Alongside the actual backups, a symlink will be automatically created/updated
after each backup, pointing to the latest backup. Its name can be set using
For the whole process to work, backups should be named after the date they were
ran at. You can customize their names using option
date-format, defaulting to
man date for more about the format supported). You can also
run the script while specifying the name to use this time (instead of using the
date format), using option
The backup source is simply set using option
source and as before you can
define exclusions through option
exclude-from (will be sent to rsync's option
of the same name).
Speaking of, you can also define the arguments for rsync through option
Make sure not to use
--link-dest as they are
auto-added if needed. If not set, it defaults to
--archive --acls --xattrs
--human-readable -h --stats
A sample configuration file is included, and you can use
--help to get
Additionally, you can also download the latest version from this link.
It's all released under GPLv3, and of course bug reports, suggestions or any other form of constructive criticism is very much welcome.