One of the first things I did, after having installed
Arch, was to think about a backup
solution. Well, in fact I had been thinking about that before I even knew which
distro I would use, and I had decided to set up a mirroring of my "main" drive,
containing both the system (
/ and its friends) and my data (
actually are on the same partition (for now?).
Logically, what I did was play with
mdadm, and started setting up a RAID1
array. This went well,
mdadm allowing to do things very easily, and since this
was all done in a VM I could without a problem make a disk disappear, bring a
new one and see how rebuilding the array would work (in brief: couldn't be
easier, and it was pretty fast, too).
Side note: It's one thing I'm liking more and more each day with Linux, everything seems pretty easy to do (it's all relative, of course), and doable simply from the command line. Now I'll admit that I'm more of a GUI guy myself, but I can still appreciate how a couple of commands allow one to create/format a partition, resize it, move it, create an image of it, or set up an encryption or pretty much any kind of RAID array of one's choice. And all that (once you start to understand how it works) with an obvious simplicity. And while this is a Linux thing in general, it feels even more true in Arch. Anyways...
But then I realized I made a mistake, and a mirror wasn't what I'm looking for. Because while this was easy enough to set up, and had the other great advantage of being a "set & forget" solution, it didn't actually provide what I needed. A mirror means if one drive dies, you can still go on as if nothing happened, get a new drive, do the replacement and keep going without any trouble or data loss.
It's not nothing, but I wanted something else. I wanted to be able to stare at
my screen and go, "Alright, I screwed things up nicely. Now let's restore
things to a working situation again." I wanted to be able to realize that,
while I meant
mv, what I actually typed was
rm, for some odd reason, without
having to panic or cry.
In other words, I wanted to have actual backups of my data, and not just a(n
always up-to-date) mirror. Backups that could be old, not too old, but
definitely no instant mirror. So I said goodbye to
mdadm, and simply settled
for a little bash script using
rsync, that would run
automatically at night as a cron.
Multiple backups, with just one copy of the data
rsync is a great a powerful too, and one that allows to create a whole lot of
backup solutions, including over ssh and many more things I don't need. One of
the things it does, though, that will be of use for me, is that it can
basically create different backups of your data, with only one full copy of the
data, and then simply what changed (i.e. new/modified files).
I'm not talking about its algorithm that only transfers part of a file that
changed, to minimize the amount of data to transfer and speed things up, but how
the data are stored. The idea is that you provide
rsync with one (or more)
additional location(s), besides the usual source & destination. When a file is
missing in the destination, before copying it from the source
rsync will check
those additional locations. Then, it can do different things:
--compare-destit will simply skip those files; Thus only new/modified files are backed up.
--copy-destit will do a local copy of the files; This doesn't actually save any space.
--link-destit will create hard links of the files; This is the magic we want.
It's that third option that interests me, because it means you will end up with e.g. three backups, each folder containing a fill backup of your drive/data at a given time, but you don't actually need three times the space for it. Only one, plus what's changed.
The magic of hard links
Quickly, for those not familiar with hard links: basically when you have a file
stored on your drive, there are two things: the actual data, the content of the
file, and its name. Usually there's only one name per content, but you can
actually have more. For example,
be the same file.
There's no links, shortcuts or anything like it: both names represent the same data. Editing one file is the same as editing the other, since they "share" their content. When you remove one, you just remove its name, thus leaving the other unaffected. When you remove the last one, and there are no more names for the data, it is "dropped" and the space on the drive is made available again.
-i option of
ls one can have the inode number shown. This number
is a unique index that represents the data linked to. Two files with the same
inode are actually pointing to the same data, i.e. are two hard-links for the
same data. You can also use the
-l option of
cp to create a new hard-link of
a file, instead of copying its data.
And that's what
rsync will do with the
--link-dest option, create
hard-links. If the same file in source is found on that additional location
(we're not talking same inodes here, of course.
rsync uses size, dates and
such attributes to determine equality of file be default, though you can have it
use checksums (e.g. MD5) as well) then a new hard-link will be created and no
data needs to be copied. This not only speed things up quite a lot, but
reduces the amount of space needed to keep multiple backups at once.
And since usually what regularly changes are small files, with the big ones tend to stay the same, this can allow to keep a few backups for far much less space that complete separated copies would require. Yet, each backup folder contains the full backup, and not just a partial/incremental copy. That's the beauty of it.
With that in mind, I decided to make me a little bash script that would run each
night, and update a backup in "day" which would represent the backup at the
beginning of the day. It would also, every Sunday night/Monday morning, update
the backup "week" representing the backup at the beginning of the week, using
the "day" backup of reference (additional location in
--link-dest that is),
and then lastly the first of each month the backup "month" will be updated the
So now I have three backups: "day", "week", and "month". Each one is a full backup of my system at a different time. There is about 5 GB of data in the source, and the three backups are using about 7.4 GB altogether, so less than what would be needed for two "full" backups. Pretty cool, isn't it?
In case you're interested, here are the two scripts I made. The first one is of course the one to create backups, the second one will be of use when needing to restore things.
I should mention that I wrote those for me, and poorly hard-coded some
things(*), like the destinations of the backups : they all go in
and are called, as we've see,
(*) I should note that I am actually planning on doing a full rewrite of those scripts, for a couple of reasons, and I'll then try to make things a bit better. Meanwhile, those are easy enough to adapt, should you want to.
The first script is used to create a backup of the data. It supports two modes:
auto is likely to be used in a cron job, to run automatically every
night or something like that. It will always create/update
if the day is the first of the month it will also create/update
day as reference, i.e. in
--link-dest) and if the day is a Monday it
will do the same in
/backups/week (still using
day as reference)
manual allows one to create a backup whenever one wants. You can give
it a name (i.e. the name of the folder where the backup will be, though it will
be a subfolder of
/backups/) or, if you don't, it defaults to the current date
(e.g: 2011-09-23_15-42). The way it's done, it always uses
day as "reference"
and therefore cannot be run if it doesn't exists.
It also uses a file with a list of locations to exclude from the backup,
/backups/backups.excludes It is simply sent to
rsync using its
--exclude-from option. Note that this file is also used in the restore script.
This script is just there to easily restore a full backup. Usually, one would imagine you will only need to go get a file or two, but should there be a major problem or something, and you want to restore everything, this is for you.
It will simply do two things:
rsyncwith all the required options
- read the
/backups/backups.excludesfile and, for each line in that file that starts with "- " (dash & space) - i.e. each folder that was excluded from the backup - it will make sure said folder does exists in the restored location, creating it if not.
The point of this is that you'll probably want to exclude things like
/proc/ or, of course,
/backups/ itself. But it might be
required to have those folders if you want your system to boot (correctly),
Both scripts (and a
backups.exclude) are available on this BitBucket
Or, you can simply download the latest version from this link.
It's all released under GPLv3, and of course bug reports, suggestions or any other form of constructive criticism is very much welcome.