Precious data, not to be lost

One of the first things I did, after having installed Arch, was to think about a backup solution. Well, in fact I had been thinking about that before I even knew which distro I would use, and I had decided to set up a mirroring of my "main" drive, containing both the system (/ and its friends) and my data (/home), which actually are on the same partition (for now?).

Logically, what I did was play with mdadm, and started setting up a RAID1 array. This went well, mdadm allowing to do things very easily, and since this was all done in a VM I could without a problem make a disk disappear, bring a new one and see how rebuilding the array would work (in brief: couldn't be easier, and it was pretty fast, too).

Side note: It's one thing I'm liking more and more each day with Linux, everything seems pretty easy to do (it's all relative, of course), and doable simply from the command line. Now I'll admit that I'm more of a GUI guy myself, but I can still appreciate how a couple of commands allow one to create/format a partition, resize it, move it, create an image of it, or set up an encryption or pretty much any kind of RAID array of one's choice. And all that (once you start to understand how it works) with an obvious simplicity. And while this is a Linux thing in general, it feels even more true in Arch. Anyways...

But then I realized I made a mistake, and a mirror wasn't what I'm looking for. Because while this was easy enough to set up, and had the other great advantage of being a "set & forget" solution, it didn't actually provide what I needed. A mirror means if one drive dies, you can still go on as if nothing happened, get a new drive, do the replacement and keep going without any trouble or data loss.

It's not nothing, but I wanted something else. I wanted to be able to stare at my screen and go, "Alright, I screwed things up nicely. Now let's restore things to a working situation again." I wanted to be able to realize that, while I meant mv, what I actually typed was rm, for some odd reason, without having to panic or cry.

In other words, I wanted to have actual backups of my data, and not just a(n always up-to-date) mirror. Backups that could be old, not too old, but definitely no instant mirror. So I said goodbye to mdadm, and simply settled for a little bash script using rsync, that would run automatically at night as a cron.

Multiple backups, with just one copy of the data

rsync is a great a powerful too, and one that allows to create a whole lot of backup solutions, including over ssh and many more things I don't need. One of the things it does, though, that will be of use for me, is that it can basically create different backups of your data, with only one full copy of the data, and then simply what changed (i.e. new/modified files).

I'm not talking about its algorithm that only transfers part of a file that changed, to minimize the amount of data to transfer and speed things up, but how the data are stored. The idea is that you provide rsync with one (or more) additional location(s), besides the usual source & destination. When a file is missing in the destination, before copying it from the source rsync will check those additional locations. Then, it can do different things:

  • using --compare-dest it will simply skip those files; Thus only new/modified files are backed up.
  • using --copy-dest it will do a local copy of the files; This doesn't actually save any space.
  • using --link-dest it will create hard links of the files; This is the magic we want.

It's that third option that interests me, because it means you will end up with e.g. three backups, each folder containing a fill backup of your drive/data at a given time, but you don't actually need three times the space for it. Only one, plus what's changed.

The magic of hard links

Quickly, for those not familiar with hard links: basically when you have a file stored on your drive, there are two things: the actual data, the content of the file, and its name. Usually there's only one name per content, but you can actually have more. For example, /home/user/foo.log and /var/log/foo could be the same file.

There's no links, shortcuts or anything like it: both names represent the same data. Editing one file is the same as editing the other, since they "share" their content. When you remove one, you just remove its name, thus leaving the other unaffected. When you remove the last one, and there are no more names for the data, it is "dropped" and the space on the drive is made available again.

Using the -i option of ls one can have the inode number shown. This number is a unique index that represents the data linked to. Two files with the same inode are actually pointing to the same data, i.e. are two hard-links for the same data. You can also use the -l option of cp to create a new hard-link of a file, instead of copying its data.

And that's what rsync will do with the --link-dest option, create hard-links. If the same file in source is found on that additional location (we're not talking same inodes here, of course. rsync uses size, dates and such attributes to determine equality of file be default, though you can have it use checksums (e.g. MD5) as well) then a new hard-link will be created and no data needs to be copied. This not only speed things up quite a lot, but reduces the amount of space needed to keep multiple backups at once.

And since usually what regularly changes are small files, with the big ones tend to stay the same, this can allow to keep a few backups for far much less space that complete separated copies would require. Yet, each backup folder contains the full backup, and not just a partial/incremental copy. That's the beauty of it.

With that in mind, I decided to make me a little bash script that would run each night, and update a backup in "day" which would represent the backup at the beginning of the day. It would also, every Sunday night/Monday morning, update the backup "week" representing the backup at the beginning of the week, using the "day" backup of reference (additional location in --link-dest that is), and then lastly the first of each month the backup "month" will be updated the same way.

So now I have three backups: "day", "week", and "month". Each one is a full backup of my system at a different time. There is about 5 GB of data in the source, and the three backups are using about 7.4 GB altogether, so less than what would be needed for two "full" backups. Pretty cool, isn't it?

Scripting time

In case you're interested, here are the two scripts I made. The first one is of course the one to create backups, the second one will be of use when needing to restore things.

I should mention that I wrote those for me, and poorly hard-coded some things(*), like the destinations of the backups : they all go in /backups/ and are called, as we've see, day, week and month

(*) I should note that I am actually planning on doing a full rewrite of those scripts, for a couple of reasons, and I'll then try to make things a bit better. Meanwhile, those are easy enough to adapt, should you want to.

The first script is used to create a backup of the data. It supports two modes: auto, and manual

The mode auto is likely to be used in a cron job, to run automatically every night or something like that. It will always create/update /backups/day, then if the day is the first of the month it will also create/update /backups/month (using day as reference, i.e. in --link-dest) and if the day is a Monday it will do the same in /backups/week (still using day as reference)

The mode manual allows one to create a backup whenever one wants. You can give it a name (i.e. the name of the folder where the backup will be, though it will be a subfolder of /backups/) or, if you don't, it defaults to the current date (e.g: 2011-09-23_15-42). The way it's done, it always uses day as "reference" and therefore cannot be run if it doesn't exists.

It also uses a file with a list of locations to exclude from the backup, /backups/backups.excludes It is simply sent to rsync using its --exclude-from option. Note that this file is also used in the restore script.

This script is just there to easily restore a full backup. Usually, one would imagine you will only need to go get a file or two, but should there be a major problem or something, and you want to restore everything, this is for you.

It will simply do two things:

  • start rsync with all the required options
  • read the /backups/backups.excludes file and, for each line in that file that starts with "- " (dash & space) - i.e. each folder that was excluded from the backup - it will make sure said folder does exists in the restored location, creating it if not.

The point of this is that you'll probably want to exclude things like /dev/, /media/, /mnt/, /proc/ or, of course, /backups/ itself. But it might be required to have those folders if you want your system to boot (correctly), therefore.


Both scripts (and a backups.exclude) are available on this BitBucket repository.

Or, you can simply download the latest version from this link.

It's all released under GPLv3, and of course bug reports, suggestions or any other form of constructive criticism is very much welcome.

Top of Page