Get off my lawn.

Tuesday, October 31, 2006

viva la rsync

I've had automated backup processes running on my Linux boxen for a few years now. Being a developer with a fondness for automated processes that work hard while I'm asleep, I reasoned that the best way to back up a machine automatically was to write a script that's smart about how it does things, and just let it take care of itself. It worked pretty well, saving my bacon a few times.

My backup routine was thus written as a shell script that would make a list of all the files in each user's home directory, omitting files in directories listed in an "exclusion list" file for each user. Then it would zip all of the files in the list into a zip file named after the user, machine, and the current date. Each night when it ran, it would remove any backups more than 3 days old.

It worked fine, except for some problems:

First of all, if I had it back up more than 2GB of data, it would fail. Because the "zip" program uses 32-bit file pointers, it can't address a file bigger than the max value of a 32-bit integer, roughly 2GB. So I was always thinking of ways to prevent the backups from being "too big". Every time I copied a file around, the possibility that it could hose up my backup was something I thought about. Lame!

Second, the process itself was resource-intensive. Creating a gigantic zip file on a machine with a 1GB of memory puts everything in RAM into swap. So in the morning, I'd wiggle the mouse to bring up the desktop, and wait... and wait... and... wait... while the machine thrashed around swapped its consciousness back into RAM. Sometimes, I'd find that the zip program was still running, 3 hours after the backup was scheduled to run. Lame!

I finally got sick enough of these annoyances to do something about it, so I tried out rsync, the "standard" way of backing up Unix boxes. It's purpose-built to copy large numbers of files and sync directories. I used it to create mirrors of the backed-up directories on a removable USB drive. After the initial sync, subsequent rsync runs only back up changed or added files.

Additionally, you can use hidden .rsync-filter files to control how files get backed up. You can omit certain types of files, files with specific names, directory trees, and so on. You can sprinkle these files all over the place and rsync will read them wherever they are and follow the rules they describe.

So now I'm using rsync. I'm backing up over 60GB of files, the process typically takes approx. 20 seconds for each machine, and the resource load is so low it's not even noticeable. If it was a product you could buy, I would recommend buying it, but since it's free, I guess I can't. There's always a downside, I guess.


Post a Comment

Subscribe to Post Comments [Atom]

<< Home