Solving Consumer Backups

Since 2006 I’ve been managing my own network-attached backup solution for the family. We wanted a way to have all our data in a single accessible location instead of writing multiple DVDs. This is both because accessing old data is useful and because you never know when an old backup DVD deteriorates or is no longer readable in your current hardware. We ended up building a basic rack-mount server with a small UPS connected to a fairly slow residential ADSL connection, for this and other uses.

For the last 4 years, between me and my dad taking photos and all of us backing up random files we filled up the initial 1TB RAID5 array. There was (and still is) some data cleaning to do but I spent ~400€ on four new 1.5TB disks and upgraded the array to 4.5TB. It’s working nicely again.

(Before someone complains, this is using RAID for redundancy in the underlying drives but is an actual backup solution. I use rsync’s –link-dest feature to create versioned backup directories that use hard-links for the unchanged files. rsnapshot is another implementation of the same concept. If I erroneously delete or corrupt a file the next backup will have the corruption but won’t overwrite the previous backup.)

After I had the single point of consolidation setup I looked for a way to backup the contents of that system to have a second layer of security. I even bought two different LTO tape drives off eBay that never worked. I then looked at cloud solutions like tarsnap which would be perfect if not for the 3600$/TB a year in storage cost alone. For that kind of money I would just build a second server, host it in a second residential ADSL connection and run a daily rsync between the two. These systems are all built atop something like Amazon’s S3 that can handle way more transactional performance than is needed for backups.

While thinking about a Hacker News post about fixing your backups I came up with a possible solution that could make for an interesting startup:

  • Create a simple RAID NAS like box filled with normal SATA drives, like the Drobo ones, to function as the consumer-friendly version of my rack-box.
  • Attach the box to the customer’s Wi-Fi and run a dropbox-like local file-share service so that every computer’s hard-drive is just a cache for the larger array.
  • Continuously sync the contents of the array to an equivalent RAID set of inexpensive SATA drives in the cloud. The drives could be in the same NAS hardware the customer has or packed more densely like the backblaze guys did.

That’s basically it. If one of the user’s computers loses a drive he restores automatically from the NAS. If the NAS loses a drive a replacement with the data in it gets shipped to replace it. This is better than cloud-based backups because the local NAS adds a level of fast cache that allows you to treat all your machines as different views over the same total data. It is better than using Time Machine/Drobo because you get seamless off-site backups.

These days even amateur photographers with a basic DSLR can easily generate tens of gigabytes of data in a day, especially now with HD video. If they even have backups they’re usually to a USB drive kept close to the PC. It’s common to have a laptop to lug around and a fixed setup with a better monitor for serious work. Using both to process images is a pain. If broadband was much faster all storage could just move to the cloud and all the computers be thin clients to it. Meanwhile a local cache makes sense. Here’s a user report of a successful use of a Drobo:

Last Sunday, I lost the hard drive that contained the upcomging book - everything - images, portraits, texts, even the list of photographers. Fortunately I use a combination of Time Machine and a redundant array Drobo 4X 1 terabyte backup system. Although it took 24 hours to restore all 500 gig. from the lost drive, it came through perfectly. This morning, my fourth drive on the Drobo gave up the ghost, but since the backup is redundant, I simply pulled out the drive and plugged in a spare one

This is close to the functionality I described but doesn’t give you multiple-machine support or offsite backups. Backblaze and Dropbox have pieces of this already implemented.

Build it and I will buy it.