So here's evidently what happened:
Some time around February 27, gmail was affected (mid-upgrade) by a bug that effectively deleted the mail data associated with about 40,000 email accounts. Now, Google maintains multiple copies of users' data, so this bug affected all the available copies of the data for these users. Google had the foresight to backup their data rather than relying on data replication as its sole protection against data loss, but that backup data resides on tape, which clearly takes time to restore. Just to give you an idea of how much time, you need an idea of the scale of the data loss. If each of those users had 5GB of data in their mailboxes, the restore operation requires about 200TB of data - not unmanageable, but clearly something that would take on the order of days to weeks to restore unless something really very cool is used. One of the interesting aspects of the restore process is that users report having no access to the email services while their data is being restored. An Exchange administrator would have the ability to spin up some dial tone databases and use something like recover-mailbox, recovery storage groups or a more robust tool like Ontrack PowerControls to merge data from the backup sets back into the dial tone databases.
Now, this is not a schadenfreude post. I have a lot of respect for Google, especially around how they've transformed messaging, and provided consumers with a very viable and attractive alternative to what was a pretty miserable corner of IT when it was first introduced in 2004 (my, how time flies, huh?). They've delivered a remarkably reliable infrastructure for a massive number of users at an incredible price.
However, as a technologist, I'd like to look at what happened, as well as the users' reactions to get an idea of how I can architect messaging systems so that when stuff inevitably hits the fan, the impact can be minimized.
- Avoid backups at your own risk. It's tempting, especially with three, four, or five copies, to think to yourself "Well, how many copies do I need before I don't need to back up?" The fact is that all of those copies are in a single failure domain. As my friend and colleague Jim Cordes says, "This will work, up to the point where it won't." In this case, a storage bug (likely associated with Google Filesystem) created data loss. But it could as easily have been an application bug, administrator error, or a security breach. In this case, the data also resides outside of the failure domain (on tape). It's generally advisable that critical data be available outside the context of the application.
- Users don't just care about service availability - they care about their data too. Many people live in their email accounts. Whether we like it or not, their email account is where they keep their most important data. So make sure your SLAs (either internal or with a service provider) cover data availability and not just service availability.
- Users don't just care about service and data availability - they care about metadata too. Complaints about the loss of starred emails and labels abound. This shouldn't surpise us. If people live in their email accounts, then they'll organize it. Think of it like a filing cabinet. If your "backup" of your filing cabinet entails copying everything and putting in all in a fireproof canister, then "restoring" those files to a usable state where you can actually find something is going to be a problem. This could be a problem for folks who use a compliance archive as a last-ditch resort for data restore.
- Set distinct SLAs for service and data/metadata availability. "Distinct" doesn't mean "different" in this context. With a robust email solution you can get service availability up quickly and cheaply after a disaster. Getting the actual data back is the longer pole in the tent, and where the bulk of investment is required. If you tier your data through the use of archives, cost can be mitgated by assigning different SLAs for the active and archive data.
- Make sure your backup/restore solution meet SLAs for data/metadata availability. As we've seen, even the best run organizations running top notch software can experience data loss, even when multiple copies of the data are deployed. If the copy of the data outside of the application failure domain takes 100 hours to restore, then that's the SLA you can sign. If a disaster requires that you restore many terabytes of data and you have a data availability SLA of under a day, then it's advisable to look at hardware-based snapshots or bookmarks as a solution.
- Fast backups ≠ Fast restores. There are many solutions out there that help people meet aggressive backup windows (incremental forever with synthetic fulls are widely available, for example). Administrators and managers are well-advised to examine the restore speeds of these solutions. Basically, if the solution calls for the full dataset to be moved from one place to another in order for it to be used, then you need to examine the restore speed.