Over a long weekend recently, I experienced what every computer user fears most: a hard drive crash. For the next few hours, I hoped that it was some sort of operating error and I did not have to worry about restoring the data. I turned out to be wrong and needed a complete restore.
Not to worry, my system administrator said. All you have to do is install your recovery CD and our backup/restore software and, like magic, the system will be restored.
My hopes were high until I got the message "Cannot Restore System Please Reinstall." I called our system administrator at home (even though it was 4:45 a.m. after all, there's nothing more worthless than a consultant without data), and he said he didn't understand why it didn't work, since he had tested the process a few months earlier as part of the final decision to buy the package. He said he would get back to me ASAP, but in the meantime I could get access to my data via a Web interface.
What do all of these problems with my small company? This is not an isolated problem it happens all the time to all sorts of users, and while you usually can recover your data after some effort, it always takes way more time than expected. That's why it's important to make sure that your backup system works before you need it.
Where We Went Wrong
The problem we found even in my small company is that testing restoration of data is difficult and costly. It is usually done once and then forgotten.
In our case, we were evaluating different backup/restoration options for employees who travel. We did some significant backup and restore testing, but when we installed the final version of the software, we did not test it again. It appears that a simple parameter was not set correctly, so we could not do an automatic restoration. We could get our physical data back, but we could not restore the machine state.
In my case, we kept beating our heads against the wall trying to restore the machine state, but it wasn't going to happen. It took more than two weeks to get answers from the company handling our backup/restore environment. Fortunately, once the new disk drive showed up, I restored my system and my data myself.
So what I did learn from this experience, both from a policy and professional point of view?
I already knew the following:
- Backups are only as good are your restoration.
- Restores are only as good as the media they are written to.
- You should architect backup from the perspective of restoration of the data, not architect backup Restoration is the requirement.
- Testing backup policy needs to be done after every single change to the backup/restore environment. This means that even changes that seem meaningless need to be tested.
- Very few organizations build in to the cost of a backup/restore environment the cost of testing that environment regularly, with or without changes. This is especially true for smaller organizations because the base cost of developing a backup/restore environment is an expensive process.
- Some of the companies that develop backup/restore software and provide off-site support for small and mid-sized businesses have a good sales story and good demos, but how good is the support? Find out as best you can before you need to know regular testing will help. In our case, the company we dealt with was involved in configuring the software used to backup my system, yet they were not able to figure out the problem for more than two weeks.
While it would be nice to blame vendors for everything, we have to take some responsibility ourselves. So here is a checklist of items to consider for backup/restore environments and why they should be considered:
- Like Environments: In most cases, I have found that people tested a few desktops and a laptop or two, but they do not test any operational systems because these systems are generally in use and testing is disruptive. Wrong answer. Go out and buy an extra disk drive or two and test real running systems over a weekend. This will give you a far greater level of confidence in the company and your procedures.
- Testing Changes: If you follow the previous point, you will be able to test like environments and have a level of confidence that the systems work in an operational environment. So if any change is made to that environment from the status quo and I mean any change at all it should be re-tested. And this is in addition to regular testing. This means any software updates from the backup/restore vendor, Microsoft patches, Linux patches, virus, firewall any and every patch. This might lead to a change in site patch policy, but getting your data back is important enough to warrant it.
- Vendor Restoration: A number of SMB packages support off-site backup methods. This is often done via the Internet, but regardless of which of the following methods you use, each method should be tested at least at some point in the year. These are the common SMB methods:
- Block-based and kept on site so you can restore a whole system block by block
- File-based and kept on site so you can restore your important data
- Block-based and kept off site so you can restore via the Internet or by contacting the vendor and getting your data on media overnight; and
- File-based and kept off site so you can restore your important data via the Internet or media.
I was down for the better part of a week, and for a consultant that can be a lifetime. Think if you were a tax accountant and you crashed on April 10 and lost a week, or some other timing-based business disaster. The restoration process and procedures must be tested no matter what the cost, since the alternative could threaten the survival of your business.
Henry Newman, a regular contributor EnterpriseStorageForum.com (a sister site to SmallBusinessComputing.com), is an industry consultant with 25 years experience in high-performance computing and storage.
See more articles by Henry Newman.