Creating recovery plans? Sounds like a tremendous task. Well, it is one, really. The creation of a recovery plan for an entire data centre is not just a piece of cake. In the back of your mind, you think that these plans will not be needed so often, hopefully never ever. This is something quite annoying about recovery plans, as for contingency plans in general. You put so much time and knowledge into the documentation, but in practice, you would prefer that it will never be used anyway. That’s not exactly motivating.
However, you can’t help it, recovery plans are essential to your business. Maybe it would help to break a recovery plan down one step at a time so that it might be easier to handle. Let’s take a look at a single IT system, such as a server. Here, too, it would be helpful if the person responsible for the system documented how to start the system and how to restore it. And this plan might be required more frequently. Maybe this could raise your motivation to create one?
Recovery plans not only for IT emergencies
It is not uncommon that a single system must be restarted, especially if it runs under a Windows OS. Thanks to the monthly patch Tuesday, admins have to get themselves into gear every month and restart their systems in regular intervals. So why not check and apply your recovery plan on these occasions? There is no better way to provide for an emergency. This means that a recovery plan can also be used to improve quality during normal IT operation. It really does not matter whether you only have to restart your system after an update or must restore it after a failure. Many of the tasks involved are the same. By and by, less complex maintenance task can be performed by other members of the IT staff, making it easier for you to go on holiday. Your substitutes can then simply take the recovery plan and go through the required steps, ensuring they will not forget anything.
For this reason, I would like to distinguish two different types of recovery plan. In many of their features, they are similar and require the same tasks to be done. However, their scope and the probability they might be used will vary – hopefully!
- Restart during normal operation
- Restart after an emergency
Who can promise that a system will restart properly after an update? Don’t worry, you will need your recovery plan sooner than you think. After all, you shouldn’t get bored. Hard disks may crash, the power supply of a switch (that, unfortunately, has not been configured redundantly) might fail after years of trouble-free operation. A configuration change performed some time ago suddenly launches a new “non-availability” function. Your colleagues and bosses will be delighted by the new record achieved in the failure statistics.
Additional steps in an emergency situation
The basic plans for both types can be identical. However, you need to add system rebuilding and data restore tasks. These can be skipped during normal operation. But the final checks after a successful system start will be the same anyway, i.e. restarting the server or checking whether all services are up and running, if there are new entries in the event log, or if dependent systems have access again. All this must be checked, or at least, it should be checked. In practice, these checks are performed rather rarely. The Principle of Hope dominates, or the belief that users will report malfunctions anyway.
The document for emergency restart is a plan that, hopefully, will never be needed. But adequate preparation and test runs while doing contingency planning can help you look at such a scenario with less horror. At least, you will find the plan faster because you had accessed it more often. Just think of it: Many managers even don’t know where these plans are stored. Again, another benefit as members of the IT staff are all familiar with the storage location of these recovery plans.
If you combine all these plans and synchronise them, you are approaching the goal of creating a comprehensive recovery plan for your server room or data centre. Of course, some more tasks are required for this purpose. Take power supply and HVAC into account and make sure that they will operate flawlessly. Maybe other persons or technical crews need to be involved. In the event of an emergency, a replacement room at another site might have to be equipped and be put into operation.
Rely on your IT documentation in emergencies
You will be off the hook if you can access an up-to-date IT documentation at any time. Thanks to the Docusnap documentation suite, you can always rely on current configuration data, provided that you initiated all required actions beforehand and, as a result, your data is constantly updated. Store your recovery plans with the Docusnap IT Concepts module and create graphical dependency overviews in the IT Relations module. This will facilitate the access for all involved parties.
Make life as easy as you can – in an emergency situation, things will be hard enough to handle anyway.