How can a recovery plan be defined? In case of an emergency, a recovery plan documents the chronological order in which necessary tasks have to be processed in the IT system. In which order are the separate IT systems switched on? When is it necessary to take a break so that systems can replicate themselves? Which functions might have to be checked during start-up? Perhaps a script has to be started manually? Have all the services of a specific IT system been booted up? This is because any systems depending on these services must not be started beforehand. This may also include IT systems in a server room or data centre. They too have to be taken into consideration. Make a note of the start-up process of your IT systems in chronological order. Network switches, storage areas (SAN or NAS) and a directory service will appear at the top of the list. In most cases the whole environment will depend on them. Maybe a management server responsible for the storage system needs to be started beforehand. Who knows this apart from the administrator?
Classify your IT systems for a structured recovery
Business-critical systems such as ERP databases, email servers such as Lotus Notes or Microsoft Exchange, and security gateways are next. Less critical systems such as management servers, VPN, etc. follow. It doesn’t necessarily require an emergency for these plans to be used, they can also be used during maintenance, if only to verify and run their content. Depending on the company size, I think it makes sense to have three criticality levels – high, normal, and low. This enables you to divide the plan and structure it more easily. Feedback could be sent to the managers, for example, if all systems of a level are running again, which won’t mean the work is done, but the relevant managers will at least be reassured.
Make use of maintenance works on building systems
Using these plans can also become necessary if all devices have to be switched off due to maintenance on the power supply network or HVAC. If there is no system redundancy this maintenance work will be interesting, to say the least. After all, who is fortunate enough to have 100% redundant HVAC? Checklists should be created for recovery plans. Ideally, each task should include the time needed so it is possible to estimate how long it will take before normal operation can be expected to resume. Your bosses will thank you if you can provide a prompt update. Saying “I don’t know” won’t help you. Who else is supposed to know?
Conversely, you can make use of scheduled test runs on your IT systems to carry out maintenance on your building systems, whether it’s the power supply network, HVAC or cleaning. This is also a good time to carry out maintenance work on server hardware and operating systems. While the time frame will be increased, a number of tasks can be carried out at the same time. That way you’ll create a regular schedule for updating and maintaining your systems and technology, which simplifies your change management.
Recovery plans also help if only a single system has to be restored after a failure, or if it has to be restarted due to maintenance work. In such cases, it isn’t necessary to process the whole plan, but only the relevant part for the affected IT system.
Up-to-date plans
It is extremely important that recovery plans are up-to-date. Therefore, it’s not enough to revise them once a year. This has to be done regularly, every time a new device is put into operation, which is also the right time to adapt and extend the recovery, or shorten it if a device is deactivated. Otherwise have fun searching when the recovery plan is used. The durations of the individual tasks have to be at least estimated, but it would be even better to carry out one complete run as part of the IT system’s start-up process.
Once a year – every other year at the latest – the whole recovery plan should be tested as part of a practical exercise. Needless to say, the points for the system and data recovery don’t have to be part of this. This should be done during the year, using suitable test systems.
It’s important to document the results of all the tasks in the recovery plans that have been carried out so the plans can be optimised. This means setting up a control system to improve tasks and durations, and a continuous improvement process (CIP) will be initiated. The test runs also help gain insights into the common network configuration so that potential improvements can be made.
Support provided by Docusnap
The premium documentation suite Docusnap helps you to create a recovery plan. A template can be found under IT Concepts which enables you to create an electronic recovery plan within Docusnap, expand the plan with graphics and inventory data from your IT systems, and keep creating reports.
Don’t forget where these plans are saved or where the print-outs can be found. In case of an emergency you won’t have time to search for them. You could also synchronise the plans with a tablet.