Preventing 3 Common Disaster Recovery Situations
Do you believe in the old adage “prevention is better than cure”? Consider these prevention and recovery methods for three common disaster recovery situations.
The success of your disaster recovery strategy is often judged based on whether it is actually implemented or not. Although no prevention method is 100% fool proof, risk avoidance and taking proactive measures for preparedness are essential elements of the disaster recovery process.
On the flip side, despite all the measures we take to avoid a disaster, we must assume that a disaster will happen. Having this mindset will help shape our decisions when it comes to planning for IT disaster recovery scenarios.
Here we look at prevention and recovery methods for three common such scenarios.
Scenario 1: The Operating System Becomes Corrupt
This scenario considers a situation where the operating system (OS) of one of your servers becomes corrupt, but the underlying data is still there. This might be caused by a failed Windows update, malware, or a non-graceful shutdown.
How do I prevent this?
Always test Windows updates or new software prior to deployment. This should ideally be done on a test system containing the same image as the original server. Antivirus applications should also be kept up-to-date on every node in your network.
Be sure to restrict access to the server room to prevent malicious or accidental modifications to the physical aspects of the servers. Have an uninterruptible power supply (UPS) in place to cater for power outages and handle graceful shutdowns.
How do I recover from this?
- Use a standby image. The idea here is to restore from a snapshot of your system prior to when the OS became corrupt.
- Perform a bare metal restore. Once the OS is back up and running, use a backup application that supports delta recovery to quickly bring the underlying data back online.
Scenario 2: The Hard Drive Dies
This scenario considers a situation where the hard drive of one of your servers suddenly fails. This might be caused by a broken RAID set, overheating, or a mechanical fault.
How do I prevent this?
The occasional hardware fault is accepted as being part and parcel of the manufacturing of modern IT equipment. However, there are things we can do to limit the impact of a failed drive. These include having the correct RAID configuration in place and/or a replication mechanism with auto-failover. Keeping IT equipment at a cool temperature (e.g. using air conditioning) is also vital. If it operates in a sub-optimal physical environment, the risk of overheating increases.
How do I recover from this?
- Use a backup solution that has continuous recovery and preconfigure a virtual machine (VM) to be on standby. When the production machine goes down, all you need to do is press play on the preconfigured VM.
- Perform a bare-metal restore from local storage to new hardware.
- Recover from the cloud using your MSP’s infrastructure, Azure, or Amazon Web Services (AWS).
- Restore from a mountable virtual hard disk (VHD). If you previously created a standby image of your computer, you can quickly open it in your virtual environment using Hyper-V or VirtualBox.
Scenario 3: The Roof Caves In
This scenario considers a situation where physical damage occurs to the building or room where your data resides. This is usually caused by a natural, or environmental, disaster. Hurricanes, tornadoes, floods, or excess snowfall are just some examples.
How do I prevent this?
This is probably the toughest scenario to prevent. ‘Acts of God’ (as defined by insurance companies) are unpredictable and can happen without any forewarning. The only surefire way to lessen the impact of a physical disaster is to have a replicated copy of the data and IT environment in a completely separate geographical location (e.g. in the cloud, another regional office, at your MSP).
How do I recover from this?
- Recover from the cloud using your MSP’s infrastructure, Azure, or AWS.
- Use a backup solution that has continuous recovery and preconfigure a virtual machine in a remote location to be on standby. When the production machine(s) goes down, all you need to do is press play on the preconfigured VM.
Conclusion
We are fans of the old adage of “prevention is better than cure.” This holds true for disaster recovery scenarios as well. Being proactive is a better approach than being reactive. However, when we are dealing with unpredictable situations beyond our control, we also need to ensure we have the people, processes, and methods in place to react as quickly as possible and help bring things back to normal. Furthermore, if you’ve learned from the COVID-19 shutdown, then you know your business continuity plan should also include your company’s strategy for how people can work from remote locations (e.g. their home).
With this in mind, it would be wise to have a system in place that allows for speedy and reliable recovery of data. If you do this and run routine restore tests, you will be in the best position to minimize the impact of any IT disaster scenario that you are faced with.