Server Recovery Nightmares
The on-call IT tech is jolted awake from a terrible dream – his heart pounding. Lightning crashes overhead as he glances at the clock – 2:59 a.m. The server isn’t down, it was just a dream.
3:00 a.m. The IT on-call pager goes off. This could mean any number of things: a fire, a break-in, a failed air-conditioner in the server room, or even a main business server crash.
3:25 a.m. The on-call IT tech arrives at the site and evaluates the situation. There is no fire, no evidence of a break-in, and the server room temperature reads a cool 18oC. A quick check of the servers shows that most of them are at a login screen. After checking two or three machines, it is obvious that the room lost power at some point. The UPS units verify a failure; all three massive battery units are showing failures and heavy load percentages.
3:40 a.m. The on-call IT tech calls the lead technician and department manager and informs them of the situation; both are on their way to the site. They leave instructions to check the main business application servers; one of them holds the company’s customer database, payroll, and accounting system, and the other is the company’s messaging server.
3:55 a.m. The on-call IT tech discovers that the RAID array for the business database server is not coming back online. The messaging server has rebooted but the messaging application is returning errors when it starts up. The tech realizes that the messaging server was performing incremental backups during the time of the outage. The on-call IT tech decides to leave that to the lead technician when he arrives.
4:00 a.m. The lead tech and manager arrive. Assessments of the other servers are made. The lead tech begins working with the messaging server. The on-call tech works with the failed RAID array. The firmware shows the array has failed; the controller only recognizes three of the ten drives. After a complete power down and restart of the server and drive enclosure, the firmware shows the drives are back online, however the array is shown as ’Failed’.
4:30 a.m. The-on call technician calls the RAID array manufacturer’s technical support. The choices in the firmware menu are vague and the IT Tech wants to know if forcing the drives online will get their array back. The manufacturer’s technical support says that the array will come back; however, there is a slight possibility that the data on the volume may be corrupted. The manufacturer’s technical support asks how recent their latest backup is. The IT Tech responds that the data is one week old and that is unacceptable; they cannot lose a week of transactions. The IT Tech hesitates in deciding what to do next...
Business system disasters like this happen every day. Despite the redundancy in backup systems or storage array systems, failures occur. Some failures can be hardware related, others can be due to software, and still others are the result of human error or natural disaster.
More and more businesses rely on their corporate server structure and document storage volumes. Some businesses rely completely on their database system, which may be financial data, job tracking data, or customer contact data. Other businesses may rely wholly on their messaging database and that is a critical business system. Some telephone systems actually convert voice messages to email notifications, thereby using the email-messaging server as part of the communication system. Today’s systems are also storage systems for all of the documents that users create.
Common Scenarios of Server Data Disasters
Ontrack Data Recovery has been the undisputed leader in the industry with the most technologically advanced data recovery solutions available. We have been serving customers globally for nearly 20 years with offices, cleanrooms, engineers, and employees located around the world. During that time, we have seen many data loss situations ranging from commonplace to unique. Here is a sampling of specific types of disasters accompanied with actual engineering notes from recent Remote Data Recovery jobs (Evaluation time represents the time it takes to evaluate the problem, make necessary file system changes to access data, and to report on all of the directories and files that can be recovered):
Causes of Partition/Volume/File System Corruption Disasters
- Corrupted File System due to system crash
- File system damaged to automatic volume repair utilities
- File system corruption due partition/volume resizing utilities
- Corrupt volume management settings
Severe damage to partition/volume information to Windows 2000 workstation; had used 3rd party recovery software--didn't work, reinstalled OS but was looking for 2nd partition/volume, found it and it was a 100% recovery Evaluation Time: 46 minutes
Causes of Specific File Error Disasters
- Corrupted business system database; file system is fine
- Corrupted message database; file system is fine
- Corrupted user files
Windows 2000 server, volume repair tool damaged file system; target directories unavailable. Complete access to original files critical. Remote Data Recovery safely repaired volume; restored original data, 100% recovery. Evaluation Time: 20 Minutes
Exchange 2000 server, severely corrupted Information store; corruption cause unknown. Scanned Information Store file for valid user mailboxes, results took up to 48 hours due to the corruption. Backup was one month old/not valid for users. Evaluation Time: 96 Hours (1.5 days)
Possible Causes of Hardware Related Disasters
- Server hardware upgrades (Storage Controller Firmware, BIOS, RAID Firmware)
- Expanding Storage Array capacity by adding larger drives to controller
- Failed Array Controller
- Failed drive on Storage Array
- Multiple failed drives on Storage Array
- Storage Array failure but drives are working
- Failed boot drive
- Migration to new Storage Array system
Netware volume server, Traditional NWFS, failing hard drive made volume inaccessible; Netware would not mount volume. Errors on hard drive were not in the data area and drive was still functional. Copied all of the data to another volume; 100% recovery. Evaluation Time: 1 hour
Causes of Software Related Disasters
- Business System Software Upgrades (Service Packs, Patches to Business system)
- Anti-virus software deleted/truncated suspect file in error and data has been deleted, overwritten or both
Partial drive copy overwrite using third party tools, overwrite started and then crashed 1% into the process, found a large portion of the original data. Rebuilt file system, provided reports on recoverable data; customer will be requiring that we test some files to verify quality of recovery.
Evaluation Time: 1 hour
Causes of User Error Disasters
- During a data loss disaster, restored backup data to exact location, thereby overwriting it
- Deleted files
- Overwritten operating system with reinstall of OS or application software
User's machine had the OS reinstalled – Restore CD was used; user looking for Outlook PST file. Searched for PST data through the drive because original file system completely overwritten. Found three potential files that might contain the user's data, after using PST recovery tools we found one of those files to contain all of the user's email; there were missing messages, majority of the messages/attachments came back.
Evaluation Time: 5 hours
Causes of Operating System Related Disasters
- Server OS upgrades (Service Packs, Patches to OS)
- Migration to different OS
Case Study Netware traditional, 2TB volume, damage to file system when trying to expand size of volume, repaired on drive, volume mountable.
Evaluation Time: 4 hours
From legacy systems and post-mainframe storage devices to the latest high-end SANs, Ontrack Data Recovery works on them all. More importantly is the validity of the recovered data—the data must be usable to the client when we have completed the recovery.
Server Recovery Tips
Data disasters will happen, accepting that reality is the first step in preparing a comprehensive disaster plan. Time is always against an IT team when a disaster strikes, therefore the details of a disaster plan are critical for success.
Here are some suggestions from Ontrack Data Recovery engineers of what not to do:
- In a disaster recovery, never restore data to the server that has lost the data—always restore to a separate server or location.
- In Microsoft Exchange or SQL failures, never try to repair the original Information Store or database files—work on a copy.
- In a deleted data situation, turn off the machine immediately. Do not shut down Windows—this will prevent the risk of overwritten data.
- Use a volume defragmenter regularly.
- If a drive fails on RAID systems, never replace the failed drive with a drive that was part of a previous RAID system—always zero out the replacement drive before using.
- If a drive is making unusual mechanical noises, turn it off immediately and get assistance.
- Have a valid backup before making hardware or software changes.
- Label the drives with their position in a RAID array.
- Do not run volume repair utilities on suspected bad drives.
- Do not run defragmenter utilities on suspected bad drives. In a power loss situation with a RAID array, if the file system looks suspicious, or is un-mountable, or the data is inaccessible after power is restored, do not run volume repair utilities.
Ontrack Data Recovery should be part of your disaster planning and your key personnel should be aware of our recovery capabilities. During an outage, it is common to have multiple recovery efforts going on at the same time. This makes sense because the goal is to get the company back to its data. The key to success is to get Ontrack Data Recovery involved as soon as possible.
One client early last year gave Ontrack Data Recovery this challenge, "We have a backup restoration going on right now and we need the data available as soon as possible. If you want the job, you have to beat the tape." Recovery engineers worked the entire weekend to get the more than 2TB of data available and recovered over before the start of the work week.
Summary and Conclusion
The fictional, true-to-life IT scenarios at the beginning of this article illustrate the situations and decisions that IT staff must make. Businesses and institutions like yours, without access to their data, run the risk of losing millions in revenue every day. The fact is, today’s systems are relied on more then ever for consistent and available data.
Ontrack Data Recovery recognizes the importance of the speed and quality of recovery– especially on large servers. As your partner, we are continually researching new data recovery tools, improving our existing data recovery software tools, and expanding our recovery capabilities to meet your needs for immediate recovery of lost data, including data on large server systems. Successful disaster planning includes having Ontrack’s emergency number (1800 872 259) near your computer systems.
Ontrack Data Recovery is the largest, most experienced and technologically advanced provider of data recovery products and services worldwide. Ontrack Data Recovery is able to recover lost or corrupted data from virtually all operating systems and types of storage devices through its do-it-yourself, remote and in-lab capabilities, using its hundreds of proprietary tools and techniques.