In our previous blog, MTBF: Hard drive failure prediction?, we noted that the best-known methods for predicting hard disk drive (HDD) failure aren’t exactly scientific. Grinding and thrashing noises are a reliable indicator that a HDD is about to fail, for example, but that’s not very comforting when your drives are sitting out of earshot in a remote data centre.
Using a metric called Mean Time Between Failures (MTBF) for hard drives, often seems misleading when estimating the longevity of these storage devices. MTBF is calculated from running large numbers of drives for weeks and months at a time, which can give readings as high as 1.5 million hours – almost 200 years – for enterprise-grade HDDs. The methodology is sound, but the outcome has little in common with the average lifespan of a hard drive in the field.
Most manufacturers do, however, also offer a more sophisticated method for predicting HDD failure. Specifically, their devices come preinstalled with a set of firmware tools called Self Monitoring, Analysis and Reporting Technology (SMART), which communicates metrics on hard drive performance back to the operating system. This data can then be viewed and analysed via software, providing IT administrators with greater insights into the health of their storage equipment than would otherwise be possible.
The metrics tracked by SMART tools – called attributes – vary from manufacturer to manufacturer, but typical examples include the number of hours the drive has been switched on, the time it takes for the spindle to reach operational speed and the count of reallocated sectors.
Checking your SMART data
Checking your storage devices’ SMART data is generally pretty simple. It’s possible to buy software expressly designed for the purpose, which might be judicious if you’re looking to gain meaningful insights from that data, but it’s not a prerequisite: if you’re using Windows, you can get a quick and dirty rundown of your HDD’s SMART attributes and their readings via the command prompt.
Of course, if you’re after a way to track and analyse SMART data more proactively, there are various tools available on multiple platforms and at different price points. One example is Ontrack EasyRecovery 11, and if you’re serious about using SMART tools to monitor the health of your hard drives and plan replacements, this is the way to go.
The reliability of SMART tools
You may have noticed that we’ve yet to discuss whether or not SMART tools are, in fact, a reliable indicator of hard drive health. So, are they? The answer is yes and no. While some SMART attributes are widely felt to be useful in predicting HDD failure, it’s also commonly accepted that the system is not without its limitations.
Most notably, SMART can’t predict 100 per cent of hard drive crashes, because not all hard drive crashes are predictable in the first place. While errors that arise from regular mechanical wear and tear tend to show up as abnormal SMART readings, sudden electronic malfunctions and component failures do not. To put this in perspective, a 2007 Google study of 100,000 consumer-grade HDDs found 64 per cent of failures over a nine-month period were not flagged up in their SMART tools beforehand.
Another factor that makes SMART attributes themselves less reliable is how they vary from manufacturer to manufacturer, even in terms of the ways that common attributes are measured. So a Seagate device and a Western Digital device of equivalent health may give completely different readings for their seek error rates, for example.
Last November, cloud backup provider Backblaze published a fascinating study on the wildy varying usability of different SMART attributes. Based on readings from almost 40,000 hard drives storing 100 petabytes of customer data, it concluded that out of 70 available attributes, only five were actually reliable indicators of HDD failure. “We would love to use more – ideally the drive vendors would tell us exactly what the SMART attributes mean,” wrote engineer Brian Beach.
In reality, then, SMART are able to predict some types of hard drive failures, but they cannot provide a 100 per cent accurate way to know how and when one of your HDDs is about to meet its maker. As we stated before, unfortunately, not all hard drive failures are predictable.
As such, no savvy storage user would ever rely on SMART alone – or any other predictive system – to prevent the loss of their data and to plan for business continuity. The nature of electromechanical devices means it’s always best to implement a mix of defences: SMART, redundancy, back-up and data recovery.