When looking into storage reliability, it’s all too easy to get caught up in the “hard disk drives are unreliable” melodrama being created by some of the All-Flash Array vendors. To argue that one media is more reliable than another is analogous to arguing that cars are more reliable than trucks – they’re two different tools for two different jobs.
The good news is that when it comes to traditional hard drives, there’s actually more statistical evidence out there to help organisations architect solutions based upon balancing cost and risk. One more piece of evidence that emerged is a study by Backblaze, a cloud backup provider who’s been looking at their 25,000 odd disk drive implementation and plotting out lifecycles. In the past we’ve seen studies such as those from Google and Carnegie Mellon University, however Backblaze looked at both consumer and enterprise grade drives.
What they initially saw was of no great surprise to those who have real-world experience with HDDs. They showed in their first study that their SATA consumer grade drives had an Annual Failure Rate (AFR) of 5.1% in the first 18 months and that drive failures followed a “bathtub” curve with a dramatic increase in years four and five of the drives’ lifecycle. Why do you think the vast majority of storage vendors are happy to give a three year warranty but get a little jumpy when you ask for an inclusive five year one? Well here’s the answer in this graph right here, the drives are much more likely to fail once they get into later life.
Now your first thought may well be, “Ahhh, but these are consumer grade drives – things will be much better with enterprise grade!” Well this is where things get interesting as Backblaze went on to look at their enterprise grade drive implementation and they plotted drive years of service against failure rates to show an AFR for enterprise drives of 4.6% against consumer drives one of 4.2%. Now it’s not a direct comparison as the enterprise drives are small in number and have only been installed for two years. There’s also an interesting point raised by Seagate on their blog that Backblaze created the “perfect storm” with their use case and physical mounting. This proves “It ain’t what you do, it’s the way that you do it”.
Anyone can build a storage array. Pop down to your local PC supplies company, grab some drives and a server, get an OEM drive shelf enclosure, pop ‘em in, load up some open source software and hey presto – you’ve got an Enterprise Grade Storage Array. Well that’s what some manufacturers would have you believe anyway. The truth is that hard disk drives are sensitive little creatures. Take a look at the excellent video by Sun Microsystems back from a few years ago. The video was produced to show off their new software that could analyse drive latency but it proved the point that drives are sensitive to vibration – in this case an Australian engineer shouting at them. Vibration and noise aren’t the only drive killers – heat and density are a big factor too. Add in the error correcting capabilities of consumer grade drives and you start to see some of the AFRs that Backblaze saw.
So how can some providers afford to offer an inclusive five year warranty on its array?
The key here is good old fashioned hardware engineering, some simple applied logic and most importantly some very clever patented software. Firstly acknowledge that drives are sensitive to such elements and deal with them. Stop them vibrating, keep them cool and treat them with a little bit of respect by mounting them evenly and horizontally. If I held you vertically, jiggled you about and kept you at high temperatures for a few years, you’d probably feel a little poorly too.
Secondly is to apply some logic. What happens when a drive goes back to base to be replaced? Well most of the time it just needs a software recondition – low level format, realign the heads, re-layout the tracks, etc. Well why does that need to fly around the world to happen? Why not just run that inside the array? These things are supposed to be intelligent right? Why don’t all storage arrays do that? Instead of running the risk of a flat footed clumsy engineer swapping the wrong drive, knocking a cable out of hitting the EPO instead of the exit button in your data centre, some providers just let the array deal with its issue and not affect any workloads. Should there still be a physical defect on the drive, work around it and only swap a drive pack should we be running low on spare space.
So the next time someone says to you, “Oh hard disk drives, they’re mechanical devices and therefore fail – here’s the proof,” remind them that not all storage is created equal.