PDA

View Full Version : Scientific study of why hard drives fail



TruckStuff
02-19-2007, 07:15 AM
From Google, one of the largest users of cheap. off-the-shelf hard drives:
Massive Google hard drive survey turns up very interesting things

Posted Feb 18th 2007 9:47PM by Ryan Block

When your server farm is in the hundreds of thousands and you're using cheap, off-the-shelf hard drives as your primary means of storage, you've probably good a pretty damned good data set for looking at the health and failure patterns of hard drives. Google studied a hundred thousand SATA and PATA drives with between 80 and 400GB storage and 5400 to 7200rpm, and while unfortunately they didn't call out specific brands or models that had high failure rates, they did find a few interesting patterns in failing hard drives. One of those we thought was most intriguing was that drives often needed replacement for issues that SMART drive status polling didn't or couldn't determine, and 56% of failed drives did not raise any significant SMART flags (and that's interesting, of course, because SMART exists solely to survey hard drive health); other notable patterns showed that failure rates are indeed definitely correlated to drive manufacturer, model, and age; failure rates did not correspond to drive usage except in very young and old drives (i.e. heavy data "grinding" is not a significant factor in failure); and there is less correlation between drive temperature and failure rates than might have been expected, and drives that are cooled excessively actually fail more often than those running a little hot. Normally we'd recommend you go on ahead and read the document, but be ready for a seriously academic and scientific analysis. http://www.engadget.com/2007/02/18/massive-google-hard-drive-survey-turns-up-very-interesting-thing/

Link to study: http://labs.google.com/papers/disk_failures.pdf

renovation
02-19-2007, 07:25 AM
Good find TruckStuff. :) So we should save our money and buy better ide drives! So from this study I have to agree keeping a drive cool may be your best hope for long life!

I try to always put a cooling fan in line with a harddrive.

johnnymk
02-19-2007, 09:13 AM
Good find TruckStuff. :) So we should save our money and buy better ide drives! So from this study I have to agree keeping a drive cool may be your best hope for long life!

I try to always put a cooling fan in line with a harddrive.If I read correctly, there is no correlation between a cool drive and a moderately hot drive.

InfiniteNothing
02-19-2007, 09:16 AM
I'm betting the actual temperature is less important than how many cold starts the drive has.

johnnymk
02-19-2007, 09:20 AM
I'm betting the actual temperature is less important than how many cold starts the drive has.

Wouldn't you think that at Google that the drives are being accessed continously 24 hours per day?

zippyjuan
02-19-2007, 11:41 AM
BBC's report on it:

Hard disk test 'surprises' Google

Hard disks are getting smaller with greater storage
The impact of heavy use and high temperatures on hard disk drive failure may be overstated, says a report by three Google engineers.
The report examined 100,000 commercial hard drives, ranging from 80GB to 400GB in capacity, used at Google since 2001.

The firm uses "off-the-shelf" drives to store cached web pages and services.

"Our data indicate a much weaker correlation between utilisation levels and failures than previous work has suggested," the authors noted.

A wide variety of manufacturers and models were included in the report, but a breakdown was not provided.

Widely-held belief

There is a widely held belief that hard disks which are subject to heavy use are more likely to fail than those used intermittently. It was also thought that hard drives preferred cool temperatures to hotter environments.

The authors wrote: "We expected to notice a very strong and consistent correlation between high utilisation and higher failure rates.

"However our results appear to paint a more complex picture. First, only very young and very old age groups appear to show the expected behaviour."

A hard disk was described as having "failed" if it needed to be replaced.

The report was compiled by Eduardo Pinheiro, Wolf-Dietrich Weber and Luiz Andre Barroso, and was presented to a storage conference in California last week.

In the report the authors said Google had developed an infrastructure which collected "vital information" about all of the firm's systems every few minutes.

'Essentially forever'

The firm then stores that information "essentially forever".

Google employs its own file system to organise the storage of data, using inexpensive commercially available hard drives rather than bespoke systems.


Lower temperatures are associated with higher failure rates

Google report
Hard drives less than three years old and used a lot are less likely to fail than similarly aged hard drives that are used infrequently, according to the report.

"One possible explanation for this behaviour is the survival of the fittest theory," said the authors, speculating that drives which failed early on in their lifetime had been removed from the overall sample leaving only the older, more robust units.

The report said that there was a clear trend showing "that lower temperatures are associated with higher failure rates".

"Only at very high temperatures is there a slight reversal of this trend."

But hard drives which are three years old and older were more likely to suffer a failure when used in warmer environments.

"This is a surprising result, which could indicate that data centre or server designers have more freedom than previously thought when setting operating temperatures for equipment containing disk drives," said the authors.

The report also looked at the impact of scan errors - problems found on the surface of a disc - on hard drive failure.

"We find that the group of drives with scan errors are 10 times more likely to fail than the group with no errors," said the authors.

They added: "After the first scan error, drives are 39 times more likely to fail within 60 days than drives without scan errors."

johnnymk
02-20-2007, 02:50 AM
It would be nice if Google would note which brands failed more frequently. But being Google, they would probably offend many companies and lose advertising revenue.

redcolours
02-20-2007, 07:22 PM
It would be nice if Google would note which brands failed more frequently. But being Google, they would probably offend many companies and lose advertising revenue.

maxtor

:shifty:

redcolours
02-20-2007, 07:24 PM
I'm betting the actual temperature is less important than how many cold starts the drive has.

unless the drive is glowing red hot already...

lesson of the story: keep your PCs on 24/7 (i know i do...)

MikeD
02-20-2007, 07:39 PM
maxtor

You got a cold? Me too...I've had the SAME THING for awhile now.

Just got over the WD bug, too... :hihi:

stufine
02-20-2007, 08:30 PM
maxtor

what? maxtor a bad drive? i've only had 2 go bad in the past 2 yrs.. hehe well 1 (120g) was from the wife pushing my pc off the table. 1 WD died recently in a external usb box.. think it overheated.. now it likes to click.. :( Been thinking about the WD with the 5yr warranty for a raid box..

DarkFury
02-21-2007, 06:39 AM
maxtor

:shifty:
I must be kinda lucky...

My Maxtors have been fairly bulletproof.... however, I've sent in at least 2 of my Western Digitals over the past 10 years.