I recently listened to the TWiCH(This Week in Computer Hardware) podcast episode 252, which talk about the red drive. They talked about the different between green and red drive and point to the article in the pcper.com. This wasn’t the first time I heard about the red drive, but this is the first time I decided to go read the article. I presume this is the article Patrick Norton and Ryan Shrout, the hosts of TWiCH, talked about.
The article brought up a very easy to understand series of event in which the green drive fail.
- Array starts off operating as normal, but drive 3 has a bad sector that cropped up a few months back. This has gone unnoticed because the bad sector was part of a rarely accessed file.
- During operation, drive 1 encounters a new bad sector.
- Since drive 1 is a consumer drive it goes into a retry loop, repeatedly attempting to read and correct the bad sector.
- The RAID controller exceeds its timeout threshold waiting on drive 1 and marks it offline.
- Array is now in degraded status with drive 1 marked as failed.
- User replaces drive 1. RAID controller initiates rebuild using parity data from the other drives.
- During rebuild, RAID controller encounters the bad sector on drive 3.
- Since drive 3 is a consumer drive it goes into a retry loop, repeatedly attempting to read and correct the bad sector.
- The RAID controller exceeds its timeout threshold waiting on drive 3 and marks it offline.
- Rebuild fails.
Before I continue, I should point out that the article outlines legitimate problem and anyone building or plan to build/buy a NAS box should be aware of. The jist of the problem is that the number of error(bad sector) exceeds the numbers of tolerance provided by the RAID. eg: 1 tolerance in RAID 5 and 2 tolerances in RAID 6. Simple enough concept. Some of the people already wrote in the comment of the article that it’s still possible to use some sort of harddrive utility like Spinrite on the harddrives before putting bring the NAS back online and rebuilt. Technically there is nothing wrong with this solution except that this is not ideal for business. What I mean is that generally speaking, consumer and business don’t necessary have the same objective. While yes, both consumer and business want some sort of backup/redundancy system but business can spend more money and at least in the US, it’s tax deductible as it’s consider cost of doing business but at the same time the recovery process need to be quick. On the other hand, for consumer use this is almost the opposite.
Another idea that I think about is that this problem is preventable. Looking at the first bullet point, it’s easy to see that this is where the problem start. A bad sector in the area of a file that is not commonly use and thus not detected. And why does it has to be like that. Imagine a home build RAID using something like FreeNAS, NAS4free, unRAID or whatever the system you want to use, why can’t they put in a feature to read those files when the NAS aren’t being active? Especially a linux based NAS solution, people could write a script to read (least accessed) files when the CPU or IO usage is low. Ideally, this could be build as a feature into the NAS distro. And it’s still possible for the manufacturer to do this in the off-the-shelf upgradable NAS system like Synology or Drobo. Even if the files or filesystem is encrypted, reading it would merely be readable or unreadable and thus generates some sort or log or trigger the SMART. If the unreadable sector can be discovered early, then we wouldn’t run into this problem. I’m sure this blog isn’t exactly a popular one, nor it’s the most useful one. I just thought why doesn’t anyone brought this up before?