Thanks! I had either missed that paper or had taken away more of the message that these errors are more likely to be from events like misdirected writes, cache flush problems (hence the high correlation with systems resets and not-ready-conditions), and firmware bugs (on-drive and further up the stack), rather than bit-rot.
Still:
> On average, each disk developed 0.26 checksum mismatches.
> The maximum number of mismatches observed for any single drive is 33,000.
Considering the latter can represent 132K on a modern, 4K-sectored drive, that's a remarkable amount of data loss, enough to warrant a checksumming higher up (such as in the filesystem).. in theory [1].
However, the fact that this was NetApp using their custom hardware as the testbed makes me wonder if the data are skewed, and if the numbers would be nearly this bad from a more "commodity" setup, such as at Google. The paper alludes to this when referring to the extra hardware for the "nearline" disks, and I'm always suspicious of huge discrepancies in statistics between "enterprise" and other disks, even more so when there's a drastic difference in comparison methodology.
It would be interesting to see if there are any numbers for more modern drives, especially as the distinction between "enterprise" and "consumer" drives is disappearing, if only because demand for the latter is disappearing.
[1] In practice, an individual isn't aware of the 16.5K/132K loss risk, which is vanishingly small compared to other risks, anyway, and businesses don't tend to care and have survived OK anyway.
Still:
> On average, each disk developed 0.26 checksum mismatches.
> The maximum number of mismatches observed for any single drive is 33,000.
Considering the latter can represent 132K on a modern, 4K-sectored drive, that's a remarkable amount of data loss, enough to warrant a checksumming higher up (such as in the filesystem).. in theory [1].
However, the fact that this was NetApp using their custom hardware as the testbed makes me wonder if the data are skewed, and if the numbers would be nearly this bad from a more "commodity" setup, such as at Google. The paper alludes to this when referring to the extra hardware for the "nearline" disks, and I'm always suspicious of huge discrepancies in statistics between "enterprise" and other disks, even more so when there's a drastic difference in comparison methodology.
It would be interesting to see if there are any numbers for more modern drives, especially as the distinction between "enterprise" and "consumer" drives is disappearing, if only because demand for the latter is disappearing.
[1] In practice, an individual isn't aware of the 16.5K/132K loss risk, which is vanishingly small compared to other risks, anyway, and businesses don't tend to care and have survived OK anyway.