I think it's helpful to put on our statistics hats when looking at data like this... We have some observed values and a number of available covariates, which, perhaps, help explain the observed variability. Some legitimate sources of variation (eg, proximity to cooling in the NFS box, whether the hard drive was dropped as a child, stray cosmic rays) will remain obscured to us - we cannot fully explain all the variation. But when we average over more instances, those unexplainable sources of variation are captured as a residual to the explanations we can make, given the avialable covariates. The averaging acts a kind of low-pass filter over the data, which helps reveal meaningful trends.
Meanwhile, if we slice the data up three ways to hell and back, /all/ we see is unexplainable variation - every point is unique.
This is where PCA is helpful - given our set of covariates, what combination of variables best explain the variation, and how much of the residual remains? If there's a lot of residual, we should look for other covariates. If it's a tiny residual, we don't care, and can work on optimizing the known major axes.
Exactly. I used to pore over the Backblaze data but so much of it is in the form of “we got 1,200 drives four months ago and so far none have failed”. That is a relatively small number over a small amount of time.
On top of that it seems like by the time there is a clear winner for reliability, the manufacturer no longer makes that particular model and the newer models are just not a part of the dataset yet. Basically, you can’t just go “Hitachi good, Seagate bad”. You have to look at specific models and there are what? Hundreds? Thousands?
> On top of that it seems like by the time there is a clear winner for reliability, the manufacturer no longer makes that particular model and the newer models are just not a part of the dataset yet.
That's how things work in general. Even if it is the same model, likely parts have changed anyway. For data storage, you can expect all devices to fail, so redundancy and backup plans are key, and once you have that set, reliability is mostly just a input into your cost calculations. (Ideally you do something to mitigate correlated failures from bad manufacturing or bad firmware)
"Actually HGST was better on average than WD"is probably about the only kind of conclusion you can make. As you have noted, looking at specific models doesn't get you anything useful because by the time you have enough data the model is already replaced by a different one - but you can make out trends for manufacturers.
> if we slice the data up three ways to hell and back, /all/ we see is unexplainable variation
It's certainly true that you can go too far, but this is a case where we can know a priori that the mfg date could be causing bias in the numbers they're showing, because the estimated failure rates at 5 years cannot contain data from any drives newer than 2020, whereas failure rates at 1 year can. At a minimum you might want to exclude newer drives from the analysis, e.g. exclude anything after 2020 if you want to draw conclusions about how the failure rate changes up to the 5-year mark.
Meanwhile, if we slice the data up three ways to hell and back, /all/ we see is unexplainable variation - every point is unique.
This is where PCA is helpful - given our set of covariates, what combination of variables best explain the variation, and how much of the residual remains? If there's a lot of residual, we should look for other covariates. If it's a tiny residual, we don't care, and can work on optimizing the known major axes.