Avoid OCZ. I had a Vertex 2 SSD fail after about two weeks. What's really a big ...

moe · on March 5, 2011

I had a Vertex 2 SSD fail after about two weeks.

Anecdote != data. We have 8 of them in database-server for over 6 months without problems. Yes, we abuse them for a server workload.

justinsb · on March 5, 2011

Sure, just sharing my personal experience. The pathological failure behaviour due to a software bug is not something I'd previously considered when choosing a drive.

I hope you've got good backups.

reeses · on March 5, 2011

It's a very common problem with certain levels of firmware. Fortunately, OCZ has been very good for replacing bricked drives for us, almost no questions asked, so yes, keep backups.

Drives fail. SSDs are no exception. That doesn't mean it doesn't drive me crazy when they do, however. :-)

moe · on March 5, 2011

The pathological failure behaviour

This is not specific to SSDs. Drives fail. Usually very early (infant death) or very late, google for 'bathtub curve'.

Spreading FUD about a specific vendor isn't fair unless you can back it up with data. The failure rate that is commonly cited for sandforce drives is around 2% - regardless of vendor. If you have different data then I'm curious to read about it.

justinsb · on March 5, 2011

I think there's a bigger problem here - this isn't just normal bathub-curve component failure. What seems to be happening is that the firmware panics due to an assertion being hit, and what could be a hiccup that requires a reboot becomes a total data loss event.

"gamble" suggested above that this is related to power-saving modes. That fits with our different experiences (laptop vs server). Think about how you would feel if you rebooted your server, it went through a different power mode as part of shutdown, and when it came back all eight of your presumably nicely RAIDed drives were dead. Of course a lightning strike could do the same thing, but you have a surge protector that guards against that. How are you protecting against your SSD's firmware? That's my real concern.

As to whether it's SandForce or OCZ, I'm just going to go Intel next time. You're free to go with OCZ or a non-battery-backed ramdisk - it's all the same to me :-)

moe · on March 6, 2011

this isn't just normal bathub-curve component failure

What part of "unless you can back it up with data" didn't you understand?

How are you protecting against your SSD's firmware?

Just like against any other hardware fault: By having backups and redundancy. Bugs happen. You're making it sound as if this was somehow specific to OCZ - which remains bullshit until you provide data beyond anecdotical evidence.

And FWIW I'm neither affiliated with OCZ, nor emotionally tied to the brand. We also run X-25s in production, my Mac Mini has an X-25 and my Macbook has an OCZ.

justinsb · on March 7, 2011

Straight from the horse's mouth, here's OCZ's patch log for v129, which describes a fix for a software-induced data-loss issue: "Fixed rare corner case that could cause the drive to reset and clear user data" http://www.ocztechnology.com/files/ssd_tools/OCZ_SSD_v129_Fi...

That was less than a month ago. Maybe this was the last bug; I'd guess not. I'm sure all the SSD vendors have firmware bugs, but my personal choice is not to go with OCZ in future. As you point out, it seems likely that other Sandforce vendors will share similar issues, so my understanding is that basically leaves Intel (?)

In future, before aggressively going on the attack, I think you should consider the possibility that you may not be as correct as you believe everyone else to be wrong. Let's keep things civilized around here.

moe · on March 7, 2011

In future, before aggressively going on the attack, I think you should consider the possibility that you may not be as correct as you believe everyone else to be wrong. Let's keep things civilized around here.

I don't see where I went "aggressive". On the contrary I consider it agressive to spread FUD about a vendor based solely on anecdotical evidence.

I'm sure all the SSD vendors have firmware bugs, but my personal choice is not to go with OCZ in future.

Exactly. All vendors hit these bugs from time to time. That's why you should refrain from writing posts "Avoid $FOO" when all you have to contribute is a single datapoint.

"I had bad luck with OCZ" would have been more appropriate.

dablue · on March 5, 2011

+1 for good backups

My strategy from now on is to to mirror the SSD to a traditional HDD. That way, in the case of a SSD failure, we can accept the performance hit and still be able to run smoothly from the platter.

justinsb · on March 5, 2011

Check out Facebook's FlashCache: they use SSDs as a cache on to traditional HDDs. You can have a read-cache to keep 'hot' data fast, and (if you want to) you can have a write cache that will buffer writes on SSD before flushing to HDD.

I don't know whether it's ready for general (non-Facebook) use yet, but it is definitely one to watch.