Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Non-ECC memory corrupted my hard drive image [video] (youtube.com)
129 points by zeristor on Dec 25, 2022 | hide | past | favorite | 152 comments


We don't have ECC mainly because Intel has long been hostile to "consumer" access to ECC.

Apparently this was conceived as a market segmentation scheme: people outfitting servers could get ECC when they pay a huge premium. They would thereby not be tempted to cheap out and buy consumer-grade equipment, otherwise wholly adequate to meet all their needs at a radically cheaper price.

That we cannot get laptops or even desk machines with ECC, and so have them crash frequently, is seen as a trivial side effect of the strategy. If you did not hate Intel enough before, you may increase your hatred accordingly. Intel doesn't hate you back; they simply care not even a little how you feel.

(Historically, just running Microsoft software was overwhelmingly more likely to be the cause of a crash than a memory bit-flip; and there were orders of magnitude fewer RAM bits at risk. Microsoft succeeded in getting customers to accept and even expect frequent crashes; before MS, a program crashing was grounds for a refund.)


I see this a lot but I really don’t think it was/is that simplistic.

It’s added complexity and cost for something that rarely would benefit most consumers. Now, you can argue that the complexity and cost is a nonissue on modern setups and I would probably agree.

But Intel has long had desktop grade hardware with ECC support. The 440GX chipset supported ECC and I ran a Dell GX1 SFF with 768MB of ECC PC100 for yeeeears with a 450 MHz P3 and later upgraded to 1.4ghz tualatin-256 via a slotket adapter.

The 440HX /socket 7/ chipset supported ECC. And that’s a Pentium 1 chipset.

The 440BX/GX and 450NX supported ECC and that’s with desktop pentium 2 and 3 chips.

The 820/820E/840 supported ECC with desktop celeron and pentium2/3 chips

845/845e/850/850e/860 pentium4 chipsets support ECC

875/e7205/e7221/e7230 did with desktop pentium 4 and pentium d chips

925/925xe/955x/975x did with desktop pentium 4/pentium d/core 2

It’s more sparse now that they moved to the IMC, granted. But Intel has long had multiple chipsets per generation with ECC support for desktop grade hardware.


Just because the chipset supported it doesn't mean the motherboard did, and there's little incentive for a motherboard company that makes server boards (OEM or under their own brand, or both) to sell ECC motherboards for the desktop consumer market, and integrators will not want to harm their "workstation" sales if "SOHO" systems are capable of ECC.

BTW, AMD's Ryzen CPUs and "Pro" APUs (APU = cpu with integrated graphics) will all do ECC, but not "officially", and not all motherboards have BIOS that supports it. You need to google around or just try it.


Right, but the motherboard for a lot of systems from that era were a nonissue. If the /chipset/ supported it like with the 440 line then most boards would handle it.

And, by your own suggestion, this isn’t a fault of Intel and more of OEMs sucking.

I mean, I guess Intel could have tried mandating it or something but meh


I miss the BX chipset. The last PC that wasn't a complete piece of shit I had was a 440BX chipset P3.


Me too. I would be running my 440BX today, but all the motherboards died from bad electrolytic capacitors that were endemic in that era.


Nothing you can't fix with Mouser and a soldering iron :)


It's not even that complicated with hardware of that era.

Let's say you are afraid to destroy the through holes of the multilayer PCB by de-soldering the broken parts, or heating bordering parts up too much, whatever.

In most cases you can pull the caps just off from their pins, clean the pins, (even with paper tissue only) and simply solder the new ones to the old pins still anchored in the board.

Looks weird, but works :-)


Once I discovered "desoldering needles" I basically stopped being scared of desoldering through hole parts altogether. You can get a set cheaply on aliexpress or amazon or whatever.

You slide the needle over the component leg and then melt the solder. The needle slips over the component leg. Wait for the solder to cool and pull the needle out. Hey presto the leg is separated from the PCB pad. Also works great for cleaning solder out of a hole after traditional desoldering.


Indeed, don't throw out old boards due to bad caps. I'm working on building a recapping skill so I can keep my old hardware alive as well. It's pretty cool to have the ability to take old non-working or marginal hardware, apply heat and modern capacitors to it, and make it stable again.


I loved my 440* chipset systems. I miss them, too, lol. My last one died due to a damn screw driver slip. sobs


> That we cannot get laptops or even desk machines with ECC, and so have them crash frequently, is seen as a trivial side effect of the strategy

I’m not sure what you mean by “frequently”, but my non-ECC machines definitely do not crash “frequently”.

> before MS, a program crashing was grounds for a refund

Source?


The problem with untrusworthy memory (or any other component) is not that your system crashes, it's that it doesn't.


I don't doubt that non-ECC hardware experiences some non-zero number of bitflips per year. I'm just doubting the parent commenter's claim that non-ECC ram is causing computers to crash "frequently".


And the parent is pointing out that not crashing on bit flips is exactly the problem.


Actually crashing on bit flips is a second problem, and an indicator of the first problem.


Crashing is "not the problem" because crashing prevents the far worse problem of incorrect data or operation that you don't know is incorrect.

The difference in significance is so great that by comparison a mere crash is no problem at all.

In fact you design systems with hair triggers to 'crash' on purpose as readily as possible. Trying desperately to crash at all times every millisecond all day every day.

IE, halting all operation of some subsystem, or the whole thing, the second a single bit wrong is detected. Better to kill the hd or the entire machine than let it keep running one second after getting any hint it might not be 100% trustworthy.

'crash' in quotes because really all you want to do is halt, and you're doing it on purpose, but that is still a crash, in the sense that your application does not want to halt and it isn't necessarily halted gracefully with any chance to conclude anything or save anything. Those are just more operations you can no longer trust to be correct, and so should not be allowed to do.


Frequently, for me, means at all. Computers are designed as deterministic state machines. Any unintended indeterminism is a failure.

Microsoft, uniquely exercising monopoly power, was able to institute a "no refunds, nohow" policy and make it stick: Windows95 machines commonly crashed several times a day, enough so that "crash" came to mean, instead, you have to reinstall the OS, which happened at least one per several months.

And here we are.


I guess you can use the term however you want, but FWIW if you say "frequently" and mean "at all", it makes it a lot harder to understand the point you're trying to make.


If it happens to me, it is happening to untold others, too. Thus, frequently.

"Infrequently" would mean I had not heard of it, or that I heard only of isolated instances, well publicized.


I feel like I just fell into a bucket of logical fallacies.


We all have no choice but to sample imperfect information. You can no more than pretend better.


Horrifying, can you imagine an alternate reality where we have to actually write working software? Microsoft, for once, did a Good Thing in a way.


I see a ton of replies here. I am a tech user with thousands of terabytes stored locally. Most of those terabytes are mirrored in the cloud. In all of those situations I've never actually had any corruption related to memory or storage. I have 932gb in PICTURES ffs.

I get the claimed need for EEC, but yet I don't. I own a number of web properties and regularly update logos, software packages, html files, PHP files, ruby files, etc. No corruption.


How do you confirm that there's no corruption? My understanding was a silent bit flip would be difficult to detect. I ended up using ZFS and ECC memory to avoid silent corruption, but I'm largely relying on the motherboard to do the right thing when an ECC error is detected. It'd be nice to verify after the fact.


bit flip hard to detect - true for RAW, false for .jpg or other compressed files.


git fsck


On 932 GB of binary data? I'm probably operating on old impressions of git, but that sounds like it'd be a nightmare to work with. Ideally, I'd like something that runs in a reasonable amount of time so it can be run on a regular schedule. I currently run a ZFS scrub that takes about 90 hours to finish on 32 TB of data, so it's only run monthly.


If your pictures are not RAW but compressed (say jpgs) then the corrupted photos will really jump out at you - half the picture is there, but then noise. I used to make sure I was storing jpgs not RAW, but then learned that wasn't always a good thing, several times over. Dunno whether damage happened in RAM or on HD, but I strongly suspect RAM.

The bad habit that nailed me was probably this one: I would ball up a whole bunch of photos in a big .zip; (to enhance copying speed and maybe privacy) then next year unzip that, add that year's selected photos and zip it again. Year after year. That added up to a lot of trips through RAM in which one bit flip could take out an image. Bugs in compression software aren't impossible as the villan, but that's not my bet.

Solution: I now zip up a few years together and copy each chunk I don't ever rezip old large chunks. No new errors noticed.


You can be pretty sure that you have some corrupt bits


No, we really can't be sure. I'm annoyed when people are sure about it without posting their ECC correction rate stats from live systems.

I'm continuously writing metrics on a non-ECC system. There are 3 places that the bitflip can affect the data: pre-writing (checksum will be correct), when flushing to the drive (checksum mismatch), calculating the checksum (mismatch again). There may be some silent corruption (in 1 out of 3 cases), but I've been scrubbing the drive every day for ages and there's not a single error.

Non-ECC seems to be doing well. Other claims could really use actual numbers.


"frequently" is very subjective or relative in this context. 25 years ago I had a crash per hour on almost any regular computer, but zero crashes per month on servers with ECC. In the past couple of years I think I had a few cases of frozen apps, but I don't remember of any OS level problem. At the same time, on servers I see from time to time ECC fixing a bit, but on the desktop or laptop I have no idea how many times corrupted bits went undetected and what is the consequence.


>25 years ago I had a crash per hour on almost any regular computer, but zero crashes per month on servers with ECC

If it's crashing once per hour, it's probably unstable drivers/software or flaky hardware that needs to be RMAed, not random bitflips.


I think sometime else is wrong if you've had a "crash" per hour.


For this reason for my first truly made-from-scratch home NAS I went AMD64 with ECC UDIMMs. It was some very basic Athlon64, but it COULD do ECC. Since then I moved to Opterons and Xeons but I still remember that choice.


> That we cannot get laptops or even desk machines with ECC

The Xeon series of laptop processors does support ECC just at a quite large premium.


Yet, actually providing ECC costs practically nothing. RAM could cost something less than 12.5% more, for the extra bits. If end-to-end, the bus might have a few more traces.


I used to be a big fan of Intel, up until the latest chips from other companies that seem to have beat them on performance/watt. My next laptop will probably be AMD if the situation hasn't changed.


I'm happy with my AMD based laptop. But I haven't seen any that support ECC.

But I did see a Lenovo model, IIRC, that had some kind of Xeon and ECC. Not sure what the noise and battery life situations on that thing were, though.


I realise that's blurrier when it comes to laptops, but AIUI it's more a case of whether the motherboard than supports it than about the AMD chip. i.e. given a desktop CPU, as far as I know you can put it in a motherboard that either does or does not support ECC RAM.


Ryzen APUs, which include almost all AMD laptops, actually have ECC fused off in silicon unless you buy the "Pro" variant.


Huh. I didn't know that.

My particular laptop does have a "pro" CPU. However, I would be surprised to no end to learn that it supports ECC. This particular model sports an MBP-level price tag [0], but is absurdly cheaply built. Even for "customer facing components", that are easy to compare, such as the screen (terrible colors) and case (creaks if you look at it wrong). HP doesn't offer ECC RAM, not even as an upgrade, so I really don't think the additional lines are physically present.

---

[0] I don't remember the specific number, but it was within 100 € of a 14" M1 MBP with 32 GB RAM and 512 GB SSD. That's counting a RAM (8 -> 32) and SSD (256 -> 512) upgrade which were made with components bought separately (though they were rather high-end).


I have a Ryzen 2600 (CPU not APU) and ECC works. I wonder what generation ECC fuses became a thing?


ECC fused off was never a thing on the CPU-only dies, just on the APUs.


That's right, but seeing how laptops seem to do the bare-minimum, I would be really surprised to learn than a random model, which doesn't advertise it, actually supports it.


I'm surprised nobody has made RAM with the ECC logic built into the ram itself, that just looks like normal ram to the CPU.


Having ECC being checked inside the CPU is actually useful as data loss may be induced by EMI (and other factors) on PCB data lines.


It is called DDR5 - it has ECC built in the module itself. Making such a module does not make much sense if you cannot report the rate of errors, so if it is just hiding you have a bad RAM stick there is only so much value in having ECC.


It is my understanding that the ECC that you're talking about only protects data-in-flight between the module and whatever is reading or writing the data. It does not protect against corruption of data-at-rest, which is what is protected with ECC in DDR4 and older.

It's also my understanding that the DDR5 data-in-flight ECC is a mandatory feature because the link between the memory modules and everything else is so error-prone that the system would simply not function without it.


Taking a quick glace at the articles, I think it's the opposite, DDR5 protects data at rest only, because they want to make the chips so unreliable it can't work without it, not the bus.

But in practice, it will probably be more reliable than DDR4 without ECC, since now you need 2 cosmic ray flips, or 1 plus a manufacturing defect flip, and the defect flips will probably be uncommon-ish.

It's too bad data in flight isn't protected without old fashioned ECC on top of that, but it will probably be a big step up, the same way that flash memory is now very reliable even though the actual uncorrected errors are probably worse under the hood.


The problem with the DDR5 approach is there's no reporting mechanism, so while it will reduce the error rate of a marginal module, it doesn't let you know so you can replace it. In my experience with ECC modules, a module with some errors is a lot more likely to get more errors than one that's operating with zero errors.


According to this post on Reddit there is reporting: https://www.reddit.com/r/hardware/comments/qjhvjg/are_ddr5_e...

But even without it, it would be no worse than DDR4 with no ECC at all, where the only indication of bad ram is just that it crashes or memtest fails.

It sounds like they intend to make the chips all slightly bad and then correct it, so a few errors here and there might not indicate a true bad stick that needs replacement anymore.


> That we cannot get laptops or even desk machines with ECC, and so have them crash frequently, is seen as a trivial side effect of the strategy.

How frequently would you say you encounter a crash that you can pin down to a lack of ECC memory in your laptop or desktop?


You can't, that's the thing, right?

I have a Ryzen desktop with ECC, and it registers about one bit-flip per week. I don't know how many of those would become crashes, but I'm more worried about the ones that wouldn't.


> I have a Ryzen desktop with ECC, and it registers about one bit-flip per week.

This isn’t normal. You have a bad memory module.

A non-ECC machine should able to support memtest86 (a rigorous memory testing tool) for a week straight without a single bitflip.

Having a bit flip per week is so far away from normal that it’s definitely a bad memory part.


> > I have a Ryzen desktop with ECC, and it registers about one bit-flip per week.

> This isn’t normal. You have a bad memory module.

> Having a bit flip per week is so far away from normal that it’s definitely a bad memory part.

It's hard to know what's normal. If you're somewhere in the world like northern Sweden the chance of cosmic rays goes up from being less protected from the suns rays (Hey, but northern lights!) and if you're at higher altitude; along with less interesting things such as how shielded the ram is (server chassis, if you removed the rack doors or not).

Memory density has a lot to go with it too, the chances of bitflips increases a lot.

Anyway; regardless of the theory behind it: I've run global operations with multiple-thousands of physical machines with high density ram (and 16-32 DIMMS in each server) and noticed at least 1 bit flip a week on average for each machine. But some locations were worse impacted than others.

Annoyingly, ECC is not created equal, there are some correctable bit-flips that are hidden from the iDRAC/iLO: this is the same type of ECC that the rPI is using (referred to as on-die): https://datasheets.raspberrypi.com/rpi4/raspberry-pi-4-produ...

As with all things like this YMMV, but ECC is doing a lot of heavy lifting.


Memtest86 will only notice a bit-flip by sheer luck: the flip has to happen between when it wrote and when it reads back a short time later. Any flip in bits it is not immediately looking at will slip by.

Memtest86 is not magic. It is designed to look for systematic failures, even if intermittent. It does that pretty well given it has only bus-level access.


I'd hate to think what all those neutrinos would be doing to you. You are very much larger than a memory cell, for every bit flip you must be catching a trillion neutrinos.

Maybe OP's house is in the Van Allen Belt or something?


> Ryzen desktop with ECC, and it registers about one bit-flip per week.

yeah, you have bad ram. Or you overclocked it.

I have a Ryzen server with 32GB ECC RAM running 24/7 for a year. 0 bit flips. Not a single one.


I did overclock it. It's a 3200 MHz module running at 3600; pretty typical for a gaming machine.

Of course, overclocked or not, I don't get memory errors. Just the occasional EDAC event.


interesting to see the coincidence with solar flares


My current PC has ECC memory and I will never go back to non-ECC memory. The stability increase is noticeable, especially when using RAM as a cache drive.


This problem has no ultimate solution. I've seen all components flip bits, CPUs, networking cards, RAM, most often you just can't know for sure what did it. You can remedy it a bit (like with ECC), but ultimately there will always be corruption if you process hundreds of petabytes of data. Get used to it, your computer executes an instruction with a probability extremely close to 1, but not equal to 1.

Deep in the archives of a well known tech company is a very well documented case of a bit flip that caused the wrong function to be executed in a C++ v-table. The big oof was that this function was the equivalent of an SQL "drop table", and just happened to be 32 bytes off of a very benign function that did something like stat(). Really funny stuff once the crisis is over :)


ECC isn't a terribly complicated technology, and can be used in all those cases.

In limited cases, a checksum is good enough. If you checksum outgoing data, and verify it on reception, then it being corrupted in transit whether on the network card or the cable can be detected and transparently compensated for.

Really, we can do much better than to "get used to it".


You are under the impression that CPUs and other chips always perform the same instructions as are written in the code, and only RAM can flip bits because DRAM is DRAM :)

It can (and should! whenever possible) be improved, not fixed. There's always that pesky gamma that can hit a specific transistor, even if it is deep underground. Gamma cannot be fully stopped. At certain scales data corruption becomes directly measurable. And yes, corruption levels vary between pieces of hardware.


Sure but once your registers, cache, data bus and address bus has ECC you have vastly smaller area that can flip.

You can even just buy (well, chipaggedon aside) ARM cores that have 2 chips running in parallel and faulting when the result is different


> You can even just buy (well, chipaggedon aside) ARM cores that have 2 chips running in parallel and faulting when the result is different

See dual-core lock-step Arm chips (used for automotive).


Of course not. I'm not saying we can have perfection. I'm saying that we can do much better, using methods and technologies that are very old at this point.

The reason why we don't is laziness and market segmentation, mostly.


> You are under the impression that CPUs and other chips always perform the same instructions as are written in the code, and only RAM can flip bits because DRAM is DRAM :)

I thought L2/L3 is ECC (at least on Intel, though L1 I think is parity only)


The probability of a bit flip depends on the size of the transistor used. RAM tends to pack many small transistors.


OP said: We can do better

You replied: it should be improved, not (it cannot be completely 100%) fixed

Your error: replying to a strawman


> ultimately there will always be corruption if you process hundreds of petabytes of data.

Your statement is in practical terms false.

Error correction exists, and adding enough redundancy to make any corruption have such a tiny chance of being undetected that the whole history of the universe could go by without it happening is a problem that was solved over 60 years ago with the invention of the Reed-Solomon code,


Is this something that's documented publicly? I'd love to read more.


If this is the same issue at the same well known tech company that I recall, it wasn't a hardware failure but a case of undefined behaviour. The existing code used memset to initialize a struct with zeroes, then someone else added a virtual method (to it? to a base? or member?) which caused the struct to grow a vptr. Now the memset was clobbering the vptr. They compiled this program with some form of link time optimization and the compiler emitted assembly that tested the vptr against one of two known values, and when it didn't match (because it was NULL) it must be the other known value! Either way it didn't use the vtable, it just made direct calls to the appropriate member functions, which led to a "drop table". As part of resolution it also led to https://reviews.llvm.org/rL130299 for compile-time checking and also UBSan's `-fsanitize=vptr` for runtime checking.


One can process petabytes without bit flips if one use proper checksums and error correction codes. While that does imposes overhead, it is not big and, thanks to Shannon theorem, can be made arbitrary small with sufficiently big blocks.


Yup, been there.

Way back I had a Pentium 133 doing firewall duty in a closet. It did approximately nothing besides iptables, but of course any machine has logs, updates and so on going on.

After running fine for months one day it suddenly died. I rebooted it. A few days later it died again. Another reboot. Then it died for the last time and failed to boot at all. Examination showed the disk was corrupt and couldn't be mounted. Further examination showed that one of the memory modules was loose for some reason, could be that it was never firmly in and I just bumped the box when messing with something else.

Then came the wasted weekend of dealing with that my normal internet connection relied on the thing that was now completely broken.

And that was the luckiest case I can imagine, when the broken machine contains no data of actual value. Since then I'm very paranoid, always run memtest on any new RAM I buy overnight, and have ECC where it's possible to have it.


> Since then I'm very paranoid, always run memtest on any new RAM I buy overnight, and have ECC where it's possible to have it.

Yeah, I do the same, but I've learned that you have to do it regularly.

In one of my desktop machines, the RAM ran fine for like two years. Then, all of sudden, random Firefox segfaults, etc.

Whipped up a memtest ISO, and sure enough, one of the sticks was bad.


That's the nice thing about ECC, it acts like an always running memory test.

You normally have a scrub time that can be configured in the BIOS, which also adds a regular verification of the entire RAM at regular intervals, just in case something goes wrong in some rarely used part of the memory.


Unfortunately, background scrubbing significantly increases power consumption and impacts performance as well.


Do you want (1) a higher rate of stable and correct computations to be performed at a slightly higher energy cost, or (2) a demonstrably less reliable device at a slightly lower energy efficiency?

I'll go for #1 in most cases, as long as the system is to be relied upon for anything deemed important.


Me too, of course. I'm just highlighting downsides so people know what to expect.


Weirdly enough I had same case but memory turned out to be fine, replacing power supply fixed the issue. I ran test, saw memory is bad, replaced sticks, same problem, put the sticks back in and decided to just run it (it was gaming PC).

Few months later powersupply outright died (had ~8years at that point), replaced it with good one, no memory errors.


In my case it was clearly a bad RAM stick. Took it out, OK. Switched them around: errors. Replaced it with a new one, back to OK.

In this particular case, a bad PSU would be the end of the PC. It's an HP dekstop mini. Basically a laptop without a screen, powered by and external adaptor that puts out a single 12V line. All further conversions are done on the motherboard somehow.


Badly socketed ram was one of the reasons my PC started failing after being on for a while. When everything was cool it was all fine, when the case and everything heated up a bit it failed eventually. Re-seating the ram fixed it and ran for quite a time without issues. This was in the early Athlon days though.


It's good that with DDR5 consumer memory will get some super basic ECC on die, so hopefully the next generation of memory will make the problematic sticks more obvious (or prevent damage in the very least). ECC won't save you from memory corruption, but it'll save your data at least.

Personally, I would've just checksummed the individual failing files rather than the disk image and only back up the bad files separately. There are all kinds of ways for a disk image to fail and I wouldn't spend a second longer on it than absolutely necessary. The whole memtest permutation setup also would've been too much work for. E, I would just declare the motherboard faulty when two sticks that otherwise pass the test fail in specific configurations. A new motherboard is cheaper than super specific RAM sticks.


I watched the full video. It was long but very informative. The humor at times made up for the length and the presenter showed a lot of deep knowledge that most people won't have. My biggest gripe is that they just didn't try replacing the RAM sticks in the first place. I get that they wanted to do a root cause analysis, but geez the time and patience they had to do all those memory tests. No wonder they did a video about it, cause otherwise that lost time would have been painful. I was baffled as well that dd and ddrescue work differently in how they utilize the RAM. Caught me offguard.

Onto the discussion of ECC RAM. In a perfect world, all memory would be ECC... but try finding some high performance 16GB sticks of ECC DDR4 RAM like what you'll see on gaming computers. I don't even think they make anything comparable in terms of speed and definitely not costs. I guess you don't really know that you needed ECC until it's too late.


> " I guess you don't really know that you needed ECC until it's too late."

I spent many years on hardware consultation and was amazed at the all the times I had to explain it was just a what if insurance like any other things their business was mitigating against. Sometimes they'd even decided they needed to save costs in non-ecc ram when it was $4 a gb in difference, or (during the FB-DIMM era) there wasn't even an option to avoid it.

Never really understood the resistance towards it.

Maybe the lack of evidence before the Google study and people thinking RAM manufacturers were trying to rip them off or something.

The "never had a problem so why would I need" it attitude with no way to know if an issue was caused by a bit flip was most baffling.


> ...but try finding some high performance 16GB sticks of ECC DDR4 RAM like what you'll see on gaming computers.

Here ya go:

https://nemixram.com/16gb-ddr4-3200-pc4-25600-ecc-udimm-2rx8...

It doesn't have pretty lights on it, but it does seem to be in the same speed class that gets called "gaming RAM" by a _whole_ bunch of retailers.


For real. I am going to have to look into this more. While it isn't by a global retail manufacturer, the price is like a fraction of most other RAM so I am curious. Does NewEgg or Amazon sell it directly, and if not why not?


NEMIX sells through Newegg, but it's one of those "Newegg routes the sale through the seller" things, rather than "Newegg buys the thing from the seller and handles the shipping and everything" thing. In my experience, the price is the same whether you order through Newegg's website, or through NEMIX's website.

Unless you have a particular reason to keep your order history in Newegg, just buy direct from NEMIX.

(IDK about NEMIX's relationship with Amazon, as I don't buy things from Amazon.)


There's no adequately priced ECC UDIMMs because the market is so small. Servers use registered so it's a very small niche. But for the gaming overclock application it would actually be a great fit. Because ECC gives you a great early indication of failure, verifying that an overclock is stable enough to use becomes much easier. My hope for ECC availability is that one of the gaming brands decides to upsell it as a feature and then we can build nice workstations with fast ECC RAM. I'll even put up with RGB for it if I must. AMD board manufacturers have been in a great position to do this since the beginning of Ryzen. Too bad none of them have.


Another important perspective of ECC memory is whether your machine can properly respond to the interrupt. ECC without reporting and handling is not a working ECC. I have been using Ryzen machines with AsRock motherboards and the Linux EDAC driver reports errors correctly (verified by overclocking the memory sticks to the edge). On the other hand, I owned a Dell Precision mobile workstation with Xeon W-11955M processor and ECC SODIMM sticks. However, since its EDAC driver never really lands in the mainline kernel, I won't get any ECC reports, so the corresponding process that has a bit flipped will not get killed.


This is especially true for NAS. And why you need BTRFS or preferably ZFS. Unfortunately none of the consumer NAS offers ZFS, and BTRFS is still not a default option. Neither Synology or Qnap seems to care.


I had RAM go bad after running 18T (5 HDDs in raid1) btrfs system in closet for years. Btrfs of course noticed it and fixed most of them automatically when some of blocks were corrupt. But eventually the system failed: the tree that contains the checksums for all the other trees corrupted itself on both copies of one node. Fixed the HW problem and then had to use hex editor to set the checksum manually to correct value (I modified the kernel to print the expected value). Now the system has been again stable for 3 years.


You should do a write up. May help someone else in the future.


Synology offers ECC in 2023 consumer modes such as 923.

Still the experience that synology btrfs provides is nowhere as good as ZFS (due to a lot of limitations).


I bought a 6 disk Synology a few months ago and it came with ECC by default. I did a cursory web search about this just now and ECC support appears to be the norm for 22 (as in the year 2022) model revisions and newer (thankfully!).


It’s because they use AMD CPUs now (already in the 21 models). The trade-off is that those CPUs have worse hardware codec support than the Intel ones they previously used, if you want to do video transcoding.


Intel sold i3s for years that had ECC unlocked. I run a i3-8100 on an x11 mobo in my TrueNAS box.


For anyone curious (as I was once and looked into it) Dell and Lenovo both ship Laptops with Xeon's and ECC memory though they are very expensive.


I have a lenovo P52, it came with 16 GB ECC. The dimm is not super-expensive, but it's hard to find, and only comes in half the size of regular dimms.

I replaced it with non-ECC memory and never had an issue. In fact you can combine ECC and non-ECC memory without any problem.


Yep, you can get it on the higher end trims of the P-series Thinkpads.


Near the end, the video mentions his (old) computer "apparently can't take 'high density' memory".

Then the author goes through a process of randomly buying memory and hoping it works.

Anyone know what that "high density" memory problem is about? Maybe it's a misunderstanding of memory channels and ranks?


Not sure if it's what he means, but in my experience 'High density' refers to less chips for a given capacity.

When I used to sling PC hardware, as memory densities increased, you would run into things like 'This Motherboard will only take a 256MB module if it has 8 chips on both sides (16 chips total), it will not take a 256MB module with 8 chips on one side (8 chips total)' Depending on the board, it might only register part of the capacity and be stable, might register part of the capacity and be unstable, or just not boot at all.


Interesting. Have heard mention of similar over the years, but never seen it in person myself.

Sounds strange, as - without knowing any better - I'd expect the memory interface to be agnostic to the chips implementing it. eg "address 0x11111111" on 'high density' ram would be seen exactly the same as "address 0x11111111" on other ram

The mind boggles. :)


Data written but not read back may as well go straight to /dev/null. Always read back and verify integrity.


We've got a server that keeps rebooting due to a bad ECC DIMM chip. I thought the whole point of ECC was to keep the server going until we can replace the DIMM?


ECC can typically correct 1 bit errors and usually detect (and fault) on 2 bit errors.

Your server is faulting and preventing itself from corrupting data.


ECC Corrects 1 bit errors and detects 2 bit errors. It does not handle all hardware failures however.

An actual bad chip may fall into a different category, I think that's when technologies like Chipkill[0] might come into play.

[0] - https://en.wikipedia.org/wiki/Chipkill


IIUC it's the kernel EDAC subsystem rebooting the server for you. If you turn it off you'll get non-ECC behaviour, but in this case you'll also get really interesting system issues because as the sibling comments have said your RAM is actually faulty :P


It sounds like it's working as intended. Redundancy doesn't help you with reliability once your system is in a degraded state. Running a server with a known bad ram stick is like driving a car on run-flat tires. It will probably get you where you need to go, but you really should be driving to a tire shop.


Ok thanks I understand things better now. I'll pester my boss to get new DIMM asap.


Got a Gen 10 HPE Microserver for my NAS run some AMD dual core SoC - factory fitted with ECC memory running Unraid.

Think it mysteriously crashed once or twice in the 4 years I had it and the HP diagnostic light came on.


It would be so nice if DD and DDRescue did calculate hashes while copying.


I'm curious what the actual manufacturing costs for ECC DRAM is compared to regular DRAM. Is it considerably more expensive, or just the usual over-charge because it's better?


It has another error correcting chip on DIMM, 9 total instead of 8. Don’t be fooled by on-die ecc marketing that’s come out recently, you need the 9th chip for parity.

https://community.fs.com/blog/ecc-vs-non-ecc-memory-which-on...


It’s exactly 1/8th more.


Afaik ECC memory is slower than normal memory, so it does not impress the folks who base their purchase decisions on benchmark scores rather than utility and best bang for the buck.


I have 128 GB of non-ecc memory in my notebook, never detected a single error, and has been on 24/7 for more than 4 years.

Unless you live over 4000 meters over the sea level, like to compile while flying or live close to an unshielded nuclear reactor, you don't need ECC.

And most memory problems you can fix by better cooling, and better shielding.


> I have 128 GB of non-ecc memory in my notebook, never detected a single error

How would you know?


I know that I never detected them. Also it never crashed nor malfunctioned in any way (thanks in part to Linux), so any undetected error didn't caused any problem.


There's been plenty of studies they measure bit flips in memory. If you really wanted to measure, you'd find them without much difficulty.

Yes though, when you have gobs of memory, you're unlikely to see the effects of bit flips. That's not the same at them not occuring.


My 4 year old Mac Mini had been showing some flakiness and finally I got an error about "hash mismatch detected on volume" - ran memtest and sure enough the RAM had gone bad. I rely on this machine for all my media, backups, etc, goodness knows what data I may have lost while the RAM was bad. It's a moderately common problem, found plenty of people with the same issue.


I’ve had 128GB of ECC memory with error reporting. Never logged a bit flip as far as I can tell.

Memory errors are heavily concentrated to bad modules. They’re not evenly distributed across all RAM. Always test the stability of your memory when doing a new build.


> Why ECC Memory Is So Important

Except that it's not for many use cases. It's great for servers but for people on their personal and/or work computer, it's simply not that useful.

Seriously: which percentage of developers have ECC on their development machine(s)?

As developers we live in a world of SSH, cryptographic hashes, checksums everywhere, Git repositories (that is a big one), Merkle trees, digital signatures, reproducible builds (which are gaining traction), etc.

Heck, I'm torrenting the latest Debian or Devuan .iso image. My torrent client is using every known trick under the sun to make sure that should anything go wrong, the broken data shall be discarded and re-downloaded. Download is done, I dd the image to some installation medium. I can then verify its checksum matches the official one. A bit flip didn't slip by unnoticed.

All the music I carefully ripped from my audio CDs? They're all cross-checked with an online DB of known bit-perfect rips. There's an accompanying file containing each song's hash and I can verify at anytime that all my files are 100% correct.

But really most of all I live in a world of Git repositories. My entire Emacs config is versioned under Git (I know YMMV but I like it that way). Some people version under Git their entire user dir.

Tell me how my lack of ECC is going to really make life miserable here?

I have nothing against ECC... But if I want to upgrade my AMD 3700X to a 7700X, apparently I cannot get ECC.

And that's totally fine: I certainly won't discard the 7700X because I cannot get ECC for it.

And if anything looks suspicious, running Memtest is the first thing you should do.

I've had bad RAM at times. I'm still there.


> But really most of all I live in a world of Git repositories. My entire Emacs config is versioned under Git (I know YMMV but I like it that way). Some people version under Git their entire user dir.

> Tell me how my lack of ECC is going to really make life miserable here?

Git will break if RAM is bad just like anything else. That it checksums everything won't save you from checking in corrupt data, the filesystem itself being corrupt, or some internal git structure becoming corrupt. Losing your repo because something in it was written wrong is very much a possibility.

Having multiple machines involved helps, but it's not a complete fix, because the possibility exists of something damaged being transmitted from a broken machine to a good one, ensuring there's no good copy anywhere.

There's really nothing software can do to operate correctly with bad RAM all of the time. Instructions for the software are in RAM. The OS that the software expects to behave right is in RAM. Various buffers used for disk access and networking are in RAM. An application like git assumes all of that is performing correctly, and can't compensate for every possible malfunction that could happen.


How trustworthy is git-fsck for detecting random-bit corruption?


Depends on how you look at it.

For actual verification, unless there's a bug my understanding is that very trustworthy. But by that point it's already too late. Okay, you know something is broken, but that won't give your good data back.

But that only tells you that Git data is intact and that all the hashes match. If git got a corrupted file to start with, then correctly hashed it, everything will verify 100% and still be broken.


> Seriously: which percentage of developers have ECC on their development machine(s)?

Every single one of my non-mini computers uses ECC ram except for my laptop. If someone would release a framework laptop motherboard that supports ECC ram (and preferably risc-v) I'd finally be able to close to reliability gap. It blows my mind that we say "Well sure, we COULD make infallible ram, but that would cost a tiny bit extra so instead lets just hope nothing bad happens." That's right up there with not wearing a seat-belt because I haven't needed one yet.


I develop on VMs on my server (with ECC and RAIDZ) which i access using SSH or VDI protocols. I would love to have ECC on all my machines but that isn't feasible until the status quo changes, so I stick with the remote approach. To me that's an acceptable tradeoff as the non-ECC desktops/laptops are just used as dumb terminals while the real work happens on the reliable server.

I can't speak for ECC at the moment but ZFS has definitely saved me from data corruption that would have been left to manifset otherwise.


"If someone would release a framework laptop motherboard that supports ECC ram..."

This. If I could, I would. The only reason my laptop doesn't have ecc is because the manufacturer doesn't offer the option in any macines I otherwise want.

That comment was very misguided in trying to suggest that there is any valid excuse to tolerate unreliable execution hardware. git and ssh and md5sums do not mean that it's ok if your very brain can't be trusted to deliver data from one part to another within itself, or spit back the same data that was put in a cell. Everything else is built upon that!


But if I open up an editor and write a document then that could be corrupted in RAM and then the corrupted data saved to disk. The document is likely to be more important than some ripped CDs or a git repo that I can download again.

The CPU, RAM and mobo manufacturers need to get together and make ECC RAM mandatory. It's absurd that we have machines with gigabytes of storage using microscopic (nanoscopic??) capacitors that doesn't have this basic protection. And honestly this should have happened years ago.

(Edit: And before anyone says DDR5 is ECC by default, that's not quite true although the difference is a bit subtle: https://en.wikipedia.org/wiki/DDR5_SDRAM#DIMMs_versus_memory...)


>But if I open up an editor and write a document then that could be corrupted in RAM and then the corrupted data saved to disk. The document is likely to be more important than some ripped CDs or a git repo that I can download again.

Devils advocate: if it is just some bits in character flipped it's entirely recoverable while flipping some bits in compressed stream for video would corrupt more.


Agreed. One problem with RAM errors (which I've actually experienced) is they are insidious. You probably won't notice them immediately so the error can be propagated, and they're very very difficult to diagnose if you do notice them.

Back in the 90s we had a database server which had a stuck bit in memory normally mapped to the page cache. This caused sectors to be written to the backing software RAID which couldn't be read back in (because I think some checksum was corrupted when written and then failed when read back). It took an absolute age to diagnose this. I think I only worked it out by eliminating everything else.


I went out of my way for ECC because losing files can mean losing days of work. With how much a sweng gets paid it's worth the ECC premium if it saves me a few days of work over the life of the machine.

Recently I discovered that one of my SSDs was quietly failing without setting off any warnings; doing a chkdsk showed that some files had already gotten corrupted. One of them was my backblaze backup index!

Even though I have automated backups (backblaze + macrium backups to a NAS), recovering files from them is non-trivial. If I were to lose work to non-ECC ram who knows how long it would take me to reconstruct a known-good work environment and file set. Imagine if you're working on something huge and hard to validate like neural net weights where corruption can occur silently and be hard to detect after it's happened?


My favorite is when data silently corrupts, and then is happily propagated to your off-site recovery :/


Yeah, I don't know when the drive started failing, so in practice my backups are probably all screwed too. My only option would be to get a spare drive, restore a backup to it, then do a block level diff of the current and backup volumes and try to figure out whether any of the differences are file corruption.


Probably I'm saying nothing you haven't thought of, but ZFS is great for preventing this. If you also pair it with ECC, then you've eliminated most ways to cause corruption.


Checksums don't save you from memory corruption. If your data gets corrupted in memory, you will just end up checksumming and committing bad data. Or your checksum could get corrupted, and you commit a checksum that doesn't match your data. Checksums are more useful for safeguarding against disk or network corruption (although you shouldn't have network corruption issues over TLS or SSH).

Apparently Ryzen 7000 cpus can use ECC. I've heard reports that AMD needs to release an AGESA update, though, and ECC DDR5 memory availability is terrible. I'm hopeful that the situation will improve, because I also want to update my desktop. I've been using ECC memory since losing a filesystem on a desktop when a DIMM went bad.


When people complain about these random OS crashes and freezes it's usually RAM corruption at fault.

Per 1gb of RAM, you can expect to see 266 bit errors per month[1-2] if you are using your PC 16h per day. Multiply that by 64GB or 128GB of RAM and it's crazy to think that you won't run into any of the stability issues.

[1-2] https://static.googleusercontent.com/media/research.google.c... [1-2] https://en.wikipedia.org/wiki/ECC_memory#Research


> Per 1gb of RAM

The Google study you linked says "25,000 to 70,000 errors per billion device hours per Mbit".

Assuming bits, {25000 to 70000}/1e9 * 3016 1000 = 12 to 33.6 bit errors per month. Assuming bytes, it's 96 to 268 bit errors per month.

Apparently you've meant bytes, not bits. (I'm a pedant, but was also just unsure and interested in the numbers.)

FWIW, a comment I ran across that confirms toast0's view[1]: "It's a bimodal distribution - you either have many errors (due to a defect somewhere) or basically zero. If you're on the good side of the distribution, with only extremely rare errors, then you probably don't need ECC. But without ECC, you don't know whether you need ECC!"

[1] https://www.realworldtech.com/forum/?threadid=198497&curpost...


I don't think this rate is reasonable to use in this manner. I've run thousands of servers, with lots of ECC ram (quite a few servers had 768GB of ram, but most were more like 32 or 64). The vast majority ran for years with zero reported errors. A small handful would have major errors of thousands in an hour, but we would replace for hundreds in an hour. A couple servers developed a periodic report of one or two (correctable) errors per day. If 266 bit errors per month per 1Gb was a usable rate, all of our servers would have been throwing ECC correctable errors all the time.

But I didn't have time or desire to publish a study on our experience, so there's no hard numbers.


>266 bit errors per month

That seems like an overestimate. How can memtest ever pass on non-ECC RAM if errors are that frequent?


Because memtest only looks at values it wrote out very recently, before they have had a chance to flip.

Memtest is looking for reliable failures, not evanescent one-off events.


SDRAM is continuously refreshing all cells. How long ago data was written doesn't make a big difference (aside from the case where you're reading data immediately after writing or reading that data).


This does not, of course, make any sense. Any given memory cell will be read once per, say, millisecond, and the value written back. Once it has flipped once, the wrong value will then be written back, and after that the wrong value is read back out and rewritten again, indefinitely. Errors are sticky, and accumulate. (Flipping back again is negligible unlikely.)

With ECC in the refresh path, such an error could be corrected and the right value would be overwritten over top of the bad one. Then errors would not accumulate, but would instead be "scrubbed". Mainframe machines scrub their RAM. Disks too.


> SDRAM is continuously refreshing all cells

...so ? we're not talking about slow deterioration (which is why refresh is needed), we're talking about bit flip from cosmic rays where cell changes state completely.


> which percentage of developers have ECC on their development machine(s)?

That question won't help you evaluate the demand for ECC simply because the supply is strangled. Those who want it have to make compromises to get ECC: get a Xeon, or get one of few AMD motherboards with matching CPUs and overpriced RAM without a guarantee that it will end up working.


> ...get one of few AMD motherboards with matching CPUs and overpriced RAM...

I mean, if you want a _guarantee_, then sure, get one of those certified-for-ECC motherboards.

But, like, as far as I know, going _at least_ far back as the Phenom II (released in 2008) AMD desktop processors have always supported ECC RAM. And -as far as I know- ASUS motherboards for said processors have always supported dropping in ECC RAM (and Linux and memtest and friends have always agreed that ECC was enabled and functioning in such a system).

Source: Personal experience with Phenom II, Threadripper, and Ryzen 5 CPUs and ASUS motherboards, and looking-from-a-distance at the rest of the AMD CPUs between the Phenom and the Ryzen 5.


And condoms are only 99.8% effective, so let's not use it? You can always pull out in time or verify checksu^W^W Plan B?

> But really most of all I live in a world of Git repositories

That's great for you, but 99% don't even know what Git is, just like checksums, cryptographic hashes, Merkle trees, digital signatures and reproducible builds.

You just miss the one imortant thing: ECC isn't that helpful where you have the means to check and verify the data. ECC is the only way to at least know what something is happening with data when there is no way to check.

To give you a slight idea I would tell you an anecdote from my L1 life almost two decades ago:

I visited a client who claimed what the PC was working erratically and constantly threw weird error messages.

Welp, the usual deal, just some ugly software or a virus. In the first 3 minutes I got like 5 errors about failing to load a .dll from C:\WINDOWS\SYSTEM33\USER32.DLL, nothing unusual, just need to.. WAIT. Why system33? Stupid virus masking for a well-known folder? Doubt. So I go to C:\Windows and I see:

  System32
  System33
  System34
If you are a smart fellow you probably already understood what that was a bit-flip error which somehow managed to be in the in-memory copy of MFT of the system drive. And this is the only reason the user noticed it - because sometimes programs wouldn't start and sometimes there was weird error messages. If that error was in the data area of some program - it wouldn't be discovered at all.


Hang on... to go from "System32" to "System33" is one bit flip, but to go from "System33" to "System34" is three bit flips at once. Doesn't that seem astronomically more likely to be a malicious program than bad RAM?


It's astronomically more likely to be a manufacturing defect what affected multiple cells than a malicious program what ... wrote to System34? And how the replacement of the DRAM module 'cured' that malicious program?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: