Why ECC RAM matters, but probably not for most CAD design

When we put together a self-build kick-ass CAD workstation a couple weeks ago, we managed to raise a few eyebrows in our seemingly reckless disregard for random access memory. We didn’t bother to spec a system with ECC RAM, and some of you questioned why populate a CAD workstation with “the cheap stuff?” To get to the bottom of this we’re going to have to talk about billions and billions of bits, uncorrectable errors, hyperactive bananas, and supernovae. Quick, someone tweet at Neil deGrasse Tyson.

8x10.ai

The universe wants to kill your CAD workstation

What causes memory errors? Technically speaking, everything. Computer memory errors can be divided into two classifications: soft errors and hard errors. Hard errors are simply explained: some kind of physical damage that causes one or more bits in memory to permanently misbehave. Soft errors are more esoteric: they are transient, instantaneous defects caused by the surrounding environment. In their simplest incarnation both hard and soft errors manifest themselves as a bit flip, meaning a single bit of binary information in the memory is altered, either from 0 to 1 or vice versa. Depending which particular bits get flipped among the 64 billion available on the average 8GB machine, the flip could mean a catastrophic crash or nothing at all.

Hard errors are attributed to internal component failure, power surge, or if you somehow manage to pummel your workstation with a particularly vengeful spinning helicopter kick. Soft errors, however, originate from more fantastic sources such as:

  • Cosmic rays: A long, long time ago in a galaxy far, far away, a supernova ejects energetic protons careening across the cosmos all the way to your office on planet earth, and happens upon a DIMM modules in your workstation, temporarily freaking a bit or two out.
  • Radioactivity: Trace amounts of radioactive isotopes like uranium-238 or thorium-232 occur naturally in the earth, and consequently are in pretty much everything, including the material used to make the memory chip itself. The alpha particles produced by decay of these trace materials can also flip bits. Fortunately, you needn’t worry about the .0001 g of potassium-40 in that pair of bananas you had for breakfast because it only decays via beta radiation. In that case, your memory is probably safe.

Here comes ECC to save the day… mostly.

Error Checking and Correction (ECC) RAM is a step above your friendly, neighborhood memory. ECC technology can’t prevent memory errors, but it can both detect and correct memory errors when they do happen, within certain limitations. Most ECC memory is engineered to detect and correct single bit errors, meaning one errant bit in a byte of memory. While typical ECC memory can detect two-bit and some multi-bit errors, it can’t repair them. Such errors are uncorrectable. Certain exotic variations of ECC like IBM’s Chipkill can wrangle multi-bit errors, but are a rather uncommon proprietary solution. One of the primary advantages of ECC is at least you know when and how many bit flips are occurring, with regular memory you haven’t a clue. ECC allows you to truck along happily immune to single bit errors, and in this respect is clearly superior to non-ECC RAM. However, in order to understand ECC’s value requires understanding memory error frequency and root causes.

Abort, retry, fail

How often do bits flip? For some time, the most often quoted benchmark was an old IBM study that claimed approximately one flipped bit per 256M of RAM per month of runtime. The more memory you have the higher the higher the chance you’ll experience a bit flip. For someone working 40 hrs/wk on a workstation with 8GB of RAM that translates to about 7 or 8 flipped bits a month. More recently, Google conducted an exhaustive 2-and-a-half-year study on their own server hardware that revealed some interesting insight into memory error rates. Some of the findings include:

  • Error rates were highly dependent on hardware configuration, with some platforms showing errors in 20% of the DIMMs while other platforms exhibited errors in only 4% of the DIMMs. Google conveniently omitted naming any specific vendors, unwilling to throw any suppliers under the bus.
  • Heavily utilized systems have considerably more errors, 2 to 3 times higher than less utilized systems. Google claimed their specific server utilization as sensitive, but you can bet these machines are being hammered pretty hard 24/7.
  • Overall 8% of the DIMMs experienced at least 1 error per year. The rest didn’t. At all.
  • A DIMM that has experienced a correctable error is 9 to 400 times more likely to suffer from an uncorrectable error in the future.
  • Because error rates had such a strong correlation with utilization, hard errors are likely the dominant root cause over soft errors.

The price of eternal memory vigilance

ECC is priced higher due to the extra error-correcting bits onboard and the fact that they are generally produced in lower volume as compared to their non-ECC consumer brethren. Depending on the size and particular speeds involved, the ECC premium can be anywhere from 5-100%. Adoption costs exceeds the RAM price differential, as you will also need an ECC capable motherboard, which in turn often requires a server class processor. Overall, you’re looking at spending several hundred dollars to benefit from ECC memory protection with otherwise similarly performing hardware.

To ECC or not to ECC?

When it comes to most desktop CAD design, ECC largely doesn’t make economic sense for a self-build. Right off the bat, you’re spending money for an issue – correctable memory errors – that statistically will only affect 8% of your hardware, and only if the hardware undergoes a server-like utilization. At lower utilization, as is the case for most CAD workflows, error counts are 2-3 times less in the worst case.

But perhaps that’s not enough justification for you. Know then that ECC is not a magic bullet, and requires a server-style maintenance philosophy to utilize effectively, otherwise it’s a waste. The Google data indicates that modules with correctable errors are up to 900 times more likely to suffer from uncorrectable errors. You should have one of two reactions to this:

  • Holy crap, I should be monitoring my ECC memory! Then you better read up on Windows Hardware Error Architecture (WHEA) and keep some spare sticks around. Get ready to spend both time and money.
  • Wait, I have to monitor my ECC memory? If you haven’t bothered, and think ECC will save your bacon on its own, you’re deluding yourself. You may have already suffered uncorrectable errors without noticing. If you’re happy with uncorrectable errors, then you would likely be happy with non ECC RAM. You just spent your money for nothing. You’re doing it wrong.

Finally, all of this assumes perfect software. While it seems really unpleasant to have a system crash because a star on the other side of the universe farted a million years ago, it’s peanuts compared to how many crashes and problems you’re going to experience because your CAD software is broken. Even in the case of a system crash, most file versioning and backup strategies are a more cost-effective investment. If you don’t mind rebooting, you don’t need ECC.

ECC only makes sense in server-like workflows such as FEM analyses or rendering, where a bump in the road costs hours of time. Well, unless you plan to design in space. But then you have a whole other set of problems, like how to keep George Clooney from eating the only piece of lettuce available.

Disclaimer: This article was written using non-ECC RAM and could be subject to eror. Oops.

  • goblin072 .

    I agree with most of this. But you might want to do some research on error above sea level. You do not have to be in space. Error rates go way up with altitude. Someone in the rockies, packistan etc are going to get more bit flips than at sea level and its not a minor percentage.

    I think ECC is ok if you can actually monitor it like on a dell server. People that put it in a game system using a xeon have no way to monitor it. They just have to go on faith that is correcting errors it has no way to inform the user, if you get a bad stick ecc can only do so much. If I could have ECC that would alert me then I would have no issue putting it in a workstation. Those features seem to be reserved for server class motherboards only. So get use to reboots Joe Consumer.