EmbeddedRelated.com
Forums
The 2026 Embedded Online Conference

RAM Failure modes [long -- whiners don't read]

Started by Don Y April 26, 2016
On 4/26/2016 12:41 PM, Don Y wrote:
>> I have never noticed a memory failure last 30 years which could >> not be tracked down to something external to the memory, e.g. bad board >> connection, missing/bad bypass caps etc. > > How do you know you've had a memory failure? Or, have they been > "catastrophic" (hard to ignore)? Without ECC -- and runtime > tools that monitor and log those errors -- you can't say whether > your experiencing none... or MANY! And, where the threshold lies > between "none", "some" and "many".
Here's a good starting point: <https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/35162.pdf> Note you can argue that google's environment is more *benign* (their machines don't have motor controls colocated on the same PCB; they have a bigger budget for hardware/maintenance/monitoring; they actively control the environment; etc.). Or, you could argue that they are relying on commodity hardware/software instead of things designed for a specific task/application. <shrug> The same is true of much of the other literature...
On 26.4.2016 &#1075;. 22:41, Don Y wrote:
> Hi Dimiter, >
Hi Don,
> On 4/26/2016 11:08 AM, Dimiter_Popoff wrote: > >> I think nowadays David's attitude is both the obvious and the correct >> one. > > Would you leave one of your instruments running (with code executing > out of RAM) for a year and expect the program image to be intact?
Of course, happens all the time. For many months at least.
> Would you leave 50 of them, side by side, and expect the same?
Yes (64 or 128 M DDRAM, no ECC).
> Errors *do* occur. How vulnerable you are to them is a different > issue. If a soft/hard error causes a light to blink at 3Hz instead > of 2Hz... <shrug>
Of course errors do occur. My point is "do you take preventive measures to not be hit by a meteor while crossing a rush hour street?".
> >> Leave memory testing to the silicon and board manufacturers, they have >> better means to test it than the CPU which uses it. If you need the >> feeling of some extra reliability use a part with ECC (and populate >> the chips for it....). > > But there aren't many parts that *do* support ECC, natively.
Well silicon makers must have good risk assessment strategies for the decision when to put in ECC, which underscores my point. If a chip does not have it it is because in all likelihood it does not need it. What is the point of ECC for a part which costs $10 or less and will be programmed in C or some other HLL where the programmer will never know what exactly does the code do. The larger parts from Freescale/NXP do have ECC on their DDRAM controllers, IIRC those I have seen correct single bit errors and signal larger ones. How justified is this technically I just don't know, don't have their testing etc. data collected over the years, but it clearly is economically justified (your system won't be discarded at some decision making point because it does not have ECC - and at this size ECC is no big part of the cost).
> How do you know you've had a memory failure? Or, have they been > "catastrophic" (hard to ignore)? Without ECC -- and runtime > tools that monitor and log those errors -- you can't say whether > your experiencing none... or MANY! And, where the threshold lies > between "none", "some" and "many".
Well like I said many systems running for many months without being reset is nothing special for me so if there were some memory problem I'd have noticed it. I have noticed things much harder to imagine; like interference into an I2C line which could cause some hang (was software correctable once I realized this occurred) etc., and other events which occur very rarely and are very hard to detect. Memory issues have not been one of them. Then again, I speak only of a few systems I have designed and manufactured. Dimiter ------------------------------------------------------ Dimiter Popoff, TGI http://www.tgi-sci.com ------------------------------------------------------ http://www.flickr.com/photos/didi_tgi/
Hi Dimiter,

>> Would you leave one of your instruments running (with code executing >> out of RAM) for a year and expect the program image to be intact? > > Of course, happens all the time. For many months at least.
Then I suspect you are just not aware of the errors that are occurring -- or, that they are masked by "expectations", etc. Using the error rates predicted in google's paper: 25,000FiT/Mb * 64MB * 8 = 12,800,000 Fit 12,800,000 / 1,000,000,000 hrs = 12.8/1000 hrs or, one every ~80 hours. Using their high figure (75000 FiT/Mb) cuts that to one error every ~1 day! For a 128MB system, that's a range of 1 error every 12 - 40 hours. If you're not seeing these (have you verified the code image in RAM is unchanged?), then there is some difference between an embedded product (e.g., soldered down RAM devices?) or a difference in the components you're using or conditions under which they are operated (e.g. larger device geometries -- though some studies claim smaller geometries are not responsible for increases in error rates)
>> Would you leave 50 of them, side by side, and expect the same? > > Yes (64 or 128 M DDRAM, no ECC).
With 50 units running concurrently (and independently distributed errors), you should see one of those machines experiencing an error every 15 - 60 minutes.
>> Errors *do* occur. How vulnerable you are to them is a different >> issue. If a soft/hard error causes a light to blink at 3Hz instead >> of 2Hz... <shrug> > > Of course errors do occur. My point is "do you take preventive measures > to not be hit by a meteor while crossing a rush hour street?".
There were 63 reported meteorites in the 2001-2012 period, WORLD-WIDE. Let's extrapolate that rate to an average lifetime -- say 500 (regardless of where you might be at the time) The planet's surface area is ~200 million square miles. Let's assume I am a one square mile target -- even when I'm indoors! So, in my lifetime, I stand a 500/200,000,000 chance of being hit by a meteorite. In an 80 year span (roughly 700,000 hours... call it a million hours), that says I'd have a 500/200M*1M chance of getting hit in a given hour. Or, the equivalent of 0.5FiT [lots of handwaving here to give a relative sense of scale]
>>> Leave memory testing to the silicon and board manufacturers, they have >>> better means to test it than the CPU which uses it. If you need the >>> feeling of some extra reliability use a part with ECC (and populate >>> the chips for it....). >> >> But there aren't many parts that *do* support ECC, natively. > > Well silicon makers must have good risk assessment strategies for > the decision when to put in ECC, which underscores my point.
No, they also have MARKETING strategies involved! PC's have been sold with ECC "optional" for many years -- despite the sizes of the memory complements installed! Because adding 15% to the cost of a DIMM would make the product "too expensive"?
> If a chip does not have it it is because in all likelihood it > does not need it. What is the point of ECC for a part which > costs $10 or less and will be programmed in C or some other > HLL where the programmer will never know what exactly does the > code do. > > The larger parts from Freescale/NXP do have ECC on their DDRAM > controllers, IIRC those I have seen correct single bit errors > and signal larger ones. How justified is this technically I > just don't know, don't have their testing etc. data collected > over the years, but it clearly is economically justified (your > system won't be discarded at some decision making point because > it does not have ECC - and at this size ECC is no big part of > the cost). > >> How do you know you've had a memory failure? Or, have they been >> "catastrophic" (hard to ignore)? Without ECC -- and runtime >> tools that monitor and log those errors -- you can't say whether >> your experiencing none... or MANY! And, where the threshold lies >> between "none", "some" and "many". > > Well like I said many systems running for many months without being > reset is nothing special for me so if there were some memory problem
I disagree. There are numerous ways for a memory error to slip through without disturbing operation in a noticeable (*verifiable*) way. Will your customers notice if the LSB in a raw datum is toggled? Have a read of: <http://www.cse.psu.edu/~mtk2/guw_DSN04.pdf> pay attention to the "not manifested" results -- cases where a KNOWN error was intentionally injected into the system but the system appeared to not react to it. As I say, I suspect errors *are* happening (the FiT figures suggest it and the experiment above shows how easily errors can slip through)
> I'd have noticed it. I have noticed things much harder to imagine; > like interference into an I2C line which could cause some hang (was > software correctable once I realized this occurred) etc., and other > events which occur very rarely and are very hard to detect. Memory > issues have not been one of them. Then again, I speak only of a > few systems I have designed and manufactured.
On 27.4.2016 &#1075;. 03:35, Don Y wrote:
> Hi Dimiter, > >>> Would you leave one of your instruments running (with code executing >>> out of RAM) for a year and expect the program image to be intact? >> >> Of course, happens all the time. For many months at least. > > Then I suspect you are just not aware of the errors that are > occurring -- or, that they are masked by "expectations", etc. > > Using the error rates predicted in google's paper: > > 25,000FiT/Mb * 64MB * 8 = 12,800,000 Fit > 12,800,000 / 1,000,000,000 hrs = 12.8/1000 hrs > or, one every ~80 hours. > > Using their high figure (75000 FiT/Mb) cuts that to one error > every ~1 day! > > For a 128MB system, that's a range of 1 error every 12 - 40 hours.
Hi Don, I would first question the basic data you are using. Having never seen the google paper I doubt they can produce a result on memory reliability judging by the memories on their servers. Knowing what a mess the software they distribute is I would say about all the errors they have attributed to memory failure must have been down to their buggy software. Again, I have not seen their paper and I won't spend time investigating but I'll choose to stay where my intuition/experience has lead me, I have more reason to trust these than to trust google.
> Have a read of: > <http://www.cse.psu.edu/~mtk2/guw_DSN04.pdf> > pay attention to the "not manifested" results -- cases where a KNOWN > error was intentionally injected into the system but the system appeared > to not react to it. > > As I say, I suspect errors *are* happening (the FiT figures suggest > it and the experiment above shows how easily errors can slip through)
Oh come on, for nuclear spectrometry gadgets - e.g. an MCA - it is vital to survive months without being reset, there are measurements and experiments which just last very long. While damage to the data memory would be unnoticed - the data themselves are random enough - a few megabytes of code and critical system data are constantly in use, damage something there and you'll just see a crash or at least erratic behaviour. So my "mind the meteors while crossing a rush hour street in a big city" still holds as far as I am concerned. I have never looked at memory maker data about bit failures, I might pay more attention to these if available than I would to some google talk. Dimiter ------------------------------------------------------ Dimiter Popoff, TGI http://www.tgi-sci.com ------------------------------------------------------ http://www.flickr.com/photos/didi_tgi/
On Tue, 26 Apr 2016 13:05:56 -0700, Don Y
<blockedofcourse@foo.invalid> wrote:

>> If you're just doing monitoring, and then preventive maintenance, >> based on an accumulated soft error rate, there again has been a fair >> bit of literature, but they all come to approximately the same >> conclusion - soft errors are pretty rare for most devices, and on a >> handful they tend to be much more common. So the exact threshold is >> actually not that important. So a DIMM getting a soft error every few >> months in ignorable, several per day is not, and there's little in the >> real world between those. > >Some studies show hard errors are more prevalent than soft; others >show the exact opposite. A google study (big data farms) claimed >~50,000 FiT/Mb. Further, it appeared to correlate error rates with >device age -- as if cells were "wearing out" from use. > >And, its not "a soft error every few months" but, rather, several thousands >per year (per GB so figure I'm at 1/4 of that -- per device node!)
On a per-DIMM basis, the Google paper has 8.2% of DIMMs experiencing one or more correctable errors per year, and of those 8.2%, the median number of errors is 64 per year (with the overall average being 3751!). The go on to mention that for the DIMMs with errors, 20% of those account for 94% of the errors.
On 4/26/2016 9:28 PM, Robert Wessel wrote:
> On Tue, 26 Apr 2016 13:05:56 -0700, Don Y > <blockedofcourse@foo.invalid> wrote: > >>> If you're just doing monitoring, and then preventive maintenance, >>> based on an accumulated soft error rate, there again has been a fair >>> bit of literature, but they all come to approximately the same >>> conclusion - soft errors are pretty rare for most devices, and on a >>> handful they tend to be much more common. So the exact threshold is >>> actually not that important. So a DIMM getting a soft error every few >>> months in ignorable, several per day is not, and there's little in the >>> real world between those. >> >> Some studies show hard errors are more prevalent than soft; others >> show the exact opposite. A google study (big data farms) claimed >> ~50,000 FiT/Mb. Further, it appeared to correlate error rates with >> device age -- as if cells were "wearing out" from use. >> >> And, its not "a soft error every few months" but, rather, several thousands >> per year (per GB so figure I'm at 1/4 of that -- per device node!) > > > On a per-DIMM basis, the Google paper has 8.2% of DIMMs experiencing > one or more correctable errors per year, and of those 8.2%, the median > number of errors is 64 per year (with the overall average being > 3751!). The go on to mention that for the DIMMs with errors, 20% of > those account for 94% of the errors.
There are LOTS of holes in the study -- you'd need access to all the raw data *plus* things they probably haven't even considered to record (e.g., physical locations of the individual servers in their racks -- esp if you consider SEU's from cosmic rays probably affecting those at the top of their racks more than those "shaded" by the machines above). If you were doing this sort of thing for yourself, you'd try moving DIMMs, moving servers, etc. to try to identify the cause of the unresolved ambiguities reported. Regardless, the takeaway is: can *you* predict what sort of error rate YOUR device will experience "in the lab"? What about "in the wild"? Do you know if your customer will be operating it at sea level or a mile (or more) up? Do you know how the design wears with age? etc. Ages ago, you could build a DRAM controller out of discrete logic. Now, the complexities of timing signals for the various DDR technologies suggests you have to rely on the MCU vendor's implementation; is it guaranteed to be "bug free" in all possible combinations of scheduled cycles? I can't see how you can rely on a one-time QUICK check of RAM to express any sort of confidence in the CONTINUING operation of a device -- short of catching permanent "stuck at" or "decode" faults. And, if you have to restart a device to get that information, then you're relying on implicit down time as part of your normal operating procedure -- like MS's "reboot windows" approach to reliability!
Hi Dimiter,

On 4/26/2016 6:21 PM, Dimiter_Popoff wrote:

>> Using the error rates predicted in google's paper: >> >> 25,000FiT/Mb * 64MB * 8 = 12,800,000 Fit >> 12,800,000 / 1,000,000,000 hrs = 12.8/1000 hrs >> or, one every ~80 hours. >> >> Using their high figure (75000 FiT/Mb) cuts that to one error >> every ~1 day! >> >> For a 128MB system, that's a range of 1 error every 12 - 40 hours. > > I would first question the basic data you are using. Having never > seen the google paper I doubt they can produce a result on memory > reliability judging by the memories on their servers.
There have been other papers looking at other "processor pools" (workstations, other "big iron", etc. Their data vary but all suggest memory can't be relied upon (without ECC -- or some other "assurance method"). Of course, bigger arrays see more errors. "Even using a relatively conservative error rate (500 FIT/Mbit), a system with 1 GByte of RAM can expect an error every two weeks" (note that's 100 times lower error rate than google's study turned up; and 10 times lower than what other surveys have concluded) And, if you treat your population of products as if a single collection of memory, that means SOMEONE, SOMEWHERE is seeing an error (and the thing they all have in common is the vendor from whom they purchased the product) Sun apparently had some spectacular failures traced to some memory manufactured by IBM. Of course, SRAM is also subject to the same sorts of "upset events". And, SRAM is increasingly found in large FPGA's. (e.g., XCV1000) "If a product contains just a single 1 megagate SRAM-based FPGA and has shipped 50,000 units, there is a significant risk of field failures due to firm errors. Even for such a simple system, the manufacturer can expect that within his customer base, there will be a field failure due to a firm error every 17 hours." And, of course, an SRAM error in an FPGA can cause the hardware to be configured in a "CAN'T HAPPEN" state (like turning on a pullup AND a pulldown, simultaneously)
> Knowing what > a mess the software they distribute is I would say about all the > errors they have attributed to memory failure must have been > down to their buggy software.
One of the researchers was not affiliated with google. Note that other similar experiments (conducted by other firms on other hardware) have yielded FiT's in the 20,000 range. It's not like google's numbers are an isolated report.
> Again, I have not seen their paper and I won't spend time investigating > but I'll choose to stay where my intuition/experience has lead me, I > have more reason to trust these than to trust google.
<frown> I don't like relying on intuition when it comes to product design. Just because you haven't seen (or, perhaps, RECOGNIZED) an error, doesn't mean it doesn't exist.
>> Have a read of: >> <http://www.cse.psu.edu/~mtk2/guw_DSN04.pdf> >> pay attention to the "not manifested" results -- cases where a KNOWN >> error was intentionally injected into the system but the system appeared >> to not react to it. >> >> As I say, I suspect errors *are* happening (the FiT figures suggest >> it and the experiment above shows how easily errors can slip through) > > Oh come on, for nuclear spectrometry gadgets - e.g. an MCA - it is vital > to survive months without being reset, there are measurements and > experiments which just last very long. While damage to the data memory > would be unnoticed - the data themselves are random enough - a few > megabytes of code and critical system data are constantly in use, > damage something there and you'll just see a crash or at least > erratic behaviour.
No, that's not a necessary conclusion. *READ* the papers cited. Or, do you want to dismiss their software/techniques ALSO? In that case, INSTRUMENT one of your NetMCA's and see what *it* reports for errors over the course of months of operation. The takeaway, for me, is that I should actually LOG any observed errors knowing they would represent just the tip of the iceberg in terms of what must be happening in normal operation -- but undetected in the absence of ECC hardware! Let my devices gather data.
> So my "mind the meteors while crossing a rush hour street in a big > city" still holds as far as I am concerned. > > I have never looked at memory maker data about bit failures, I might > pay more attention to these if available than I would to some google > talk.
Their silence is deafening. Given the "buzz" in the literature questioning the integrity of their products (after all, the sole purpose of MEMORY is to REMEMBER, *accurately*!), you would assume an organization with access to virtually unlimited amounts of memory would conduct and publish a comprehensive study refuting these claims!
On 27.4.2016 &#1075;. 16:36, Don Y wrote:
> Hi Dimiter, > > On 4/26/2016 6:21 PM, Dimiter_Popoff wrote: > >>> Using the error rates predicted in google's paper: >>> >>> 25,000FiT/Mb * 64MB * 8 = 12,800,000 Fit >>> 12,800,000 / 1,000,000,000 hrs = 12.8/1000 hrs >>> or, one every ~80 hours. >>> >>> Using their high figure (75000 FiT/Mb) cuts that to one error >>> every ~1 day! >>> >>> For a 128MB system, that's a range of 1 error every 12 - 40 hours. >> >> I would first question the basic data you are using. Having never >> seen the google paper I doubt they can produce a result on memory >> reliability judging by the memories on their servers. > > There have been other papers looking at other "processor pools" > (workstations, other "big iron", etc. Their data vary but all > suggest memory can't be relied upon (without ECC -- or some other > "assurance method"). Of course, bigger arrays see more errors. > "Even using a relatively conservative error rate (500 FIT/Mbit), > a system with 1 GByte of RAM can expect an error every two weeks" > (note that's 100 times lower error rate than google's study turned up; > and 10 times lower than what other surveys have concluded)
Hi Don, The more papers you read on the topic the wider the interval of results will get. Apparently these have been done by people who have had some problem with their memory - or thought to have it and could not discover the true source of the error, typically a bug. And I am not saying there are no faulty memories and poorly designed boards where memory errors do occur - but the solution is just to have good silicon on properly designed boards. Much more efficient than chasing errors which nobody knows when, if and why they do occur. Our testing here is by running a newborn unit for 72 hours measuring continuously with its HV at maximum (usually 5kV), have never had a memory issue during this test and have never had one with devices in the field some of which run for months without being reset, that while being on a network. Then at different densities the probability of an error might well be different. At the DDR1 - I typically use 2 x16 chips to get 64 or 128 megabytes - I have not seen one error for years and I am pretty good at spotting things if they are not right. Perhaps on gigabytes per chip densities things get worse. But then the controllers meant for such chips have ECC which eliminates the probability of an error hitting you completely (if the silicon/board are good); even if you get 1 bit per hour the probability of getting two at the same time at the same address is infinitely small. Once I have a system with ECC to port DPS on I'll probably be able to see how many memory errors the ECC sees and corrects, I strongly suspect they will still be 0 but we'll see. Anyway at a few G of memory ECC makes sense I suppose.
>> Again, I have not seen their paper and I won't spend time investigating >> but I'll choose to stay where my intuition/experience has lead me, I >> have more reason to trust these than to trust google. > > <frown> I don't like relying on intuition when it comes to product > design. Just because you haven't seen (or, perhaps, RECOGNIZED) an > error, doesn't mean it doesn't exist.
Well I said intuition/experience, we all use both all the time simply because we don't have many other options.
> >>> Have a read of: >>> <http://www.cse.psu.edu/~mtk2/guw_DSN04.pdf> >>> pay attention to the "not manifested" results -- cases where a KNOWN >>> error was intentionally injected into the system but the system appeared >>> to not react to it. >>> >>> As I say, I suspect errors *are* happening (the FiT figures suggest >>> it and the experiment above shows how easily errors can slip through) >> >> Oh come on, for nuclear spectrometry gadgets - e.g. an MCA - it is vital >> to survive months without being reset, there are measurements and >> experiments which just last very long. While damage to the data memory >> would be unnoticed - the data themselves are random enough - a few >> megabytes of code and critical system data are constantly in use, >> damage something there and you'll just see a crash or at least >> erratic behaviour. > > No, that's not a necessary conclusion. *READ* the papers cited. Or, > do you want to dismiss their software/techniques ALSO?
Well I am not sure I'll read it any time soon, have other things to do. But I may get back to it if at some point I feel I have a related problem or something. Dimiter ------------------------------------------------------ Dimiter Popoff, TGI http://www.tgi-sci.com ------------------------------------------------------ http://www.flickr.com/photos/didi_tgi/
On 28.4.2016 &#1075;. 02:47, Dimiter_Popoff wrote:
> On 27.4.2016 &#1075;. 16:36, Don Y wrote: >.... >>>> Have a read of: >>>> <http://www.cse.psu.edu/~mtk2/guw_DSN04.pdf> >>>> pay attention to the "not manifested" results -- cases where a KNOWN >>>> error was intentionally injected into the system but the system >>>> appeared >>>> to not react to it. >>>> >>>> As I say, I suspect errors *are* happening (the FiT figures suggest >>>> it and the experiment above shows how easily errors can slip through) >>> >>> Oh come on, for nuclear spectrometry gadgets - e.g. an MCA - it is vital >>> to survive months without being reset, there are measurements and >>> experiments which just last very long. While damage to the data memory >>> would be unnoticed - the data themselves are random enough - a few >>> megabytes of code and critical system data are constantly in use, >>> damage something there and you'll just see a crash or at least >>> erratic behaviour. >> >> No, that's not a necessary conclusion. *READ* the papers cited. Or, >> do you want to dismiss their software/techniques ALSO? > > Well I am not sure I'll read it any time soon, have other things to do. > But I may get back to it if at some point I feel I have a related > problem or something. >
Hi Don, had a look at the paper (just the abstract). It is not really relevant, they compare error immunity of different processors but this has little to do with RAM errors and me not seeing them. They try to crash linux by injecting errors - well, given that it is C written and bloated by at least a factor of 10 (on occasions 100+ times) to its DPS equivalent (vpa written) no wonder there is plenty of room in RAM wasted which can be damaged to no consequences. I am quite sure dps won't survive a fraction of their intentional memory damage - remember, I am programming into it all day and I know what happens if my code does something stupid... I am not saying their results are invalid, just not applicable to how I estimate the possibility of memory errors, the bloat factor difference is just way too big. Dimiter
The 2026 Embedded Online Conference