Hi, For moderately large to very large memory (DRAM) subsystems, what sorts of policies are folks using to test RAM in POST? And in BIST (presumably more involved than POST)? The days of device-specific test patterns seem long gone. So, cruder tests seem like they are just as effective and considerably faster. E.g., I typically use three passes of a "carpet" pattern (seed a LFSR -- or any other PRNG -- write the byte to the current address, kick the RNG, rinse, lather, repeat; reseed the LFSR, reset the address, read the byte at the current address, compare to RNG state, kick the RNG, rinse, lather, repeat) expecting any problems to manifest as gross failures (rather than checking for disturb patterns, etc.) [Of course, the period of the PRNG is chosen to be long and relatively prime wrt any of the addressing patterns] BIST just changes the number of iterations with protections on certain key parts of the address space. The tougher issue is testing "live" memory in systems that are "up" 24/7/365...
Memory testing
Started by ●June 3, 2015
Reply by ●June 4, 20152015-06-04
Systems running 24/7/365 are much more likely to have disc failure before memory failure. But for embedded systems there may not be any disc storage. but I think the lifetime of the RAM is likely about the same as the CPU. So it is not worth the effort, since the device may be replaced before failure starts. Even a pacemaker has a finite lifetime. Given that, you must have some control over the system. you're working on the BIOS level, right? So can you tell what RAM is unused? That's the easy case. Can you force the higher level software to yield control briefly for random checks? Otherwise, the only way I can see is to somehow watch the higher level execution and check its reads and writes. But short of single stepping, I don't know how to do it. Custom memory hardware? A dual ported memory management unit where you can swap RAM pages at will without disturbing the application execution? If your system needs that level of reliability, then it may be worth the money and effort. ed
Reply by ●June 4, 20152015-06-04
Hi Ed, On 6/4/2015 4:11 AM, Ed Prochak wrote:> Systems running 24/7/365 are much more likely to have disc failure before > memory failure.No rotating media.> But for embedded systems there may not be any disc storage. but I think the > lifetime of the RAM is likely about the same as the CPU. So it is not worth > the effort, since the device may be replaced before failure starts. > > Even a pacemaker has a finite lifetime.A memory subsystem can "fail" (i.e., not be reliable in maintaining the data it is charged with preserving) without being "worn out". E.g., problems with the power supply can manifest in memory errors long before the system itself "fails".> Given that, you must have some control over the system. you're working on > the BIOS level, right? So can you tell what RAM is unused? That's the easy > case.Consider the different cases: - POST You essentially have the entire system at your disposal for some amount of time (ideally, you want to keep this period short so your bootstrap doesn't become a noticeable event -- perhaps have different methods of bringing up different levels of POST so you can exercise more comprehensive tests when you feel you may have more time available) - BIST You probably have "a good portion" of the system at your disposal. And, probably for a considerably longer period of time. I.e., you are *deliberately* engaged in testing, not "operating" - Run-time You probably have a severely restricted portion of the system at your disposal and probably for very short periods of time (lest your efforts start to interfere with concurrent operations) During POST, I think the time constraints mean you can really only perform gross tests of functionality. Comprehensive/exhaustive testing would just take way too long. Hence my use of a simple test ("carpet") that hopefully catches *some* gross errors if any exist. In BIST, I think you can be far more methodical in applying patterns to the subsystem to try to draw marginal portions of the array out. At run-time, I think the constraints are so severe that you can really only look for gross errors in very localized portions of the array (as become available for testing)> Can you force the higher level software to yield control briefly for random > checks?I have many nodes in the system. If I don't need the I/O's on a particular node (i.e., if I am just using it as a compute server), then I can migrate the executing tasks to another node (possibly bringing a new "cold" node on-line just for that purpose) while the majority of the memory is tested on the "original" node. If the (or some) I/O's are *required*, then I have to leave some services running on the node to make that hardware available -- even if I migrate the tasks that are interfacing to those I/O's off to another node (as above). Of course, the I/O's on certain nodes will tend to be "needed" more often than the I/O's on other nodes. But, I can always schedule "bulk testing" to take advantage of even brief periods where the I/O's are expected to be idle. E.g., if the HVAC node has *just* brought the house up/down to the desired setpoint temperature, it is likely that the furnace/ACbrrr will not be needed for "a few minutes". So, I could move everything off of that node (even the "drivers" for the I/O's) for a short time while the node is placed in "test mode". The assumption being that the testing will take less time than the house's thermal time constant necessitating a reactivation of the furnace/ACbrrr. Other nodes may be less predictable. E.g., the security system needs to remain on-line even if the house is "occupied" and the "alarm" doesn't need to be armed (e.g., consider the role of "panic switches", fire/smoke detectors, etc.)> Otherwise, the only way I can see is to somehow watch the higher level > execution and check its reads and writes. But short of single stepping, I > don't know how to do it. > > Custom memory hardware? A dual ported memory management unit where you can > swap RAM pages at will without disturbing the application execution?I have a demand-paged virtual memory system. As a matter of security, every swapped out page is scrubbed before being placed back into use (as this is a potential bridging of a protection domain. As part of that scrubbing, I could briefly test the physical memory represented by that page. But, this is HIGHLY localized. E.g., a decode failure would never (incredibly rarely!) be detectable as you deliberately can't see the entire memory subsystem; no way of knowing if your actions "here" are manifesting "there". I can "silently" arrange for in use pages to be replicated and then swapped without affecting the application (esp for TEXT pages). But, it still leaves me with just a tiny, local region of memory that I can examine and play with -- hard to imagine a failure showing up there that hasn't already caused a failure elsewhere!> If your system needs that level of reliability, then it may be worth the > money and effort.Note that there is a subtle difference between *ensuring* the system is reliable and detecting when it is prone to failure. Especially in my architecture, a faulty node can just be kept out of service -- making all or some of the I/O's that it handles unavailable (e.g., maybe you can't irrigate) -- without compromising everything (or, increasing the cost to ensure the "lawn can always be watered")
Reply by ●June 4, 20152015-06-04
Boudewijn Dijkstra wrote:> Op Wed, 03 Jun 2015 10:15:09 +0200 schreef Don Y <this@is.not.me.com>: >> The tougher issue is testing "live" memory in systems that >> are "up" 24/7/365... > > If the memory has a fixed block of 24/32bpp video memory, then you can > borrow 2-3 bits of each band without much visible disturbance. > > But indeed even if you can find a block of free memory, it is easy to > saturate the bus and cause deadlines to be missed.Well, what kind of errors do you want to detect? I'd guess the most common failure pattern for a factory test / power on self test would be individual lines shorted against ground / Vcc / each other, or disconnected, due to bad soldering, dirt, corrosion etc. Therefore, I'd just try every line and its inverse. This doesn't need to be particularily fast, and it only needs some strategically placed memory pages. But it would need all bit lines, so borrowing unused bits of an array of 32-bit words would not work. Stefan
Reply by ●June 4, 20152015-06-04
Hi Boudewijn, On 6/4/2015 6:08 AM, Boudewijn Dijkstra wrote:> Op Wed, 03 Jun 2015 10:15:09 +0200 schreef Don Y <this@is.not.me.com>: >> The tougher issue is testing "live" memory in systems that >> are "up" 24/7/365... > > If the memory has a fixed block of 24/32bpp video memory, then you can borrow > 2-3 bits of each band without much visible disturbance.I only have displays on 3 nodes. And, those are "optional". But, regardless, you'd want to test all bit positions (without having to "wait" for the right data to *happen* to be "displayed")> But indeed even if you can find a block of free memory, it is easy to saturate > the bus and cause deadlines to be missed. > > High-reliability systems often employ Hamming codes (for booleans and enums) > and inverted shadow copies for other values (which are checked on each access).These are SoC's (augmented with external memory) so ECC isn't usually supported. I don't think I need to pay for "live" error detection. I expect to catch most failures either because a node misbehaves in the course of its normal operation (emits some faulty data, "goes offline", trips a deadline handler for one of its tasks, etc.) *or* a node that is being brought on-line fails its POST (i.e., "died in its sleep"). BIST is a necessary evil to troubleshoot any system: the node is misbehaving, why? (removing and replacing a node can be expensive -- mainly labor). Being able to put a node into a diagnostic mode for an indeterminate amount of time means you can have considerable control over what it is doing during that time (letting it run some "random" collection of apps is less predictable). Run-time testing is an attempt to bridge the two -- catching failures before they manifest. E.g., knowing that an irrigation solenoid is shorted/opened *before* you need to energize it; thus, allowing you to notify the needed repair before it has consequences.
Reply by ●June 4, 20152015-06-04
Hi Stefan, On 6/4/2015 9:06 AM, Stefan Reuther wrote:> Boudewijn Dijkstra wrote: >> Op Wed, 03 Jun 2015 10:15:09 +0200 schreef Don Y <this@is.not.me.com>: >>> The tougher issue is testing "live" memory in systems that >>> are "up" 24/7/365... >> >> If the memory has a fixed block of 24/32bpp video memory, then you can >> borrow 2-3 bits of each band without much visible disturbance. >> >> But indeed even if you can find a block of free memory, it is easy to >> saturate the bus and cause deadlines to be missed. > > Well, what kind of errors do you want to detect? I'd guess the most > common failure pattern for a factory test / power on self test would be > individual lines shorted against ground / Vcc / each other, or > disconnected, due to bad soldering, dirt, corrosion etc. Therefore, I'dI'm not concerned with factory test -- that can be as comprehensive as needed because the costs are external to the device(s) in question. And, because there are far more things that need to be tested than can (affordably) be accommodated with recurring dollars.> just try every line and its inverse. This doesn't need to be > particularily fast, and it only needs some strategically placed memory > pages. But it would need all bit lines, so borrowing unused bits of an > array of 32-bit words would not work.I suspect most "memory failures" won't be "hard" failures. Nor will they be directly related to the memory subsystem itself. Rather, I expect things like excessive power supply ripple (because a filter is failing over time/temperature) or other issues on which the memory relies for its proper operation (ventilation/cooling/etc.). I have nodes installed in a wide range of environments so its not reasonable to expect them all to be operating at a comfortable ambient, etc. And, some "less knowledgeable" user might fail to realize the consequences of his choice of siting ("Um, sure it's only 115F outside; but the sun shining directly on that nice black ABS casing in which you've mounted that node probably has the internal temperature up 50F higher!" E.g., car interiors, here, easily and OFTEN attain temperatures in excess of 140F. With outside temps above 100F for ~70-100 days each year, that's not an "exception" but, rather, a *rule*!)
Reply by ●June 5, 20152015-06-05
On Wed, 03 Jun 2015 01:15:09 -0700, Don Y <this@is.not.me.com> wrote:>The tougher issue is testing "live" memory in systems that >are "up" 24/7/365...Of course, POST is done only once at the first (and hopefully the only time) for a few decades. For such system, typically ECC memory is used. In such systems you can perform "flushing" i.e. read-writeback sequences to all memory locations at regular intervals, perhaps every few minutes if strong radiation is present. If the memory word contains a bit error, the ECC will correct it and the writeback will write clean data+ECC into that memory word. Of course, you should log the location and frequency when ECC is needed and the need for correction is high at some location, you should declare that memory page dead and use some bad block replacement system, which is easy to implement on any virtual memory operating system.
Reply by ●June 5, 20152015-06-05
On Thu, 04 Jun 2015 18:06:47 +0200, Stefan Reuther <stefan.news@arcor.de> wrote:>Boudewijn Dijkstra wrote: >> Op Wed, 03 Jun 2015 10:15:09 +0200 schreef Don Y <this@is.not.me.com>: >>> The tougher issue is testing "live" memory in systems that >>> are "up" 24/7/365... >> >> If the memory has a fixed block of 24/32bpp video memory, then you can >> borrow 2-3 bits of each band without much visible disturbance. >> >> But indeed even if you can find a block of free memory, it is easy to >> saturate the bus and cause deadlines to be missed. > >Well, what kind of errors do you want to detect? I'd guess the most >common failure pattern for a factory test / power on self test would be >individual lines shorted against ground / Vcc / each other, or >disconnected, due to bad soldering, dirt, corrosion etc. Therefore, I'd >just try every line and its inverse. This doesn't need to be >particularily fast, and it only needs some strategically placed memory >pages. But it would need all bit lines, so borrowing unused bits of an >array of 32-bit words would not work.If it takes too long to test each individual memory cell, on a DRAM at least test every row driver and every column sense amplifier. For a single memory page, test every memory location. This will test the column sense amplifiers as well as the input/output multiplexor. In addition to this, test one memory word from each memory page, which will examine the row decoder and row driver lines. For RAS/CAS DRAMs this will also examine all external address as well as data lines for shorts and Vcc/Gnd issues, since all lines are examined anyway.
Reply by ●June 5, 20152015-06-05
On 6/4/2015 11:42 PM, upsidedown@downunder.com wrote:> On Wed, 03 Jun 2015 01:15:09 -0700, Don Y <this@is.not.me.com> wrote: > > >> The tougher issue is testing "live" memory in systems that >> are "up" 24/7/365... > > Of course, POST is done only once at the first (and hopefully the only > time) for a few decades.As stated elsewhere, individual nodes are powered up and down routinely within the normal operation of the system. So, it is possible for POST _on_a_specific_node_ to be run often (i.e., as often as power is cycled to that particular node).> For such system, typically ECC memory is used. In such systems you can > perform "flushing" i.e. read-writeback sequences to all memory > locations at regular intervals, perhaps every few minutes if strong > radiation is present. If the memory word contains a bit error, the ECC > will correct it and the writeback will write clean data+ECC into that > memory word.SoC implementation so ECC is not in the cards. Even if I added the syndrome management in an external ASIC, there's no way to fault the CPU to rerun a bus cycle. So, WYSIWYG as far as DDR memory is concerned.> Of course, you should log the location and frequency when ECC is > needed and the need for correction is high at some location, you > should declare that memory page dead and use some bad block > replacement system, which is easy to implement on any virtual memory > operating system.This is actually an amusing concept. Ask folks when they consider their ECC memory system to be "compromised" and you'll never get a firm answer. E.g., how many bus errors do you consider as sufficient to leave you wondering if the ECC is actually *detecting* all errors (let alone *correcting* "some")? How do you know that (detected) errors are completely localized and have no other consequences? <shrug> In my case, I treat errors as indicative of a failure. Most probably something in the power conditioning and not a "wear" error in a device. Leaving it unchecked will almost certainly result in more errors popping up -- some of which I will likely NOT be able to detect. E.g., a POST error in DRAM causes me to fall back to recovery routines that operate out of (internal) SRAM. A failure in SRAM similarly causes DRAM to be used to the exclusion of SRAM. A failure in both means SoL! Regardless, in these degraded modes, the goal is only to *report* errors and support some limited remote diagnostics -- not to attempt to *operate* in the presence of a known problem.
Reply by ●June 8, 20152015-06-08
Op Thu, 04 Jun 2015 19:34:41 +0200 schreef Don Y <this@is.not.me.com>:> On 6/4/2015 6:08 AM, Boudewijn Dijkstra wrote: >> Op Wed, 03 Jun 2015 10:15:09 +0200 schreef Don Y <this@is.not.me.com>: >>> The tougher issue is testing "live" memory in systems that >>> are "up" 24/7/365... >> >> High-reliability systems often employ Hamming codes (for booleans and >> enums) and inverted shadow copies for other values (which are checkedon >> each access). > > These are SoC's (augmented with external memory) so ECC isn't usually > supported.I wasn't talking about ECC. I meant in software. Which is overkill for most applications. -- (Remove the obvious prefix to reply privately.) Gemaakt met Opera's e-mailprogramma: http://www.opera.com/mail/







