On 9/23/20 10:03 AM, David Brown wrote:> On 23/09/2020 14:51, Richard Damon wrote: >> On 9/23/20 8:13 AM, David Brown wrote: >>> On 23/09/2020 12:57, Don Y wrote: >>>> On 9/23/2020 3:13 AM, pozz wrote: >>>>> I'd like to implement a Power On Self Test to be sure all (or many) >>>>> parts of the electronics are functioning well. >>>>> >>>>> The tests to be done to check external hardware depends on the actual >>>>> hardware that is present. >>>>> >>>>> What about the MCU that features an internal Flash (where is the code) >>>>> and RAM and some peripherals? Are there any strategies to test if the >>>>> internal RAM or Flash are good? Do you think these kind of tests could >>>>> be useful? >>>>> >>>>> What about a test of the clock based on an external crystal? >>>> >>> <snip> >>>> Think hard about the types of failures you EXPECT to see (i.e., many are >>>> USER errors!) and don't invest too much time detecting things that will >>>> likely never fail OR whose failure you won't be able to do much about. >>> >>> This last bit is crucial. >>> >>> A lot of testing "requirements" that are specified are completely >>> pointless - or far worse than useless, as they introduce real points of >>> failure in their attempts to cover everything. >>> >>> First, figure out what you should /not/ test. >>> >>> Don't bother testing something unless you can usefully handle the >>> failure. If the way you communicate errors is through a UART, there is >>> no point in trying to check that the UART is working. If you have a >>> single microcontroller in the system, there is no point in trying to >>> check the cpu or the on-chip ram. There is no point in checking that >>> you can write to flash or on-chip eeprom - all you do is reduce its >>> lifetime and make it more likely to fail. >>> >>> Don't write any test code which cannot itself be tested. If you cannot >>> induce a failure, or at least simulate it reasonably, do not write code >>> to check or handle that failure. The reality is that the untested code >>> will have a higher risk of problems than the thing you are testing. >>> >>> Don't check the ram or the flash of the microcontroller - there's >>> nothing you can do if there is a failure. (You can check that you have >>> successfully loaded a new software update, or that there hasn't been a >>> reset during an update - a CRC for that kind of thing is a good idea.) >>> If you have a system that is important enough that ram or flash failures >>> need to be checked and handled, use a safety-qualified microcontroller >>> with ECC ram, flash, cache, etc., and perhaps even redundant cores (you >>> get these with PowerPC and Cortex-R cores). >>> >>> And think about what can reasonably go wrong, how it can go wrong, and >>> what can be done about it. Other than for devices susceptible to >>> current surges (like filament light bulbs), most hardware failures are >>> in usage, not while power is off - checking on power-up (rather than >>> while the system is in use) usually only makes sense if it is likely for >>> a user to see there is a problem and try to "fix" it by turning power >>> off and on again. >>> >> >> Testing RAM can be useful, letting the system fail gracefully rather >> than acting flaky, perhaps just locking up into a tight loop flashing a >> LED as a fault indicator. > > Have you ever seen microcontroller RAM that failed? It's a possibility > for dynamic ram on PC's that is pushed to its limits for power and > speed, and made as cheaply as possible. But for static RAM in a > microcontroller, the risk of failures is pretty much negligible. The > exception is if a bit is hit by a cosmic ray (or other serious > radiation), which can flip a bit, but that won't be detected by any RAM > test of this kind.I think I have seen it once, the part had gotten electrically stressed in debugging and one of the banks of internal ram failed. We had only put the test in because the uniit was going to be in critical infrastructure where certain types of malfunctions could present dangers to people in the area.> > Testing RAM is useful /if/ it can fail, and /if/ you can do something > useful when it fails. (I agree that "a tight loop flashing an LED" > might count as something useful, depending on the situation.) > > I've seen "safety standards requirements" that included regular ram > tests. Such requirements generally originate decades ago, and are not > appropriately nuanced for real-life systems. I've seen resulting code > used to implement such tests, added solely to fulfil such requirements. > And I've seen such code written in a way that is untested and > untestable, and in a way that has risks that /hugely/ outweigh those of > a fault occurring in the on-board RAM. > > If the OP is in the situation where there are customer requirements for > fulfilling certain safety requirements that include ram tests, and where > "mindlessly obeying these rules no matter how pointless they are in > reality" is the right choice to please arse-covering lawyers, then go > for it. If not, then think long and hard about the realism of such a > failure and such a test, and whether it is truly a positive contribution > to the project as a whole. > >> Similarly, you could CRC check the program >> flash, and fail on an error, preferably falling into a minimal system >> that allows a user reflash, but it might mean just bricking. >> > > The possibility of a flash failure is a great deal higher than that of a > RAM failure. Flash writes are analogue - a bit can be written in such a > way that it reads back correctly at programming time, but goes outside > the margins over time or at different temperatures or voltages. So yes, > sometimes a CRC of the flash is worth doing. But remember that the > program doing the check is just as much at risk of such failures > (perhaps even more so, if you have a "boot" program that does the check > of the "main" program, as the boot program is less likely to be updated > and thus its bits will have decayed over a longer time). > > If flash fails are a real risk, and the system is important enough, it's > better to pick a microcontroller with ECC flash. > >> Note, that as you say, most faults will happen while powered up, but >> many faults will cause a system crash, that the user is likely to power >> cycle to try and clear, so power up is a good time to check (since many >> things are a lot harder to check while the system is running in operation). >> > > Yes, I mentioned that. (It assumes the embedded system has a user that > can do such a power-cycle.) >It is also possible that many failure might trip a watchdog that forces a reset, and the unit then finds the fault and locks itself 'safe'.
Power On Self Test
Started by ●September 23, 2020
Reply by ●September 23, 20202020-09-23
Reply by ●September 24, 20202020-09-24
Il 24/09/2020 00:13, Mike Perkins ha scritto:> On 23/09/2020 11:13:43, pozz wrote: >> I'd like to implement a Power On Self Test to be sure all (or many) >> parts of the electronics are functioning well. >> >> The tests to be done to check external hardware depends on the actual >> hardware that is present. >> >> What about the MCU that features an internal Flash (where is the code) >> and RAM and some peripherals? Are there any strategies to test if the >> internal RAM or Flash are good? Do you think these kind of tests could >> be useful? >> >> What about a test of the clock based on an external crystal? > > I've done this with the STM32 variety of MCUs. The device itself has a > Flash checksum and if this fails it won't start. > > ST also proved some example code and libraries for POST. These are more > comprehensive than just checking RAM. > > Might be worth have a look. >Could you give me a link on this code? Thanks
Reply by ●September 24, 20202020-09-24
On 23/09/2020 21:32, Don Y wrote:> On 9/23/2020 12:02 PM, David Brown wrote: >> On 23/09/2020 18:36, Don Y wrote: >> >>> If you don't understand your hardware and how it can fail, you shouldn't >>> be the one who is designing the test suite! >> >> That bit is correct.� The rest - well, I don't want to get into a long >> and protracted argument. >> >> Any system is made up of layers.� Higher level layers assume that lower >> level layers work according to specification (which may include >> indicating an error for some kinds of detectable fault).� If you think >> the higher level part can fully verify the lower level parts - "prove" >> that the assumptions hold - you are fooling yourself.� When you design a >> system based on a microcontroller, you pick a device that is as reliable >> as you need it to be - so that you /can/ assume the core parts (cpu, >> ram, flash, interrupts, etc.) work well enough for your needs.� If you >> are not sure it is reliable enough, pick a different device or make a >> redundant system. >> >> No amount of testing can /ever/ prove that something works - it can only >> prove that something does /not/ work. > > A system is not a static entity.� It changes over time (even if the > design is > frozen).� So, while your RAM (or any other component) may not be LIKELY to > fail, the rest of the system that enables the RAM to function as intended > can change in ways that manifest as RAM (or other resources) failures.Ah, nonsense. Sure, it is /conceivable/, but it's a one in a million possibility. This is not the 70's any more, and we are not talking about dynamic ram. The onboard ram is static ram - it is a sea of simple flip-flops. These /can/ be bit-flipped by cosmic rays, if they are small enough, but they don't suddenly stop working. If "the rest of the system" fails in a way that stops the ram bits working, you can be confident that the cpu core and other critical parts have stopped too, as the problem is your clock, your voltage supply, or overheating. If you are going to try to make sensible decisions about what can fail, and where it is useful to test, you need to understand how devices work - devices that you are using /today/, not systems from 50 years ago. Otherwise your testing is counter-productive as the tests have higher risks of failures than the thing you are testing.> > [Picking the "world's most reliable MCU" won't guarantee that it won't > throw > RAM errors in a deployed product.] >/Nothing/ will give you guarantees like that. But if you pick a microcontroller with ECC on its onboard ram (and cache, if it has it), you reduce, by many orders of magnitude, the risk of single-event upsets (such as cosmic rays) leading to failures of the system. Anything else you can do in software is pointless in comparison. "Testing" your ram can't possibly detect such issues. Not many products justify the extra expense of such microcontrollers, but they are available for those that need them.> Simply assuming it "can't fail" is naive.Of course. Simply assuming that you can do a test at startup and think that makes the system more reliable is at least equally na�ve.> > And, identifying faulty "can't happen" behavior EARLY (e.g. POST) rather > than late gives you a better idea of what to report to the user/customer > because you are closer to the problem's manifestation.� You don't end > up misbehaving and wondering "why?" > > [And, all of this assumes "bugfree software" so any errors are entirely a > result of hardware faults]And there is perhaps your biggest invalid assumption. Software is always a risk. Software that can't be properly tested is a significantly higher risk. Software designed to handle situations that cannot possibly be reproduced for testing purposes, cannot be properly tested. So writing software test routines for something that has no realistic chance of happening in the field, /reduces/ the reliability of the product.
Reply by ●September 24, 20202020-09-24
On 23/09/2020 22:05, Paul Rubin wrote:> David Brown <david.brown@hesbynett.no> writes: >> When you design a system based on a microcontroller, you pick a device >> that is as reliable as you need it to be > > That might not exist. E.g. it's common for security processors and > software to continuously self-test while running, since the user might > be trying to tamper with them. "Differential fault analysis" is a > relevant search string. The attacker does stuff like intentionally > overclock the processor in the hope of introducing errors, so they can > observe the difference between the error result and the normal result, > and infer stuff about the supposedly-secured info inside the processor. > There is no magic way to defeat these attacks, but the cpu designers do > what they can. >Security against deliberate attacks is a completely different ballgame. If you have made a system where an attacker can cause processor overclocking, and such attacks are realistic, then you need to put whatever checks, tests and mitigations are needed to deal with that situation. If there are no feasible scenarios where the processor clock can be suddenly increased to the point where hardware or software becomes unreliable, then any tests or handling of such a situation is worse than useless. You are just adding more stuff that can go wrong (or be attacked), without any benefits.
Reply by ●September 24, 20202020-09-24
On 24/09/2020 03:55, Richard Damon wrote:> On 9/23/20 10:03 AM, David Brown wrote:>> Have you ever seen microcontroller RAM that failed? It's a possibility >> for dynamic ram on PC's that is pushed to its limits for power and >> speed, and made as cheaply as possible. But for static RAM in a >> microcontroller, the risk of failures is pretty much negligible. The >> exception is if a bit is hit by a cosmic ray (or other serious >> radiation), which can flip a bit, but that won't be detected by any RAM >> test of this kind. > > I think I have seen it once, the part had gotten electrically stressed > in debugging and one of the banks of internal ram failed. We had only > put the test in because the uniit was going to be in critical > infrastructure where certain types of malfunctions could present dangers > to people in the area.Presumably you are careful about keeping the systems that developers have potentially broken separate from the systems that get delivered to customers. (Another possible cause of this kind of failure is ESD damage. Production departments are usually a lot more meticulous about ESD than developers.) If you have a system that is safety critical, you have to do an analysis of the risks of things going wrong, the consequences of those failures, and how these (risks and consequences) can be reduced or mitigated. If you figure out that static failure of the memory is a risk, then testing can be worth doing. You might also decide that ECC ram, or redundant devices, or external monitors are a better solution. There's no fixed answer.> > It is also possible that many failure might trip a watchdog that forces > a reset, and the unit then finds the fault and locks itself 'safe'. >That is definitely possible. But again, be very careful with watchdogs - watchdog handling code is rarely properly tested, because it is handling situations that don't occur. (Usually it /can/ be tested, but that does not mean it /is/ tested.)
Reply by ●September 24, 20202020-09-24
On 9/24/2020 2:10 AM, David Brown wrote:> On 23/09/2020 21:32, Don Y wrote: >> On 9/23/2020 12:02 PM, David Brown wrote: >> A system is not a static entity. It changes over time (even if the >> design is >> frozen). So, while your RAM (or any other component) may not be LIKELY to >> fail, the rest of the system that enables the RAM to function as intended >> can change in ways that manifest as RAM (or other resources) failures. > > Ah, nonsense. Sure, it is /conceivable/, but it's a one in a million > possibility. This is not the 70's any more, and we are not talking > about dynamic ram. The onboard ram is static ram - it is a sea of > simple flip-flops. These /can/ be bit-flipped by cosmic rays, if they > are small enough, but they don't suddenly stop working. If "the rest of > the system" fails in a way that stops the ram bits working, you can be > confident that the cpu core and other critical parts have stopped too, > as the problem is your clock, your voltage supply, or overheating.By your argument, you should test NOTHING and just wait for the user to complain that the device "isn't working". And, hope that this manifests in a spectacular -- but not costly -- way. AND, hope it doesn't piss off the user who now has a product that isn't performing as he had hoped (and you had ADVERTISED) it would. The whole point of BIST/POST is to provide a point in time where failures will hopefully manifest -- instead of SILENTLY affecting the operation of the device in question, in typically unpredictable ways.> If you are going to try to make sensible decisions about what can fail, > and where it is useful to test, you need to understand how devices work > - devices that you are using /today/, not systems from 50 years ago. > Otherwise your testing is counter-productive as the tests have higher > risks of failures than the thing you are testing.How is a RAM test going to fail post deployment that didn't happen prior to release? POST/BIST are considerably easier to "get right" than application code. Their goals are much more concretely defined and implementation verified. "50 years ago" you didn't have SRAM suffering from disturb errors. Yet, now this is a fact of life for even caches. Technology advances and, with it, come new "challenges". I suggest you've been basing your assumptions on SRAM reliability on 50 year old anecdotes and not the consequences of more modern implementations, shrinking device geometries and lower operating voltages. Have a run through the literature to see...>> [Picking the "world's most reliable MCU" won't guarantee that it won't >> throw >> RAM errors in a deployed product.] > > /Nothing/ will give you guarantees like that. But if you pick a > microcontroller with ECC on its onboard ram (and cache, if it has it), > you reduce, by many orders of magnitude, the risk of single-event upsets > (such as cosmic rays) leading to failures of the system. Anything else > you can do in software is pointless in comparison. "Testing" your ram > can't possibly detect such issues. > > Not many products justify the extra expense of such microcontrollers, > but they are available for those that need them.Few designs have the features that they require, let alone DESIRE. Unless you're working in a market where customers will pay "whatever it takes", most designs have to live with some subset of what they would LIKE to have in their product.>> Simply assuming it "can't fail" is naive. > > Of course. Simply assuming that you can do a test at startup and think > that makes the system more reliable is at least equally na�ve.You miss the point of POST. It doesn't MAKE a system more reliable. Instead, it tells you when a system is not meeting your expectations. This is true of ALL testing. You have a defined point in time -- and operating conditions -- in which you hope to catch a failure so that you can report on it. A user (customer) is more willing to accept "there's a flashing red light on the device" than "the &*^($^& thing doesn't work worth a sh*t -- but I can't provide Tech Support with any information beyond the fact that I'm frustrated and UNHAPPY WITH MY PURCHASE" (and, even if they are willing to replace the device for me -- at no charge and only minor inconvenience to me -- I still don't feel confident that the next device won't have exactly the same problem!) "The worst thing you can do to a system is power it up; the SECOND worst thing you can do to a system is power it DOWN!" Both of these bad things have happened just before you run POST. Few systems can afford to test RAM (regardless of technology) while the system is actively running. And, few can defer power off to run such a test just prior to shutdown (where it will be LESS useful as it will miss the consequences of that impending shutdown and the subsequent powerup). [OTOH, there are systems that don't see regular/periodic power cyclings. Do you just let defects grow until the system resets itself (and THEN invokes POST)?] BUT, the cost and ease of testing RAM (regardless of technology) at power up is typically easy to bear in a product's design. It costs me a fraction of a second to give a cursory test of 500MB. Chances are, I'm going to find failures THERE instead of "dubious behaviors" in the running product.>> And, identifying faulty "can't happen" behavior EARLY (e.g. POST) rather >> than late gives you a better idea of what to report to the user/customer >> because you are closer to the problem's manifestation. You don't end >> up misbehaving and wondering "why?" >> >> [And, all of this assumes "bugfree software" so any errors are entirely a >> result of hardware faults] > > And there is perhaps your biggest invalid assumption. Software is > always a risk. Software that can't be properly tested is a > significantly higher risk. Software designed to handle situations that > cannot possibly be reproduced for testing purposes, cannot be properly > tested. So writing software test routines for something that has no > realistic chance of happening in the field, /reduces/ the reliability of > the product.YOUR biggest invalid assumption is that is has no realistic chance of happening. Your SECOND biggest assumption is thinking that folks who are qualified to write application software (for often ill-defined scenarios) are NOT capable of developing reliable test programs (for very WELL-DEFINED scenarios). Do you think *all* MCU-device failures are simply attributable to software bugs? Why test anything? ASSUME the power supply and power conditioning circuitry will never fail. Assume the various I/Os will never fail. Blame every failure on "it must be a bug". Never scrap returned product cuz all it needs -- along with every unit coming off the line, TODAY -- is a reflash! Are all of your products short-lived and in inconsequential applications? Naive. Please DO the research regarding TODAY's SRAM implementations. Understand why they fail and why folks are now adding EDAC to SRAM ("50 years ago" you wouldn't see EDAC and SRAM discussed in the same sentence). Then: 10um process. Now: < 30nm. Then: "5V". Now: < 2V. Then: 60W/Mb. Now: 20nW/Mb. Then: ~200ns. Now: ~1ns. Then: fixed power/speed. Now: dynamically variable power vs. speed. None of these things were heard of "50 years ago". Do you think there are no consequences of these changes? The illusive "win-win"?? Better yet, convince your employer/client to let you design a full custom. Make sure it has SRAM onboard. Then, notice how much attention the fab pays to TESTING that SRAM vs. junk logic -- as well as HOW they test it. Ask them what to expect from your component after a year operating in "typ" conditions; two years; five years. Ask them if they can quantifiably predict the effects of electromigration on the component in those periods. Changes in power supply sensitivity. Etc. "Chips" age. When you commit to purchasing your parts from their fab, ask them what sort of guarantees they'd be willing to offer on the devices that THEY will be producing for you. Will they defend it's operational status 5 years down the road? 10? 20?? ("Hey, SRAM doesn't fail so you should be willing to extend GENEROUS warranty terms to me, right? BEYOND just the cost of replacing the component! After all, there's no REALISTIC CHANCE of it failing!!") Do some reading. You'll learn something.
Reply by ●September 24, 20202020-09-24
On 24/09/2020 13:11, Don Y wrote:> On 9/24/2020 2:10 AM, David Brown wrote: >> On 23/09/2020 21:32, Don Y wrote: >>> On 9/23/2020 12:02 PM, David Brown wrote: >>> A system is not a static entity.� It changes over time (even if the >>> design is >>> frozen).� So, while your RAM (or any other component) may not be >>> LIKELY to >>> fail, the rest of the system that enables the RAM to function as >>> intended >>> can change in ways that manifest as RAM (or other resources) failures. >> >> Ah, nonsense.� Sure, it is /conceivable/, but it's a one in a million >> possibility.� This is not the 70's any more, and we are not talking >> about dynamic ram.� The onboard ram is static ram - it is a sea of >> simple flip-flops.� These /can/ be bit-flipped by cosmic rays, if they >> are small enough, but they don't suddenly stop working.� If "the rest of >> the system" fails in a way that stops the ram bits working, you can be >> confident that the cpu core and other critical parts have stopped too, >> as the problem is your clock, your voltage supply, or overheating. > > By your argument, you should test NOTHING and just wait for the user to > complain that the device "isn't working".� And, hope that this manifests in > a spectacular -- but not costly -- way.� AND, hope it doesn't piss off > the user who now has a product that isn't performing as he had hoped > (and you had ADVERTISED) it would.I can't see how you came to that bizarre conclusion.> > The whole point of BIST/POST is to provide a point in time where failures > will hopefully manifest -- instead of SILENTLY affecting the operation > of the device in question, in typically unpredictable ways. >Failures rarely occur when a device is switched off. They happen when the device is running. (They also happen during production or putting together a system, and it's worth doing checks then.) If you think that failures might realistically occur, and the tradeoffs between costs, reliability, safety, etc., warrant it, then you put in the appropriate level of failure detection and mitigation at /runtime/ in the system. There's little help in the failure leading to operation problems, and then saying afterwards that you could have spotted that problem in a POST.>> If you are going to try to make sensible decisions about what can fail, >> and where it is useful to test, you need to understand how devices work >> - devices that you are using /today/, not systems from 50 years ago. >> Otherwise your testing is counter-productive as the tests have higher >> risks of failures than the thing you are testing. > > How is a RAM test going to fail post deployment that didn't happen > prior to release?� POST/BIST are considerably easier to "get right" > than application code.� Their goals are much more concretely defined > and implementation verified. >Never underestimate the complexity of these things, nor the ability of software developers to get things wrong.> "50 years ago" you didn't have SRAM suffering from disturb errors. > Yet, now this is a fact of life for even caches.� Technology advances > and, with it, come new "challenges". >Yes, "disturb errors" as you call them - "single-event upsets", bit-flips, etc., are a possibility with ram. They are more likely in dynamic ram, but can occur in small, fast static ram cells. And POSTs and other ram checks are totally and completely /useless/ at identifying them or dealing with them. That is why I say you need to understand the hardware and the possible failure modes in order to make reliable systems. Are you sure you understand what POSTs can do, and the difference between transient failures and static failures?> I suggest you've been basing your assumptions on SRAM reliability on > 50 year old anecdotes and not the consequences of more modern > implementations, > shrinking device geometries and lower operating voltages.� Have a run > through > the literature to see...You are the one that was discussing 50 year old anecdotes!> >>> [Picking the "world's most reliable MCU" won't guarantee that it won't >>> throw >>> RAM errors in a deployed product.] >> >> /Nothing/ will give you guarantees like that.� But if you pick a >> microcontroller with ECC on its onboard ram (and cache, if it has it), >> you reduce, by many orders of magnitude, the risk of single-event upsets >> (such as cosmic rays) leading to failures of the system.� Anything else >> you can do in software is pointless in comparison.� "Testing" your ram >> can't possibly detect such issues. >> >> Not many products justify the extra expense of such microcontrollers, >> but they are available for those that need them. > > Few designs have the features that they require, let alone DESIRE. > Unless you're working in a market where customers will pay "whatever it > takes", most designs have to live with some subset of what they would > LIKE to have in their product. >In a safety-critical system, the cost of using a microcontroller with ECC ram is negligible. These are used all the time in the automotive industry.>>> Simply assuming it "can't fail" is naive. >> >> Of course.� Simply assuming that you can do a test at startup and think >> that makes the system more reliable is at least equally na�ve. > > You miss the point of POST.� It doesn't MAKE a system more reliable.I know it doesn't do that - I've been saying this all along.> Instead, it tells you when a system is not meeting your expectations. > This is true of ALL testing.� You have a defined point in time -- and > operating conditions -- in which you hope to catch a failure so that > you can report on it.� A user (customer) is more willing to accept > "there's a flashing red light on the device" than "the &*^($^& thing > doesn't work worth a sh*t -- but I can't provide Tech Support with > any information beyond the fact that I'm frustrated and UNHAPPY WITH > MY PURCHASE"For /some/ devices, some kind of POST can be useful. For many, it is pointless - it does not detect the failures that actually matter, and can only detect ones that have negligible chances of occurring. If you have a device that is regularly restarted, and where the hardware is so fault-prone that you really are finding problems with a POST, then yes - go for it. All I am arguing for is that people /think/ before making a POST, and do some analysis and investigation to see if it really is a useful feature.> > BUT, the cost and ease of testing RAM (regardless of technology) at > power up > is typically easy to bear in a product's design.� It costs me a fraction of > a second to give a cursory test of 500MB.� Chances are, I'm going to find > failures THERE instead of "dubious behaviors" in the running product.Do you understand the concept of cost/use analysis? If something is useless, or worse than useless, it doesn't help if it is cheap. Well, it helps for the marketing folks.> >>> And, identifying faulty "can't happen" behavior EARLY (e.g. POST) rather >>> than late gives you a better idea of what to report to the user/customer >>> because you are closer to the problem's manifestation.� You don't end >>> up misbehaving and wondering "why?" >>> >>> [And, all of this assumes "bugfree software" so any errors are >>> entirely a >>> result of hardware faults] >> >> And there is perhaps your biggest invalid assumption.� Software is >> always a risk.� Software that can't be properly tested is a >> significantly higher risk.� Software designed to handle situations that >> cannot possibly be reproduced for testing purposes, cannot be properly >> tested.� So writing software test routines for something that has no >> realistic chance of happening in the field, /reduces/ the reliability of >> the product. > > YOUR biggest invalid assumption is that is has no realistic chance of > happening. >Again, in your enthusiasm you have failed to notice what I have written repeatedly. If there is a /realistic/ chance of a failure, then it will often make sense to test for it. If there is no such chance - or negligible chance of it failing without some other major failure, or nothing you can do about a failure, then there is no point in trying to test.> Your SECOND biggest assumption is thinking that folks who are qualified > to write application software (for often ill-defined scenarios) are > NOT capable of developing reliable test programs (for very WELL-DEFINED > scenarios).That is often a realistic assumption - different people specialise in different things. However, it was not an assumption I made - again, you seem to prefer to make things up than read my posts. Software is always a risk. It might be low risk, but it is always a risk.> > Do you think *all* MCU-device failures are simply attributable to software > bugs?� Why test anything?� ASSUME the power supply and power conditioning > circuitry will never fail.� Assume the various I/Os will never fail. > Blame every failure on "it must be a bug".� Never scrap returned product > cuz all it needs -- along with every unit coming off the line, TODAY -- is > a reflash!Another wild idea all of your own.> > Are all of your products short-lived and in inconsequential applications? >I've made systems that are buried in concrete in oil installations, working for decades. Do I do that by relying on POSTs, memory tests and perhaps a watchdog? No.> > Do some reading.� You'll learn something.Try it yourself. You could start by reading what I wrote. Then, when you have learned a bit about this stuff, you can start applying a bit of /thought/ to the process. And when you look at my posts here, you'll see that what I have been advocating is that people /think/ about what they are doing with tests - what are they actually trying to achieve, what use it is, what the risks are. And stop making pointless code just because you can.
Reply by ●September 24, 20202020-09-24
On 9/24/20 5:26 AM, David Brown wrote:> On 24/09/2020 03:55, Richard Damon wrote: >> On 9/23/20 10:03 AM, David Brown wrote: > >>> Have you ever seen microcontroller RAM that failed? It's a possibility >>> for dynamic ram on PC's that is pushed to its limits for power and >>> speed, and made as cheaply as possible. But for static RAM in a >>> microcontroller, the risk of failures is pretty much negligible. The >>> exception is if a bit is hit by a cosmic ray (or other serious >>> radiation), which can flip a bit, but that won't be detected by any RAM >>> test of this kind. >> >> I think I have seen it once, the part had gotten electrically stressed >> in debugging and one of the banks of internal ram failed. We had only >> put the test in because the uniit was going to be in critical >> infrastructure where certain types of malfunctions could present dangers >> to people in the area. > > Presumably you are careful about keeping the systems that developers > have potentially broken separate from the systems that get delivered to > customers. (Another possible cause of this kind of failure is ESD > damage. Production departments are usually a lot more meticulous about > ESD than developers.) > > If you have a system that is safety critical, you have to do an analysis > of the risks of things going wrong, the consequences of those failures, > and how these (risks and consequences) can be reduced or mitigated. If > you figure out that static failure of the memory is a risk, then testing > can be worth doing. You might also decide that ECC ram, or redundant > devices, or external monitors are a better solution. There's no fixed > answer.I wasn't saying that such a test does make sense, but that such a test CAN be done reasonably, if for some legal/political reason it is introduced as a requirement. I brought up the example to show that this type of error CAN occur. Yes, unless some externally imposed requirement says to test internal ram, I am unlikely to add such a test for a production system (I have at time done it in development, mostly to confirm that I understand the limitations and operation of the device).> >> >> It is also possible that many failure might trip a watchdog that forces >> a reset, and the unit then finds the fault and locks itself 'safe'. >> > > That is definitely possible. But again, be very careful with watchdogs > - watchdog handling code is rarely properly tested, because it is > handling situations that don't occur. (Usually it /can/ be tested, but > that does not mean it /is/ tested.) >Yes, testing watchdogs is tricky.
Reply by ●September 24, 20202020-09-24
On 9/24/2020 4:58 AM, David Brown wrote:> On 24/09/2020 13:11, Don Y wrote: >> On 9/24/2020 2:10 AM, David Brown wrote: >>> On 23/09/2020 21:32, Don Y wrote: >>>> On 9/23/2020 12:02 PM, David Brown wrote: >>>> A system is not a static entity. It changes over time (even if the >>>> design is >>>> frozen). So, while your RAM (or any other component) may not be >>>> LIKELY to >>>> fail, the rest of the system that enables the RAM to function as >>>> intended >>>> can change in ways that manifest as RAM (or other resources) failures. >>> >>> Ah, nonsense. Sure, it is /conceivable/, but it's a one in a million >>> possibility. This is not the 70's any more, and we are not talking >>> about dynamic ram. The onboard ram is static ram - it is a sea of >>> simple flip-flops. These /can/ be bit-flipped by cosmic rays, if they >>> are small enough, but they don't suddenly stop working. If "the rest of >>> the system" fails in a way that stops the ram bits working, you can be >>> confident that the cpu core and other critical parts have stopped too, >>> as the problem is your clock, your voltage supply, or overheating. >> >> By your argument, you should test NOTHING and just wait for the user to >> complain that the device "isn't working". And, hope that this manifests in >> a spectacular -- but not costly -- way. AND, hope it doesn't piss off >> the user who now has a product that isn't performing as he had hoped >> (and you had ADVERTISED) it would. > > I can't see how you came to that bizarre conclusion.Well, if the "clock, voltage supply or overheating" is the problem -- and you can't DIRECTLY test for any of those -- then why are you testing ANYTHING (except as secondary evidence that some ASSUMPTION your design relies upon has been violated -- clock, volts, temp)?>> The whole point of BIST/POST is to provide a point in time where failures >> will hopefully manifest -- instead of SILENTLY affecting the operation >> of the device in question, in typically unpredictable ways. > > Failures rarely occur when a device is switched off. They happen when > the device is running. (They also happen during production or putting > together a system, and it's worth doing checks then.)Failures rarely occur when the device IS off. But, the act of removing power to a device is just as hazardous as APPLYING power. Power supplies rarely are designed to cleanly go up and down without inflicting transients on the devices they power. Many designers fail to note, carefully, how power transitions are expected to be managed (in ages past, with many supplies per device, this was more "in your face" and less easy to ignore) Of course, to a typical user, the failure will only manifest when the device is NEXT powered up. You can't test while it's powered down!> If you think that failures might realistically occur, and the tradeoffs > between costs, reliability, safety, etc., warrant it, then you put in > the appropriate level of failure detection and mitigation at /runtime/ > in the system. There's little help in the failure leading to operation > problems, and then saying afterwards that you could have spotted that > problem in a POST.POST provides a reassurance that "all appears well". It can't be thorough because it is a serial activity with "bringing the system on-line" -- and few people are willing to wait for exhaustive tests to complete when they will typically not uncover errors. But, systems/devices *routinely* fail POST -- for a variety of reasons. Some may be misapplication (the user has done something he shouldn't). Some hardware faults (the system hasn't endured as expected). Some from tampering (nowadays, you can rest assured that folks WILL open your product and try to tinker with it... to increase memory, enable an unused feature, patch the firmware, access "hidden" capabilities, etc.) Your code, however, is based on a set of assumptions -- some formally codified and some simply internalized. Before it runs, it should verify that those assumptions are valid, NOW (or, just shrug if the product misbehaves). I designed a device used in performing blood assays. It had socketed DRAM (DIPS) to allow the data store to be increased in 6KB increments (replace a 16Kx1 DRAM with a 64Kx1 DRAM and you've got 6KB more capacity). Of course, I had to "size" and "query" the data store's complexion on startup (which devices are 16Kb and which are 64Kb). But, I also had to address the fact that the technician in the hospital may have removed ALL of the devices (shame on him! but, maybe he simply forgot to install the new set?) *or* left one "bit lane" empty (I used a portion of the lower 16KB as "system RAM" so can't do much without it) Do I just wait until someone tries to use the device and then <cough>... while they have a micropipette loaded with a blood sample in their hand? I've got no writeable memory -- how can I tell the user that this has happened? Do I just start "squealing" to induce a panic?? Similarly, the "sensor array" onto which the assayed samples were placed was connected by a detachable cord. What if it is not present? What if it IS present but one of the conductors in the cord has failed? What if the cord is connected and intact but the array has been "soiled" by a sample (rendering portions of it unusable)? (these are actions that the USER -- not the technician -- could initiate) IME, it's foolish to blindly rely on anything being as you hope. If you NEED something to be a certain way, then you have to do whatever it takes to gain confidence that it IS that way. [Think about how much happens inside a PC that the manufacturers' likely didn't INTEND in creating their designs. Overclocking processors, replacing CPUs and active coolers, adding daughter cards (does anyone actually verify that their system can electrically -- not just mechanically -- support al of these things? or, do they just plug them in and "let's see if it works"??)]>>> If you are going to try to make sensible decisions about what can fail, >>> and where it is useful to test, you need to understand how devices work >>> - devices that you are using /today/, not systems from 50 years ago. >>> Otherwise your testing is counter-productive as the tests have higher >>> risks of failures than the thing you are testing. >> >> How is a RAM test going to fail post deployment that didn't happen >> prior to release? POST/BIST are considerably easier to "get right" >> than application code. Their goals are much more concretely defined >> and implementation verified. > > Never underestimate the complexity of these things, nor the ability of > software developers to get things wrong.As I said, there is a difference between POST/BIST and "diagnostics". The former provide a basic reassurance of expected operating condition. The latter provide (often exhaustive) analysis to QUANTIFY the operating condition. How many ECC errors do you tolerate in your product? Do you try to recover/self-heal from problems -- or, just illuminate "check engine"? How do you handle a checksum error in your ROM/FLASH -- do you reload a backup copy or panic()? Do you keep track of how OFTEN you are doing this? Or, do you just do it open-loop? What costs have you ADDED to your product (and passed along to the customer) to support these "fixes"? How costly is it to your customer (and, by extension, YOU!) to encounter an error and have to take some remedial action (even if that is just an irate phone call)? How long do you expect your customer to keep the device in service? How reluctant will he be to "upgrade" (for enhanced functionality OR to fix a fault)? Does he already bear the cost of maintaining kit similar to yours? Or, is this a cost he's going to be unhappy with bearing? In the 80's, I designed a bit of medical kit that cost a few hundred dollars to produce. A firmware upgrade/fix cost $600 in labor to perform if the device was sited "just down the road". You can imagine there was a big emphasis on NOT having to update the firmware and to be able to provide an indication of machine faults that the user could convey to support staff over the phone (instead of requiring a visit). The same sort of costs were present if I had to replace (swap out, repair at depot) a display board, power supply, backup battery, etc. Much consumer kit places the cost of maintenance on the consumer. Worst case, he returns the product for a refund. This is a costly proposition because you've lost more than you would have made on the sale (handling the return) AND have likely annoyed a customer who MIGHT have represented repeat business -- as well as performing in an advertising role (word of mouth). Industrial kit often has local support staff on hand that can diagnose problems (IF your product and documentation provide a means for them to do so). But, the cost of that staff is figured into the "burden" your product imposes; if they are spending inordinate amounts of time fixing YOUR problems, then your products suffer in their eyes (cuz management is always under pressure to "do more with less" -- staff) My experience has been that providing MORE information to a user always works to the manufacturer's advantage. A user confronted with a flashing red light will cost you more (even if you don't lose the sale) than a user who is told to "check connection at J1". Anything that removes a potential "issue" from his thought process is an improvement ("How do I know that the cache memory isn't defective? Is he testing that, too? Am I going to spend hours tracking down a problem that's buried in a place that I can't access/test?")>> "50 years ago" you didn't have SRAM suffering from disturb errors. >> Yet, now this is a fact of life for even caches. Technology advances >> and, with it, come new "challenges". > > Yes, "disturb errors" as you call them - "single-event upsets", > bit-flips, etc., are a possibility with ram. They are more likely in > dynamic ram, but can occur in small, fast static ram cells. And POSTs > and other ram checks are totally and completely /useless/ at identifying > them or dealing with them. That is why I say you need to understand the > hardware and the possible failure modes in order to make reliable systems.Please tell me where I indicated that puzz should be checking for disturb errors in SRAM, DRAM or FLASH (where all can occur -- as well as in "junk logic"). You can't just run a simple, quick test to determine if you have a problem with these. OTOH, if you have a system that is running and can "do this on the side" (with or without hardware EDAC), then you can compile statistics regarding their likely frequency. If you DON'T have a closed system, you can also use these observations as indicators of possible "attacks" or poorly coded applications (that, left to their own BENIGN devices, could compromise your system). If you notice WHEN they occur, you can also take actions to thwart them (e.g., if TryToGainRoot() is the active process when a statistically greater frequency of such events occurs, then you might want to blacklist TryToGainRoot() so that it never runs, again.)> Are you sure you understand what POSTs can do, and the difference > between transient failures and static failures?You do understand that there are differences between truly transient (i.e., self-healing) errors and persistent consequences of things like SEUs? Are you sure the code in your FLASH (ROM) is intact, NOW (assuming XIP)? Are you sure the code that you loaded from that FLASH into (S/D)RAM hasn't been corrupted, NOW (ignore the effects of bugs)? Will your customer notice if it has been corrupted? Will the consequences of the corruption be masked (by whatever)? Or, will it manifest in a spectacular way? [There have been several studies of how resilient various applications are to memory errors. Given that they can occur "anywhere", it's easy to see how some can be masked or contribute to "system noise". But, that's not a given for all...] What are you doing about this, besides hoping to catch it at the next POST (assuming you even bother to test for it)?>> I suggest you've been basing your assumptions on SRAM reliability on >> 50 year old anecdotes and not the consequences of more modern >> implementations, >> shrinking device geometries and lower operating voltages. Have a run >> through >> the literature to see... > > You are the one that was discussing 50 year old anecdotes!I'm showing how YOUR confidence in SRAM is rooted in 50 year old anecdotes and not "modern practices".>>>> [Picking the "world's most reliable MCU" won't guarantee that it won't >>>> throw >>>> RAM errors in a deployed product.] >>> >>> /Nothing/ will give you guarantees like that. But if you pick a >>> microcontroller with ECC on its onboard ram (and cache, if it has it), >>> you reduce, by many orders of magnitude, the risk of single-event upsets >>> (such as cosmic rays) leading to failures of the system. Anything else >>> you can do in software is pointless in comparison. "Testing" your ram >>> can't possibly detect such issues. >>> >>> Not many products justify the extra expense of such microcontrollers, >>> but they are available for those that need them. >> >> Few designs have the features that they require, let alone DESIRE. >> Unless you're working in a market where customers will pay "whatever it >> takes", most designs have to live with some subset of what they would >> LIKE to have in their product. > > In a safety-critical system, the cost of using a microcontroller with > ECC ram is negligible. These are used all the time in the automotive > industry.So, only safety critical products need to work, reliably? It must be really easy designing with a bar set that low! You don't need to rely on hardware EDAC to improve your confidence in the retentive powers of the RAM (any RAM). That just provides a more immediate indication of a particular detected/corrected fault. It's not uncommon for me to have running checksum processes that continually scan the program store looking for "disturbances". I can't necessarily point to a specific location. Or, an exact time at which the disturbance crept into the system. But, I *do* know that the contents of that memory region are no longer what they SHOULD be. If I have hardware protecting write access to that region, then I can deduce that the error is caused by a fault in a device (even if I can't point to a specific device). In either case, I can't vouch for my product's "output"/functionality. (Or, I can stick my head in the sand and assume that memory is never corrupted) Hardware EDAC also only tells you about errors in REFERENCED locations. So, if your code doesn't reference every location "frequently" (for some value of "frequently"), you may not discover the corruption until hours after it occurred. And, the single error may have become a multiple-bit error -- now your EDAC (SECDEC) is useless. [This is the same false sense of security that folks using RAID rely on; if you aren't looking at EVERYTHING periodically, then you have no idea as to whether or not it's been corrupted and/or is recoverable (hence the reason for patrol reads)]>>>> Simply assuming it "can't fail" is naive. >>> >>> Of course. Simply assuming that you can do a test at startup and think >>> that makes the system more reliable is at least equally na�ve. >> >> You miss the point of POST. It doesn't MAKE a system more reliable. > > I know it doesn't do that - I've been saying this all along.Then why are you assuming *I* am professing that?>> Instead, it tells you when a system is not meeting your expectations. >> This is true of ALL testing. You have a defined point in time -- and >> operating conditions -- in which you hope to catch a failure so that >> you can report on it. A user (customer) is more willing to accept >> "there's a flashing red light on the device" than "the &*^($^& thing >> doesn't work worth a sh*t -- but I can't provide Tech Support with >> any information beyond the fact that I'm frustrated and UNHAPPY WITH >> MY PURCHASE" > > For /some/ devices, some kind of POST can be useful. For many, it is > pointless - it does not detect the failures that actually matter, and > can only detect ones that have negligible chances of occurring.You install POST/BIST *before* you release the product. You likely discover hardware reliability problems AFTER the design is complete (potentially after it has been released to manufacturing). Few people intentionally design with poor reliability as a goal, implied or otherwise. You don't know what your problems will be -- until you start doing /post mortems/ on returned product. This is the WORST time to find out because you likely have lots of product in the field before you can see a pattern in their failures. Now you throw away profit and reputation in trying to compensate for those shortcomings.> If you have a device that is regularly restarted, and where the hardware > is so fault-prone that you really are finding problems with a POST, then > yes - go for it. > > All I am arguing for is that people /think/ before making a POST, and do > some analysis and investigation to see if it really is a useful feature.An engineer should always be "thinking" (not necessarily true of a "programmer"). But, there are costs to "omissions" that can be sizeable.>> BUT, the cost and ease of testing RAM (regardless of technology) at >> power up >> is typically easy to bear in a product's design. It costs me a fraction of >> a second to give a cursory test of 500MB. Chances are, I'm going to find >> failures THERE instead of "dubious behaviors" in the running product. > > Do you understand the concept of cost/use analysis? If something is > useless, or worse than useless, it doesn't help if it is cheap. Well, > it helps for the marketing folks.Again, you're assuming it IS "useless". Most memory failures that I've encountered are caught in a POST -- stuck at faults, decode faults or problems with "external factors". By catching them, there, before the application runs, I avoid annoying the user. (yeah, he may be disappointed that the device won't run -- or will only run with reduced capabilities -- but he won't be annoyed that he produced $30,000 of stainless steel parts that are out of tolerance. Or, that 8 hours' production of pharmaceuticals have to be scrapped (cuz you can't test millions of individual tablets!)>>>> And, identifying faulty "can't happen" behavior EARLY (e.g. POST) rather >>>> than late gives you a better idea of what to report to the user/customer >>>> because you are closer to the problem's manifestation. You don't end >>>> up misbehaving and wondering "why?" >>>> >>>> [And, all of this assumes "bugfree software" so any errors are >>>> entirely a result of hardware faults] >>> >>> And there is perhaps your biggest invalid assumption. Software is >>> always a risk. Software that can't be properly tested is a >>> significantly higher risk. Software designed to handle situations that >>> cannot possibly be reproduced for testing purposes, cannot be properly >>> tested. So writing software test routines for something that has no >>> realistic chance of happening in the field, /reduces/ the reliability of >>> the product. >> >> YOUR biggest invalid assumption is that is has no realistic chance of >> happening. > > Again, in your enthusiasm you have failed to notice what I have written > repeatedly. If there is a /realistic/ chance of a failure, then it will > often make sense to test for it. If there is no such chance - or > negligible chance of it failing without some other major failure, or > nothing you can do about a failure, then there is no point in trying to > test.But you dismiss this testing as being targeted at something that "won't happen". I contend that it will and does. (though I can't speak re: the OP's specific product)>> Your SECOND biggest assumption is thinking that folks who are qualified >> to write application software (for often ill-defined scenarios) are >> NOT capable of developing reliable test programs (for very WELL-DEFINED >> scenarios). > > That is often a realistic assumption - different people specialise in > different things. However, it was not an assumption I made - again, you > seem to prefer to make things up than read my posts.You've stated that adding the test(s) decreases reliability. Do the tests physically damage the product? If not, then the only potential downside is if they are implemented defectively -- hence the above.> Software is always a risk. It might be low risk, but it is always a risk. > >> Do you think *all* MCU-device failures are simply attributable to software >> bugs? Why test anything? ASSUME the power supply and power conditioning >> circuitry will never fail. Assume the various I/Os will never fail. >> Blame every failure on "it must be a bug". Never scrap returned product >> cuz all it needs -- along with every unit coming off the line, TODAY -- is >> a reflash! > > Another wild idea all of your own. > >> Are all of your products short-lived and in inconsequential applications? > > I've made systems that are buried in concrete in oil installations, > working for decades. Do I do that by relying on POSTs, memory tests and > perhaps a watchdog? No.Instead, you rely on expensive staff being available in the event that a problem occurs. Thats not the case with most products or customers. I design differently for environments where I can reasonably expect to have "capable" staff on hand. I expose more details about what I've "noticed" in my product(s) so they can use that to determine how to further test, repair or replace the items. This is no different than "test equipment" manufacturers making diagnostic and calibration procedures available to end users. In some cases, downtime is paramount so I design the entire product with ease of replacement in mind -- swap out the questionable unit, install the spare, forward the old one to us for analysis (or do your own testing, "offline"). This is more than just thinking about making it replaceable; you also have to consider the activities that will be involved in making that replacement! In consumer applications, the typical remedy is to have the consumer get annoyed -- dealing with online "chat", or phone support -- as even the simplest problems (operator error) take hours or more to resolve ("The current hold time is 27 minutes.") This has a direct cost to the manufacturer (support staff, repairs, returns) as well as an indirect cost (pissed off customer who typically is more willing to badmouth a disappointing product than praise a delightfully performant one!). The dollars involved "per incident" vary -- as do the quantities. But, I can survive a "bad experience" (in THEIR minds) with an industrial user more readily; they might make me squirm a bit or may extract other concessions from me going forward... but, chances are, they aren't going to pull all of my products and move on to a competitor. It's a more rational "business decision" instead of an EMOTIONAL reaction (for a consumer). OTOH, if I misdiagnose or mistreat a patient and some litigation (and possibly loss) ensues, I can likely write off that business for the foreseeable future (even if I don't directly incur those losses)!>> Do some reading. You'll learn something. > > Try it yourself. You could start by reading what I wrote. Then, when > you have learned a bit about this stuff, you can start applying a bit of > /thought/ to the process. And when you look at my posts here, you'll > see that what I have been advocating is that people /think/ about what > they are doing with tests - what are they actually trying to achieve, > what use it is, what the risks are. And stop making pointless code just > because you can.You've not JUST said that. You've said testing SRAM is pointless because (effectively) it never fails.
Reply by ●September 24, 20202020-09-24
On 24/09/2020 15:01, Richard Damon wrote:> On 9/24/20 5:26 AM, David Brown wrote: >> On 24/09/2020 03:55, Richard Damon wrote: >>> On 9/23/20 10:03 AM, David Brown wrote: >> >>>> Have you ever seen microcontroller RAM that failed? It's a possibility >>>> for dynamic ram on PC's that is pushed to its limits for power and >>>> speed, and made as cheaply as possible. But for static RAM in a >>>> microcontroller, the risk of failures is pretty much negligible. The >>>> exception is if a bit is hit by a cosmic ray (or other serious >>>> radiation), which can flip a bit, but that won't be detected by any RAM >>>> test of this kind. >>> >>> I think I have seen it once, the part had gotten electrically stressed >>> in debugging and one of the banks of internal ram failed. We had only >>> put the test in because the uniit was going to be in critical >>> infrastructure where certain types of malfunctions could present dangers >>> to people in the area. >> >> Presumably you are careful about keeping the systems that developers >> have potentially broken separate from the systems that get delivered to >> customers. (Another possible cause of this kind of failure is ESD >> damage. Production departments are usually a lot more meticulous about >> ESD than developers.) >> >> If you have a system that is safety critical, you have to do an analysis >> of the risks of things going wrong, the consequences of those failures, >> and how these (risks and consequences) can be reduced or mitigated. If >> you figure out that static failure of the memory is a risk, then testing >> can be worth doing. You might also decide that ECC ram, or redundant >> devices, or external monitors are a better solution. There's no fixed >> answer. > > I wasn't saying that such a test does make sense, but that such a test > CAN be done reasonably, if for some legal/political reason it is > introduced as a requirement.Agreed.> I brought up the example to show that this > type of error CAN occur.Fair enough (and I specifically asked for such examples).> Yes, unless some externally imposed requirement > says to test internal ram, I am unlikely to add such a test for a > production system (I have at time done it in development, mostly to > confirm that I understand the limitations and operation of the device). >And that is fine, of course. I've also had such tests in production, especially for external memories - it confirms there are no (obvious) soldering defects.>> >>> >>> It is also possible that many failure might trip a watchdog that forces >>> a reset, and the unit then finds the fault and locks itself 'safe'. >>> >> >> That is definitely possible. But again, be very careful with watchdogs >> - watchdog handling code is rarely properly tested, because it is >> handling situations that don't occur. (Usually it /can/ be tested, but >> that does not mean it /is/ tested.) >> > > Yes, testing watchdogs is tricky. >