EmbeddedRelated.com
Forums
Memfault Beyond the Launch

Power On Self Test

Started by pozz September 23, 2020
I'd like to implement a Power On Self Test to be sure all (or many) 
parts of the electronics are functioning well.

The tests to be done to check external hardware depends on the actual 
hardware that is present.

What about the MCU that features an internal Flash (where is the code) 
and RAM and some peripherals? Are there any strategies to test if the 
internal RAM or Flash are good? Do you think these kind of tests could 
be useful?

What about a test of the clock based on an external crystal?
On 9/23/2020 3:13 AM, pozz wrote:
> I'd like to implement a Power On Self Test to be sure all (or many) parts of > the electronics are functioning well. > > The tests to be done to check external hardware depends on the actual hardware > that is present. > > What about the MCU that features an internal Flash (where is the code) and RAM > and some peripherals? Are there any strategies to test if the internal RAM or > Flash are good? Do you think these kind of tests could be useful? > > What about a test of the clock based on an external crystal?
You can test whatever you think you need confidence in prior to declaring the system healthy enough to boot. Historically, a small area of ROM was relied upon to contain enough code to verify it's integrity along with the rest of the POST/BIST code's integrity. This was done without referencing RAM (which may be defective). Some folks would include "CPU tests" to verify the basic integrity of the processor. I think these are dubious as it likely either works or it doesn't ("Gee, it can ADD but can't JMP!"). With more advanced CPUs, you'd likely want to verify the cache, ECC and VMM hardware behave as expected (sometimes this requires adding hooks in order to be able to synthesize faults). RAM was then tested using strategies appropriate to the RAM technology used. E.g., DRAM wanted to be tested with long delays between the write and read-verify to ensure any failures in hardware refresh mechanisms were given an opportunity to manifest. (Even nonexistent memory can appear to be present and functional if the test is poorly designed, in certain hardware configurations) From there, different peripherals could be tested while relying on the now assumed *functional* ROM & RAM to conduct those tests. I.e., the test application can start to look like a more full-featured application instead of tight little bits of code. You're at the mercy of the hardware designer to incorporate appropriate hooks to test many aspects of the circuitry. E.g., can you generate a serial data stream to test if a UART is receiving correctly? transmitting? Does your MAC let you push octets onto the wire and see them or is the loopback interface purely inside the NIC? In the past, I've taken unused outputs and used them as termination voltages for high impedance pullups/pulldowns that I could use to determine if an external bit of kit was plugged into the system. I.e., drive the termination up, then down -- possibly multiple times, depending on what those "inputs" feed -- and see if anything is detected. If not, it is hopefully because the external device is driving those inputs with lower impedance signals. So, test the external device! You can test for stuck keys/buttons -- if you can ensure the user (or mechanism) -- can be relied upon NOT to activate them during the POST. You can test for a functional XTAL -- but only if you have some other timebase (which may be crude/inaccurate) is operational. [I once diagnosed a pinball machine as having a defective crystal simply by observing the refresh of the displays with my unaided eyes -- PGDs appear to vibrate when lit. Had the POST for the machine been able to detect -- and flag -- that, it could have diagnosed itself!] You also have to decide what role the test will play in the user's device experience; will you flash an indicator telling the user that a fault has been detected (if so, what will the user do)? Or, will you attempt to workaround any faults? How reliable will your "indicator" be?? Will you want to convey anything more informative than "check engine"? I've added circuitry to my designs to allow me to dynamically (POST as well as BIST) verify the operational status of the hardware. E.g., every speaker is accompanied by a microphone -- so I can "listen" to the sounds that I'm generating to verify the speaker is operational. And, likewise, so I can generate sounds of particular characteristics to know that my microphone is working! Of course, having those "devices" on hand means I can also find uses for them that I might not have originally included in the design! In my application, I can move the device to a "testing" state at any time. In this state, I can load diagnostics (once the device itself has verified that it is capable of executing those diagnostics!) to do whatever testing I deem necessary. E.g., if I encounter lots of ECC errors in the onboard RAM, I can take the device offline and run a comprehensive memory diagnostic. Depending on the results of that test, I can recertify the device for normal operation, some subset of normal *or* flag it as faulted. But, my environment expects the devices to operate "unattended" for very long periods of time, 24/7/365, so I can't rely on the activation of a POST at power-up. Think hard about the types of failures you EXPECT to see (i.e., many are USER errors!) and don't invest too much time detecting things that will likely never fail OR whose failure you won't be able to do much about.
On 23/09/2020 12:57, Don Y wrote:
> On 9/23/2020 3:13 AM, pozz wrote: >> I'd like to implement a Power On Self Test to be sure all (or many) >> parts of the electronics are functioning well. >> >> The tests to be done to check external hardware depends on the actual >> hardware that is present. >> >> What about the MCU that features an internal Flash (where is the code) >> and RAM and some peripherals? Are there any strategies to test if the >> internal RAM or Flash are good? Do you think these kind of tests could >> be useful? >> >> What about a test of the clock based on an external crystal? >
<snip>
> Think hard about the types of failures you EXPECT to see (i.e., many are > USER errors!) and don't invest too much time detecting things that will > likely never fail OR whose failure you won't be able to do much about.
This last bit is crucial. A lot of testing "requirements" that are specified are completely pointless - or far worse than useless, as they introduce real points of failure in their attempts to cover everything. First, figure out what you should /not/ test. Don't bother testing something unless you can usefully handle the failure. If the way you communicate errors is through a UART, there is no point in trying to check that the UART is working. If you have a single microcontroller in the system, there is no point in trying to check the cpu or the on-chip ram. There is no point in checking that you can write to flash or on-chip eeprom - all you do is reduce its lifetime and make it more likely to fail. Don't write any test code which cannot itself be tested. If you cannot induce a failure, or at least simulate it reasonably, do not write code to check or handle that failure. The reality is that the untested code will have a higher risk of problems than the thing you are testing. Don't check the ram or the flash of the microcontroller - there's nothing you can do if there is a failure. (You can check that you have successfully loaded a new software update, or that there hasn't been a reset during an update - a CRC for that kind of thing is a good idea.) If you have a system that is important enough that ram or flash failures need to be checked and handled, use a safety-qualified microcontroller with ECC ram, flash, cache, etc., and perhaps even redundant cores (you get these with PowerPC and Cortex-R cores). And think about what can reasonably go wrong, how it can go wrong, and what can be done about it. Other than for devices susceptible to current surges (like filament light bulbs), most hardware failures are in usage, not while power is off - checking on power-up (rather than while the system is in use) usually only makes sense if it is likely for a user to see there is a problem and try to "fix" it by turning power off and on again.
On 9/23/20 8:13 AM, David Brown wrote:
> On 23/09/2020 12:57, Don Y wrote: >> On 9/23/2020 3:13 AM, pozz wrote: >>> I'd like to implement a Power On Self Test to be sure all (or many) >>> parts of the electronics are functioning well. >>> >>> The tests to be done to check external hardware depends on the actual >>> hardware that is present. >>> >>> What about the MCU that features an internal Flash (where is the code) >>> and RAM and some peripherals? Are there any strategies to test if the >>> internal RAM or Flash are good? Do you think these kind of tests could >>> be useful? >>> >>> What about a test of the clock based on an external crystal? >> > <snip> >> Think hard about the types of failures you EXPECT to see (i.e., many are >> USER errors!) and don't invest too much time detecting things that will >> likely never fail OR whose failure you won't be able to do much about. > > This last bit is crucial. > > A lot of testing "requirements" that are specified are completely > pointless - or far worse than useless, as they introduce real points of > failure in their attempts to cover everything. > > First, figure out what you should /not/ test. > > Don't bother testing something unless you can usefully handle the > failure. If the way you communicate errors is through a UART, there is > no point in trying to check that the UART is working. If you have a > single microcontroller in the system, there is no point in trying to > check the cpu or the on-chip ram. There is no point in checking that > you can write to flash or on-chip eeprom - all you do is reduce its > lifetime and make it more likely to fail. > > Don't write any test code which cannot itself be tested. If you cannot > induce a failure, or at least simulate it reasonably, do not write code > to check or handle that failure. The reality is that the untested code > will have a higher risk of problems than the thing you are testing. > > Don't check the ram or the flash of the microcontroller - there's > nothing you can do if there is a failure. (You can check that you have > successfully loaded a new software update, or that there hasn't been a > reset during an update - a CRC for that kind of thing is a good idea.) > If you have a system that is important enough that ram or flash failures > need to be checked and handled, use a safety-qualified microcontroller > with ECC ram, flash, cache, etc., and perhaps even redundant cores (you > get these with PowerPC and Cortex-R cores). > > And think about what can reasonably go wrong, how it can go wrong, and > what can be done about it. Other than for devices susceptible to > current surges (like filament light bulbs), most hardware failures are > in usage, not while power is off - checking on power-up (rather than > while the system is in use) usually only makes sense if it is likely for > a user to see there is a problem and try to "fix" it by turning power > off and on again. >
Testing RAM can be useful, letting the system fail gracefully rather than acting flaky, perhaps just locking up into a tight loop flashing a LED as a fault indicator. Similarly, you could CRC check the program flash, and fail on an error, preferably falling into a minimal system that allows a user reflash, but it might mean just bricking. Note, that as you say, most faults will happen while powered up, but many faults will cause a system crash, that the user is likely to power cycle to try and clear, so power up is a good time to check (since many things are a lot harder to check while the system is running in operation).
On 23/09/2020 14:51, Richard Damon wrote:
> On 9/23/20 8:13 AM, David Brown wrote: >> On 23/09/2020 12:57, Don Y wrote: >>> On 9/23/2020 3:13 AM, pozz wrote: >>>> I'd like to implement a Power On Self Test to be sure all (or many) >>>> parts of the electronics are functioning well. >>>> >>>> The tests to be done to check external hardware depends on the actual >>>> hardware that is present. >>>> >>>> What about the MCU that features an internal Flash (where is the code) >>>> and RAM and some peripherals? Are there any strategies to test if the >>>> internal RAM or Flash are good? Do you think these kind of tests could >>>> be useful? >>>> >>>> What about a test of the clock based on an external crystal? >>> >> <snip> >>> Think hard about the types of failures you EXPECT to see (i.e., many are >>> USER errors!) and don't invest too much time detecting things that will >>> likely never fail OR whose failure you won't be able to do much about. >> >> This last bit is crucial. >> >> A lot of testing "requirements" that are specified are completely >> pointless - or far worse than useless, as they introduce real points of >> failure in their attempts to cover everything. >> >> First, figure out what you should /not/ test. >> >> Don't bother testing something unless you can usefully handle the >> failure. If the way you communicate errors is through a UART, there is >> no point in trying to check that the UART is working. If you have a >> single microcontroller in the system, there is no point in trying to >> check the cpu or the on-chip ram. There is no point in checking that >> you can write to flash or on-chip eeprom - all you do is reduce its >> lifetime and make it more likely to fail. >> >> Don't write any test code which cannot itself be tested. If you cannot >> induce a failure, or at least simulate it reasonably, do not write code >> to check or handle that failure. The reality is that the untested code >> will have a higher risk of problems than the thing you are testing. >> >> Don't check the ram or the flash of the microcontroller - there's >> nothing you can do if there is a failure. (You can check that you have >> successfully loaded a new software update, or that there hasn't been a >> reset during an update - a CRC for that kind of thing is a good idea.) >> If you have a system that is important enough that ram or flash failures >> need to be checked and handled, use a safety-qualified microcontroller >> with ECC ram, flash, cache, etc., and perhaps even redundant cores (you >> get these with PowerPC and Cortex-R cores). >> >> And think about what can reasonably go wrong, how it can go wrong, and >> what can be done about it. Other than for devices susceptible to >> current surges (like filament light bulbs), most hardware failures are >> in usage, not while power is off - checking on power-up (rather than >> while the system is in use) usually only makes sense if it is likely for >> a user to see there is a problem and try to "fix" it by turning power >> off and on again. >> > > Testing RAM can be useful, letting the system fail gracefully rather > than acting flaky, perhaps just locking up into a tight loop flashing a > LED as a fault indicator.
Have you ever seen microcontroller RAM that failed? It's a possibility for dynamic ram on PC's that is pushed to its limits for power and speed, and made as cheaply as possible. But for static RAM in a microcontroller, the risk of failures is pretty much negligible. The exception is if a bit is hit by a cosmic ray (or other serious radiation), which can flip a bit, but that won't be detected by any RAM test of this kind. Testing RAM is useful /if/ it can fail, and /if/ you can do something useful when it fails. (I agree that "a tight loop flashing an LED" might count as something useful, depending on the situation.) I've seen "safety standards requirements" that included regular ram tests. Such requirements generally originate decades ago, and are not appropriately nuanced for real-life systems. I've seen resulting code used to implement such tests, added solely to fulfil such requirements. And I've seen such code written in a way that is untested and untestable, and in a way that has risks that /hugely/ outweigh those of a fault occurring in the on-board RAM. If the OP is in the situation where there are customer requirements for fulfilling certain safety requirements that include ram tests, and where "mindlessly obeying these rules no matter how pointless they are in reality" is the right choice to please arse-covering lawyers, then go for it. If not, then think long and hard about the realism of such a failure and such a test, and whether it is truly a positive contribution to the project as a whole.
> Similarly, you could CRC check the program > flash, and fail on an error, preferably falling into a minimal system > that allows a user reflash, but it might mean just bricking. >
The possibility of a flash failure is a great deal higher than that of a RAM failure. Flash writes are analogue - a bit can be written in such a way that it reads back correctly at programming time, but goes outside the margins over time or at different temperatures or voltages. So yes, sometimes a CRC of the flash is worth doing. But remember that the program doing the check is just as much at risk of such failures (perhaps even more so, if you have a "boot" program that does the check of the "main" program, as the boot program is less likely to be updated and thus its bits will have decayed over a longer time). If flash fails are a real risk, and the system is important enough, it's better to pick a microcontroller with ECC flash.
> Note, that as you say, most faults will happen while powered up, but > many faults will cause a system crash, that the user is likely to power > cycle to try and clear, so power up is a good time to check (since many > things are a lot harder to check while the system is running in operation). >
Yes, I mentioned that. (It assumes the embedded system has a user that can do such a power-cycle.)
On 9/23/2020 7:03 AM, David Brown wrote:
> On 23/09/2020 14:51, Richard Damon wrote: >> Testing RAM can be useful, letting the system fail gracefully rather >> than acting flaky, perhaps just locking up into a tight loop flashing a >> LED as a fault indicator.
Exactly. If memory is expected to work -- and NEVER expected to fail -- then it's a small cost to actually make some attempt to prove that is actually the case. Otherwise, when that "Can't Happen" actually does, you're left clueless. [In the 70's, a common system failure I encountered was an address decoding error which would effectively disable all memory (think misprogrammed PLA). It was readily apparent as the processor would be found halted at ~0x0076 (IIRC) -- 0x76 being the opcode for HALT which would be the low byte of the address still "floating" on the multiplexed address/data bus. Nowadays, one can imagine similar failures -- including grown defects -- deleteriously affecting deployed product.]
> Have you ever seen microcontroller RAM that failed? It's a possibility > for dynamic ram on PC's that is pushed to its limits for power and > speed, and made as cheaply as possible. But for static RAM in a > microcontroller, the risk of failures is pretty much negligible. The > exception is if a bit is hit by a cosmic ray (or other serious > radiation), which can flip a bit, but that won't be detected by any RAM > test of this kind.
You're assuming that there is only one, predetermined way to get into the self-test routine. And, that nothing in the machine has failed that would render that assumption false. At each point in your code, you should know what assumptions are safe and which are yet to be proven/made safe. If you're in the self-test routine, you shouldn't have to wonder if memory works, is configured as you expect it to be, etc. ("Hey, I'm running code so why bother to TEST the code image??") Assuming that the memory is operational NOW (while I am executing this piece of self-test code) is a hazard waiting to happen. For example, an errant RETURN could land the program IN the self-test code WITHOUT the benefit of having been through the controlled, repeatable start-up sequence. (i.e., the RAM -- or other resource -- may NOW be mapped to a different location in the address space such that the code written under the assumption that it resides in its "power on reset" configuration no longer works properly.) I'd rather have that code FAIL and report the error to me -- because it tried to verify some assumption(s) and failed -- than have the code continue to operate FROM THERE on the assumption that it is actually (later) talking to functional RAM that has yet-to-be reconfigured. Otherwise, you get a "fluke" that you can never resolve (and, because you can't easily sort out what might have happened in order to reproduce and repair, you shrug it off due to time pressures -- even though YOU SAW IT FAIL!). [I have an entry point in all of my products called RESET. It manually and deliberately works to restore the hardware to the same condition that it was in just after the application of power. So, any code that executes after passing through that entry point -- to "RESETTED" -- SHOULD behave the same regardless of whether power was just applied, or not] Note that there's a difference between the sort of "confidence testing" that occurs at POST (how many devices perform exhaustive tests at POST? How many users would tolerate that sort of delay?) and "diagnostic testing" which truly provides an assessment of the health of the device and can often be used to assist in determine the need for replacement (or, for self-healing). In most cases, you can test RAM with a single write pass followed by a verification read pass and be reasonably sure that you've caught stuck-at failures as well as decode failures -- no need for a whole barrage of different tests when you're typically looking for a simple Go-NoGo. [I run three passes on a 512MB block and use that as a crude assessment as to whether or not the memory will LIKELY accept a program image. Installing -- and verifying -- that image acts as a further test of the memory's crude functionality. Thereafter, I swap pages of memory out and exercise them to verify that I'm not seeing an increase in ECC activity in a particular region -- which I will remap if need be.] You also need to know how the device is fabricated; a memory module will experience different errors than memory that is soldered down. (and, in the latter case, you have to be prepared for the memory to NOT be what you THOUGHT it was going to be, at design time). And, soldered down memory will behave different than chip-on-chip. Folks write ONE memory test and then assume all memory behaves (fails!) the same. If you don't understand your hardware and how it can fail, you shouldn't be the one who is designing the test suite!
On 23/09/2020 18:36, Don Y wrote:

> If you don't understand your hardware and how it can fail, you shouldn't > be the one who is designing the test suite!
That bit is correct. The rest - well, I don't want to get into a long and protracted argument. Any system is made up of layers. Higher level layers assume that lower level layers work according to specification (which may include indicating an error for some kinds of detectable fault). If you think the higher level part can fully verify the lower level parts - "prove" that the assumptions hold - you are fooling yourself. When you design a system based on a microcontroller, you pick a device that is as reliable as you need it to be - so that you /can/ assume the core parts (cpu, ram, flash, interrupts, etc.) work well enough for your needs. If you are not sure it is reliable enough, pick a different device or make a redundant system. No amount of testing can /ever/ prove that something works - it can only prove that something does /not/ work.
On 9/23/2020 12:02 PM, David Brown wrote:
> On 23/09/2020 18:36, Don Y wrote: > >> If you don't understand your hardware and how it can fail, you shouldn't >> be the one who is designing the test suite! > > That bit is correct. The rest - well, I don't want to get into a long > and protracted argument. > > Any system is made up of layers. Higher level layers assume that lower > level layers work according to specification (which may include > indicating an error for some kinds of detectable fault). If you think > the higher level part can fully verify the lower level parts - "prove" > that the assumptions hold - you are fooling yourself. When you design a > system based on a microcontroller, you pick a device that is as reliable > as you need it to be - so that you /can/ assume the core parts (cpu, > ram, flash, interrupts, etc.) work well enough for your needs. If you > are not sure it is reliable enough, pick a different device or make a > redundant system. > > No amount of testing can /ever/ prove that something works - it can only > prove that something does /not/ work.
A system is not a static entity. It changes over time (even if the design is frozen). So, while your RAM (or any other component) may not be LIKELY to fail, the rest of the system that enables the RAM to function as intended can change in ways that manifest as RAM (or other resources) failures. [Picking the "world's most reliable MCU" won't guarantee that it won't throw RAM errors in a deployed product.] Simply assuming it "can't fail" is naive. And, identifying faulty "can't happen" behavior EARLY (e.g. POST) rather than late gives you a better idea of what to report to the user/customer because you are closer to the problem's manifestation. You don't end up misbehaving and wondering "why?" [And, all of this assumes "bugfree software" so any errors are entirely a result of hardware faults]
David Brown <david.brown@hesbynett.no> writes:
> When you design a system based on a microcontroller, you pick a device > that is as reliable as you need it to be
That might not exist. E.g. it's common for security processors and software to continuously self-test while running, since the user might be trying to tamper with them. "Differential fault analysis" is a relevant search string. The attacker does stuff like intentionally overclock the processor in the hope of introducing errors, so they can observe the difference between the error result and the normal result, and infer stuff about the supposedly-secured info inside the processor. There is no magic way to defeat these attacks, but the cpu designers do what they can.
On 23/09/2020 11:13:43, pozz wrote:
> I'd like to implement a Power On Self Test to be sure all (or many) > parts of the electronics are functioning well. > > The tests to be done to check external hardware depends on the actual > hardware that is present. > > What about the MCU that features an internal Flash (where is the code) > and RAM and some peripherals? Are there any strategies to test if the > internal RAM or Flash are good? Do you think these kind of tests could > be useful? > > What about a test of the clock based on an external crystal?
I've done this with the STM32 variety of MCUs. The device itself has a Flash checksum and if this fails it won't start. ST also proved some example code and libraries for POST. These are more comprehensive than just checking RAM. Might be worth have a look. -- Mike Perkins Video Solutions Ltd www.videosolutions.ltd.uk

Memfault Beyond the Launch