Power On Self Test

I'd like to implement a Power On Self Test to be sure all (or many) 
parts of the electronics are functioning well.

The tests to be done to check external hardware depends on the actual 
hardware that is present.

What about the MCU that features an internal Flash (where is the code) 
and RAM and some peripherals? Are there any strategies to test if the 
internal RAM or Flash are good? Do you think these kind of tests could 
be useful?

What about a test of the clock based on an external crystal?

Reply by Don Y ●September 23, 20202020-09-23

On 9/23/2020 3:13 AM, pozz wrote:
> I'd like to implement a Power On Self Test to be sure all (or many) parts of 
> the electronics are functioning well.
> 
> The tests to be done to check external hardware depends on the actual hardware 
> that is present.
> 
> What about the MCU that features an internal Flash (where is the code) and RAM 
> and some peripherals? Are there any strategies to test if the internal RAM or 
> Flash are good? Do you think these kind of tests could be useful?
> 
> What about a test of the clock based on an external crystal?

You can test whatever you think you need confidence in prior to declaring the
system healthy enough to boot.

Historically, a small area of ROM was relied upon to contain enough code to
verify it's integrity along with the rest of the POST/BIST code's integrity.
This was done without referencing RAM (which may be defective).

Some folks would include "CPU tests" to verify the basic integrity of the
processor.  I think these are dubious as it likely either works or it doesn't
("Gee, it can ADD but can't JMP!").  With more advanced CPUs, you'd likely
want to verify the cache, ECC and VMM hardware behave as expected (sometimes
this requires adding hooks in order to be able to synthesize faults).

RAM was then tested using strategies appropriate to the RAM technology used.
E.g., DRAM wanted to be tested with long delays between the write and
read-verify to ensure any failures in hardware refresh mechanisms were
given an opportunity to manifest.  (Even nonexistent memory can appear to
be present and functional if the test is poorly designed, in certain hardware
configurations)

 From there, different peripherals could be tested while relying on the now
assumed *functional* ROM & RAM to conduct those tests.  I.e., the test
application can start to look like a more full-featured application instead
of tight little bits of code.

You're at the mercy of the hardware designer to incorporate appropriate hooks
to test many aspects of the circuitry.  E.g., can you generate a serial
data stream to test if a UART is receiving correctly?  transmitting?  Does
your MAC let you push octets onto the wire and see them or is the loopback
interface purely inside the NIC?

In the past, I've taken unused outputs and used them as termination voltages
for high impedance pullups/pulldowns that I could use to determine if an
external bit of kit was plugged into the system.  I.e., drive the termination
up, then down -- possibly multiple times, depending on what those "inputs"
feed -- and see if anything is detected.  If not, it is hopefully because the
external device is driving those inputs with lower impedance signals.  So,
test the external device!

You can test for stuck keys/buttons -- if you can ensure the user (or
mechanism) -- can be relied upon NOT to activate them during the POST.

You can test for a functional XTAL -- but only if you have some other timebase
(which may be crude/inaccurate) is operational.

[I once diagnosed a pinball machine as having a defective crystal simply by
observing the refresh of the displays with my unaided eyes -- PGDs appear to
vibrate when lit.  Had the POST for the machine been able to detect -- and
flag -- that, it could have diagnosed itself!]

You also have to decide what role the test will play in the user's device
experience; will you flash an indicator telling the user that a fault has
been detected (if so, what will the user do)?  Or, will you attempt to
workaround any faults?  How reliable will your "indicator" be??  Will you
want to convey anything more informative than "check engine"?

I've added circuitry to my designs to allow me to dynamically (POST as well
as BIST) verify the operational status of the hardware.  E.g., every speaker
is accompanied by a microphone -- so I can "listen" to the sounds that I'm
generating to verify the speaker is operational.  And, likewise, so I can
generate sounds of particular characteristics to know that my microphone is
working!  Of course, having those "devices" on hand means I can also
find uses for them that I might not have originally included in the design!

In my application, I can move the device to a "testing" state at any time.
In this state, I can load diagnostics (once the device itself has verified
that it is capable of executing those diagnostics!) to do whatever testing
I deem necessary.  E.g., if I encounter lots of ECC errors in the onboard
RAM, I can take the device offline and run a comprehensive memory diagnostic.
Depending on the results of that test, I can recertify the device for
normal operation, some subset of normal *or* flag it as faulted.

But, my environment expects the devices to operate "unattended" for very
long periods of time, 24/7/365, so I can't rely on the activation of a
POST at power-up.

Think hard about the types of failures you EXPECT to see (i.e., many are
USER errors!) and don't invest too much time detecting things that will
likely never fail OR whose failure you won't be able to do much about.

Reply by David Brown ●September 23, 20202020-09-23

On 23/09/2020 12:57, Don Y wrote:
> On 9/23/2020 3:13 AM, pozz wrote:
>> I'd like to implement a Power On Self Test to be sure all (or many)
>> parts of the electronics are functioning well.
>>
>> The tests to be done to check external hardware depends on the actual
>> hardware that is present.
>>
>> What about the MCU that features an internal Flash (where is the code)
>> and RAM and some peripherals? Are there any strategies to test if the
>> internal RAM or Flash are good? Do you think these kind of tests could
>> be useful?
>>
>> What about a test of the clock based on an external crystal?
> 
<snip>
> Think hard about the types of failures you EXPECT to see (i.e., many are
> USER errors!) and don't invest too much time detecting things that will
> likely never fail OR whose failure you won't be able to do much about.

This last bit is crucial.

A lot of testing "requirements" that are specified are completely
pointless - or far worse than useless, as they introduce real points of
failure in their attempts to cover everything.

First, figure out what you should /not/ test.

Don't bother testing something unless you can usefully handle the
failure.  If the way you communicate errors is through a UART, there is
no point in trying to check that the UART is working.  If you have a
single microcontroller in the system, there is no point in trying to
check the cpu or the on-chip ram.  There is no point in checking that
you can write to flash or on-chip eeprom - all you do is reduce its
lifetime and make it more likely to fail.

Don't write any test code which cannot itself be tested.  If you cannot
induce a failure, or at least simulate it reasonably, do not write code
to check or handle that failure.  The reality is that the untested code
will have a higher risk of problems than the thing you are testing.

Don't check the ram or the flash of the microcontroller - there's
nothing you can do if there is a failure.  (You can check that you have
successfully loaded a new software update, or that there hasn't been a
reset during an update - a CRC for that kind of thing is a good idea.)
If you have a system that is important enough that ram or flash failures
need to be checked and handled, use a safety-qualified microcontroller
with ECC ram, flash, cache, etc., and perhaps even redundant cores (you
get these with PowerPC and Cortex-R cores).

And think about what can reasonably go wrong, how it can go wrong, and
what can be done about it.  Other than for devices susceptible to
current surges (like filament light bulbs), most hardware failures are
in usage, not while power is off - checking on power-up (rather than
while the system is in use) usually only makes sense if it is likely for
a user to see there is a problem and try to "fix" it by turning power
off and on again.

Reply by Richard Damon ●September 23, 20202020-09-23

On 9/23/20 8:13 AM, David Brown wrote:
> On 23/09/2020 12:57, Don Y wrote:
>> On 9/23/2020 3:13 AM, pozz wrote:
>>> I'd like to implement a Power On Self Test to be sure all (or many)
>>> parts of the electronics are functioning well.
>>>
>>> The tests to be done to check external hardware depends on the actual
>>> hardware that is present.
>>>
>>> What about the MCU that features an internal Flash (where is the code)
>>> and RAM and some peripherals? Are there any strategies to test if the
>>> internal RAM or Flash are good? Do you think these kind of tests could
>>> be useful?
>>>
>>> What about a test of the clock based on an external crystal?
>>
> <snip>
>> Think hard about the types of failures you EXPECT to see (i.e., many are
>> USER errors!) and don't invest too much time detecting things that will
>> likely never fail OR whose failure you won't be able to do much about.
> 
> This last bit is crucial.
> 
> A lot of testing "requirements" that are specified are completely
> pointless - or far worse than useless, as they introduce real points of
> failure in their attempts to cover everything.
> 
> First, figure out what you should /not/ test.
> 
> Don't bother testing something unless you can usefully handle the
> failure.  If the way you communicate errors is through a UART, there is
> no point in trying to check that the UART is working.  If you have a
> single microcontroller in the system, there is no point in trying to
> check the cpu or the on-chip ram.  There is no point in checking that
> you can write to flash or on-chip eeprom - all you do is reduce its
> lifetime and make it more likely to fail.
> 
> Don't write any test code which cannot itself be tested.  If you cannot
> induce a failure, or at least simulate it reasonably, do not write code
> to check or handle that failure.  The reality is that the untested code
> will have a higher risk of problems than the thing you are testing.
> 
> Don't check the ram or the flash of the microcontroller - there's
> nothing you can do if there is a failure.  (You can check that you have
> successfully loaded a new software update, or that there hasn't been a
> reset during an update - a CRC for that kind of thing is a good idea.)
> If you have a system that is important enough that ram or flash failures
> need to be checked and handled, use a safety-qualified microcontroller
> with ECC ram, flash, cache, etc., and perhaps even redundant cores (you
> get these with PowerPC and Cortex-R cores).
> 
> And think about what can reasonably go wrong, how it can go wrong, and
> what can be done about it.  Other than for devices susceptible to
> current surges (like filament light bulbs), most hardware failures are
> in usage, not while power is off - checking on power-up (rather than
> while the system is in use) usually only makes sense if it is likely for
> a user to see there is a problem and try to "fix" it by turning power
> off and on again.
> 

Testing RAM can be useful, letting the system fail gracefully rather
than acting flaky, perhaps just locking up into a tight loop flashing a
LED as a fault indicator. Similarly, you could CRC check the program
flash, and fail on an error, preferably falling into a minimal system
that allows a user reflash, but it might mean just bricking.

Note, that as you say, most faults will happen while powered up, but
many faults will cause a system crash, that the user is likely to power
cycle to try and clear, so power up is a good time to check (since many
things are a lot harder to check while the system is running in operation).

Reply by David Brown ●September 23, 20202020-09-23

On 23/09/2020 14:51, Richard Damon wrote:
> On 9/23/20 8:13 AM, David Brown wrote:
>> On 23/09/2020 12:57, Don Y wrote:
>>> On 9/23/2020 3:13 AM, pozz wrote:
>>>> I'd like to implement a Power On Self Test to be sure all (or many)
>>>> parts of the electronics are functioning well.
>>>>
>>>> The tests to be done to check external hardware depends on the actual
>>>> hardware that is present.
>>>>
>>>> What about the MCU that features an internal Flash (where is the code)
>>>> and RAM and some peripherals? Are there any strategies to test if the
>>>> internal RAM or Flash are good? Do you think these kind of tests could
>>>> be useful?
>>>>
>>>> What about a test of the clock based on an external crystal?
>>>
>> <snip>
>>> Think hard about the types of failures you EXPECT to see (i.e., many are
>>> USER errors!) and don't invest too much time detecting things that will
>>> likely never fail OR whose failure you won't be able to do much about.
>>
>> This last bit is crucial.
>>
>> A lot of testing "requirements" that are specified are completely
>> pointless - or far worse than useless, as they introduce real points of
>> failure in their attempts to cover everything.
>>
>> First, figure out what you should /not/ test.
>>
>> Don't bother testing something unless you can usefully handle the
>> failure.  If the way you communicate errors is through a UART, there is
>> no point in trying to check that the UART is working.  If you have a
>> single microcontroller in the system, there is no point in trying to
>> check the cpu or the on-chip ram.  There is no point in checking that
>> you can write to flash or on-chip eeprom - all you do is reduce its
>> lifetime and make it more likely to fail.
>>
>> Don't write any test code which cannot itself be tested.  If you cannot
>> induce a failure, or at least simulate it reasonably, do not write code
>> to check or handle that failure.  The reality is that the untested code
>> will have a higher risk of problems than the thing you are testing.
>>
>> Don't check the ram or the flash of the microcontroller - there's
>> nothing you can do if there is a failure.  (You can check that you have
>> successfully loaded a new software update, or that there hasn't been a
>> reset during an update - a CRC for that kind of thing is a good idea.)
>> If you have a system that is important enough that ram or flash failures
>> need to be checked and handled, use a safety-qualified microcontroller
>> with ECC ram, flash, cache, etc., and perhaps even redundant cores (you
>> get these with PowerPC and Cortex-R cores).
>>
>> And think about what can reasonably go wrong, how it can go wrong, and
>> what can be done about it.  Other than for devices susceptible to
>> current surges (like filament light bulbs), most hardware failures are
>> in usage, not while power is off - checking on power-up (rather than
>> while the system is in use) usually only makes sense if it is likely for
>> a user to see there is a problem and try to "fix" it by turning power
>> off and on again.
>>
> 
> Testing RAM can be useful, letting the system fail gracefully rather
> than acting flaky, perhaps just locking up into a tight loop flashing a
> LED as a fault indicator.

Have you ever seen microcontroller RAM that failed?  It's a possibility
for dynamic ram on PC's that is pushed to its limits for power and
speed, and made as cheaply as possible.  But for static RAM in a
microcontroller, the risk of failures is pretty much negligible.  The
exception is if a bit is hit by a cosmic ray (or other serious
radiation), which can flip a bit, but that won't be detected by any RAM
test of this kind.

Testing RAM is useful /if/ it can fail, and /if/ you can do something
useful when it fails.  (I agree that "a tight loop flashing an LED"
might count as something useful, depending on the situation.)

I've seen "safety standards requirements" that included regular ram
tests.  Such requirements generally originate decades ago, and are not
appropriately nuanced for real-life systems.  I've seen resulting code
used to implement such tests, added solely to fulfil such requirements.
 And I've seen such code written in a way that is untested and
untestable, and in a way that has risks that /hugely/ outweigh those of
a fault occurring in the on-board RAM.

If the OP is in the situation where there are customer requirements for
fulfilling certain safety requirements that include ram tests, and where
"mindlessly obeying these rules no matter how pointless they are in
reality" is the right choice to please arse-covering lawyers, then go
for it.  If not, then think long and hard about the realism of such a
failure and such a test, and whether it is truly a positive contribution
to the project as a whole.

> Similarly, you could CRC check the program
> flash, and fail on an error, preferably falling into a minimal system
> that allows a user reflash, but it might mean just bricking.
> 

The possibility of a flash failure is a great deal higher than that of a
RAM failure.  Flash writes are analogue - a bit can be written in such a
way that it reads back correctly at programming time, but goes outside
the margins over time or at different temperatures or voltages.  So yes,
sometimes a CRC of the flash is worth doing.  But remember that the
program doing the check is just as much at risk of such failures
(perhaps even more so, if you have a "boot" program that does the check
of the "main" program, as the boot program is less likely to be updated
and thus its bits will have decayed over a longer time).

If flash fails are a real risk, and the system is important enough, it's
better to pick a microcontroller with ECC flash.

> Note, that as you say, most faults will happen while powered up, but
> many faults will cause a system crash, that the user is likely to power
> cycle to try and clear, so power up is a good time to check (since many
> things are a lot harder to check while the system is running in operation).
> 

Yes, I mentioned that.  (It assumes the embedded system has a user that
can do such a power-cycle.)

Reply by Don Y ●September 23, 20202020-09-23

On 9/23/2020 7:03 AM, David Brown wrote:
> On 23/09/2020 14:51, Richard Damon wrote:
>> Testing RAM can be useful, letting the system fail gracefully rather
>> than acting flaky, perhaps just locking up into a tight loop flashing a
>> LED as a fault indicator.

Exactly.  If memory is expected to work -- and NEVER expected to fail -- then
it's a small cost to actually make some attempt to prove that is actually the
case.  Otherwise, when that "Can't Happen" actually does, you're left clueless.

[In the 70's, a common system failure I encountered was an address decoding
error which would effectively disable all memory (think misprogrammed PLA).
It was readily apparent as the processor would be found halted at ~0x0076
(IIRC) -- 0x76 being the opcode for HALT which would be the low byte of the
address still "floating" on the multiplexed address/data bus.  Nowadays, one
can imagine similar failures -- including grown defects -- deleteriously
affecting deployed product.]

> Have you ever seen microcontroller RAM that failed?  It's a possibility
> for dynamic ram on PC's that is pushed to its limits for power and
> speed, and made as cheaply as possible.  But for static RAM in a
> microcontroller, the risk of failures is pretty much negligible.  The
> exception is if a bit is hit by a cosmic ray (or other serious
> radiation), which can flip a bit, but that won't be detected by any RAM
> test of this kind.

You're assuming that there is only one, predetermined way to get into the
self-test routine.  And, that nothing in the machine has failed that would
render that assumption false.

At each point in your code, you should know what assumptions are safe and
which are yet to be proven/made safe.  If you're in the self-test routine,
you shouldn't have to wonder if memory works, is configured as you expect
it to be, etc.  ("Hey, I'm running code so why bother to TEST the code
image??")  Assuming that the memory is operational NOW (while I am executing
this piece of self-test code) is a hazard waiting to happen.

For example, an errant RETURN could land the program IN the self-test code
WITHOUT the benefit of having been through the controlled, repeatable start-up
sequence.  (i.e., the RAM -- or other resource -- may NOW be mapped to a
different location in the address space such that the code written under
the assumption that it resides in its "power on reset" configuration no
longer works properly.)

I'd rather have that code FAIL and report the error to me -- because it
tried to verify some assumption(s) and failed -- than have the code continue
to operate FROM THERE on the assumption that it is actually (later) talking
to functional RAM that has yet-to-be reconfigured.  Otherwise, you get a
"fluke" that you can never resolve (and, because you can't easily sort out
what might have happened in order to reproduce and repair, you shrug it
off due to time pressures -- even though YOU SAW IT FAIL!).

[I have an entry point in all of my products called RESET.  It manually and
deliberately works to restore the hardware to the same condition that it
was in just after the application of power.  So, any code that executes
after passing through that entry point -- to "RESETTED" -- SHOULD behave the
same regardless of whether power was just applied, or not]

Note that there's a difference between the sort of "confidence testing"
that occurs at POST (how many devices perform exhaustive tests at POST?
How many users would tolerate that sort of delay?) and "diagnostic testing"
which truly provides an assessment of the health of the device and can often
be used to assist in determine the need for replacement (or, for self-healing).

In most cases, you can test RAM with a single write pass followed by a
verification read pass and be reasonably sure that you've caught stuck-at
failures as well as decode failures -- no need for a whole barrage of
different tests when you're typically looking for a simple Go-NoGo.

[I run three passes on a 512MB block and use that as a crude assessment
as to whether or not the memory will LIKELY accept a program image.
Installing -- and verifying -- that image acts as a further test of
the memory's crude functionality.  Thereafter, I swap pages of memory
out and exercise them to verify that I'm not seeing an increase in
ECC activity in a particular region -- which I will remap if need be.]

You also need to know how the device is fabricated; a memory module
will experience different errors than memory that is soldered down.
(and, in the latter case, you have to be prepared for the memory to
NOT be what you THOUGHT it was going to be, at design time).  And,
soldered down memory will behave different than chip-on-chip.

Folks write ONE memory test and then assume all memory behaves (fails!)
the same.

If you don't understand your hardware and how it can fail, you shouldn't
be the one who is designing the test suite!

Reply by David Brown ●September 23, 20202020-09-23

On 23/09/2020 18:36, Don Y wrote:

> If you don't understand your hardware and how it can fail, you shouldn't
> be the one who is designing the test suite!

That bit is correct.  The rest - well, I don't want to get into a long
and protracted argument.

Any system is made up of layers.  Higher level layers assume that lower
level layers work according to specification (which may include
indicating an error for some kinds of detectable fault).  If you think
the higher level part can fully verify the lower level parts - "prove"
that the assumptions hold - you are fooling yourself.  When you design a
system based on a microcontroller, you pick a device that is as reliable
as you need it to be - so that you /can/ assume the core parts (cpu,
ram, flash, interrupts, etc.) work well enough for your needs.  If you
are not sure it is reliable enough, pick a different device or make a
redundant system.

No amount of testing can /ever/ prove that something works - it can only
prove that something does /not/ work.

Reply by Don Y ●September 23, 20202020-09-23

On 9/23/2020 12:02 PM, David Brown wrote:
> On 23/09/2020 18:36, Don Y wrote:
> 
>> If you don't understand your hardware and how it can fail, you shouldn't
>> be the one who is designing the test suite!
> 
> That bit is correct.  The rest - well, I don't want to get into a long
> and protracted argument.
> 
> Any system is made up of layers.  Higher level layers assume that lower
> level layers work according to specification (which may include
> indicating an error for some kinds of detectable fault).  If you think
> the higher level part can fully verify the lower level parts - "prove"
> that the assumptions hold - you are fooling yourself.  When you design a
> system based on a microcontroller, you pick a device that is as reliable
> as you need it to be - so that you /can/ assume the core parts (cpu,
> ram, flash, interrupts, etc.) work well enough for your needs.  If you
> are not sure it is reliable enough, pick a different device or make a
> redundant system.
> 
> No amount of testing can /ever/ prove that something works - it can only
> prove that something does /not/ work.

A system is not a static entity.  It changes over time (even if the design is
frozen).  So, while your RAM (or any other component) may not be LIKELY to
fail, the rest of the system that enables the RAM to function as intended
can change in ways that manifest as RAM (or other resources) failures.

[Picking the "world's most reliable MCU" won't guarantee that it won't throw
RAM errors in a deployed product.]

Simply assuming it "can't fail" is naive.

And, identifying faulty "can't happen" behavior EARLY (e.g. POST) rather
than late gives you a better idea of what to report to the user/customer
because you are closer to the problem's manifestation.  You don't end
up misbehaving and wondering "why?"

[And, all of this assumes "bugfree software" so any errors are entirely a
result of hardware faults]

Reply by Paul Rubin ●September 23, 20202020-09-23

David Brown <david.brown@hesbynett.no> writes:
> When you design a system based on a microcontroller, you pick a device
> that is as reliable as you need it to be

That might not exist.  E.g. it's common for security processors and
software to continuously self-test while running, since the user might
be trying to tamper with them.  "Differential fault analysis" is a
relevant search string.  The attacker does stuff like intentionally
overclock the processor in the hope of introducing errors, so they can
observe the difference between the error result and the normal result,
and infer stuff about the supposedly-secured info inside the processor.
There is no magic way to defeat these attacks, but the cpu designers do
what they can.

Reply by Mike Perkins ●September 23, 20202020-09-23

On 23/09/2020 11:13:43, pozz wrote:
> I'd like to implement a Power On Self Test to be sure all (or many) 
> parts of the electronics are functioning well.
> 
> The tests to be done to check external hardware depends on the actual 
> hardware that is present.
> 
> What about the MCU that features an internal Flash (where is the code) 
> and RAM and some peripherals? Are there any strategies to test if the 
> internal RAM or Flash are good? Do you think these kind of tests could 
> be useful?
> 
> What about a test of the clock based on an external crystal?

I've done this with the STM32 variety of MCUs. The device itself has a 
Flash checksum and if this fails it won't start.

ST also proved some example code and libraries for POST. These are more 
comprehensive than just checking RAM.

Might be worth have a look.

-- 
Mike Perkins
Video Solutions Ltd
www.videosolutions.ltd.uk

Previous12 3 Next

Power On Self Test

Sign in

You might also like...

Search forums

Free PDF Downloads

Blogs - Hall of Fame

Discussion Groups

Quick Links

About EmbeddedRelated.com

Social Networks

The Related Media Group