EmbeddedRelated.com
Forums
Memfault Beyond the Launch

micro self-check of checksum

Started by Thomas Magma September 23, 2005
Richard wrote:
> "Lanarcam" <lanarcam1@yahoo.fr> wrote in message > news:1127582372.465810.205740@o13g2000cwo.googlegroups.com... > > > > Richard wrote: > > > > I am not familiar with PICs, and somebody already pointed out that at > > > > least some PICs can't read their code space as data. But the general > > > > idea of performing a checksum on a binary image is a pretty simple > > > > one, especially if the image is in a single contiguous chunk of > > > > memory. > > > > > > Now, assuming you find an error...what do you do? You have just proven > the > > > code is not trust worthy, so you cannot rely on the code to make the > system > > > safe in any way. Or in fact do anything predictably So is the test > worth > > > while? (playing devils advocate). > > > > The principle is to put all devices in a safe state, this can be > > accomplished by forcing a reset and by not executing the normal code > > after that but instead disabling all hardware and busy looping. > > This is based on the assumption that there exist a safe state for > > the system upon failure of both hardware and software for instance > > in the case of a power failure. > > So you are relying on code you know is corrupt to put all devices into a > safe state? In fact you don't even know the code is corrupt, as if it is > corrupt you don't know anything for sure, etc. You cannot even rely on it > to force a reset - maybe it is the decision to reset that has the > corruption. > > As Spehro Pefhany says, its a statistics game. You can only improve the > probability of safe behaviour.
A probability of failure of 10E-9 is required in the most severe cases. This is not a zero probability of failure. You certainly can't leave the code memory unchecked. Now you can imagine measures that enhance the probability of a correct behaviour in case of code corruption. For instance you can have multiple sections of code each with a check sum. You can also double the critical portions of code, you could also use different memory devices for each. In some systems you must output a square watchdog signal. You can have one portion of code thats writes a 1 and another that writes a zero. If following the detection of an error you busy loop, the correct waveform will not be output. This will trigger a reset by an external watchdog. Of course you would then hope that the reset code is correct. Every safety device has to perform memory code checking. For high levels of safety several processors are used, each with its own copy of code. Even in these cases each processor performs a code check and stops if an error is found. The others detect this state and put the device in a safe state. Regards
Lanarcam wrote:

> > Richard wrote: >> "Lanarcam" <lanarcam1@yahoo.fr> wrote in message >> news:1127582372.465810.205740@o13g2000cwo.googlegroups.com... >> > >> > Richard wrote: >> > > > I am not familiar with PICs, and somebody already pointed out that >> > > > at >> > > > least some PICs can't read their code space as data. But the >> > > > general idea of performing a checksum on a binary image is a pretty >> > > > simple one, especially if the image is in a single contiguous chunk >> > > > of memory. >> > > >> > > Now, assuming you find an error...what do you do? You have just >> > > proven >> the >> > > code is not trust worthy, so you cannot rely on the code to make the >> system >> > > safe in any way. Or in fact do anything predictably So is the test >> worth >> > > while? (playing devils advocate). >> > >> > The principle is to put all devices in a safe state, this can be >> > accomplished by forcing a reset and by not executing the normal code >> > after that but instead disabling all hardware and busy looping. >> > This is based on the assumption that there exist a safe state for >> > the system upon failure of both hardware and software for instance >> > in the case of a power failure. >> >> So you are relying on code you know is corrupt to put all devices into a >> safe state? In fact you don't even know the code is corrupt, as if it is >> corrupt you don't know anything for sure, etc. You cannot even rely on >> it to force a reset - maybe it is the decision to reset that has the >> corruption. >> >> As Spehro Pefhany says, its a statistics game. You can only improve the >> probability of safe behaviour. > > A probability of failure of 10E-9 is required in the most severe > cases. This is not a zero probability of failure.
Is that really a realistic expectation of system integrity. In the, increasingly, evidence based high integrity development world there is a movement towaqrds being able to quote the confidence with which the integrity level figure is quoted (the ACARP** prinicple - See "Dependability evaluation: a question of confidence" by Bev Littlewood - Safety Systems newsletter published by the Safety-Critical Systems Club). In Bev's article he has expressed the opinion that claims of 10E-4 failures per demand is difficult to support with a high degree of confidence. Therefore, the confidence level for 10E-9 failures per demand must be quite low. This is part of the reason why inherent safety must be first built in and utilised as a first resort.
> You certainly can't leave the code memory unchecked. Now you can > imagine measures that enhance the probability of a correct > behaviour in case of code corruption. For instance you can have > multiple sections of code each with a check sum.
There is also the question of, having checked programme memory integrity at the power-up stage, what efforts are you going to make to continue checking the integrity of the operational code. Reducing the re-test period will help improve the apparent system integrity especially of you have a definite plan of action for putting the system in a safe state should you detect a problem that does not rely on the errant code. -- ******************************************************************** Paul E. Bennett ....................<email://peb@amleth.demon.co.uk> Forth based HIDECS Consultancy .....<http://www.amleth.demon.co.uk/> Mob: +44 (0)7811-639972 Tel: +44 (0)1235-811095 Going Forth Safely ....EBA. http://www.electric-boat-association.org.uk/ ********************************************************************
Paul E. Bennett wrote:
> Lanarcam wrote: > > > > > Richard wrote: > >> "Lanarcam" <lanarcam1@yahoo.fr> wrote in message > >> news:1127582372.465810.205740@o13g2000cwo.googlegroups.com... > >> > > >> > Richard wrote: > >> > > > I am not familiar with PICs, and somebody already pointed out that > >> > > > at > >> > > > least some PICs can't read their code space as data. But the > >> > > > general idea of performing a checksum on a binary image is a pretty > >> > > > simple one, especially if the image is in a single contiguous chunk > >> > > > of memory. > >> > > > >> > > Now, assuming you find an error...what do you do? You have just > >> > > proven > >> the > >> > > code is not trust worthy, so you cannot rely on the code to make the > >> system > >> > > safe in any way. Or in fact do anything predictably So is the test > >> worth > >> > > while? (playing devils advocate). > >> > > >> > The principle is to put all devices in a safe state, this can be > >> > accomplished by forcing a reset and by not executing the normal code > >> > after that but instead disabling all hardware and busy looping. > >> > This is based on the assumption that there exist a safe state for > >> > the system upon failure of both hardware and software for instance > >> > in the case of a power failure. > >> > >> So you are relying on code you know is corrupt to put all devices into a > >> safe state? In fact you don't even know the code is corrupt, as if it is > >> corrupt you don't know anything for sure, etc. You cannot even rely on > >> it to force a reset - maybe it is the decision to reset that has the > >> corruption. > >> > >> As Spehro Pefhany says, its a statistics game. You can only improve the > >> probability of safe behaviour. > > > > A probability of failure of 10E-9 is required in the most severe > > cases. This is not a zero probability of failure. > > Is that really a realistic expectation of system integrity.
This is the figure required for civil avionics : 10E-9 failure per hour in working conditions.
> In the, > increasingly, evidence based high integrity development world there is a > movement towaqrds being able to quote the confidence with which the > integrity level figure is quoted (the ACARP** prinicple - See > "Dependability evaluation: a question of confidence" by Bev Littlewood - > Safety Systems newsletter published by the Safety-Critical Systems Club). > > In Bev's article he has expressed the opinion that claims of 10E-4 failures > per demand is difficult to support with a high degree of confidence. > Therefore, the confidence level for 10E-9 failures per demand must be quite > low. This is part of the reason why inherent safety must be first built in > and utilised as a first resort. > > > You certainly can't leave the code memory unchecked. Now you can > > imagine measures that enhance the probability of a correct > > behaviour in case of code corruption. For instance you can have > > multiple sections of code each with a check sum. > > There is also the question of, having checked programme memory integrity at > the power-up stage, what efforts are you going to make to continue checking > the integrity of the operational code. Reducing the re-test period will > help improve the apparent system integrity especially of you have a > definite plan of action for putting the system in a safe state should you > detect a problem that does not rely on the errant code.
In the systems we made the memory code was not tested periodically but continuously by the lowest priority task. This does not affect the performance of the system since this task is active only when no other task is running.
Lanarcam wrote:

>> >> As Spehro Pefhany says, its a statistics game. You can only improve >> >> the probability of safe behaviour. >> > >> > A probability of failure of 10E-9 is required in the most severe >> > cases. This is not a zero probability of failure. >> >> Is that really a realistic expectation of system integrity. > > This is the figure required for civil avionics : 10E-9 failure per > hour in working conditions.
I, and Bev Littlewood, are both aware of the figure as a requirement in avionics. The question below that, though, was what level of confidence do you have that you actually achieve that level of integrity.
>> There is also the question of, having checked programme memory integrity >> at the power-up stage, what efforts are you going to make to continue >> checking the integrity of the operational code. Reducing the re-test >> period will help improve the apparent system integrity especially of you >> have a definite plan of action for putting the system in a safe state >> should you detect a problem that does not rely on the errant code. > > In the systems we made the memory code was not tested periodically but > continuously by the lowest priority task. This does not affect the > performance of the system since this task is active only when no > other task is running.
This is still periodically. You are, I expect, only performing part of the test every time the idle task runs. Therefore, the full test is run over the course of a period of time and begins again immediately following the completion. The test interval then, is the time taken to complete the full test scenario. -- ******************************************************************** Paul E. Bennett ....................<email://peb@amleth.demon.co.uk> Forth based HIDECS Consultancy .....<http://www.amleth.demon.co.uk/> Mob: +44 (0)7811-639972 Tel: +44 (0)1235-811095 Going Forth Safely ....EBA. http://www.electric-boat-association.org.uk/ ********************************************************************
In article <dh6848$ipm$1$830fa7a5@news.demon.co.uk>, 
peb@amleth.demon.co.uk says...
> Lanarcam wrote: > > >> >> As Spehro Pefhany says, its a statistics game. You can only improve > >> >> the probability of safe behaviour. > >> > > >> > A probability of failure of 10E-9 is required in the most severe > >> > cases. This is not a zero probability of failure. > >> > >> Is that really a realistic expectation of system integrity. > > > > This is the figure required for civil avionics : 10E-9 failure per > > hour in working conditions. > > I, and Bev Littlewood, are both aware of the figure as a requirement in > avionics. The question below that, though, was what level of confidence do > you have that you actually achieve that level of integrity.
This sort of leads to the question How often have you (anyone using code checksums) seen these catch field failures? And as a supplement how many of these field failures that have been caught have been the result of (failed) field code updates? Robert
R Adsett wrote:

> In article <dh6848$ipm$1$830fa7a5@news.demon.co.uk>, > peb@amleth.demon.co.uk says... >> Lanarcam wrote: >> >> >> >> As Spehro Pefhany says, its a statistics game. You can only >> >> >> improve the probability of safe behaviour. >> >> > >> >> > A probability of failure of 10E-9 is required in the most severe >> >> > cases. This is not a zero probability of failure. >> >> >> >> Is that really a realistic expectation of system integrity. >> > >> > This is the figure required for civil avionics : 10E-9 failure per >> > hour in working conditions. >> >> I, and Bev Littlewood, are both aware of the figure as a requirement in >> avionics. The question below that, though, was what level of confidence >> do you have that you actually achieve that level of integrity. > > This sort of leads to the question > > How often have you (anyone using code checksums) seen these catch field > failures?
That would require observing a field failure. I know of only one instance of field failure with any of the systems I have designed over the past 36 years. That field failure, though, was a hardware component failure way back in the 70's. Most of my observed failures were on the prototypers test bench.
> And as a supplement how many of these field failures that have been > caught have been the result of (failed) field code updates?
As to catching errant code, never seen the occurrence despite the environments that some of my equipment runs in. There is still time though. -- ******************************************************************** Paul E. Bennett ....................<email://peb@amleth.demon.co.uk> Forth based HIDECS Consultancy .....<http://www.amleth.demon.co.uk/> Mob: +44 (0)7811-639972 Tel: +44 (0)1235-811095 Going Forth Safely ....EBA. http://www.electric-boat-association.org.uk/ ********************************************************************


R Adsett wrote:

>This sort of leads to the question > >How often have you (anyone using code checksums) seen these catch field >failures?
I measured exactly one in a run of 200,000 systems. These were 6800 uPs with UV-erase EPROMS. Keep in mind that the above in only the number that passed the tests in burn-in and production test and then started failing later. The POST caught a lot of bad units in production test, but I don't have a breakdown for how many were ROM checksum failures.
>And as a supplement how many of these field failures that have been >caught have been the result of (failed) field code updates?
None. This was before field code updates were common. I am very careful about the right voltages and algorithms for burning EPROMS; someone doing a poor erase or a marginal burn might have had a lot more trouble.
On Sun, 25 Sep 2005 20:43:04 +0000, the renowned Guy Macon
<http://www.guymacon.com/> wrote:

> > > >R Adsett wrote: > >>This sort of leads to the question >> >>How often have you (anyone using code checksums) seen these catch field >>failures? > >I measured exactly one in a run of 200,000 systems. These were 6800 >uPs with UV-erase EPROMS. > >Keep in mind that the above in only the number that passed the tests >in burn-in and production test and then started failing later. >The POST caught a lot of bad units in production test, but I don't >have a breakdown for how many were ROM checksum failures. > >>And as a supplement how many of these field failures that have been >>caught have been the result of (failed) field code updates? > >None. This was before field code updates were common. I am very >careful about the right voltages and algorithms for burning EPROMS; >someone doing a poor erase or a marginal burn might have had a lot >more trouble.
Yes, probably a lot of systems currently are in-circuit programmed without verification at Vdd limits. Best regards, Spehro Pefhany -- "it's the network..." "The Journey is the reward" speff@interlog.com Info for manufacturers: http://www.trexon.com Embedded software/hardware/analog Info for designers: http://www.speff.com
"Richard" <nospam@thanks.com> wrote in message 
news:lAfZe.114898$G8.40537@text.news.blueyonder.co.uk...
>> I am not familiar with PICs, and somebody already pointed out that at >> least some PICs can't read their code space as data. But the general >> idea of performing a checksum on a binary image is a pretty simple >> one, especially if the image is in a single contiguous chunk of >> memory. > > Now, assuming you find an error...what do you do? You have just proven > the > code is not trust worthy, so you cannot rely on the code to make the > system > safe in any way. Or in fact do anything predictably So is the test > worth > while? (playing devils advocate). > > Regards, > Richard. > > > > http://www.FreeRTOS.org >
This has to do with a self-test feature. The device is often baked out in the sun year after year and/or placed next to high wattage transmitters. If the self-test fails...time for repair. Thomas
R Adsett <radsett@junk.aeolusdevelopment.cm> wrote:
[snip]
> This sort of leads to the question > > How often have you (anyone using code checksums) seen these catch field > failures?
I've had quite a few. Equipment installed in cellular towers may get its fair share of lightning surges. Even with pretty hefty protection schemes some voltage spikes will get through, something that can cause partial Flash PROM erase, sometimes only a single bit error.
> And as a supplement how many of these field failures that have been > caught have been the result of (failed) field code updates? >
In the cases mentioned above, none. Remote upgrade of software is a different can of worms. It takes pretty careful design to avoid all possible pitfalls. Ending up with a dead lump of you have to change on site will in many cases cause enormous costs, especially if it is outdoor equipment in areas that due to the climate is more or less impossible to reach during several months of the year. /Henrik --

Memfault Beyond the Launch