micro self-check of checksum| page 2

Reply by Lanarcam ●September 24, 20052005-09-24

Richard wrote:
> "Lanarcam" <lanarcam1@yahoo.fr> wrote in message
> news:1127582372.465810.205740@o13g2000cwo.googlegroups.com...
> >
> > Richard wrote:
> > > > I am not familiar with PICs, and somebody already pointed out that at
> > > > least some PICs can't read their code space as data.  But the general
> > > > idea of performing a checksum on a binary image is a pretty simple
> > > > one, especially if the image is in a single contiguous chunk of
> > > > memory.
> > >
> > > Now, assuming you find an error...what do you do?  You have just proven
> the
> > > code is not trust worthy, so you cannot rely on the code to make the
> system
> > > safe in any way.   Or in fact do anything predictably  So is the test
> worth
> > > while?  (playing devils advocate).
> >
> > The principle is to put all devices in a safe state, this can be
> > accomplished by forcing a reset and by not executing the normal code
> > after that but instead disabling all hardware and busy looping.
> > This is based on the assumption that there exist a safe state for
> > the system upon failure of both hardware and software for instance
> > in the case of a power failure.
>
> So you are relying on code you know is corrupt to put all devices into a
> safe state?  In fact you don't even know the code is corrupt, as if it is
> corrupt you don't know anything for sure, etc.  You cannot even rely on it
> to force a reset - maybe it is the decision to reset that has the
> corruption.
>
> As Spehro Pefhany says, its a statistics game.  You can only improve the
> probability of safe behaviour.

A probability of failure of 10E-9 is required in the most severe
cases. This is not a zero probability of failure.

You certainly can't leave the code memory unchecked. Now you can
imagine measures that enhance the probability of a correct
behaviour in case of code corruption. For instance you can have
multiple sections of code each with a check sum.

You can also double the critical portions of code, you could also
use different memory devices for each.

In some systems you must output a square watchdog signal. You can
have one portion of code thats writes a 1 and another that writes
a zero. If following the detection of an error you busy loop, the
correct waveform will not be output. This will trigger a reset by
an external watchdog. Of course you would then hope that the reset
code is correct.

Every safety device has to perform memory code checking. For high
levels of safety several processors are used, each with its own
copy of code. Even in these cases each processor performs a code
check and stops if an error is found. The others detect this state
and put the device in a safe state.

Regards

Reply by Paul E. Bennett ●September 25, 20052005-09-25

Lanarcam wrote:

> 
> Richard wrote:
>> "Lanarcam" <lanarcam1@yahoo.fr> wrote in message
>> news:1127582372.465810.205740@o13g2000cwo.googlegroups.com...
>> >
>> > Richard wrote:
>> > > > I am not familiar with PICs, and somebody already pointed out that
>> > > > at
>> > > > least some PICs can't read their code space as data.  But the
>> > > > general idea of performing a checksum on a binary image is a pretty
>> > > > simple one, especially if the image is in a single contiguous chunk
>> > > > of memory.
>> > >
>> > > Now, assuming you find an error...what do you do?  You have just
>> > > proven
>> the
>> > > code is not trust worthy, so you cannot rely on the code to make the
>> system
>> > > safe in any way.   Or in fact do anything predictably  So is the test
>> worth
>> > > while?  (playing devils advocate).
>> >
>> > The principle is to put all devices in a safe state, this can be
>> > accomplished by forcing a reset and by not executing the normal code
>> > after that but instead disabling all hardware and busy looping.
>> > This is based on the assumption that there exist a safe state for
>> > the system upon failure of both hardware and software for instance
>> > in the case of a power failure.
>>
>> So you are relying on code you know is corrupt to put all devices into a
>> safe state?  In fact you don't even know the code is corrupt, as if it is
>> corrupt you don't know anything for sure, etc.  You cannot even rely on
>> it to force a reset - maybe it is the decision to reset that has the
>> corruption.
>>
>> As Spehro Pefhany says, its a statistics game.  You can only improve the
>> probability of safe behaviour.
> 
> A probability of failure of 10E-9 is required in the most severe
> cases. This is not a zero probability of failure.

Is that really a realistic expectation of system integrity. In the, 
increasingly, evidence based high integrity development world there is a 
movement towaqrds being able to quote the confidence with which the 
integrity level figure is quoted (the ACARP** prinicple - See 
"Dependability evaluation: a question of confidence" by Bev Littlewood - 
Safety Systems newsletter published by the Safety-Critical Systems Club).

In Bev's article he has expressed the opinion that claims of 10E-4 failures 
per demand is difficult to support with a high degree of confidence. 
Therefore, the confidence level for 10E-9 failures per demand must be quite 
low. This is part of the reason why inherent safety must be first built in 
and utilised as a first resort.

> You certainly can't leave the code memory unchecked. Now you can
> imagine measures that enhance the probability of a correct
> behaviour in case of code corruption. For instance you can have
> multiple sections of code each with a check sum.

There is also the question of, having checked programme memory integrity at 
the power-up stage, what efforts are you going to make to continue checking 
the integrity of the operational code. Reducing the re-test period will 
help improve the apparent system integrity especially of you have a 
definite plan of action for putting the system in a safe state should you 
detect a problem that does not rely on the errant code.

-- 
********************************************************************
Paul E. Bennett ....................<email://peb@amleth.demon.co.uk>
Forth based HIDECS Consultancy .....<http://www.amleth.demon.co.uk/>
Mob: +44 (0)7811-639972
Tel: +44 (0)1235-811095
Going Forth Safely ....EBA. http://www.electric-boat-association.org.uk/
********************************************************************

Reply by Lanarcam ●September 25, 20052005-09-25

Paul E. Bennett wrote:
> Lanarcam wrote:
>
> >
> > Richard wrote:
> >> "Lanarcam" <lanarcam1@yahoo.fr> wrote in message
> >> news:1127582372.465810.205740@o13g2000cwo.googlegroups.com...
> >> >
> >> > Richard wrote:
> >> > > > I am not familiar with PICs, and somebody already pointed out that
> >> > > > at
> >> > > > least some PICs can't read their code space as data.  But the
> >> > > > general idea of performing a checksum on a binary image is a pretty
> >> > > > simple one, especially if the image is in a single contiguous chunk
> >> > > > of memory.
> >> > >
> >> > > Now, assuming you find an error...what do you do?  You have just
> >> > > proven
> >> the
> >> > > code is not trust worthy, so you cannot rely on the code to make the
> >> system
> >> > > safe in any way.   Or in fact do anything predictably  So is the test
> >> worth
> >> > > while?  (playing devils advocate).
> >> >
> >> > The principle is to put all devices in a safe state, this can be
> >> > accomplished by forcing a reset and by not executing the normal code
> >> > after that but instead disabling all hardware and busy looping.
> >> > This is based on the assumption that there exist a safe state for
> >> > the system upon failure of both hardware and software for instance
> >> > in the case of a power failure.
> >>
> >> So you are relying on code you know is corrupt to put all devices into a
> >> safe state?  In fact you don't even know the code is corrupt, as if it is
> >> corrupt you don't know anything for sure, etc.  You cannot even rely on
> >> it to force a reset - maybe it is the decision to reset that has the
> >> corruption.
> >>
> >> As Spehro Pefhany says, its a statistics game.  You can only improve the
> >> probability of safe behaviour.
> >
> > A probability of failure of 10E-9 is required in the most severe
> > cases. This is not a zero probability of failure.
>
> Is that really a realistic expectation of system integrity.

This is the figure required for civil avionics : 10E-9 failure per
hour in working conditions.

> In the,
> increasingly, evidence based high integrity development world there is a
> movement towaqrds being able to quote the confidence with which the
> integrity level figure is quoted (the ACARP** prinicple - See
> "Dependability evaluation: a question of confidence" by Bev Littlewood -
> Safety Systems newsletter published by the Safety-Critical Systems Club).
>
> In Bev's article he has expressed the opinion that claims of 10E-4 failures
> per demand is difficult to support with a high degree of confidence.
> Therefore, the confidence level for 10E-9 failures per demand must be quite
> low. This is part of the reason why inherent safety must be first built in
> and utilised as a first resort.
>
> > You certainly can't leave the code memory unchecked. Now you can
> > imagine measures that enhance the probability of a correct
> > behaviour in case of code corruption. For instance you can have
> > multiple sections of code each with a check sum.
>
> There is also the question of, having checked programme memory integrity at
> the power-up stage, what efforts are you going to make to continue checking
> the integrity of the operational code. Reducing the re-test period will
> help improve the apparent system integrity especially of you have a
> definite plan of action for putting the system in a safe state should you
> detect a problem that does not rely on the errant code.

In the systems we made the memory code was not tested periodically but
continuously by the lowest priority task. This does not affect the
performance of the system since this task is active only when no
other task is running.

Reply by Paul E. Bennett ●September 25, 20052005-09-25

Lanarcam wrote:

>> >> As Spehro Pefhany says, its a statistics game.  You can only improve
>> >> the probability of safe behaviour.
>> >
>> > A probability of failure of 10E-9 is required in the most severe
>> > cases. This is not a zero probability of failure.
>>
>> Is that really a realistic expectation of system integrity.
> 
> This is the figure required for civil avionics : 10E-9 failure per
> hour in working conditions.

I, and Bev Littlewood, are both aware of the figure as a requirement in 
avionics. The question below that, though, was what level of confidence do 
you have that you actually achieve that level of integrity.
 
>> There is also the question of, having checked programme memory integrity
>> at the power-up stage, what efforts are you going to make to continue
>> checking the integrity of the operational code. Reducing the re-test
>> period will help improve the apparent system integrity especially of you
>> have a definite plan of action for putting the system in a safe state
>> should you detect a problem that does not rely on the errant code.
> 
> In the systems we made the memory code was not tested periodically but
> continuously by the lowest priority task. This does not affect the
> performance of the system since this task is active only when no
> other task is running.

This is still periodically. You are, I expect, only performing part of the 
test every time the idle task runs. Therefore, the full test is run over 
the course of a period of time and begins again immediately following the 
completion. The test interval then, is the time taken to complete the full 
test scenario.

-- 
********************************************************************
Paul E. Bennett ....................<email://peb@amleth.demon.co.uk>
Forth based HIDECS Consultancy .....<http://www.amleth.demon.co.uk/>
Mob: +44 (0)7811-639972
Tel: +44 (0)1235-811095
Going Forth Safely ....EBA. http://www.electric-boat-association.org.uk/
********************************************************************

Reply by R Adsett ●September 25, 20052005-09-25

In article <dh6848$ipm$1$830fa7a5@news.demon.co.uk>, 
peb@amleth.demon.co.uk says...
> Lanarcam wrote:
> 
> >> >> As Spehro Pefhany says, its a statistics game.  You can only improve
> >> >> the probability of safe behaviour.
> >> >
> >> > A probability of failure of 10E-9 is required in the most severe
> >> > cases. This is not a zero probability of failure.
> >>
> >> Is that really a realistic expectation of system integrity.
> > 
> > This is the figure required for civil avionics : 10E-9 failure per
> > hour in working conditions.
> 
> I, and Bev Littlewood, are both aware of the figure as a requirement in 
> avionics. The question below that, though, was what level of confidence do 
> you have that you actually achieve that level of integrity.

This sort of leads to the question 

How often have you (anyone using code checksums) seen these catch field 
failures?

And as a supplement how many of these field failures that have been 
caught have been the result of (failed) field code updates?

Robert

Reply by Paul E. Bennett ●September 25, 20052005-09-25

R Adsett wrote:

> In article <dh6848$ipm$1$830fa7a5@news.demon.co.uk>,
> peb@amleth.demon.co.uk says...
>> Lanarcam wrote:
>> 
>> >> >> As Spehro Pefhany says, its a statistics game.  You can only
>> >> >> improve the probability of safe behaviour.
>> >> >
>> >> > A probability of failure of 10E-9 is required in the most severe
>> >> > cases. This is not a zero probability of failure.
>> >>
>> >> Is that really a realistic expectation of system integrity.
>> > 
>> > This is the figure required for civil avionics : 10E-9 failure per
>> > hour in working conditions.
>> 
>> I, and Bev Littlewood, are both aware of the figure as a requirement in
>> avionics. The question below that, though, was what level of confidence
>> do you have that you actually achieve that level of integrity.
> 
> This sort of leads to the question
> 
> How often have you (anyone using code checksums) seen these catch field
> failures?

That would require observing a field failure. I know of only one instance 
of field failure with any of the systems I have designed over the past 36 
years. That field failure, though, was a hardware component failure way 
back in the 70's. Most of my observed failures were on the prototypers test 
bench. 
 
> And as a supplement how many of these field failures that have been
> caught have been the result of (failed) field code updates?

As to catching errant code, never seen the occurrence despite the 
environments that some of my equipment runs in. There is still time though.

-- 
********************************************************************
Paul E. Bennett ....................<email://peb@amleth.demon.co.uk>
Forth based HIDECS Consultancy .....<http://www.amleth.demon.co.uk/>
Mob: +44 (0)7811-639972
Tel: +44 (0)1235-811095
Going Forth Safely ....EBA. http://www.electric-boat-association.org.uk/
********************************************************************

Reply by Guy Macon ●September 25, 20052005-09-25

R Adsett wrote:

>This sort of leads to the question 
>
>How often have you (anyone using code checksums) seen these catch field 
>failures?

I measured exactly one in a run of 200,000 systems.  These were 6800
uPs with UV-erase EPROMS.

Keep in mind that the above in only the number that passed the tests 
in burn-in and production test and then started failing later.
The POST caught a lot of bad units in production test, but I don't 
have a breakdown for how many were ROM checksum failures.

>And as a supplement how many of these field failures that have been 
>caught have been the result of (failed) field code updates?

None.  This was before field code updates were common. I am very
careful about the right voltages and algorithms for burning EPROMS;
someone doing a poor erase or a marginal burn might have had a lot 
more trouble.

Reply by Spehro Pefhany ●September 25, 20052005-09-25

On Sun, 25 Sep 2005 20:43:04 +0000, the renowned Guy Macon
<http://www.guymacon.com/> wrote:

>
>
>
>R Adsett wrote:
>
>>This sort of leads to the question 
>>
>>How often have you (anyone using code checksums) seen these catch field 
>>failures?
>
>I measured exactly one in a run of 200,000 systems.  These were 6800
>uPs with UV-erase EPROMS.
>
>Keep in mind that the above in only the number that passed the tests 
>in burn-in and production test and then started failing later.
>The POST caught a lot of bad units in production test, but I don't 
>have a breakdown for how many were ROM checksum failures.
>
>>And as a supplement how many of these field failures that have been 
>>caught have been the result of (failed) field code updates?
>
>None.  This was before field code updates were common. I am very
>careful about the right voltages and algorithms for burning EPROMS;
>someone doing a poor erase or a marginal burn might have had a lot 
>more trouble.

Yes, probably a lot of systems currently are in-circuit programmed
without verification at Vdd limits. 


Best regards, 
Spehro Pefhany
-- 
"it's the network..."                          "The Journey is the reward"
speff@interlog.com             Info for manufacturers: http://www.trexon.com
Embedded software/hardware/analog  Info for designers:  http://www.speff.com

Reply by Thomas Magma ●September 26, 20052005-09-26

"Richard" <nospam@thanks.com> wrote in message 
news:lAfZe.114898$G8.40537@text.news.blueyonder.co.uk...
>> I am not familiar with PICs, and somebody already pointed out that at
>> least some PICs can't read their code space as data.  But the general
>> idea of performing a checksum on a binary image is a pretty simple
>> one, especially if the image is in a single contiguous chunk of
>> memory.
>
> Now, assuming you find an error...what do you do?  You have just proven 
> the
> code is not trust worthy, so you cannot rely on the code to make the 
> system
> safe in any way.   Or in fact do anything predictably  So is the test 
> worth
> while?  (playing devils advocate).
>
> Regards,
> Richard.
>
>
>
> http://www.FreeRTOS.org
>

This has to do with a self-test feature. The device is often baked out in 
the sun year after year and/or placed next to high wattage transmitters. If 
the self-test fails...time for repair.

Thomas

Reply by Henrik Johnsson ●September 27, 20052005-09-27

R Adsett <radsett@junk.aeolusdevelopment.cm> wrote:
[snip]
> This sort of leads to the question 
> 
> How often have you (anyone using code checksums) seen these catch field 
> failures?

I've had quite a few. Equipment installed in cellular towers may get
its fair share of lightning surges. Even with pretty hefty protection 
schemes some voltage spikes will get through, something that can cause 
partial Flash PROM erase, sometimes only a single bit error.

> And as a supplement how many of these field failures that have been 
> caught have been the result of (failed) field code updates?
> 

In the cases mentioned above, none. Remote upgrade of software is a
different can of worms. It takes pretty careful design to avoid all
possible pitfalls. Ending up with a dead lump of you have to change
on site will in many cases cause enormous costs, especially if it is
outdoor equipment in areas that due to the climate is more or less
impossible to reach during several months of the year.

/Henrik

--