Ignacio G.T. wrote:

[%X]

>>When you say that some systems have "no safe state" I am taking it that
>>you are speaking of individual sub-system modules that are one of a
>>redundant set so that failure of an individual sub-system module does not
>>have an impact on the overall safety of the whole system.
>
> I think the OP wouldn't agree on this definition (at least, I do not). One
> example of a system with a safe state is a railway interlocking, where the
> safe state is "all signals red, all points motionless": if a catastrofic
> error is diagnosed by a properly designed interlocking, you can always go
> to that state, where a minimum harm is guaranteed for trains and
> passangers.
> 
> On the contrary, an avionic system has not an evident safe state. Just
> imagine stopping the jets in case of panic...

I think you may have missed the point of my paragraph above. I am quite 
aware that no-one should tolerate things like an avionics system failing 
which is why I expect to see redundant sub-systems and voting mechanisms in 
such overall system structures. I do not consider avionics as one amorphous 
system but as a collection autonomous sub-systems withj back-up measures, 
reduncant sub-systems and compliance voting in a mesh that supports the 
full and continuing functioniung of the air/space craft.

As I stated, I have never worked in avionics but I try and stay abreast of 
techniques used there just to be aware of methods that may prove useful to 
me in my own domains (energy, transport and medical)..

-- 
********************************************************************
Paul E. Bennett ....................<email://peb@a...>
Forth based HIDECS Consultancy .....<http://www.amleth.demon.co.uk/>
Mob: +44 (0)7811-639972 .........NOW AVAILABLE:- HIDECS COURSE......
Tel: +44 (0)1235-811095 .... see http://www.feabhas.com for details.
Going Forth Safely ..... EBA. www.electric-boat-association.org.uk..
********************************************************************

On Fri, 16 Jul 2004 00:54:44 -0700, Guy Macon
<http://www.guymacon.com> wrote:

>Paul Keinanen <keinanen@sci.fi> says...
>
>>These circuits are exercised each time any program executes, not just
>>when the checker routine is executed
>
>You think that a eeprom checksum task exercises the same circuits
>(registers, RAM, instruction decoders, EEPROM reading amplifiers...)
>that a do nothing task exercises?

Any system using interrupts will use quite a lot of the CPU resources.
In a RTOS you may have to run the scheduler after each interrupt to
see, if any high priority task became runnable due to the interrupt.

Thus, the job done by interrupts and scheduler is similar to that of
the EEPROM checker, even if the high priority tasks do nothing for a
long time.

I agree that the null task could be as trivial as a single
WaitForInterrupt instruction or a single branch to itself instruction,
which will exercise only a small part of the CPU, but this is not the
point.

Paul

On Thu, 15 Jul 2004 18:48:52 +0100, "Paul E. Bennett" <peb@amleth.demon.co.uk>
wrote:

>Spehro Pefhany wrote:
>
>[%X]
>
>> It means I incorporate a lot (more than just a mirror) of redundancy
>> on important information, because a non-recoverable failure is very
>> expensive. Data integrity is more important than saving a few cents on
>> memory. The other options you mention are open if they are acceptable
>> in the application, of course. Some systems have no "safe" state (few
>> I work on), or there is an unpleasant choice such a) test limit
>> controls, b) cause $10,000 damage (100% certain).
>
>As the definition of system is quite wide I am just asking to clarify 
>matters (although I think I know what you mean).
>
>When you say that some systems have "no safe state" I am taking it that you 
>are speaking of individual sub-system modules that are one of a redundant 
>set so that failure of an individual sub-system module does not have an 
>impact on the overall safety of the whole system.
>

I think the OP wouldn't agree on this definition (at least, I do not). One
example of a system with a safe state is a railway interlocking, where the safe
state is "all signals red, all points motionless": if a catastrofic error is
diagnosed by a properly designed interlocking, you can always go to that state,
where a minimum harm is guaranteed for trains and passangers.

On the contrary, an avionic system has not an evident safe state. Just imagine
stopping the jets in case of panic...

>I have not come across many of this type of system but then I have never 
>worked in any of the aerospace industries (where I expect such 
>considerations to exist in plenitude).
>
>-- 
>********************************************************************
>Paul E. Bennett ....................<email://peb@a...>
>Forth based HIDECS Consultancy .....<http://www.amleth.demon.co.uk/>
>Mob: +44 (0)7811-639972 .........NOW AVAILABLE:- HIDECS COURSE......
>Tel: +44 (0)1235-811095 .... see http://www.feabhas.com for details.
>Going Forth Safely ..... EBA. www.electric-boat-association.org.uk..
>********************************************************************

--
Ignacio G.T.

> It should also be noted that in systems that may run for years without
> reboot, a check executed at startup does not be very useful.
  As for EPROMs/EEPROMs: with a leaking oxide that looses charge over 
time one would expect not a flipped bit but a noisy bit. The 
read-amplifier/transistor has no hysteresis to prevent that, because
in normal operation hysteresis is not needed. State would depend on
supplyvoltage and temperature too. 
  Therefore memory could test ok on startup ( cold chip ). 
And it could test again ok after a running checksum has detected an 
error. 
  "Repairing" a noisy EEPROM would be possible if one has segmented
it in small blocks each with a checksum. After an error in a block
has been detected it would have to be reread several times till one
has established the true data because the pattern is stable and 
consistant with checksum. After that one would rewrite the data to 
the EEPROM. Obviously a Hamming-Code would be a more direct/faster 
approach for repair. 

Good book is:
Sharma "Semiconductor Memories. Technology, Testing, Reliability" 
IEEE Press 1997  
But it has no simple answers either.

MfG  JRD

"Guy Macon" <http://www.guymacon.com> schreef in bericht
news:10ff07ljvbeecf8@corp.supernews.com...
>
> Paul E. Bennett <peb@amleth.demon.co.uk> says...
> >
> >Frank Bemelman wrote:
> >
> >> Oh, if someone insist on it, even after pointing out it isn't
> >> very useful, why not. But most requirement flaws I ignore without
> >> informing the person that wrote them (or simply copied them
> >> from another project).
> >
> >Does that mean you deliver projects that are not to the clients spec?
> >
> >The early part of my projects usually involve rewriting the specification
> >to make it fully coherent. It takes quite a bit of negotiation but then
can
> >end up costing the client less (once you rid the spec of the useless
> >dross).
>
> I would expect nothing less from a professional embedded systems engineer.

You should expect more, if you want to see more than early stages alone.

-- 
Thanks, Frank.
(remove 'x' and 'invalid' when replying by email)

Paul Keinanen <keinanen@sci.fi> says...

>These circuits are exercised each time any program executes, not just
>when the checker routine is executed

You think that a eeprom checksum task exercises the same circuits
(registers, RAM, instruction decoders, EEPROM reading amplifiers...)
that a do nothing task exercises?

Paul Keinanen <keinanen@sci.fi> says...
>
>Guy Macon <http://www.guymacon.com> wrote:
>
>>Again, I have seen no evidence that the sum-checker is more reliable
>>than the EEPROM being checked. Everyone seems to be accepting that
>>it is based on nothing more than blind faith.
>
>Assuming that a continuous check is done in the null task, which would
>otherwise just burn idle CPU cycles, do you have examples in which
>adding the continuous checking would have decreased the total _system_
>reliability ?

Certainly.

Assume that the application is rarely run (making the null task the one 
that is orders of magnitude most likely to be running). Let's assume 
the main task runs once a second and the null task runs a million times 
a second while doing nothing and 100,000 times a second while checking
the EEPROM.

Further assume that there is a register, ALU, or other part of the uC 
that the main task uses once, that the EEPROM check uses 10 times, and 
that the do nothing task never uses.

Assume that this register gives a wrong answer one time out of a million,
and that the EEPROM is far less likely than this to have an error.

With continuous EEPROM checksum: one error per second on average.

Without continuous EEPROM checksum: one error per million seconds
on average.

(Paul goes on to discuss running the sum checker less often, which 
would, of course, reduce the million to one ratio above.  The million 
to one ratio was just a made-up example, of course; it could be 1:1
or 10:1 or 1:10 or any of a number of different ratios.  In real life
you could wait years for the first failure of the EEPROM or of the 
EEPROM checker.)

-- 
Guy Macon, Electronics Engineer & Project Manager for hire. 
Remember Doc Brown from the _Back to the Future_ movies? Do you 
have an "impossible" engineering project that only someone like 
Doc Brown can solve?  My resume is at http://www.guymacon.com/

Paul E. Bennett <peb@amleth.demon.co.uk> says...
>
>Frank Bemelman wrote:
>
>> Oh, if someone insist on it, even after pointing out it isn't
>> very useful, why not. But most requirement flaws I ignore without
>> informing the person that wrote them (or simply copied them
>> from another project).
>
>Does that mean you deliver projects that are not to the clients spec?
>
>The early part of my projects usually involve rewriting the specification 
>to make it fully coherent. It takes quite a bit of negotiation but then can 
>end up costing the client less (once you rid the spec of the useless 
>dross).

I would expect nothing less from a professional embedded systems engineer.

-- 
Guy Macon, Electronics Engineer & Project Manager for hire. 
Remember Doc Brown from the _Back to the Future_ movies? Do you 
have an "impossible" engineering project that only someone like 
Doc Brown can solve?  My resume is at http://www.guymacon.com/

On Thu, 15 Jul 2004 01:45:42 -0700, Guy Macon
<http://www.guymacon.com> wrote:

>Again, I have seen no evidence that the sum-checker is more reliable
>than the EEPROM being checked. Everyone seems to be accepting that
>it is based on nothing more than blind faith.

Assuming that a continuous check is done in the null task, which would
otherwise just burn idle CPU cycles, do you have examples in which
adding the continuous checking would have decreased the total _system_
reliability ?

The only mechanism I can think of is that the checker routine
instructions consume more power than idle instructions, so the CPU
temperature will slightly increase and thus slightly decrease the MTBF
of some components. In battery powered systems, the battery will fail
slightly earlier.

On the other hand, a "continuous" specification does not have to mean
that you burn 100 % of the (idle) cycles for the check routine, a scan
could take along time if you sleep for a millisecond after each
kilobyte checked :-). 

Put this kilobyte checker into a task just above the idle task
priority and each time the system has nothing else to do, it first
drops to the kilobyte checker to check the next memory segment and
then falls down to the idle task. Thus, only 1-10 % of the idle cycles
would be consumed and the temperature increase would be insignificant.

Of course you would have to consider the most likely EEPROM failure
rate when deciding how long the scan can take.

Paul

"Guy Macon" <http://www.guymacon.com> schreef in bericht
news:10fdiclh8d5fv35@corp.supernews.com...
>
> Frank Bemelman <f.bemelmanx@xs4all.invalid.nl> says...

[snip]

> >Implementing a continous check may cure that just enough to let
> >the systems pass the testers,
>
> Again you assume that continuous EEPROM checksum makes the system
> more reliable rather than less reliable.  How do you know this?
> What method did you use to arrive at this conclusion?

If the checking system would be less reliable, it would make the
entire system useless for the more obvious tasks it has to do.
In that case, I couldn't care less about eeproms flipping a bit.
Checking the eeprom isn't the main goal of the system. So a good
system is first priority, no matter what (flawed) specs lull me
into believing.

Continous checking (with auto correcting) is sweeping the dust under
the carpet, out of sight. Something you should add very late in the
development, at the time you are wondering why bothering.

-- 
Thanks, Frank.
(remove 'x' and 'invalid' when replying by email)