Reply by Paul E. Bennett July 16, 20042004-07-16
Ignacio G.T. wrote:

[%X]

>>When you say that some systems have "no safe state" I am taking it that >>you are speaking of individual sub-system modules that are one of a >>redundant set so that failure of an individual sub-system module does not >>have an impact on the overall safety of the whole system. > > I think the OP wouldn't agree on this definition (at least, I do not). One > example of a system with a safe state is a railway interlocking, where the > safe state is "all signals red, all points motionless": if a catastrofic > error is diagnosed by a properly designed interlocking, you can always go > to that state, where a minimum harm is guaranteed for trains and > passangers. > > On the contrary, an avionic system has not an evident safe state. Just > imagine stopping the jets in case of panic...
I think you may have missed the point of my paragraph above. I am quite aware that no-one should tolerate things like an avionics system failing which is why I expect to see redundant sub-systems and voting mechanisms in such overall system structures. I do not consider avionics as one amorphous system but as a collection autonomous sub-systems withj back-up measures, reduncant sub-systems and compliance voting in a mesh that supports the full and continuing functioniung of the air/space craft. As I stated, I have never worked in avionics but I try and stay abreast of techniques used there just to be aware of methods that may prove useful to me in my own domains (energy, transport and medical).. -- ******************************************************************** Paul E. Bennett ....................<email://peb@a...> Forth based HIDECS Consultancy .....<http://www.amleth.demon.co.uk/> Mob: +44 (0)7811-639972 .........NOW AVAILABLE:- HIDECS COURSE...... Tel: +44 (0)1235-811095 .... see http://www.feabhas.com for details. Going Forth Safely ..... EBA. www.electric-boat-association.org.uk.. ********************************************************************
Reply by Paul Keinanen July 16, 20042004-07-16
On Fri, 16 Jul 2004 00:54:44 -0700, Guy Macon
<http://www.guymacon.com> wrote:


>Paul Keinanen <keinanen@sci.fi> says... > >>These circuits are exercised each time any program executes, not just >>when the checker routine is executed > >You think that a eeprom checksum task exercises the same circuits >(registers, RAM, instruction decoders, EEPROM reading amplifiers...) >that a do nothing task exercises?
Any system using interrupts will use quite a lot of the CPU resources. In a RTOS you may have to run the scheduler after each interrupt to see, if any high priority task became runnable due to the interrupt. Thus, the job done by interrupts and scheduler is similar to that of the EEPROM checker, even if the high priority tasks do nothing for a long time. I agree that the null task could be as trivial as a single WaitForInterrupt instruction or a single branch to itself instruction, which will exercise only a small part of the CPU, but this is not the point. Paul
Reply by Ignacio G.T. July 16, 20042004-07-16
On Thu, 15 Jul 2004 18:48:52 +0100, "Paul E. Bennett" <peb@amleth.demon.co.uk>
wrote:

>Spehro Pefhany wrote: > >[%X] > >> It means I incorporate a lot (more than just a mirror) of redundancy >> on important information, because a non-recoverable failure is very >> expensive. Data integrity is more important than saving a few cents on >> memory. The other options you mention are open if they are acceptable >> in the application, of course. Some systems have no "safe" state (few >> I work on), or there is an unpleasant choice such a) test limit >> controls, b) cause $10,000 damage (100% certain). > >As the definition of system is quite wide I am just asking to clarify >matters (although I think I know what you mean). > >When you say that some systems have "no safe state" I am taking it that you >are speaking of individual sub-system modules that are one of a redundant >set so that failure of an individual sub-system module does not have an >impact on the overall safety of the whole system. >
I think the OP wouldn't agree on this definition (at least, I do not). One example of a system with a safe state is a railway interlocking, where the safe state is "all signals red, all points motionless": if a catastrofic error is diagnosed by a properly designed interlocking, you can always go to that state, where a minimum harm is guaranteed for trains and passangers. On the contrary, an avionic system has not an evident safe state. Just imagine stopping the jets in case of panic...
>I have not come across many of this type of system but then I have never >worked in any of the aerospace industries (where I expect such >considerations to exist in plenitude). > >-- >******************************************************************** >Paul E. Bennett ....................<email://peb@a...> >Forth based HIDECS Consultancy .....<http://www.amleth.demon.co.uk/> >Mob: +44 (0)7811-639972 .........NOW AVAILABLE:- HIDECS COURSE...... >Tel: +44 (0)1235-811095 .... see http://www.feabhas.com for details. >Going Forth Safely ..... EBA. www.electric-boat-association.org.uk.. >********************************************************************
-- Ignacio G.T.
Reply by Rafael Deliano July 16, 20042004-07-16
> It should also be noted that in systems that may run for years without > reboot, a check executed at startup does not be very useful.
As for EPROMs/EEPROMs: with a leaking oxide that looses charge over time one would expect not a flipped bit but a noisy bit. The read-amplifier/transistor has no hysteresis to prevent that, because in normal operation hysteresis is not needed. State would depend on supplyvoltage and temperature too. Therefore memory could test ok on startup ( cold chip ). And it could test again ok after a running checksum has detected an error. "Repairing" a noisy EEPROM would be possible if one has segmented it in small blocks each with a checksum. After an error in a block has been detected it would have to be reread several times till one has established the true data because the pattern is stable and consistant with checksum. After that one would rewrite the data to the EEPROM. Obviously a Hamming-Code would be a more direct/faster approach for repair. Good book is: Sharma "Semiconductor Memories. Technology, Testing, Reliability" IEEE Press 1997 But it has no simple answers either. MfG JRD
Reply by Frank Bemelman July 16, 20042004-07-16
"Guy Macon" <http://www.guymacon.com> schreef in bericht
news:10ff07ljvbeecf8@corp.supernews.com...
> > Paul E. Bennett <peb@amleth.demon.co.uk> says... > > > >Frank Bemelman wrote: > > > >> Oh, if someone insist on it, even after pointing out it isn't > >> very useful, why not. But most requirement flaws I ignore without > >> informing the person that wrote them (or simply copied them > >> from another project). > > > >Does that mean you deliver projects that are not to the clients spec? > > > >The early part of my projects usually involve rewriting the specification > >to make it fully coherent. It takes quite a bit of negotiation but then
can
> >end up costing the client less (once you rid the spec of the useless > >dross). > > I would expect nothing less from a professional embedded systems engineer.
You should expect more, if you want to see more than early stages alone. -- Thanks, Frank. (remove 'x' and 'invalid' when replying by email)
Reply by Guy Macon July 16, 20042004-07-16
Paul Keinanen <keinanen@sci.fi> says...

>These circuits are exercised each time any program executes, not just >when the checker routine is executed
You think that a eeprom checksum task exercises the same circuits (registers, RAM, instruction decoders, EEPROM reading amplifiers...) that a do nothing task exercises?
Reply by Guy Macon July 16, 20042004-07-16
Paul Keinanen <keinanen@sci.fi> says...
> >Guy Macon <http://www.guymacon.com> wrote: > >>Again, I have seen no evidence that the sum-checker is more reliable >>than the EEPROM being checked. Everyone seems to be accepting that >>it is based on nothing more than blind faith. > >Assuming that a continuous check is done in the null task, which would >otherwise just burn idle CPU cycles, do you have examples in which >adding the continuous checking would have decreased the total _system_ >reliability ?
Certainly. Assume that the application is rarely run (making the null task the one that is orders of magnitude most likely to be running). Let's assume the main task runs once a second and the null task runs a million times a second while doing nothing and 100,000 times a second while checking the EEPROM. Further assume that there is a register, ALU, or other part of the uC that the main task uses once, that the EEPROM check uses 10 times, and that the do nothing task never uses. Assume that this register gives a wrong answer one time out of a million, and that the EEPROM is far less likely than this to have an error. With continuous EEPROM checksum: one error per second on average. Without continuous EEPROM checksum: one error per million seconds on average. (Paul goes on to discuss running the sum checker less often, which would, of course, reduce the million to one ratio above. The million to one ratio was just a made-up example, of course; it could be 1:1 or 10:1 or 1:10 or any of a number of different ratios. In real life you could wait years for the first failure of the EEPROM or of the EEPROM checker.) -- Guy Macon, Electronics Engineer & Project Manager for hire. Remember Doc Brown from the _Back to the Future_ movies? Do you have an "impossible" engineering project that only someone like Doc Brown can solve? My resume is at http://www.guymacon.com/
Reply by Guy Macon July 16, 20042004-07-16
Paul E. Bennett <peb@amleth.demon.co.uk> says...
> >Frank Bemelman wrote: > >> Oh, if someone insist on it, even after pointing out it isn't >> very useful, why not. But most requirement flaws I ignore without >> informing the person that wrote them (or simply copied them >> from another project). > >Does that mean you deliver projects that are not to the clients spec? > >The early part of my projects usually involve rewriting the specification >to make it fully coherent. It takes quite a bit of negotiation but then can >end up costing the client less (once you rid the spec of the useless >dross).
I would expect nothing less from a professional embedded systems engineer. -- Guy Macon, Electronics Engineer & Project Manager for hire. Remember Doc Brown from the _Back to the Future_ movies? Do you have an "impossible" engineering project that only someone like Doc Brown can solve? My resume is at http://www.guymacon.com/
Reply by Paul Keinanen July 16, 20042004-07-16
On Thu, 15 Jul 2004 01:45:42 -0700, Guy Macon
<http://www.guymacon.com> wrote:

>Again, I have seen no evidence that the sum-checker is more reliable >than the EEPROM being checked. Everyone seems to be accepting that >it is based on nothing more than blind faith.
Assuming that a continuous check is done in the null task, which would otherwise just burn idle CPU cycles, do you have examples in which adding the continuous checking would have decreased the total _system_ reliability ? The only mechanism I can think of is that the checker routine instructions consume more power than idle instructions, so the CPU temperature will slightly increase and thus slightly decrease the MTBF of some components. In battery powered systems, the battery will fail slightly earlier. On the other hand, a "continuous" specification does not have to mean that you burn 100 % of the (idle) cycles for the check routine, a scan could take along time if you sleep for a millisecond after each kilobyte checked :-). Put this kilobyte checker into a task just above the idle task priority and each time the system has nothing else to do, it first drops to the kilobyte checker to check the next memory segment and then falls down to the idle task. Thus, only 1-10 % of the idle cycles would be consumed and the temperature increase would be insignificant. Of course you would have to consider the most likely EEPROM failure rate when deciding how long the scan can take. Paul
Reply by Frank Bemelman July 15, 20042004-07-15
"Guy Macon" <http://www.guymacon.com> schreef in bericht
news:10fdiclh8d5fv35@corp.supernews.com...
> > Frank Bemelman <f.bemelmanx@xs4all.invalid.nl> says...
[snip]
> >Implementing a continous check may cure that just enough to let > >the systems pass the testers, > > Again you assume that continuous EEPROM checksum makes the system > more reliable rather than less reliable. How do you know this? > What method did you use to arrive at this conclusion?
If the checking system would be less reliable, it would make the entire system useless for the more obvious tasks it has to do. In that case, I couldn't care less about eeproms flipping a bit. Checking the eeprom isn't the main goal of the system. So a good system is first priority, no matter what (flawed) specs lull me into believing. Continous checking (with auto correcting) is sweeping the dust under the carpet, out of sight. Something you should add very late in the development, at the time you are wondering why bothering. -- Thanks, Frank. (remove 'x' and 'invalid' when replying by email)