EmbeddedRelated.com
Forums

Continous eeprom checksum microcontroller

Started by Vishal July 3, 2004
Spehro Pefhany <speffSNIP@interlogDOTyou.knowwhat> says...

>EEPROM is fundamentally different from RAM etc. because any errors >that arise (because of issues beyond the control of the engineer) will >persist indefinitely. They also wear out, and are fundamentally less >reliable than RAM due to the high dielectric stresses involved in >Fowler-Nordheim tunneling etc. (especially from re-writing).
This is the kind of discussion that I expect from an embedded systems programmer. Instead of blindly assuming that the sum-checker is more reliable than the EEPROM being checked, Spehro is giving reasons why the assumption might be true.
>On frequency- as you say above, I don't think it's necessary to do it >more often than the information is accessed. ;-) More seriously, the >upper time limit is typically set by how long it takes the system to >get into trouble, worst-case. If it's a slow thermal system, then a >minute or ten minutes with worst-case outputs may be no big deal.
This also is the kind of discussion that I expect from an embedded systems programmer. Spehro is addressing the fundamental issues of "what should be done if there is an error", and he doesn't fall into the common error of not analysing the "do nothing" option. -- Guy Macon, Electronics Engineer & Project Manager for hire. Remember Doc Brown from the _Back to the Future_ movies? Do you have an "impossible" engineering project that only someone like Doc Brown can solve? My resume is at http://www.guymacon.com/
Guy Macon wrote:
...
> But the sum-checker is far more than lust the place where the > sum-checking code is stored. It is also the electronics that > reads the code, the ALU that executes the code, the registers > and RAM that the code uses, and so forth. One would have to > estimate the error rate of all of those parts of the uC and > compare them to the error rate of the EEPROM. Unless you do > that, you have no idea whether your continuous sum-checker > increases or decreases system reliability compared to an > on-demand sum-checker or no sum-checker at all.
Very true, but still subject to further refinement: consider which parts (ALU, RAM) are * on the same chip * because failures due to threshold changes are more liable to occur from one chip to another than between components on the same silicon. - RM
Guy Macon <http://www.guymacon.com> says...

>far more than lust the place where
&*$!@*! spellchecker! JUST the place...
"Guy Macon" <http://www.guymacon.com> schreef in bericht
news:10fd9jnb6nnpc0b@corp.supernews.com...
> > Frank Bemelman <f.bemelmanx@xs4all.invalid.nl> says... > > >But most requirement flaws I ignore without > >informing the person that wrote them (or simply copied them > >from another project). > > You would have a problem if I was your project manager. You would > be instructed to evaluate the requirements and to agree or disagree > with each requirement, and the definition of your code being "done" > would include the independent testers verifying that your code complies > with all requirements. On my projects requirement errors are serious, > and they are to be corrected, not ignored.
I am a programmer. 'Requirement flaws' are treated differently from 'Requirements' because they are flagged as flaws.
> Then again, I wouldn't be handing you requirements that are male bovine > excrement. Before you got a requirement to implement a continuous
checksum,
> you would have hard numbers for EEPROM errors, sense-amp errors, ALU
errors,
> register errors. etc.. both under normal conditions and under conditions > of radiation, ESD, etc, and an analysis of the reliability impact, cost, > etc. of the sum-checker.
Rubbish. Hard numbers don't have any value in this context, and reliability/cost of the sum checker is the least interesting bit. If I had hired you as a manager you would have a problem, wasting time on instructing other staff to waste even more time. What matters is if a system failure is something you can afford or not. Assuming, for the sake of this discussion, the software has to work with occasionally corrupted eeprom data, you have to decide what you can do to avoid that, and at what cost. Piles of analysis tend to be highly unreliable, cost calculations in such areas never make sense, better to trust a bit of common sense. BTW, in whatever system, I would not be worried by eeprom itself, would worry more about software making accidental writes and. most important, a healty hardware design with nice power up/down behaviour. Implementing a continous check may cure that just enough to let the systems pass the testers, but if that is desireable... For the same reasons I don't like the well spread practice of restoring important hardware registers on a regular basis. Or watchdogs. I use both, but I don't like it a bit. -- Thanks, Frank. (remove 'x' and 'invalid' when replying by email) -- Thanks, Frank. (remove 'x' and 'invalid' when replying by email)
Spehro Pefhany wrote:

[%X]

> It means I incorporate a lot (more than just a mirror) of redundancy > on important information, because a non-recoverable failure is very > expensive. Data integrity is more important than saving a few cents on > memory. The other options you mention are open if they are acceptable > in the application, of course. Some systems have no "safe" state (few > I work on), or there is an unpleasant choice such a) test limit > controls, b) cause $10,000 damage (100% certain).
As the definition of system is quite wide I am just asking to clarify matters (although I think I know what you mean). When you say that some systems have "no safe state" I am taking it that you are speaking of individual sub-system modules that are one of a redundant set so that failure of an individual sub-system module does not have an impact on the overall safety of the whole system. I have not come across many of this type of system but then I have never worked in any of the aerospace industries (where I expect such considerations to exist in plenitude). -- ******************************************************************** Paul E. Bennett ....................<email://peb@a...> Forth based HIDECS Consultancy .....<http://www.amleth.demon.co.uk/> Mob: +44 (0)7811-639972 .........NOW AVAILABLE:- HIDECS COURSE...... Tel: +44 (0)1235-811095 .... see http://www.feabhas.com for details. Going Forth Safely ..... EBA. www.electric-boat-association.org.uk.. ********************************************************************
Frank Bemelman <f.bemelmanx@xs4all.invalid.nl> says...
> >"Guy Macon" <http://www.guymacon.com> schreef... >> >> Frank Bemelman <f.bemelmanx@xs4all.invalid.nl> says... >> >> >But most requirement flaws I ignore without >> >informing the person that wrote them (or simply copied them >> >from another project). >> >> You would have a problem if I was your project manager. You would >> be instructed to evaluate the requirements and to agree or disagree >> with each requirement, and the definition of your code being "done" >> would include the independent testers verifying that your code complies >> with all requirements. On my projects requirement errors are serious, >> and they are to be corrected, not ignored. > >I am a programmer. 'Requirement flaws' are treated differently from >'Requirements' because they are flagged as flaws.
Which is it? Do you flag them as flaws or ignore them without informing the person that wrote them? The former I like. The latter I consider to be grounds for termination on the third offense.
>>Then again, I wouldn't be handing you requirements that are male >>bovine excrement. Before you got a requirement to implement a >>continuous checksum, you would have hard numbers for EEPROM >>errors, sense-amp errors, ALU errors, register errors. etc.. >>both under normal conditions and under conditions of radiation, >>ESD, etc, and an analysis of the reliability impact, cost, etc. >>of the sum-checker. > >Rubbish. Hard numbers don't have any value in this context,
They do if you do them right.
>and reliability/cost of the sum checker is the least >interesting bit.
Reliability is the most *important* bit, whether you find it to be interesting or not.
>If I had hired you as a manager you would have a problem, wasting >time on instructing other staff to waste even more time.
I hope that you are referring to the analysis of whether the continuous eeprom checksum makes the system more or less reliable. I would not instruct anyone to do that analysis - I would simply would refuse to add a continuous eeprom checksum to the requirements without it. If I allowed requirements to be added without any apparent benefit, *that* would be wasting time.
>What matters is if a system failure is something you can >afford or not. Assuming, for the sake of this discussion, the software has >to work with occasionally corrupted eeprom data, you have to decide what >you can do to avoid that,
Once again you are pretending that you know that the hardware that does the continuous EEPROM checksum is more reliable than the EEPROM. If it happens to be a lot less reliable, you are making a system failure more likely.
>Piles of analysis tend to be highly unreliable,
Not if you do them right.
>cost calculations in such areas never make sense,
They make sense if you do them right.
>better to trust a bit of common sense.
And you think that doing a continuous EEPROM checksum when you don't know (because you don't like analysis) whether the EEPROM is orders of magnatude more likely or less likely to have an error than the system that does the checksumming makes common sense? I will stick with the "piles of analysis" as being more reliable than "common sense."
>BTW, in whatever system, I would not be >worried by eeprom itself, would worry more about software making accidental >writes and. most important, a healty hardware design with nice power up/down >behaviour.
We agree here.
>Implementing a continous check may cure that just enough to let >the systems pass the testers,
Again you assume that continuous EEPROM checksum makes the system more reliable rather than less reliable. How do you know this? What method did you use to arrive at this conclusion?
Guy Macon <http://www.guymacon.com> wrote:

> > Paul Keinanen <keinanen@sci.fi> says... >> >>Guy Macon <http://www.guymacon.com> wrote: >> >>>Again, I have seen no evidence that the sum-checker is more reliable >>>than the EEPROM being checked. Everyone seems to be accepting that >>>it is based on nothing more than blind faith. >> >>Even if the checksum algorithm is executed directly out of the EEPROM >>(which is not always the case), the surface area occupied by the >>checker is very small compared to the total area of the EEPROM in most >>cases. If there is a single (hard or soft) error in the EEPROM, the >>likelihood is much greater that is in the error is the other part of >>the EEPROM than in the checker code itself due to the area ratio. >>The worst case is that there are error(s) in the EEPROM, but a bit >>flip in the actual checker code will modify the program so that it >>will return EEPROM OK, but the likelihood is still smaller. > > But the sum-checker is far more than lust the place where the > sum-checking code is stored. It is also the electronics that > reads the code, the ALU that executes the code, the registers > and RAM that the code uses, and so forth. One would have to > estimate the error rate of all of those parts of the uC and > compare them to the error rate of the EEPROM. Unless you do > that, you have no idea whether your continuous sum-checker > increases or decreases system reliability compared to an > on-demand sum-checker or no sum-checker at all. >
Assuming that you have demonstrated a need to be certain of the validity of data in the EEPROM (or any other area of fixed memory) then you should also have a figure that indicates the maximum time bewteen full checking reports (rember, integrity is a time and probability of failure measure). Also assuming that the system you are developing has, as mentioned in another post, no safe state then you may need to know how much of the time the individualk parts of the system are available to you. Not only would you run the checksum but you would also run other hardware integrity checking on a continuous piecemeal) basis, leaving markers as to the success or otherwise such that a reporting programme can report the results of the error analysis. Note that we are now in the realm of MUST NOT FAIL systems. The question of what you do when a part of your system fails must be answered fairly early on in the design phase. Every engioneer should ask himslef that question as a matter of routine deliberation for new designs. Forunately for me, I need not care too much about losing one module of a system so long as it indicates that it has failed (and why). I have several techniques that I use to check that the system is really behaving itself and ensure that outputs are disabled (a safe state in 99% of mys syetms). As I often state, let the risk assessments guide you to what you need to check and then work out the scheme that gives you the best chance of meeting the integrity taregets (not all parts of the system need to work to the same level). -- ******************************************************************** Paul E. Bennett ....................<email://peb@a...> Forth based HIDECS Consultancy .....<http://www.amleth.demon.co.uk/> Mob: +44 (0)7811-639972 .........NOW AVAILABLE:- HIDECS COURSE...... Tel: +44 (0)1235-811095 .... see http://www.feabhas.com for details. Going Forth Safely ..... EBA. www.electric-boat-association.org.uk.. ********************************************************************
Frank Bemelman wrote:

> Oh, if someone insist on it, even after pointing out it isn't > very useful, why not. But most requirement flaws I ignore without > informing the person that wrote them (or simply copied them > from another project).
Does that mean you deliver projects that are not to the clients spec? The early part of my projects usually involve rewriting the specification to make it fully coherent. It takes quite a bit of negotiation but then can end up costing the client less (once you rid the spec of the useless dross). Remember that you have to engineer the customer as well as the system. -- ******************************************************************** Paul E. Bennett ....................<email://peb@a...> Forth based HIDECS Consultancy .....<http://www.amleth.demon.co.uk/> Mob: +44 (0)7811-639972 .........NOW AVAILABLE:- HIDECS COURSE...... Tel: +44 (0)1235-811095 .... see http://www.feabhas.com for details. Going Forth Safely ..... EBA. www.electric-boat-association.org.uk.. ********************************************************************
Guy Macon <http://www.guymacon.com> wrote:

> > Frank Bemelman <f.bemelmanx@xs4all.invalid.nl> says... > >>But most requirement flaws I ignore without >>informing the person that wrote them (or simply copied them >>from another project). > > You would have a problem if I was your project manager. You would > be instructed to evaluate the requirements and to agree or disagree > with each requirement, and the definition of your code being "done" > would include the independent testers verifying that your code complies > with all requirements. On my projects requirement errors are serious, > and they are to be corrected, not ignored. > > Then again, I wouldn't be handing you requirements that are male bovine > excrement. Before you got a requirement to implement a continuous > checksum, you would have hard numbers for EEPROM errors, sense-amp errors, > ALU errors, register errors. etc.. both under normal conditions and under > conditions of radiation, ESD, etc, and an analysis of the reliability > impact, cost, etc. of the sum-checker.
Way to go Guy! -- ******************************************************************** Paul E. Bennett ....................<email://peb@a...> Forth based HIDECS Consultancy .....<http://www.amleth.demon.co.uk/> Mob: +44 (0)7811-639972 .........NOW AVAILABLE:- HIDECS COURSE...... Tel: +44 (0)1235-811095 .... see http://www.feabhas.com for details. Going Forth Safely ..... EBA. www.electric-boat-association.org.uk.. ********************************************************************
On Thu, 15 Jul 2004 11:14:55 -0700, Guy Macon
<http://www.guymacon.com> wrote:

[...]
> >Again you assume that continuous EEPROM checksum makes the system >more reliable rather than less reliable. How do you know this? >What method did you use to arrive at this conclusion?
Interesting discussion. Reminds me of my first job out of college, part of a team modifying the software of the fuel gauge for a commercial airliner. I thought I'd posted on this before, but google isn't finding it for me... We were working on a project known as "dispatch enhancement," which was a complete misnomer. We were actually tightening up some diagnostics, adding some others, and adding the ability to send diagnostic messages to the aircrafts Engine Indicating and Crew Alerting System (EICAS). In summary, we were adding the ability to detect more problems and providing better error messages. (Prior to this enhancement, the only "error message" we provided to the crew was blank displays). Nothing we were doing would "enhance" the "dispatch" of aircraft on their flights. While we were working on this, an aircraft using the existing fuel gauge ran out of fuel in mid-air. Look up the "Gimli Glider" if you want more information. We suddenly came under much greater pressure to complete our modifications ahead of schedule. Which made no sense whatsoever: 1) The fuel gauge on the subject aircraft was blank, indicating internal diagnostics had found a problem. We were not going to prevent that from happening -- indeed, after our modifications, it would potentially occur more often, because we could find additional problems. 2) The FAA regulations said that when this aircraft's fuel gauges are blank, the aircraft doesn't fly. This aircraft was flying because it wasn't subject to the FAA (i.e., not an American flight). 3) The flight regs to which the aircraft was subject allowed flight when the fuel in the tanks was measured manually. This was done more than once, correctly each time. The ground crew reported to the pilots the number of pounds of fuel in the tanks. The pilots thought the reported value was in kg. Back to the subject: for some reason someone got it in their head that our changes would make the fuel gauge "more reliable," and therefore we _had_ to complete our changes ASAP. Probably because of the bogus project name. In one sense we were: our changes would make it less likely the fuel gauge would cause the airplane to malfunction. But by their definition (aircraft flies more often), we would probably make the fuel gauge *less* reliable. And the question of what to do in a failure. We would still blank the displays. We would also notify the crew of the nature of the problem through EICAS. No change there. The only change that could have prevented this incident was external to our group (and was made IIRC: the subject flight regs were changed to prevent the aircraft from flying with blank fuel gauges). Regards, -=Dave -- Change is inevitable, progress is not.