Continous eeprom checksum microcontroller| page 5

Reply by Guy Macon ●July 15, 20042004-07-15

Spehro Pefhany <speffSNIP@interlogDOTyou.knowwhat> says...

>EEPROM is fundamentally different from RAM etc. because any errors
>that arise (because of issues beyond the control of the engineer) will
>persist indefinitely. They also wear out, and are fundamentally less
>reliable than RAM due to the high dielectric stresses involved in
>Fowler-Nordheim tunneling etc. (especially from re-writing). 

This is the kind of discussion that I expect from an embedded systems 
programmer.  Instead of blindly assuming that the sum-checker is more
reliable than the EEPROM being checked, Spehro is giving reasons why
the assumption might be true.  

>On frequency- as you say above, I don't think it's necessary to do it
>more often than the information is accessed. ;-) More seriously, the
>upper time limit is typically set by how long it takes the system to
>get into trouble, worst-case. If it's a slow thermal system, then a
>minute or ten minutes with worst-case outputs may be no big deal. 

This also is the kind of discussion that I expect from an embedded 
systems programmer.  Spehro is addressing the fundamental issues of
"what should be done if there is an error", and he doesn't fall into 
the common error of not analysing the "do nothing" option. 


-- 
Guy Macon, Electronics Engineer & Project Manager for hire. 
Remember Doc Brown from the _Back to the Future_ movies? Do you 
have an "impossible" engineering project that only someone like 
Doc Brown can solve?  My resume is at http://www.guymacon.com/

Reply by Rick Merrill ●July 15, 20042004-07-15

Guy Macon wrote:
...
> But the sum-checker is far more than lust the place where the 
> sum-checking code is stored.  It is also the electronics that 
> reads the code, the ALU that executes the code, the registers
> and RAM that the code uses, and so forth.  One would have to 
> estimate the error rate of all of those parts of the uC and 
> compare them to the error rate of the EEPROM.  Unless you do 
> that, you have no idea whether your continuous sum-checker 
> increases or decreases system reliability compared to an 
> on-demand sum-checker or no sum-checker at all. 

Very true, but still subject to further refinement: consider
which parts (ALU, RAM) are * on the same chip * because failures
due to threshold changes are more liable to occur from one chip to 
another than between components on the same silicon. - RM

Reply by Guy Macon ●July 15, 20042004-07-15

Guy Macon <http://www.guymacon.com> says...

>far more than lust the place where

&*$!@*! spellchecker! JUST the place...

Reply by Frank Bemelman ●July 15, 20042004-07-15

"Guy Macon" <http://www.guymacon.com> schreef in bericht
news:10fd9jnb6nnpc0b@corp.supernews.com...
>
> Frank Bemelman <f.bemelmanx@xs4all.invalid.nl> says...
>
> >But most requirement flaws I ignore without
> >informing the person that wrote them (or simply copied them
> >from another project).
>
> You would have a problem if I was your project manager.  You would
> be instructed to evaluate the requirements and to agree or disagree
> with each requirement, and the definition of your code being "done"
> would include the independent testers verifying that your code complies
> with all requirements. On my projects requirement errors are serious,
> and they are to be corrected, not ignored.

I am a programmer. 'Requirement flaws' are treated differently from
'Requirements' because they are flagged as flaws.

> Then again, I wouldn't be handing you requirements that are male bovine
> excrement. Before you got a requirement to implement a continuous
checksum,
> you would have hard numbers for EEPROM errors, sense-amp errors, ALU
errors,
> register errors. etc.. both under normal conditions and under conditions
> of radiation, ESD, etc, and an analysis of the reliability impact, cost,
> etc. of the sum-checker.

Rubbish. Hard numbers don't have any value in this context, and
reliability/cost
of the sum checker is the least interesting bit. If I had hired you as a
manager
you would have a problem, wasting time on instructing other staff to waste
even more time. What matters is if a system failure is something you can
afford or not. Assuming, for the sake of this discussion, the software has
to work with occasionally corrupted eeprom data, you have to decide what
you can do to avoid that, and at what cost. Piles of analysis tend to be
highly unreliable, cost calculations in such areas never make sense, better
to trust a bit of common sense. BTW, in whatever system, I would not be
worried by eeprom itself, would worry more about software making accidental
writes and. most important, a healty hardware design with nice power up/down
behaviour. Implementing a continous check may cure that just enough to let
the systems pass the testers, but if that is desireable... For the same
reasons
I don't like the well spread practice of restoring important hardware
registers
on a regular basis. Or watchdogs. I use both, but I don't like it a bit.

-- 
Thanks, Frank.
(remove 'x' and 'invalid' when replying by email)


-- 
Thanks, Frank.
(remove 'x' and 'invalid' when replying by email)

Reply by Paul E. Bennett ●July 15, 20042004-07-15

Spehro Pefhany wrote:

[%X]

> It means I incorporate a lot (more than just a mirror) of redundancy
> on important information, because a non-recoverable failure is very
> expensive. Data integrity is more important than saving a few cents on
> memory. The other options you mention are open if they are acceptable
> in the application, of course. Some systems have no "safe" state (few
> I work on), or there is an unpleasant choice such a) test limit
> controls, b) cause $10,000 damage (100% certain).

As the definition of system is quite wide I am just asking to clarify 
matters (although I think I know what you mean).

When you say that some systems have "no safe state" I am taking it that you 
are speaking of individual sub-system modules that are one of a redundant 
set so that failure of an individual sub-system module does not have an 
impact on the overall safety of the whole system.

I have not come across many of this type of system but then I have never 
worked in any of the aerospace industries (where I expect such 
considerations to exist in plenitude).

-- 
********************************************************************
Paul E. Bennett ....................<email://peb@a...>
Forth based HIDECS Consultancy .....<http://www.amleth.demon.co.uk/>
Mob: +44 (0)7811-639972 .........NOW AVAILABLE:- HIDECS COURSE......
Tel: +44 (0)1235-811095 .... see http://www.feabhas.com for details.
Going Forth Safely ..... EBA. www.electric-boat-association.org.uk..
********************************************************************

Reply by Guy Macon ●July 15, 20042004-07-15

Frank Bemelman <f.bemelmanx@xs4all.invalid.nl> says...
>
>"Guy Macon" <http://www.guymacon.com> schreef...
>>
>> Frank Bemelman <f.bemelmanx@xs4all.invalid.nl> says...
>>
>> >But most requirement flaws I ignore without
>> >informing the person that wrote them (or simply copied them
>> >from another project).
>>
>> You would have a problem if I was your project manager.  You would
>> be instructed to evaluate the requirements and to agree or disagree
>> with each requirement, and the definition of your code being "done"
>> would include the independent testers verifying that your code complies
>> with all requirements. On my projects requirement errors are serious,
>> and they are to be corrected, not ignored.
>
>I am a programmer. 'Requirement flaws' are treated differently from
>'Requirements' because they are flagged as flaws.

Which is it?  Do you flag them as flaws or ignore them without
informing the person that wrote them?  The former I like.  The
latter I consider to be grounds for termination on the third 
offense.  

>>Then again, I wouldn't be handing you requirements that are male 
>>bovine excrement. Before you got a requirement to implement a 
>>continuous checksum, you would have hard numbers for EEPROM 
>>errors, sense-amp errors, ALU errors, register errors. etc.. 
>>both under normal conditions and under conditions of radiation, 
>>ESD, etc, and an analysis of the reliability impact, cost, etc.
>>of the sum-checker.
>
>Rubbish. Hard numbers don't have any value in this context,

They do if you do them right.

>and reliability/cost of the sum checker is the least 
>interesting bit.

Reliability is the most *important* bit, whether you find it to
be interesting or not.

>If I had hired you as a manager you would have a problem, wasting 
>time on instructing other staff to waste even more time.

I hope that you are referring to the analysis of whether the continuous 
eeprom checksum makes the system more or less reliable.  I would not
instruct anyone to do that analysis - I would simply would refuse to add
a continuous eeprom checksum to the requirements without it.  If I allowed
requirements to be added without any apparent benefit, *that* would be
wasting time.  
 
>What matters is if a system failure is something you can 
>afford or not. Assuming, for the sake of this discussion, the software has
>to work with occasionally corrupted eeprom data, you have to decide what
>you can do to avoid that,

Once again you are pretending that you know that the hardware that does 
the continuous EEPROM checksum is more reliable than the EEPROM. If it
happens to be a lot less reliable, you are making a system failure more
likely. 

>Piles of analysis tend to be highly unreliable,

Not if you do them right.

>cost calculations in such areas never make sense,

They make sense if you do them right.

>better to trust a bit of common sense.

And you think that doing a continuous EEPROM checksum when you don't 
know (because you don't like analysis) whether the EEPROM is orders
of magnatude more likely or less likely to have an error than the 
system that does the checksumming makes common sense?  I will stick 
with the "piles of analysis" as being more reliable than "common 
sense."
 
>BTW, in whatever system, I would not be
>worried by eeprom itself, would worry more about software making accidental
>writes and. most important, a healty hardware design with nice power up/down
>behaviour.

We agree here.

>Implementing a continous check may cure that just enough to let
>the systems pass the testers,

Again you assume that continuous EEPROM checksum makes the system
more reliable rather than less reliable.  How do you know this?
What method did you use to arrive at this conclusion?

Reply by Paul E. Bennett ●July 15, 20042004-07-15

Guy Macon <http://www.guymacon.com> wrote:

> 
> Paul Keinanen <keinanen@sci.fi> says...
>>
>>Guy Macon <http://www.guymacon.com> wrote:
>>
>>>Again, I have seen no evidence that the sum-checker is more reliable
>>>than the EEPROM being checked. Everyone seems to be accepting that
>>>it is based on nothing more than blind faith.
>>
>>Even if the checksum algorithm is executed directly out of the EEPROM
>>(which is not always the case), the surface area occupied by the
>>checker is very small compared to the total area of the EEPROM in most
>>cases. If there is a single (hard or soft) error in the EEPROM, the
>>likelihood is much greater that is in the error is the other part of
>>the EEPROM than in the checker code itself due to the area ratio.
>>The worst case is that there are error(s) in the EEPROM, but a bit
>>flip in the actual checker code will modify the program so that it
>>will return EEPROM OK, but the likelihood is still smaller.
> 
> But the sum-checker is far more than lust the place where the
> sum-checking code is stored.  It is also the electronics that
> reads the code, the ALU that executes the code, the registers
> and RAM that the code uses, and so forth.  One would have to
> estimate the error rate of all of those parts of the uC and
> compare them to the error rate of the EEPROM.  Unless you do
> that, you have no idea whether your continuous sum-checker
> increases or decreases system reliability compared to an
> on-demand sum-checker or no sum-checker at all.
> 

Assuming that you have demonstrated a need to be certain of the validity of 
data in the EEPROM (or any other area of fixed memory) then you should also 
have a figure that indicates the maximum time bewteen full checking reports 
(rember, integrity is a time and probability of failure measure).

Also assuming that the system you are developing has, as mentioned in 
another post, no safe state then you may need to know how much of the time 
the individualk parts of the system are available to you. Not only would 
you run the checksum but you would also run other hardware integrity 
checking on a continuous piecemeal) basis, leaving markers as to the 
success or otherwise such that a reporting programme can report the results 
of the error analysis. Note that we are now in the realm of MUST NOT FAIL 
systems.

The question of what you do when a part of your system fails must be 
answered fairly early on in the design phase. Every engioneer should ask 
himslef that question as a matter of routine deliberation for new designs.

Forunately for me, I need not care too much about losing one module of a 
system so long as it indicates that it has failed (and why). I have several 
techniques that I use to check that the system is really behaving itself 
and ensure that outputs are disabled (a safe state in 99% of mys syetms).

As I often state, let the risk assessments guide you to what you need to 
check and then work out the scheme that gives you the best chance of 
meeting the integrity taregets (not all parts of the system need to work to 
the same level).

-- 
********************************************************************
Paul E. Bennett ....................<email://peb@a...>
Forth based HIDECS Consultancy .....<http://www.amleth.demon.co.uk/>
Mob: +44 (0)7811-639972 .........NOW AVAILABLE:- HIDECS COURSE......
Tel: +44 (0)1235-811095 .... see http://www.feabhas.com for details.
Going Forth Safely ..... EBA. www.electric-boat-association.org.uk..
********************************************************************

Reply by Paul E. Bennett ●July 15, 20042004-07-15

Frank Bemelman wrote:

> Oh, if someone insist on it, even after pointing out it isn't
> very useful, why not. But most requirement flaws I ignore without
> informing the person that wrote them (or simply copied them
> from another project).

Does that mean you deliver projects that are not to the clients spec?

The early part of my projects usually involve rewriting the specification 
to make it fully coherent. It takes quite a bit of negotiation but then can 
end up costing the client less (once you rid the spec of the useless 
dross). Remember that you have to engineer the customer as well as the 
system.

-- 
********************************************************************
Paul E. Bennett ....................<email://peb@a...>
Forth based HIDECS Consultancy .....<http://www.amleth.demon.co.uk/>
Mob: +44 (0)7811-639972 .........NOW AVAILABLE:- HIDECS COURSE......
Tel: +44 (0)1235-811095 .... see http://www.feabhas.com for details.
Going Forth Safely ..... EBA. www.electric-boat-association.org.uk..
********************************************************************

Reply by Paul E. Bennett ●July 15, 20042004-07-15

Guy Macon <http://www.guymacon.com> wrote:

> 
> Frank Bemelman <f.bemelmanx@xs4all.invalid.nl> says...
> 
>>But most requirement flaws I ignore without
>>informing the person that wrote them (or simply copied them
>>from another project).
> 
> You would have a problem if I was your project manager.  You would
> be instructed to evaluate the requirements and to agree or disagree
> with each requirement, and the definition of your code being "done"
> would include the independent testers verifying that your code complies
> with all requirements. On my projects requirement errors are serious,
> and they are to be corrected, not ignored.
> 
> Then again, I wouldn't be handing you requirements that are male bovine
> excrement. Before you got a requirement to implement a continuous
> checksum, you would have hard numbers for EEPROM errors, sense-amp errors,
> ALU errors, register errors. etc.. both under normal conditions and under
> conditions of radiation, ESD, etc, and an analysis of the reliability
> impact, cost, etc. of the sum-checker.

Way to go Guy!

-- 
********************************************************************
Paul E. Bennett ....................<email://peb@a...>
Forth based HIDECS Consultancy .....<http://www.amleth.demon.co.uk/>
Mob: +44 (0)7811-639972 .........NOW AVAILABLE:- HIDECS COURSE......
Tel: +44 (0)1235-811095 .... see http://www.feabhas.com for details.
Going Forth Safely ..... EBA. www.electric-boat-association.org.uk..
********************************************************************

Reply by Dave Hansen ●July 15, 20042004-07-15

On Thu, 15 Jul 2004 11:14:55 -0700, Guy Macon
<http://www.guymacon.com> wrote:

[...]
>
>Again you assume that continuous EEPROM checksum makes the system
>more reliable rather than less reliable.  How do you know this?
>What method did you use to arrive at this conclusion?

Interesting discussion.  Reminds me of my first job out of college,
part of a team modifying the software of the fuel gauge for a
commercial airliner. I thought I'd posted on this before, but google
isn't finding it for me...

We were working on a project known as "dispatch enhancement," which
was a complete misnomer.  We were actually tightening up some
diagnostics, adding some others, and adding the ability to send
diagnostic messages to the aircrafts Engine Indicating and Crew
Alerting System (EICAS).  In summary, we were adding the ability to
detect more problems and providing better error messages.  (Prior to
this enhancement, the only "error message" we provided to the crew was
blank displays).  Nothing we were doing would "enhance" the "dispatch"
of aircraft on their flights.

While we were working on this, an aircraft using the existing fuel
gauge ran out of fuel in mid-air.  Look up the "Gimli Glider" if you
want more information.

We suddenly came under much greater pressure to complete our
modifications ahead of schedule.  Which made no sense whatsoever:

   1) The fuel gauge on the subject aircraft was blank, indicating 
      internal diagnostics had found a problem.  We were not going
      to prevent that from happening -- indeed, after our
      modifications, it would potentially occur more often, because
      we could find additional problems.

   2) The FAA regulations said that when this aircraft's fuel gauges
      are blank, the aircraft doesn't fly.  This aircraft was flying
      because it wasn't subject to the FAA (i.e., not an American
      flight).

   3) The flight regs to which the aircraft was subject allowed flight

      when the fuel in the tanks was measured manually.  This was 
      done more than once, correctly each time.  The ground crew 
      reported to the pilots the number of pounds of fuel in the
      tanks.  The pilots thought the reported value was in kg.

Back to the subject: for some reason someone got it in their head that
our changes would make the fuel gauge "more reliable," and therefore
we _had_ to complete our changes ASAP.  Probably because of the bogus
project name.  In one sense we were: our changes would make it less
likely the fuel gauge would cause the airplane to malfunction.  But by
their definition (aircraft flies more often), we would probably make
the fuel gauge *less* reliable.

And the question of what to do in a failure.  We would still blank the
displays.  We would also notify the crew of the nature of the problem
through EICAS.  No change there.  The only change that could have
prevented this incident was external to our group (and was made IIRC:
the subject flight regs were changed to prevent the aircraft from
flying with blank fuel gauges).

Regards,

                               -=Dave
-- 
Change is inevitable, progress is not.

Previous 3 456 7 Next

Continous eeprom checksum microcontroller

Sign in

You might also like...

Search forums

Free PDF Downloads

Blogs - Hall of Fame

Discussion Groups

Quick Links

About EmbeddedRelated.com

Social Networks

The Related Media Group