Hi Simon,

On 2/24/2015 6:35 AM, Simon Clubley wrote:
> On 2015-02-23, Don Y <this@is.not.me.com> wrote:
>> On 2/22/2015 6:13 PM, Simon Clubley wrote:
>>
>>> I was addressing Don's interesting and specific comment about how do
>>> you detect, in general, a faulty flash image caused by a malfunctioning
>>> reflasher ? I wasn't offering a general suggestion for the OP.
>>>
>>> The beauty of a build time hash is that even if a faulty reflasher
>>> corrupts the in-memory image _before_ burning it, the hash will detect
>>> that but comparing the burnt image against the corrupt in-memory image
>>> will not.
>>
>> Good point!  OTOH, do you build a re-reflasher to verify the hash stored
>> in the reflasher hasn't been corrupted?  I.e., reflasher's hash gets
>> mangled.  It CORRECTLY reflashes the device in question.  Then, computes
>> the hash of that image (from/in the device) and notices that it is
>> not in agreement with the stored hash -- so, it (erroneously) decides
>> the reflash didn't "take" and repeats the process...  :-/
>>
>> (which, of course, will *still* fail -- because the *hash* is corrupt!)
>
> :-)
>
> I learnt a long time ago that not every problem can be solved by
> technical means; sometimes a technical solution becomes a management
> solution instead.
>
> In this hypothetical case, the build time hash has allowed it to be
> established that either the image or the hash itself is getting
> corrupted by the reflasher. In either case, the end result is the
> same - the reflasher is faulty and cannot be trusted.

Yes -- but notice how we're now talking about a problem with
a REFLASHER!  The *original* problem is hiding (unsolved)
behind a (potentially) newly created one!  :-/

(OP) Understand the problem first.  It *may* be that the most
practical solution ends up being a reflasher (ick!).  E.g., Hubble's
defective mirror was best solved as it was -- instead of *replacing*
the entire mirror (which would have been the "ideal" solution).

But, know *why* this solution is the best instead of just throwing
it up as a quick fix!

[I'm off to one of my pro bono gigs...]

On 2015-02-23, Don Y <this@is.not.me.com> wrote:
> Hi Simon,
>
> On 2/22/2015 6:13 PM, Simon Clubley wrote:
>
>> I was addressing Don's interesting and specific comment about how do
>> you detect, in general, a faulty flash image caused by a malfunctioning
>> reflasher ? I wasn't offering a general suggestion for the OP.
>>
>> The beauty of a build time hash is that even if a faulty reflasher
>> corrupts the in-memory image _before_ burning it, the hash will detect
>> that but comparing the burnt image against the corrupt in-memory image
>> will not.
>
> Good point!  OTOH, do you build a re-reflasher to verify the hash stored
> in the reflasher hasn't been corrupted?  I.e., reflasher's hash gets
> mangled.  It CORRECTLY reflashes the device in question.  Then, computes
> the hash of that image (from/in the device) and notices that it is
> not in agreement with the stored hash -- so, it (erroneously) decides
> the reflash didn't "take" and repeats the process...  :-/
>
> (which, of course, will *still* fail -- because the *hash* is corrupt!)
>

:-)

I learnt a long time ago that not every problem can be solved by
technical means; sometimes a technical solution becomes a management
solution instead.

In this hypothetical case, the build time hash has allowed it to be
established that either the image or the hash itself is getting
corrupted by the reflasher. In either case, the end result is the
same - the reflasher is faulty and cannot be trusted.

At this point, the reflasher should be pulled out of service and dumped
on the bench of whoever created it. This person should be told "this
reflasher is faulty and this hash is the proof. Fix it."

If they still can't do that then that's when you either go to their
manager with your hash proof and/or put a quote for your design
services on their desk. :-)

>> However, based on the thread so far, I agree the OP has a more basic
>> problem which is the real cause and is the one which needs solving.
>
> I think the OP hasn't even (clearly) identified the *symptoms*,
> let alone the *problem*!  (i.e., *is* the image intact or not?
> if it *is*, then why are you reflashing it??)

Indeed. And just to repeat this; I am not suggesting the OP go down the
reflasher route. I am just thinking about how to detect/solve the
specific question Don posed.

Simon.

-- 
Simon Clubley, clubley@remove_me.eisner.decus.org-Earth.UFP
Microsoft: Bringing you 1980s technology to a 21st century world

Hi Simon,

On 2/22/2015 6:13 PM, Simon Clubley wrote:

> I was addressing Don's interesting and specific comment about how do
> you detect, in general, a faulty flash image caused by a malfunctioning
> reflasher ? I wasn't offering a general suggestion for the OP.
>
> The beauty of a build time hash is that even if a faulty reflasher
> corrupts the in-memory image _before_ burning it, the hash will detect
> that but comparing the burnt image against the corrupt in-memory image
> will not.

Good point!  OTOH, do you build a re-reflasher to verify the hash stored
in the reflasher hasn't been corrupted?  I.e., reflasher's hash gets
mangled.  It CORRECTLY reflashes the device in question.  Then, computes
the hash of that image (from/in the device) and notices that it is
not in agreement with the stored hash -- so, it (erroneously) decides
the reflash didn't "take" and repeats the process...  :-/

(which, of course, will *still* fail -- because the *hash* is corrupt!)

> However, based on the thread so far, I agree the OP has a more basic
> problem which is the real cause and is the one which needs solving.

I think the OP hasn't even (clearly) identified the *symptoms*,
let alone the *problem*!  (i.e., *is* the image intact or not?
if it *is*, then why are you reflashing it??)

Tim Wescott <seemywebsite@myfooter.really> wrote:

> Can you set protect bits on the flash, either permanently or (assuming
> that you have to re-program from time to time) unlockable?
> 
> It sounds like you're allowing the processor to write to program memory,
> which is just wrong.  If you have valid flash writes (i.e., if you have
> program and non-volatile data in flash), consider hard-coding the flash
> write routines to fail if they're told to write someplace they're not
> supposed to.

and also do some check on the non-volatile data in flash in caseit
becomes corrupt...

Bye Jack
-- 
Yoda of Borg am I! Assimilated shall you be! Futile resistance is, hmm?

On 2015-02-22, Don Y <this@is.not.me.com> wrote:
> Hi Simon,
>
> On 2/22/2015 6:26 AM, Simon Clubley wrote:
>>
>> If you are concerned about that, have the build procedures which generate
>> the image to be flashed in the first place also generate a MD5 or
>> similar hash of the generated image at the same time.
>>
>> As part of your post-flash verify pass, you can then download the image
>> which was actually flashed and generate it's MD5. Comparing the two
>> hashes will tell you if the image was flashed correctly (unless you
>> manage to generate a hash collision :-)).
>
> I'm not sure that would give a conclusive result.
>
> First, the OP hasn't confirmed that the image even *appears* to have
> been corrupted (i.e., altered).  All he's said is that reflashing FIXES
> the "problem".  I.e., he is (apparently) assuming that the flash has
> been corrupted -- as that is what reflashing *purports* to "fix".
>

Hello Don (and Paul),

I was addressing Don's interesting and specific comment about how do
you detect, in general, a faulty flash image caused by a malfunctioning
reflasher ? I wasn't offering a general suggestion for the OP.

The beauty of a build time hash is that even if a faulty reflasher
corrupts the in-memory image _before_ burning it, the hash will detect
that but comparing the burnt image against the corrupt in-memory image
will not.

However, based on the thread so far, I agree the OP has a more basic
problem which is the real cause and is the one which needs solving.

Simon.

-- 
Simon Clubley, clubley@remove_me.eisner.decus.org-Earth.UFP
Microsoft: Bringing you 1980s technology to a 21st century world

Hi Paul,

On 2/22/2015 7:10 AM, Paul E Bennett wrote:
> As Don Y, Tim Wescott and myself have suggested, there is something that is
> fundamentally wrong with the installed systems. Rather than designing,
> building and installing thousands of re-flashers they should explore the
> root cause of the problem more thoroughly.
>
> It is obvious that the Flash is being trashed somehow. Finding out what and
> why would be the best use of their time. If they have to change the design
> perhaps they can build in the protection measures to prevent such
> recurrences.

As I said to Simon (upthread), I am not convinced that the flash HAS
been trashed!  The only "evidence" may be entirely coincidental.
That's why I'd like to hear (from the OP) what he did to verify
the flash's integrity (or, lack thereof).

"Reinstall Windows" is just too simplistic an approach to a problem
(and, like most of those cases where "windows was reinstalled", it
often doesn't prevent the problem from re-occurring!  Because the
PROBLEM hasn't been identified and solved).

Scientific method:  construct a hypothesis; then construct an experiment
(test) to validate or invalidate that hypothesis.  *THEN*, come to conclusions
(or, a refined hypothesis).  OP seems to have just found something that
APPEARS to work (?  no idea how WELL!) and settled on that.

Sunday lunch:  Finestkind!

Hi Simon,

On 2/22/2015 6:26 AM, Simon Clubley wrote:
> On 2015-02-21, Don Y <this@is.not.me.com> wrote:
>> Hi Paul,
>>
>> On 2/21/2015 1:23 AM, Paul E Bennett wrote:
>>>
>>> In addition to Don's points, you might ask yourself what happens if the
>>> problem you are experiencing happens also to your re-flashing device (as
>>> that is likely to have FLASH also). You may end up loading a corrupt image
>>> in the wrong locations.
>>
>> Ha!  I hadn't considered that!  (though if the same soul designed both
>> devices, it only stands to reason!)  Rather, I was more concerned (above)
>> with the reflasher failing to flash the (original) device due to a problem
>> *in* the original device.  E.g., perhaps when "staff" reflash the device,
>> they have it powered from a more stable power source, implement more
>> robust "tests" that the flash "took", etc.  A "dumb box" could easily
>> fail to achieve any of these "differences" leading to a less reliable
>> reflash... followed by another crash (perhaps for some *other* reason
>> than the original problem!) and another reflash followed by...
>
> If you are concerned about that, have the build procedures which generate
> the image to be flashed in the first place also generate a MD5 or
> similar hash of the generated image at the same time.
>
> As part of your post-flash verify pass, you can then download the image
> which was actually flashed and generate it's MD5. Comparing the two
> hashes will tell you if the image was flashed correctly (unless you
> manage to generate a hash collision :-)).

I'm not sure that would give a conclusive result.

First, the OP hasn't confirmed that the image even *appears* to have
been corrupted (i.e., altered).  All he's said is that reflashing FIXES
the "problem".  I.e., he is (apparently) assuming that the flash has
been corrupted -- as that is what reflashing *purports* to "fix".

There may, indeed, be something (?) that has happened to the system
that his reflashing ACTIVITY/procedure is "fixing" OTHER THAN "CORRECTING"
THE CONTENTS OF THE FLASH.

E.g., imagine a device that is powered *on* 24/7/365 and only has
power cycled as a side-effect of the reflashing process.  The contents
of the flash may, in fact, be intact and it is the cycling of power
that is "fixing" the ACTUAL problem.

[I am not claiming this is the case.  Rather, indicating that the OP's
"diagnosis" is unsubstantiated:  is the firmware image ACTUALLY corrupt?
*How*/where?  Do all afflicted devices exhibit the same problem in the
same *way*/place?  etc.]

Second, how you obtain that checksum/hash -- even a literal byte-by-byte
comparison -- may not reflect the operating conditions of the device
in its failed state.  E.g., using JTAG to pull the bytes from the
device will obviously *not* occur at "opcode-fetch speed".  Nor will
the memory access patterns mimic those that occur in normal operation.

Etc.

The OP first needs to prove to himself that reflashing *could* be a
remedy -- by indicating that the contents HAVE, in fact, been altered
between the time the device was manufactured and the time the
"crash" (and proposed reflash) occurred.

E.g., imagine examining the flash's contents and finding it *intact*!
Yet, still noting that the reflash "fixes" the problem!  This poses
a different problem than finding the contents have been *altered*...

While the OP may, in fact, have done these things, I'm just asking
for confirmation and an elaboration as to *how* he came to the
conclusion that a reflasher "makes sense" (even as a PTF).  It's
sort of like someone who "debugs" code by making "arbitrary" changes
and waiting to DISCOVER which of them (appears to) yield the correct
results.  While you *may* find a change that appears to work, unless
you can PROVE that it *should* work (by understanding the real
problem), you may have just CHANGED the problem...

On 22.2.15 16:10, Paul E Bennett wrote:
> Simon Clubley wrote:
>
>> On 2015-02-21, Don Y <this@is.not.me.com> wrote:
>>> Hi Paul,
>>>
>>> On 2/21/2015 1:23 AM, Paul E Bennett wrote:
>>>>
>>>> In addition to Don's points, you might ask yourself what happens if the
>>>> problem you are experiencing happens also to your re-flashing device (as
>>>> that is likely to have FLASH also). You may end up loading a corrupt
>>>> image in the wrong locations.
>>>
>>> Ha!  I hadn't considered that!  (though if the same soul designed both
>>> devices, it only stands to reason!)  Rather, I was more concerned (above)
>>> with the reflasher failing to flash the (original) device due to a
>>> problem
>>> *in* the original device.  E.g., perhaps when "staff" reflash the device,
>>> they have it powered from a more stable power source, implement more
>>> robust "tests" that the flash "took", etc.  A "dumb box" could easily
>>> fail to achieve any of these "differences" leading to a less reliable
>>> reflash... followed by another crash (perhaps for some *other* reason
>>> than the original problem!) and another reflash followed by...
>>>
>>
>> If you are concerned about that, have the build procedures which generate
>> the image to be flashed in the first place also generate a MD5 or
>> similar hash of the generated image at the same time.
>>
>> As part of your post-flash verify pass, you can then download the image
>> which was actually flashed and generate it's MD5. Comparing the two
>> hashes will tell you if the image was flashed correctly (unless you
>> manage to generate a hash collision :-)).
>>
>> Simon.
>>
>
> As Don Y, Tim Wescott and myself have suggested, there is something that is
> fundamentally wrong with the installed systems. Rather than designing,
> building and installing thousands of re-flashers they should explore the
> root cause of the problem more thoroughly.
>
> It is obvious that the Flash is being trashed somehow. Finding out what and
> why would be the best use of their time. If they have to change the design
> perhaps they can build in the protection measures to prevent such
> recurrences.


A brownout detector reset chip could be a good investment.

-- 

-TV

Simon Clubley wrote:

> On 2015-02-21, Don Y <this@is.not.me.com> wrote:
>> Hi Paul,
>>
>> On 2/21/2015 1:23 AM, Paul E Bennett wrote:
>>>
>>> In addition to Don's points, you might ask yourself what happens if the
>>> problem you are experiencing happens also to your re-flashing device (as
>>> that is likely to have FLASH also). You may end up loading a corrupt
>>> image in the wrong locations.
>>
>> Ha!  I hadn't considered that!  (though if the same soul designed both
>> devices, it only stands to reason!)  Rather, I was more concerned (above)
>> with the reflasher failing to flash the (original) device due to a
>> problem
>> *in* the original device.  E.g., perhaps when "staff" reflash the device,
>> they have it powered from a more stable power source, implement more
>> robust "tests" that the flash "took", etc.  A "dumb box" could easily
>> fail to achieve any of these "differences" leading to a less reliable
>> reflash... followed by another crash (perhaps for some *other* reason
>> than the original problem!) and another reflash followed by...
>>
> 
> If you are concerned about that, have the build procedures which generate
> the image to be flashed in the first place also generate a MD5 or
> similar hash of the generated image at the same time.
> 
> As part of your post-flash verify pass, you can then download the image
> which was actually flashed and generate it's MD5. Comparing the two
> hashes will tell you if the image was flashed correctly (unless you
> manage to generate a hash collision :-)).
> 
> Simon.
> 

As Don Y, Tim Wescott and myself have suggested, there is something that is 
fundamentally wrong with the installed systems. Rather than designing, 
building and installing thousands of re-flashers they should explore the 
root cause of the problem more thoroughly.

It is obvious that the Flash is being trashed somehow. Finding out what and 
why would be the best use of their time. If they have to change the design 
perhaps they can build in the protection measures to prevent such 
recurrences.

-- 
********************************************************************
Paul E. Bennett IEng MIET.....<email://Paul_E.Bennett@topmail.co.uk>
Forth based HIDECS Consultancy.............<http://www.hidecs.co.uk>
Mob: +44 (0)7811-639972
Tel: +44 TBA (due to  re-location)
Going Forth Safely ..... EBA. www.electric-boat-association.org.uk..
********************************************************************

On 2015-02-21, Don Y <this@is.not.me.com> wrote:
> Hi Paul,
>
> On 2/21/2015 1:23 AM, Paul E Bennett wrote:
>>
>> In addition to Don's points, you might ask yourself what happens if the
>> problem you are experiencing happens also to your re-flashing device (as
>> that is likely to have FLASH also). You may end up loading a corrupt image
>> in the wrong locations.
>
> Ha!  I hadn't considered that!  (though if the same soul designed both
> devices, it only stands to reason!)  Rather, I was more concerned (above)
> with the reflasher failing to flash the (original) device due to a problem
> *in* the original device.  E.g., perhaps when "staff" reflash the device,
> they have it powered from a more stable power source, implement more
> robust "tests" that the flash "took", etc.  A "dumb box" could easily
> fail to achieve any of these "differences" leading to a less reliable
> reflash... followed by another crash (perhaps for some *other* reason
> than the original problem!) and another reflash followed by...
>

If you are concerned about that, have the build procedures which generate
the image to be flashed in the first place also generate a MD5 or
similar hash of the generated image at the same time.

As part of your post-flash verify pass, you can then download the image
which was actually flashed and generate it's MD5. Comparing the two
hashes will tell you if the image was flashed correctly (unless you
manage to generate a hash collision :-)).

Simon.

-- 
Simon Clubley, clubley@remove_me.eisner.decus.org-Earth.UFP
Microsoft: Bringing you 1980s technology to a 21st century world