Reply by Don Y February 24, 20152015-02-24
Hi Simon,

On 2/24/2015 6:35 AM, Simon Clubley wrote:
> On 2015-02-23, Don Y <this@is.not.me.com> wrote: >> On 2/22/2015 6:13 PM, Simon Clubley wrote: >> >>> I was addressing Don's interesting and specific comment about how do >>> you detect, in general, a faulty flash image caused by a malfunctioning >>> reflasher ? I wasn't offering a general suggestion for the OP. >>> >>> The beauty of a build time hash is that even if a faulty reflasher >>> corrupts the in-memory image _before_ burning it, the hash will detect >>> that but comparing the burnt image against the corrupt in-memory image >>> will not. >> >> Good point! OTOH, do you build a re-reflasher to verify the hash stored >> in the reflasher hasn't been corrupted? I.e., reflasher's hash gets >> mangled. It CORRECTLY reflashes the device in question. Then, computes >> the hash of that image (from/in the device) and notices that it is >> not in agreement with the stored hash -- so, it (erroneously) decides >> the reflash didn't "take" and repeats the process... :-/ >> >> (which, of course, will *still* fail -- because the *hash* is corrupt!) > > :-) > > I learnt a long time ago that not every problem can be solved by > technical means; sometimes a technical solution becomes a management > solution instead. > > In this hypothetical case, the build time hash has allowed it to be > established that either the image or the hash itself is getting > corrupted by the reflasher. In either case, the end result is the > same - the reflasher is faulty and cannot be trusted.
Yes -- but notice how we're now talking about a problem with a REFLASHER! The *original* problem is hiding (unsolved) behind a (potentially) newly created one! :-/ (OP) Understand the problem first. It *may* be that the most practical solution ends up being a reflasher (ick!). E.g., Hubble's defective mirror was best solved as it was -- instead of *replacing* the entire mirror (which would have been the "ideal" solution). But, know *why* this solution is the best instead of just throwing it up as a quick fix! [I'm off to one of my pro bono gigs...]
Reply by Simon Clubley February 24, 20152015-02-24
On 2015-02-23, Don Y <this@is.not.me.com> wrote:
> Hi Simon, > > On 2/22/2015 6:13 PM, Simon Clubley wrote: > >> I was addressing Don's interesting and specific comment about how do >> you detect, in general, a faulty flash image caused by a malfunctioning >> reflasher ? I wasn't offering a general suggestion for the OP. >> >> The beauty of a build time hash is that even if a faulty reflasher >> corrupts the in-memory image _before_ burning it, the hash will detect >> that but comparing the burnt image against the corrupt in-memory image >> will not. > > Good point! OTOH, do you build a re-reflasher to verify the hash stored > in the reflasher hasn't been corrupted? I.e., reflasher's hash gets > mangled. It CORRECTLY reflashes the device in question. Then, computes > the hash of that image (from/in the device) and notices that it is > not in agreement with the stored hash -- so, it (erroneously) decides > the reflash didn't "take" and repeats the process... :-/ > > (which, of course, will *still* fail -- because the *hash* is corrupt!) >
:-) I learnt a long time ago that not every problem can be solved by technical means; sometimes a technical solution becomes a management solution instead. In this hypothetical case, the build time hash has allowed it to be established that either the image or the hash itself is getting corrupted by the reflasher. In either case, the end result is the same - the reflasher is faulty and cannot be trusted. At this point, the reflasher should be pulled out of service and dumped on the bench of whoever created it. This person should be told "this reflasher is faulty and this hash is the proof. Fix it." If they still can't do that then that's when you either go to their manager with your hash proof and/or put a quote for your design services on their desk. :-)
>> However, based on the thread so far, I agree the OP has a more basic >> problem which is the real cause and is the one which needs solving. > > I think the OP hasn't even (clearly) identified the *symptoms*, > let alone the *problem*! (i.e., *is* the image intact or not? > if it *is*, then why are you reflashing it??)
Indeed. And just to repeat this; I am not suggesting the OP go down the reflasher route. I am just thinking about how to detect/solve the specific question Don posed. Simon. -- Simon Clubley, clubley@remove_me.eisner.decus.org-Earth.UFP Microsoft: Bringing you 1980s technology to a 21st century world
Reply by Don Y February 23, 20152015-02-23
Hi Simon,

On 2/22/2015 6:13 PM, Simon Clubley wrote:

> I was addressing Don's interesting and specific comment about how do > you detect, in general, a faulty flash image caused by a malfunctioning > reflasher ? I wasn't offering a general suggestion for the OP. > > The beauty of a build time hash is that even if a faulty reflasher > corrupts the in-memory image _before_ burning it, the hash will detect > that but comparing the burnt image against the corrupt in-memory image > will not.
Good point! OTOH, do you build a re-reflasher to verify the hash stored in the reflasher hasn't been corrupted? I.e., reflasher's hash gets mangled. It CORRECTLY reflashes the device in question. Then, computes the hash of that image (from/in the device) and notices that it is not in agreement with the stored hash -- so, it (erroneously) decides the reflash didn't "take" and repeats the process... :-/ (which, of course, will *still* fail -- because the *hash* is corrupt!)
> However, based on the thread so far, I agree the OP has a more basic > problem which is the real cause and is the one which needs solving.
I think the OP hasn't even (clearly) identified the *symptoms*, let alone the *problem*! (i.e., *is* the image intact or not? if it *is*, then why are you reflashing it??)
Reply by Jack February 23, 20152015-02-23
Tim Wescott <seemywebsite@myfooter.really> wrote:

> Can you set protect bits on the flash, either permanently or (assuming > that you have to re-program from time to time) unlockable? > > It sounds like you're allowing the processor to write to program memory, > which is just wrong. If you have valid flash writes (i.e., if you have > program and non-volatile data in flash), consider hard-coding the flash > write routines to fail if they're told to write someplace they're not > supposed to.
and also do some check on the non-volatile data in flash in caseit becomes corrupt... Bye Jack -- Yoda of Borg am I! Assimilated shall you be! Futile resistance is, hmm?
Reply by Simon Clubley February 22, 20152015-02-22
On 2015-02-22, Don Y <this@is.not.me.com> wrote:
> Hi Simon, > > On 2/22/2015 6:26 AM, Simon Clubley wrote: >> >> If you are concerned about that, have the build procedures which generate >> the image to be flashed in the first place also generate a MD5 or >> similar hash of the generated image at the same time. >> >> As part of your post-flash verify pass, you can then download the image >> which was actually flashed and generate it's MD5. Comparing the two >> hashes will tell you if the image was flashed correctly (unless you >> manage to generate a hash collision :-)). > > I'm not sure that would give a conclusive result. > > First, the OP hasn't confirmed that the image even *appears* to have > been corrupted (i.e., altered). All he's said is that reflashing FIXES > the "problem". I.e., he is (apparently) assuming that the flash has > been corrupted -- as that is what reflashing *purports* to "fix". >
Hello Don (and Paul), I was addressing Don's interesting and specific comment about how do you detect, in general, a faulty flash image caused by a malfunctioning reflasher ? I wasn't offering a general suggestion for the OP. The beauty of a build time hash is that even if a faulty reflasher corrupts the in-memory image _before_ burning it, the hash will detect that but comparing the burnt image against the corrupt in-memory image will not. However, based on the thread so far, I agree the OP has a more basic problem which is the real cause and is the one which needs solving. Simon. -- Simon Clubley, clubley@remove_me.eisner.decus.org-Earth.UFP Microsoft: Bringing you 1980s technology to a 21st century world
Reply by Don Y February 22, 20152015-02-22
Hi Paul,

On 2/22/2015 7:10 AM, Paul E Bennett wrote:
> As Don Y, Tim Wescott and myself have suggested, there is something that is > fundamentally wrong with the installed systems. Rather than designing, > building and installing thousands of re-flashers they should explore the > root cause of the problem more thoroughly. > > It is obvious that the Flash is being trashed somehow. Finding out what and > why would be the best use of their time. If they have to change the design > perhaps they can build in the protection measures to prevent such > recurrences.
As I said to Simon (upthread), I am not convinced that the flash HAS been trashed! The only "evidence" may be entirely coincidental. That's why I'd like to hear (from the OP) what he did to verify the flash's integrity (or, lack thereof). "Reinstall Windows" is just too simplistic an approach to a problem (and, like most of those cases where "windows was reinstalled", it often doesn't prevent the problem from re-occurring! Because the PROBLEM hasn't been identified and solved). Scientific method: construct a hypothesis; then construct an experiment (test) to validate or invalidate that hypothesis. *THEN*, come to conclusions (or, a refined hypothesis). OP seems to have just found something that APPEARS to work (? no idea how WELL!) and settled on that. Sunday lunch: Finestkind!
Reply by Don Y February 22, 20152015-02-22
Hi Simon,

On 2/22/2015 6:26 AM, Simon Clubley wrote:
> On 2015-02-21, Don Y <this@is.not.me.com> wrote: >> Hi Paul, >> >> On 2/21/2015 1:23 AM, Paul E Bennett wrote: >>> >>> In addition to Don's points, you might ask yourself what happens if the >>> problem you are experiencing happens also to your re-flashing device (as >>> that is likely to have FLASH also). You may end up loading a corrupt image >>> in the wrong locations. >> >> Ha! I hadn't considered that! (though if the same soul designed both >> devices, it only stands to reason!) Rather, I was more concerned (above) >> with the reflasher failing to flash the (original) device due to a problem >> *in* the original device. E.g., perhaps when "staff" reflash the device, >> they have it powered from a more stable power source, implement more >> robust "tests" that the flash "took", etc. A "dumb box" could easily >> fail to achieve any of these "differences" leading to a less reliable >> reflash... followed by another crash (perhaps for some *other* reason >> than the original problem!) and another reflash followed by... > > If you are concerned about that, have the build procedures which generate > the image to be flashed in the first place also generate a MD5 or > similar hash of the generated image at the same time. > > As part of your post-flash verify pass, you can then download the image > which was actually flashed and generate it's MD5. Comparing the two > hashes will tell you if the image was flashed correctly (unless you > manage to generate a hash collision :-)).
I'm not sure that would give a conclusive result. First, the OP hasn't confirmed that the image even *appears* to have been corrupted (i.e., altered). All he's said is that reflashing FIXES the "problem". I.e., he is (apparently) assuming that the flash has been corrupted -- as that is what reflashing *purports* to "fix". There may, indeed, be something (?) that has happened to the system that his reflashing ACTIVITY/procedure is "fixing" OTHER THAN "CORRECTING" THE CONTENTS OF THE FLASH. E.g., imagine a device that is powered *on* 24/7/365 and only has power cycled as a side-effect of the reflashing process. The contents of the flash may, in fact, be intact and it is the cycling of power that is "fixing" the ACTUAL problem. [I am not claiming this is the case. Rather, indicating that the OP's "diagnosis" is unsubstantiated: is the firmware image ACTUALLY corrupt? *How*/where? Do all afflicted devices exhibit the same problem in the same *way*/place? etc.] Second, how you obtain that checksum/hash -- even a literal byte-by-byte comparison -- may not reflect the operating conditions of the device in its failed state. E.g., using JTAG to pull the bytes from the device will obviously *not* occur at "opcode-fetch speed". Nor will the memory access patterns mimic those that occur in normal operation. Etc. The OP first needs to prove to himself that reflashing *could* be a remedy -- by indicating that the contents HAVE, in fact, been altered between the time the device was manufactured and the time the "crash" (and proposed reflash) occurred. E.g., imagine examining the flash's contents and finding it *intact*! Yet, still noting that the reflash "fixes" the problem! This poses a different problem than finding the contents have been *altered*... While the OP may, in fact, have done these things, I'm just asking for confirmation and an elaboration as to *how* he came to the conclusion that a reflasher "makes sense" (even as a PTF). It's sort of like someone who "debugs" code by making "arbitrary" changes and waiting to DISCOVER which of them (appears to) yield the correct results. While you *may* find a change that appears to work, unless you can PROVE that it *should* work (by understanding the real problem), you may have just CHANGED the problem...
Reply by Tauno Voipio February 22, 20152015-02-22
On 22.2.15 16:10, Paul E Bennett wrote:
> Simon Clubley wrote: > >> On 2015-02-21, Don Y <this@is.not.me.com> wrote: >>> Hi Paul, >>> >>> On 2/21/2015 1:23 AM, Paul E Bennett wrote: >>>> >>>> In addition to Don's points, you might ask yourself what happens if the >>>> problem you are experiencing happens also to your re-flashing device (as >>>> that is likely to have FLASH also). You may end up loading a corrupt >>>> image in the wrong locations. >>> >>> Ha! I hadn't considered that! (though if the same soul designed both >>> devices, it only stands to reason!) Rather, I was more concerned (above) >>> with the reflasher failing to flash the (original) device due to a >>> problem >>> *in* the original device. E.g., perhaps when "staff" reflash the device, >>> they have it powered from a more stable power source, implement more >>> robust "tests" that the flash "took", etc. A "dumb box" could easily >>> fail to achieve any of these "differences" leading to a less reliable >>> reflash... followed by another crash (perhaps for some *other* reason >>> than the original problem!) and another reflash followed by... >>> >> >> If you are concerned about that, have the build procedures which generate >> the image to be flashed in the first place also generate a MD5 or >> similar hash of the generated image at the same time. >> >> As part of your post-flash verify pass, you can then download the image >> which was actually flashed and generate it's MD5. Comparing the two >> hashes will tell you if the image was flashed correctly (unless you >> manage to generate a hash collision :-)). >> >> Simon. >> > > As Don Y, Tim Wescott and myself have suggested, there is something that is > fundamentally wrong with the installed systems. Rather than designing, > building and installing thousands of re-flashers they should explore the > root cause of the problem more thoroughly. > > It is obvious that the Flash is being trashed somehow. Finding out what and > why would be the best use of their time. If they have to change the design > perhaps they can build in the protection measures to prevent such > recurrences.
A brownout detector reset chip could be a good investment. -- -TV
Reply by Paul E Bennett February 22, 20152015-02-22
Simon Clubley wrote:

> On 2015-02-21, Don Y <this@is.not.me.com> wrote: >> Hi Paul, >> >> On 2/21/2015 1:23 AM, Paul E Bennett wrote: >>> >>> In addition to Don's points, you might ask yourself what happens if the >>> problem you are experiencing happens also to your re-flashing device (as >>> that is likely to have FLASH also). You may end up loading a corrupt >>> image in the wrong locations. >> >> Ha! I hadn't considered that! (though if the same soul designed both >> devices, it only stands to reason!) Rather, I was more concerned (above) >> with the reflasher failing to flash the (original) device due to a >> problem >> *in* the original device. E.g., perhaps when "staff" reflash the device, >> they have it powered from a more stable power source, implement more >> robust "tests" that the flash "took", etc. A "dumb box" could easily >> fail to achieve any of these "differences" leading to a less reliable >> reflash... followed by another crash (perhaps for some *other* reason >> than the original problem!) and another reflash followed by... >> > > If you are concerned about that, have the build procedures which generate > the image to be flashed in the first place also generate a MD5 or > similar hash of the generated image at the same time. > > As part of your post-flash verify pass, you can then download the image > which was actually flashed and generate it's MD5. Comparing the two > hashes will tell you if the image was flashed correctly (unless you > manage to generate a hash collision :-)). > > Simon. >
As Don Y, Tim Wescott and myself have suggested, there is something that is fundamentally wrong with the installed systems. Rather than designing, building and installing thousands of re-flashers they should explore the root cause of the problem more thoroughly. It is obvious that the Flash is being trashed somehow. Finding out what and why would be the best use of their time. If they have to change the design perhaps they can build in the protection measures to prevent such recurrences. -- ******************************************************************** Paul E. Bennett IEng MIET.....<email://Paul_E.Bennett@topmail.co.uk> Forth based HIDECS Consultancy.............<http://www.hidecs.co.uk> Mob: +44 (0)7811-639972 Tel: +44 TBA (due to re-location) Going Forth Safely ..... EBA. www.electric-boat-association.org.uk.. ********************************************************************
Reply by Simon Clubley February 22, 20152015-02-22
On 2015-02-21, Don Y <this@is.not.me.com> wrote:
> Hi Paul, > > On 2/21/2015 1:23 AM, Paul E Bennett wrote: >> >> In addition to Don's points, you might ask yourself what happens if the >> problem you are experiencing happens also to your re-flashing device (as >> that is likely to have FLASH also). You may end up loading a corrupt image >> in the wrong locations. > > Ha! I hadn't considered that! (though if the same soul designed both > devices, it only stands to reason!) Rather, I was more concerned (above) > with the reflasher failing to flash the (original) device due to a problem > *in* the original device. E.g., perhaps when "staff" reflash the device, > they have it powered from a more stable power source, implement more > robust "tests" that the flash "took", etc. A "dumb box" could easily > fail to achieve any of these "differences" leading to a less reliable > reflash... followed by another crash (perhaps for some *other* reason > than the original problem!) and another reflash followed by... >
If you are concerned about that, have the build procedures which generate the image to be flashed in the first place also generate a MD5 or similar hash of the generated image at the same time. As part of your post-flash verify pass, you can then download the image which was actually flashed and generate it's MD5. Comparing the two hashes will tell you if the image was flashed correctly (unless you manage to generate a hash collision :-)). Simon. -- Simon Clubley, clubley@remove_me.eisner.decus.org-Earth.UFP Microsoft: Bringing you 1980s technology to a 21st century world