EmbeddedRelated.com
Forums
The 2024 Embedded Online Conference

Thoughts on developing an

Started by Geato February 20, 2015
Hi,

New member here so hopefully I am in the correct group.

We have many controller boards in the field running the NXP1769 processor. 
Randomly, maybe a year or even 2 years down the road, the processor crashes
and re-flashing the firmware brings it back to life.  We know it is a
transient doing this but while we are chasing that problem, a stop gap fix
would be to come up with an auto re-flasher of sorts.

I am thinking of a small pcb that plugs onto the existing JTAG connector
that has a firmware image stored a uSD card.  Something (non-processor
hopefully), perhaps a CPLD powers up the uSD and transfers the image to the
NXP.  The hardware watchdog will initiate the transfer.

Does this sound like it is possible to do at a high level?  I am trying to
minimize the reliance on additional firmware like bootloaders or standalone
JTAG programmers.  The latter is physically too big and pricey as I would
need about a thousand of these.

Cheers....

	   
					
---------------------------------------		
Posted through http://www.EmbeddedRelated.com
Den fredag den 20. februar 2015 kl. 21.18.51 UTC+1 skrev Geato:
> Hi, > > New member here so hopefully I am in the correct group. > > We have many controller boards in the field running the NXP1769 processor. > Randomly, maybe a year or even 2 years down the road, the processor crashes > and re-flashing the firmware brings it back to life. We know it is a > transient doing this but while we are chasing that problem, a stop gap fix > would be to come up with an auto re-flasher of sorts. > > I am thinking of a small pcb that plugs onto the existing JTAG connector > that has a firmware image stored a uSD card. Something (non-processor > hopefully), perhaps a CPLD powers up the uSD and transfers the image to the > NXP. The hardware watchdog will initiate the transfer. > > Does this sound like it is possible to do at a high level? I am trying to > minimize the reliance on additional firmware like bootloaders or standalone > JTAG programmers. The latter is physically too big and pricey as I would > need about a thousand of these. > > Cheers.... >
you can't get to the uart and bootpin so you can use the buildin bootloader? many years ago I did boot and flash ARM7 via JTAG for a test system, it was basically some parallelport JTAG code from a PC app ported to an MCU Talking JTAG isn't complicated, figuring what to tell the chip can be, but if you can figure that out any old MCU with enough flash should be able to do what you want -Lasse
On 2/20/2015 1:18 PM, Geato wrote:

> We have many controller boards in the field running the NXP1769 processor. > Randomly, maybe a year or even 2 years down the road, the processor crashes > and re-flashing the firmware brings it back to life.
Have you verified that the old/existing firmware image *is* corrupted at this point? I.e., that the device won't restart "normally" after such a crash (cycle power)? As your existing system (apparently) can't self-flash, you must be dispatching "staff" to perform this reflash? Do they do anything besides blindly reflashing the device? Is it cheaper/easier to just *replace* defective devices (which gives you a chance to do a post-mortem on the device(s) that have failed)? (I.e., this sounds like the "reinstall Windows" solution-to-all-problems)
> We know it is a > transient doing this but while we are chasing that problem, a stop gap fix > would be to come up with an auto re-flasher of sorts.
<frown>
> I am thinking of a small pcb that plugs onto the existing JTAG connector > that has a firmware image stored a uSD card. Something (non-processor > hopefully), perhaps a CPLD powers up the uSD and transfers the image to the > NXP. The hardware watchdog will initiate the transfer.
So, you assume the ONLY time the watchdog kicks in is when this "crash" happens? I.e., there are NEVER cases where the watchdog kicks in, resets the processor and execution resumes CORRECTLY (without needing a reflash)? Do you track "reset" events anywhere so you can determine *if* this is the case? Does your run-time *ever* attempt to write to that flash in normal operation? Is power cycled often/frequently in your environment? Is anyone tracking the frequency of these crashes in your deployed population so you can begin to identify if there is a common pattern (power-on-hours, power cycles, manufacturing date code, etc.)? I.e., can you *predict* when this event is likely to happen (or, NOT happen)? Does reflashing cause the device to be "reliable" for "another 2 years"? Or, once reflashed, do crashes occur with greater frequency? What are the consequences to the user/application when this crash occurs?
> Does this sound like it is possible to do at a high level? I am trying to > minimize the reliance on additional firmware like bootloaders or standalone > JTAG programmers. The latter is physically too big and pricey as I would > need about a thousand of these.
The "right" solution is to figure out what the actual cause is. If "can't happen" *is* happening, then some assumption has been violated (which can result in *other* problems that haven't yet been visible). What will you do if (when?) a device just sits in a tight crash-reflash loop, indefinitely? Will the user be able to determine that this is actually happening (big red light)? Will *you* be able to determine how often any particular "reflasher" has been triggered? I.e., are you sure your fix won't just *change* the problem's manifestation?
Don Y wrote:

> On 2/20/2015 1:18 PM, Geato wrote: > >> We have many controller boards in the field running the NXP1769 >> processor. Randomly, maybe a year or even 2 years down the road, the >> processor crashes and re-flashing the firmware brings it back to life. > > Have you verified that the old/existing firmware image *is* corrupted > at this point? I.e., that the device won't restart "normally" after > such a crash (cycle power)? > > As your existing system (apparently) can't self-flash, you must be > dispatching "staff" to perform this reflash? Do they do anything > besides blindly reflashing the device? > > Is it cheaper/easier to just *replace* defective devices (which gives > you a chance to do a post-mortem on the device(s) that have failed)? > > (I.e., this sounds like the "reinstall Windows" solution-to-all-problems) > >> We know it is a >> transient doing this but while we are chasing that problem, a stop gap >> fix would be to come up with an auto re-flasher of sorts. > > <frown> > >> I am thinking of a small pcb that plugs onto the existing JTAG connector >> that has a firmware image stored a uSD card. Something (non-processor >> hopefully), perhaps a CPLD powers up the uSD and transfers the image to >> the >> NXP. The hardware watchdog will initiate the transfer. > > So, you assume the ONLY time the watchdog kicks in is when this "crash" > happens? I.e., there are NEVER cases where the watchdog kicks in, resets > the processor and execution resumes CORRECTLY (without needing a reflash)? > Do you track "reset" events anywhere so you can determine *if* this is > the case? > > Does your run-time *ever* attempt to write to that flash in normal > operation? > > Is power cycled often/frequently in your environment? > > Is anyone tracking the frequency of these crashes in your deployed > population so you can begin to identify if there is a common pattern > (power-on-hours, power cycles, manufacturing date code, etc.)? I.e., > can you *predict* when this event is likely to happen (or, NOT happen)? > Does reflashing cause the device to be "reliable" for "another 2 years"? > Or, once reflashed, do crashes occur with greater frequency? > > What are the consequences to the user/application when this crash occurs? > >> Does this sound like it is possible to do at a high level? I am trying >> to minimize the reliance on additional firmware like bootloaders or >> standalone >> JTAG programmers. The latter is physically too big and pricey as I would >> need about a thousand of these. > > The "right" solution is to figure out what the actual cause is. If > "can't happen" *is* happening, then some assumption has been violated > (which can result in *other* problems that haven't yet been visible). > > What will you do if (when?) a device just sits in a tight crash-reflash > loop, indefinitely? Will the user be able to determine that this is > actually happening (big red light)? Will *you* be able to determine how > often any particular "reflasher" has been triggered? I.e., are you > sure your fix won't just *change* the problem's manifestation?
In addition to Don's points, you might ask yourself what happens if the problem you are experiencing happens also to your re-flashing device (as that is likely to have FLASH also). You may end up loading a corrupt image in the wrong locations. You really need to understand the problem in much better detail. Does the unit design have vulnerabilities to electrical noise, brown-outs, RF interference, High Energy Transients, Higher Frequency Interrupts than it can deal with? I am not sure how much protection Don puts in his circuitry but I expend quite some effort to make sure that the processors in my products are quite well protected from a whole raft of transient interference. I also have checking in place to know when I am facing problems and need to report the fact. Then, my systems are usually expected to run a couple of decades with little or no maintenance effort in high dependability applications. So, back to Don's point. Have you done an analysis of the failures that lead to the perceived need for re-flashing? Have you traced the impetus for such failures. You might want to discuss the problem with NXP as well. -- ******************************************************************** Paul E. Bennett IEng MIET.....<email://Paul_E.Bennett@topmail.co.uk> Forth based HIDECS Consultancy.............<http://www.hidecs.co.uk> Mob: +44 (0)7811-639972 Tel: +44 TBA (due to re-location) Going Forth Safely ..... EBA. www.electric-boat-association.org.uk.. ********************************************************************
Hi Paul,

On 2/21/2015 1:23 AM, Paul E Bennett wrote:
> Don Y wrote:
>> What will you do if (when?) a device just sits in a tight crash-reflash >> loop, indefinitely? Will the user be able to determine that this is >> actually happening (big red light)? Will *you* be able to determine how >> often any particular "reflasher" has been triggered? I.e., are you >> sure your fix won't just *change* the problem's manifestation? > > In addition to Don's points, you might ask yourself what happens if the > problem you are experiencing happens also to your re-flashing device (as > that is likely to have FLASH also). You may end up loading a corrupt image > in the wrong locations.
Ha! I hadn't considered that! (though if the same soul designed both devices, it only stands to reason!) Rather, I was more concerned (above) with the reflasher failing to flash the (original) device due to a problem *in* the original device. E.g., perhaps when "staff" reflash the device, they have it powered from a more stable power source, implement more robust "tests" that the flash "took", etc. A "dumb box" could easily fail to achieve any of these "differences" leading to a less reliable reflash... followed by another crash (perhaps for some *other* reason than the original problem!) and another reflash followed by...
> You really need to understand the problem in much better detail. Does the > unit design have vulnerabilities to electrical noise, brown-outs, RF > interference, High Energy Transients, Higher Frequency Interrupts than it > can deal with?
When "can't happen" *does*, you really need to step back and figure out what's wrong with your assumptions. Have you overlooked something? Has something *changed* unexpectedly?? Do you even *know* what your assumptions *are*? Dismissing these sorts of events as "flukes" is a sign of poor engineeering (when do you begin to consider a "fluke" a "genuine bug" to be acted upon??)
> I am not sure how much protection Don puts in his circuitry but I expend > quite some effort to make sure that the processors in my products are quite > well protected from a whole raft of transient interference. I also have > checking in place to know when I am facing problems and need to report the > fact. Then, my systems are usually expected to run a couple of decades with > little or no maintenance effort in high dependability applications.
It puzzles me that ALL devices don't have BlackBoxes /de rigueur/. Even *volatile* implementations are very feasible and invaluable (IMO) for these sorts of situations! It's not like it's an "expensive" mechanism (development, time *or* space)
> So, back to Don's point. Have you done an analysis of the failures that lead > to the perceived need for re-flashing? Have you traced the impetus for such > failures. You might want to discuss the problem with NXP as well.
The OP seems to have decided a Band-Aid is the quickest way to "solve" this problem. That seems unlikely (though we've not seen all the particulars re: his design/application/environment). Ask oneself: what *should* I do differently to ensure the NEXT design doesn't suffer from the same problem? I suspect the "right" answer is NOT "design a reflasher in with the INITIAL design!" And, as you've said, "what do I do when the reflasher fails?"
On Fri, 20 Feb 2015 14:18:47 -0600, Geato wrote:

> Hi, > > New member here so hopefully I am in the correct group. > > We have many controller boards in the field running the NXP1769 > processor. Randomly, maybe a year or even 2 years down the road, the > processor crashes and re-flashing the firmware brings it back to life. > We know it is a transient doing this but while we are chasing that > problem, a stop gap fix would be to come up with an auto re-flasher of > sorts.
<snip> Can you set protect bits on the flash, either permanently or (assuming that you have to re-program from time to time) unlockable? It sounds like you're allowing the processor to write to program memory, which is just wrong. If you have valid flash writes (i.e., if you have program and non-volatile data in flash), consider hard-coding the flash write routines to fail if they're told to write someplace they're not supposed to. -- Tim Wescott Wescott Design Services http://www.wescottdesign.com
On 2015-02-21, Don Y <this@is.not.me.com> wrote:
> Hi Paul, > > On 2/21/2015 1:23 AM, Paul E Bennett wrote: >> >> In addition to Don's points, you might ask yourself what happens if the >> problem you are experiencing happens also to your re-flashing device (as >> that is likely to have FLASH also). You may end up loading a corrupt image >> in the wrong locations. > > Ha! I hadn't considered that! (though if the same soul designed both > devices, it only stands to reason!) Rather, I was more concerned (above) > with the reflasher failing to flash the (original) device due to a problem > *in* the original device. E.g., perhaps when "staff" reflash the device, > they have it powered from a more stable power source, implement more > robust "tests" that the flash "took", etc. A "dumb box" could easily > fail to achieve any of these "differences" leading to a less reliable > reflash... followed by another crash (perhaps for some *other* reason > than the original problem!) and another reflash followed by... >
If you are concerned about that, have the build procedures which generate the image to be flashed in the first place also generate a MD5 or similar hash of the generated image at the same time. As part of your post-flash verify pass, you can then download the image which was actually flashed and generate it's MD5. Comparing the two hashes will tell you if the image was flashed correctly (unless you manage to generate a hash collision :-)). Simon. -- Simon Clubley, clubley@remove_me.eisner.decus.org-Earth.UFP Microsoft: Bringing you 1980s technology to a 21st century world
Simon Clubley wrote:

> On 2015-02-21, Don Y <this@is.not.me.com> wrote: >> Hi Paul, >> >> On 2/21/2015 1:23 AM, Paul E Bennett wrote: >>> >>> In addition to Don's points, you might ask yourself what happens if the >>> problem you are experiencing happens also to your re-flashing device (as >>> that is likely to have FLASH also). You may end up loading a corrupt >>> image in the wrong locations. >> >> Ha! I hadn't considered that! (though if the same soul designed both >> devices, it only stands to reason!) Rather, I was more concerned (above) >> with the reflasher failing to flash the (original) device due to a >> problem >> *in* the original device. E.g., perhaps when "staff" reflash the device, >> they have it powered from a more stable power source, implement more >> robust "tests" that the flash "took", etc. A "dumb box" could easily >> fail to achieve any of these "differences" leading to a less reliable >> reflash... followed by another crash (perhaps for some *other* reason >> than the original problem!) and another reflash followed by... >> > > If you are concerned about that, have the build procedures which generate > the image to be flashed in the first place also generate a MD5 or > similar hash of the generated image at the same time. > > As part of your post-flash verify pass, you can then download the image > which was actually flashed and generate it's MD5. Comparing the two > hashes will tell you if the image was flashed correctly (unless you > manage to generate a hash collision :-)). > > Simon. >
As Don Y, Tim Wescott and myself have suggested, there is something that is fundamentally wrong with the installed systems. Rather than designing, building and installing thousands of re-flashers they should explore the root cause of the problem more thoroughly. It is obvious that the Flash is being trashed somehow. Finding out what and why would be the best use of their time. If they have to change the design perhaps they can build in the protection measures to prevent such recurrences. -- ******************************************************************** Paul E. Bennett IEng MIET.....<email://Paul_E.Bennett@topmail.co.uk> Forth based HIDECS Consultancy.............<http://www.hidecs.co.uk> Mob: +44 (0)7811-639972 Tel: +44 TBA (due to re-location) Going Forth Safely ..... EBA. www.electric-boat-association.org.uk.. ********************************************************************
On 22.2.15 16:10, Paul E Bennett wrote:
> Simon Clubley wrote: > >> On 2015-02-21, Don Y <this@is.not.me.com> wrote: >>> Hi Paul, >>> >>> On 2/21/2015 1:23 AM, Paul E Bennett wrote: >>>> >>>> In addition to Don's points, you might ask yourself what happens if the >>>> problem you are experiencing happens also to your re-flashing device (as >>>> that is likely to have FLASH also). You may end up loading a corrupt >>>> image in the wrong locations. >>> >>> Ha! I hadn't considered that! (though if the same soul designed both >>> devices, it only stands to reason!) Rather, I was more concerned (above) >>> with the reflasher failing to flash the (original) device due to a >>> problem >>> *in* the original device. E.g., perhaps when "staff" reflash the device, >>> they have it powered from a more stable power source, implement more >>> robust "tests" that the flash "took", etc. A "dumb box" could easily >>> fail to achieve any of these "differences" leading to a less reliable >>> reflash... followed by another crash (perhaps for some *other* reason >>> than the original problem!) and another reflash followed by... >>> >> >> If you are concerned about that, have the build procedures which generate >> the image to be flashed in the first place also generate a MD5 or >> similar hash of the generated image at the same time. >> >> As part of your post-flash verify pass, you can then download the image >> which was actually flashed and generate it's MD5. Comparing the two >> hashes will tell you if the image was flashed correctly (unless you >> manage to generate a hash collision :-)). >> >> Simon. >> > > As Don Y, Tim Wescott and myself have suggested, there is something that is > fundamentally wrong with the installed systems. Rather than designing, > building and installing thousands of re-flashers they should explore the > root cause of the problem more thoroughly. > > It is obvious that the Flash is being trashed somehow. Finding out what and > why would be the best use of their time. If they have to change the design > perhaps they can build in the protection measures to prevent such > recurrences.
A brownout detector reset chip could be a good investment. -- -TV
Hi Simon,

On 2/22/2015 6:26 AM, Simon Clubley wrote:
> On 2015-02-21, Don Y <this@is.not.me.com> wrote: >> Hi Paul, >> >> On 2/21/2015 1:23 AM, Paul E Bennett wrote: >>> >>> In addition to Don's points, you might ask yourself what happens if the >>> problem you are experiencing happens also to your re-flashing device (as >>> that is likely to have FLASH also). You may end up loading a corrupt image >>> in the wrong locations. >> >> Ha! I hadn't considered that! (though if the same soul designed both >> devices, it only stands to reason!) Rather, I was more concerned (above) >> with the reflasher failing to flash the (original) device due to a problem >> *in* the original device. E.g., perhaps when "staff" reflash the device, >> they have it powered from a more stable power source, implement more >> robust "tests" that the flash "took", etc. A "dumb box" could easily >> fail to achieve any of these "differences" leading to a less reliable >> reflash... followed by another crash (perhaps for some *other* reason >> than the original problem!) and another reflash followed by... > > If you are concerned about that, have the build procedures which generate > the image to be flashed in the first place also generate a MD5 or > similar hash of the generated image at the same time. > > As part of your post-flash verify pass, you can then download the image > which was actually flashed and generate it's MD5. Comparing the two > hashes will tell you if the image was flashed correctly (unless you > manage to generate a hash collision :-)).
I'm not sure that would give a conclusive result. First, the OP hasn't confirmed that the image even *appears* to have been corrupted (i.e., altered). All he's said is that reflashing FIXES the "problem". I.e., he is (apparently) assuming that the flash has been corrupted -- as that is what reflashing *purports* to "fix". There may, indeed, be something (?) that has happened to the system that his reflashing ACTIVITY/procedure is "fixing" OTHER THAN "CORRECTING" THE CONTENTS OF THE FLASH. E.g., imagine a device that is powered *on* 24/7/365 and only has power cycled as a side-effect of the reflashing process. The contents of the flash may, in fact, be intact and it is the cycling of power that is "fixing" the ACTUAL problem. [I am not claiming this is the case. Rather, indicating that the OP's "diagnosis" is unsubstantiated: is the firmware image ACTUALLY corrupt? *How*/where? Do all afflicted devices exhibit the same problem in the same *way*/place? etc.] Second, how you obtain that checksum/hash -- even a literal byte-by-byte comparison -- may not reflect the operating conditions of the device in its failed state. E.g., using JTAG to pull the bytes from the device will obviously *not* occur at "opcode-fetch speed". Nor will the memory access patterns mimic those that occur in normal operation. Etc. The OP first needs to prove to himself that reflashing *could* be a remedy -- by indicating that the contents HAVE, in fact, been altered between the time the device was manufactured and the time the "crash" (and proposed reflash) occurred. E.g., imagine examining the flash's contents and finding it *intact*! Yet, still noting that the reflash "fixes" the problem! This poses a different problem than finding the contents have been *altered*... While the OP may, in fact, have done these things, I'm just asking for confirmation and an elaboration as to *how* he came to the conclusion that a reflasher "makes sense" (even as a PTF). It's sort of like someone who "debugs" code by making "arbitrary" changes and waiting to DISCOVER which of them (appears to) yield the correct results. While you *may* find a change that appears to work, unless you can PROVE that it *should* work (by understanding the real problem), you may have just CHANGED the problem...

The 2024 Embedded Online Conference