Thoughts on developing an

Hi,

New member here so hopefully I am in the correct group.

We have many controller boards in the field running the NXP1769 processor. 
Randomly, maybe a year or even 2 years down the road, the processor crashes
and re-flashing the firmware brings it back to life.  We know it is a
transient doing this but while we are chasing that problem, a stop gap fix
would be to come up with an auto re-flasher of sorts.

I am thinking of a small pcb that plugs onto the existing JTAG connector
that has a firmware image stored a uSD card.  Something (non-processor
hopefully), perhaps a CPLD powers up the uSD and transfers the image to the
NXP.  The hardware watchdog will initiate the transfer.

Does this sound like it is possible to do at a high level?  I am trying to
minimize the reliance on additional firmware like bootloaders or standalone
JTAG programmers.  The latter is physically too big and pricey as I would
need about a thousand of these.

Cheers....

	   
					
---------------------------------------		
Posted through http://www.EmbeddedRelated.com

Reply by ●February 20, 20152015-02-20

Den fredag den 20. februar 2015 kl. 21.18.51 UTC+1 skrev Geato:
> Hi,
> 
> New member here so hopefully I am in the correct group.
> 
> We have many controller boards in the field running the NXP1769 processor. 
> Randomly, maybe a year or even 2 years down the road, the processor crashes
> and re-flashing the firmware brings it back to life.  We know it is a
> transient doing this but while we are chasing that problem, a stop gap fix
> would be to come up with an auto re-flasher of sorts.
> 
> I am thinking of a small pcb that plugs onto the existing JTAG connector
> that has a firmware image stored a uSD card.  Something (non-processor
> hopefully), perhaps a CPLD powers up the uSD and transfers the image to the
> NXP.  The hardware watchdog will initiate the transfer.
> 
> Does this sound like it is possible to do at a high level?  I am trying to
> minimize the reliance on additional firmware like bootloaders or standalone
> JTAG programmers.  The latter is physically too big and pricey as I would
> need about a thousand of these.
> 
> Cheers....
> 

you can't get to the uart and bootpin so you can use the buildin bootloader? 

many years ago I did boot and flash ARM7 via JTAG for a test system, it was basically some parallelport JTAG code from a PC app ported to an MCU

Talking JTAG isn't complicated, figuring what to tell the chip can be, but if
you can figure that out any old MCU with enough flash should be able to do what you want


-Lasse

Reply by Don Y ●February 20, 20152015-02-20

On 2/20/2015 1:18 PM, Geato wrote:

> We have many controller boards in the field running the NXP1769 processor.
> Randomly, maybe a year or even 2 years down the road, the processor crashes
> and re-flashing the firmware brings it back to life.

Have you verified that the old/existing firmware image *is* corrupted
at this point?  I.e., that the device won't restart "normally" after
such a crash (cycle power)?

As your existing system (apparently) can't self-flash, you must be
dispatching "staff" to perform this reflash?  Do they do anything
besides blindly reflashing the device?

Is it cheaper/easier to just *replace* defective devices (which gives
you a chance to do a post-mortem on the device(s) that have failed)?

(I.e., this sounds like the "reinstall Windows" solution-to-all-problems)

> We know it is a
> transient doing this but while we are chasing that problem, a stop gap fix
> would be to come up with an auto re-flasher of sorts.

<frown>

> I am thinking of a small pcb that plugs onto the existing JTAG connector
> that has a firmware image stored a uSD card.  Something (non-processor
> hopefully), perhaps a CPLD powers up the uSD and transfers the image to the
> NXP.  The hardware watchdog will initiate the transfer.

So, you assume the ONLY time the watchdog kicks in is when this "crash"
happens?  I.e., there are NEVER cases where the watchdog kicks in, resets
the processor and execution resumes CORRECTLY (without needing a reflash)?
Do you track "reset" events anywhere so you can determine *if* this is
the case?

Does your run-time *ever* attempt to write to that flash in normal
operation?

Is power cycled often/frequently in your environment?

Is anyone tracking the frequency of these crashes in your deployed
population so you can begin to identify if there is a common pattern
(power-on-hours, power cycles, manufacturing date code, etc.)?  I.e.,
can you *predict* when this event is likely to happen (or, NOT happen)?
Does reflashing cause the device to be "reliable" for "another 2 years"?
Or, once reflashed, do crashes occur with greater frequency?

What are the consequences to the user/application when this crash occurs?

> Does this sound like it is possible to do at a high level?  I am trying to
> minimize the reliance on additional firmware like bootloaders or standalone
> JTAG programmers.  The latter is physically too big and pricey as I would
> need about a thousand of these.

The "right" solution is to figure out what the actual cause is.  If
"can't happen" *is* happening, then some assumption has been violated
(which can result in *other* problems that haven't yet been visible).

What will you do if (when?) a device just sits in a tight crash-reflash
loop, indefinitely?  Will the user be able to determine that this is
actually happening (big red light)?  Will *you* be able to determine how
often any particular "reflasher" has been triggered?  I.e., are you
sure your fix won't just *change* the problem's manifestation?

Reply by Paul E Bennett ●February 21, 20152015-02-21

Don Y wrote:

> On 2/20/2015 1:18 PM, Geato wrote:
> 
>> We have many controller boards in the field running the NXP1769
>> processor. Randomly, maybe a year or even 2 years down the road, the
>> processor crashes and re-flashing the firmware brings it back to life.
> 
> Have you verified that the old/existing firmware image *is* corrupted
> at this point?  I.e., that the device won't restart "normally" after
> such a crash (cycle power)?
> 
> As your existing system (apparently) can't self-flash, you must be
> dispatching "staff" to perform this reflash?  Do they do anything
> besides blindly reflashing the device?
> 
> Is it cheaper/easier to just *replace* defective devices (which gives
> you a chance to do a post-mortem on the device(s) that have failed)?
> 
> (I.e., this sounds like the "reinstall Windows" solution-to-all-problems)
> 
>> We know it is a
>> transient doing this but while we are chasing that problem, a stop gap
>> fix would be to come up with an auto re-flasher of sorts.
> 
> <frown>
> 
>> I am thinking of a small pcb that plugs onto the existing JTAG connector
>> that has a firmware image stored a uSD card.  Something (non-processor
>> hopefully), perhaps a CPLD powers up the uSD and transfers the image to
>> the
>> NXP.  The hardware watchdog will initiate the transfer.
> 
> So, you assume the ONLY time the watchdog kicks in is when this "crash"
> happens?  I.e., there are NEVER cases where the watchdog kicks in, resets
> the processor and execution resumes CORRECTLY (without needing a reflash)?
> Do you track "reset" events anywhere so you can determine *if* this is
> the case?
> 
> Does your run-time *ever* attempt to write to that flash in normal
> operation?
> 
> Is power cycled often/frequently in your environment?
> 
> Is anyone tracking the frequency of these crashes in your deployed
> population so you can begin to identify if there is a common pattern
> (power-on-hours, power cycles, manufacturing date code, etc.)?  I.e.,
> can you *predict* when this event is likely to happen (or, NOT happen)?
> Does reflashing cause the device to be "reliable" for "another 2 years"?
> Or, once reflashed, do crashes occur with greater frequency?
> 
> What are the consequences to the user/application when this crash occurs?
> 
>> Does this sound like it is possible to do at a high level?  I am trying
>> to minimize the reliance on additional firmware like bootloaders or
>> standalone
>> JTAG programmers.  The latter is physically too big and pricey as I would
>> need about a thousand of these.
> 
> The "right" solution is to figure out what the actual cause is.  If
> "can't happen" *is* happening, then some assumption has been violated
> (which can result in *other* problems that haven't yet been visible).
> 
> What will you do if (when?) a device just sits in a tight crash-reflash
> loop, indefinitely?  Will the user be able to determine that this is
> actually happening (big red light)?  Will *you* be able to determine how
> often any particular "reflasher" has been triggered?  I.e., are you
> sure your fix won't just *change* the problem's manifestation?

In addition to Don's points, you might ask yourself what happens if the 
problem you are experiencing happens also to your re-flashing device (as 
that is likely to have FLASH also). You may end up loading a corrupt image 
in the wrong locations.

You really need to understand the problem in much better detail. Does the 
unit design have vulnerabilities to electrical noise, brown-outs, RF 
interference, High Energy Transients, Higher Frequency Interrupts than it 
can deal with? 

I am not sure how much protection Don puts in his circuitry but I expend 
quite some effort to make sure that the processors in my products are quite 
well protected from a whole raft of transient interference. I also have 
checking in place to know when I am facing problems and need to report the 
fact. Then, my systems are usually expected to run a couple of decades with 
little or no maintenance effort in high dependability applications.

So, back to Don's point. Have you done an analysis of the failures that lead 
to the perceived need for re-flashing? Have you traced the impetus for such 
failures. You might want to discuss the problem with NXP as well.

-- 
********************************************************************
Paul E. Bennett IEng MIET.....<email://Paul_E.Bennett@topmail.co.uk>
Forth based HIDECS Consultancy.............<http://www.hidecs.co.uk>
Mob: +44 (0)7811-639972
Tel: +44 TBA (due to  re-location)
Going Forth Safely ..... EBA. www.electric-boat-association.org.uk..
********************************************************************

Reply by Don Y ●February 21, 20152015-02-21

Hi Paul,

On 2/21/2015 1:23 AM, Paul E Bennett wrote:
> Don Y wrote:

>> What will you do if (when?) a device just sits in a tight crash-reflash
>> loop, indefinitely?  Will the user be able to determine that this is
>> actually happening (big red light)?  Will *you* be able to determine how
>> often any particular "reflasher" has been triggered?  I.e., are you
>> sure your fix won't just *change* the problem's manifestation?
>
> In addition to Don's points, you might ask yourself what happens if the
> problem you are experiencing happens also to your re-flashing device (as
> that is likely to have FLASH also). You may end up loading a corrupt image
> in the wrong locations.

Ha!  I hadn't considered that!  (though if the same soul designed both
devices, it only stands to reason!)  Rather, I was more concerned (above)
with the reflasher failing to flash the (original) device due to a problem
*in* the original device.  E.g., perhaps when "staff" reflash the device,
they have it powered from a more stable power source, implement more
robust "tests" that the flash "took", etc.  A "dumb box" could easily
fail to achieve any of these "differences" leading to a less reliable
reflash... followed by another crash (perhaps for some *other* reason
than the original problem!) and another reflash followed by...

> You really need to understand the problem in much better detail. Does the
> unit design have vulnerabilities to electrical noise, brown-outs, RF
> interference, High Energy Transients, Higher Frequency Interrupts than it
> can deal with?

When "can't happen" *does*, you really need to step back and figure out
what's wrong with your assumptions.  Have you overlooked something?
Has something *changed* unexpectedly??  Do you even *know* what your
assumptions *are*?

Dismissing these sorts of events as "flukes" is a sign of poor engineeering
(when do you begin to consider a "fluke" a "genuine bug" to be acted upon??)

> I am not sure how much protection Don puts in his circuitry but I expend
> quite some effort to make sure that the processors in my products are quite
> well protected from a whole raft of transient interference. I also have
> checking in place to know when I am facing problems and need to report the
> fact. Then, my systems are usually expected to run a couple of decades with
> little or no maintenance effort in high dependability applications.

It puzzles me that ALL devices don't have BlackBoxes /de rigueur/.  Even
*volatile* implementations are very feasible and invaluable (IMO) for
these sorts of situations!  It's not like it's an "expensive" mechanism
(development, time *or* space)

> So, back to Don's point. Have you done an analysis of the failures that lead
> to the perceived need for re-flashing? Have you traced the impetus for such
> failures. You might want to discuss the problem with NXP as well.

The OP seems to have decided a Band-Aid is the quickest way to "solve"
this problem.  That seems unlikely (though we've not seen all the particulars
re: his design/application/environment).

Ask oneself:  what *should* I do differently to ensure the NEXT design
doesn't suffer from the same problem?  I suspect the "right" answer is
NOT "design a reflasher in with the INITIAL design!"

And, as you've said, "what do I do when the reflasher fails?"

Reply by Tim Wescott ●February 21, 20152015-02-21

On Fri, 20 Feb 2015 14:18:47 -0600, Geato wrote:

> Hi,
> 
> New member here so hopefully I am in the correct group.
> 
> We have many controller boards in the field running the NXP1769
> processor. Randomly, maybe a year or even 2 years down the road, the
> processor crashes and re-flashing the firmware brings it back to life. 
> We know it is a transient doing this but while we are chasing that
> problem, a stop gap fix would be to come up with an auto re-flasher of
> sorts.

<snip>

Can you set protect bits on the flash, either permanently or (assuming 
that you have to re-program from time to time) unlockable?

It sounds like you're allowing the processor to write to program memory, 
which is just wrong.  If you have valid flash writes (i.e., if you have 
program and non-volatile data in flash), consider hard-coding the flash 
write routines to fail if they're told to write someplace they're not 
supposed to.

-- 

Tim Wescott
Wescott Design Services
http://www.wescottdesign.com

Reply by Simon Clubley ●February 22, 20152015-02-22

On 2015-02-21, Don Y <this@is.not.me.com> wrote:
> Hi Paul,
>
> On 2/21/2015 1:23 AM, Paul E Bennett wrote:
>>
>> In addition to Don's points, you might ask yourself what happens if the
>> problem you are experiencing happens also to your re-flashing device (as
>> that is likely to have FLASH also). You may end up loading a corrupt image
>> in the wrong locations.
>
> Ha!  I hadn't considered that!  (though if the same soul designed both
> devices, it only stands to reason!)  Rather, I was more concerned (above)
> with the reflasher failing to flash the (original) device due to a problem
> *in* the original device.  E.g., perhaps when "staff" reflash the device,
> they have it powered from a more stable power source, implement more
> robust "tests" that the flash "took", etc.  A "dumb box" could easily
> fail to achieve any of these "differences" leading to a less reliable
> reflash... followed by another crash (perhaps for some *other* reason
> than the original problem!) and another reflash followed by...
>

If you are concerned about that, have the build procedures which generate
the image to be flashed in the first place also generate a MD5 or
similar hash of the generated image at the same time.

As part of your post-flash verify pass, you can then download the image
which was actually flashed and generate it's MD5. Comparing the two
hashes will tell you if the image was flashed correctly (unless you
manage to generate a hash collision :-)).

Simon.

-- 
Simon Clubley, clubley@remove_me.eisner.decus.org-Earth.UFP
Microsoft: Bringing you 1980s technology to a 21st century world

Reply by Paul E Bennett ●February 22, 20152015-02-22

Simon Clubley wrote:

> On 2015-02-21, Don Y <this@is.not.me.com> wrote:
>> Hi Paul,
>>
>> On 2/21/2015 1:23 AM, Paul E Bennett wrote:
>>>
>>> In addition to Don's points, you might ask yourself what happens if the
>>> problem you are experiencing happens also to your re-flashing device (as
>>> that is likely to have FLASH also). You may end up loading a corrupt
>>> image in the wrong locations.
>>
>> Ha!  I hadn't considered that!  (though if the same soul designed both
>> devices, it only stands to reason!)  Rather, I was more concerned (above)
>> with the reflasher failing to flash the (original) device due to a
>> problem
>> *in* the original device.  E.g., perhaps when "staff" reflash the device,
>> they have it powered from a more stable power source, implement more
>> robust "tests" that the flash "took", etc.  A "dumb box" could easily
>> fail to achieve any of these "differences" leading to a less reliable
>> reflash... followed by another crash (perhaps for some *other* reason
>> than the original problem!) and another reflash followed by...
>>
> 
> If you are concerned about that, have the build procedures which generate
> the image to be flashed in the first place also generate a MD5 or
> similar hash of the generated image at the same time.
> 
> As part of your post-flash verify pass, you can then download the image
> which was actually flashed and generate it's MD5. Comparing the two
> hashes will tell you if the image was flashed correctly (unless you
> manage to generate a hash collision :-)).
> 
> Simon.
> 

As Don Y, Tim Wescott and myself have suggested, there is something that is 
fundamentally wrong with the installed systems. Rather than designing, 
building and installing thousands of re-flashers they should explore the 
root cause of the problem more thoroughly.

It is obvious that the Flash is being trashed somehow. Finding out what and 
why would be the best use of their time. If they have to change the design 
perhaps they can build in the protection measures to prevent such 
recurrences.

-- 
********************************************************************
Paul E. Bennett IEng MIET.....<email://Paul_E.Bennett@topmail.co.uk>
Forth based HIDECS Consultancy.............<http://www.hidecs.co.uk>
Mob: +44 (0)7811-639972
Tel: +44 TBA (due to  re-location)
Going Forth Safely ..... EBA. www.electric-boat-association.org.uk..
********************************************************************

Reply by Tauno Voipio ●February 22, 20152015-02-22

On 22.2.15 16:10, Paul E Bennett wrote:
> Simon Clubley wrote:
>
>> On 2015-02-21, Don Y <this@is.not.me.com> wrote:
>>> Hi Paul,
>>>
>>> On 2/21/2015 1:23 AM, Paul E Bennett wrote:
>>>>
>>>> In addition to Don's points, you might ask yourself what happens if the
>>>> problem you are experiencing happens also to your re-flashing device (as
>>>> that is likely to have FLASH also). You may end up loading a corrupt
>>>> image in the wrong locations.
>>>
>>> Ha!  I hadn't considered that!  (though if the same soul designed both
>>> devices, it only stands to reason!)  Rather, I was more concerned (above)
>>> with the reflasher failing to flash the (original) device due to a
>>> problem
>>> *in* the original device.  E.g., perhaps when "staff" reflash the device,
>>> they have it powered from a more stable power source, implement more
>>> robust "tests" that the flash "took", etc.  A "dumb box" could easily
>>> fail to achieve any of these "differences" leading to a less reliable
>>> reflash... followed by another crash (perhaps for some *other* reason
>>> than the original problem!) and another reflash followed by...
>>>
>>
>> If you are concerned about that, have the build procedures which generate
>> the image to be flashed in the first place also generate a MD5 or
>> similar hash of the generated image at the same time.
>>
>> As part of your post-flash verify pass, you can then download the image
>> which was actually flashed and generate it's MD5. Comparing the two
>> hashes will tell you if the image was flashed correctly (unless you
>> manage to generate a hash collision :-)).
>>
>> Simon.
>>
>
> As Don Y, Tim Wescott and myself have suggested, there is something that is
> fundamentally wrong with the installed systems. Rather than designing,
> building and installing thousands of re-flashers they should explore the
> root cause of the problem more thoroughly.
>
> It is obvious that the Flash is being trashed somehow. Finding out what and
> why would be the best use of their time. If they have to change the design
> perhaps they can build in the protection measures to prevent such
> recurrences.


A brownout detector reset chip could be a good investment.

-- 

-TV

Reply by Don Y ●February 22, 20152015-02-22

Hi Simon,

On 2/22/2015 6:26 AM, Simon Clubley wrote:
> On 2015-02-21, Don Y <this@is.not.me.com> wrote:
>> Hi Paul,
>>
>> On 2/21/2015 1:23 AM, Paul E Bennett wrote:
>>>
>>> In addition to Don's points, you might ask yourself what happens if the
>>> problem you are experiencing happens also to your re-flashing device (as
>>> that is likely to have FLASH also). You may end up loading a corrupt image
>>> in the wrong locations.
>>
>> Ha!  I hadn't considered that!  (though if the same soul designed both
>> devices, it only stands to reason!)  Rather, I was more concerned (above)
>> with the reflasher failing to flash the (original) device due to a problem
>> *in* the original device.  E.g., perhaps when "staff" reflash the device,
>> they have it powered from a more stable power source, implement more
>> robust "tests" that the flash "took", etc.  A "dumb box" could easily
>> fail to achieve any of these "differences" leading to a less reliable
>> reflash... followed by another crash (perhaps for some *other* reason
>> than the original problem!) and another reflash followed by...
>
> If you are concerned about that, have the build procedures which generate
> the image to be flashed in the first place also generate a MD5 or
> similar hash of the generated image at the same time.
>
> As part of your post-flash verify pass, you can then download the image
> which was actually flashed and generate it's MD5. Comparing the two
> hashes will tell you if the image was flashed correctly (unless you
> manage to generate a hash collision :-)).

I'm not sure that would give a conclusive result.

First, the OP hasn't confirmed that the image even *appears* to have
been corrupted (i.e., altered).  All he's said is that reflashing FIXES
the "problem".  I.e., he is (apparently) assuming that the flash has
been corrupted -- as that is what reflashing *purports* to "fix".

There may, indeed, be something (?) that has happened to the system
that his reflashing ACTIVITY/procedure is "fixing" OTHER THAN "CORRECTING"
THE CONTENTS OF THE FLASH.

E.g., imagine a device that is powered *on* 24/7/365 and only has
power cycled as a side-effect of the reflashing process.  The contents
of the flash may, in fact, be intact and it is the cycling of power
that is "fixing" the ACTUAL problem.

[I am not claiming this is the case.  Rather, indicating that the OP's
"diagnosis" is unsubstantiated:  is the firmware image ACTUALLY corrupt?
*How*/where?  Do all afflicted devices exhibit the same problem in the
same *way*/place?  etc.]

Second, how you obtain that checksum/hash -- even a literal byte-by-byte
comparison -- may not reflect the operating conditions of the device
in its failed state.  E.g., using JTAG to pull the bytes from the
device will obviously *not* occur at "opcode-fetch speed".  Nor will
the memory access patterns mimic those that occur in normal operation.

Etc.

The OP first needs to prove to himself that reflashing *could* be a
remedy -- by indicating that the contents HAVE, in fact, been altered
between the time the device was manufactured and the time the
"crash" (and proposed reflash) occurred.

E.g., imagine examining the flash's contents and finding it *intact*!
Yet, still noting that the reflash "fixes" the problem!  This poses
a different problem than finding the contents have been *altered*...

While the OP may, in fact, have done these things, I'm just asking
for confirmation and an elaboration as to *how* he came to the
conclusion that a reflasher "makes sense" (even as a PTF).  It's
sort of like someone who "debugs" code by making "arbitrary" changes
and waiting to DISCOVER which of them (appears to) yield the correct
results.  While you *may* find a change that appears to work, unless
you can PROVE that it *should* work (by understanding the real
problem), you may have just CHANGED the problem...

Previous12 Next

Thoughts on developing an

Sign in

You might also like...

Search forums

Free PDF Downloads

Blogs - Hall of Fame

Discussion Groups

Quick Links

About EmbeddedRelated.com

Social Networks

The Related Media Group