EmbeddedRelated.com
Forums

Detecting corrupted or unreadable Flash memory

Started by one00100100 August 2, 2011
Hey Guys and Girls,
I would like to see what you guys are doing in your projects to detect flash emory errors, say, due to corruption or wear. Not that I'm having a problem, it's just that we are seeking a European certification, "MID", that refers to a WELMEC 7.2 document (www.welmec.org). Their official has said, "there must be also some protection of software itself against some changes that could arise for example from corruption of flash memory during use." What is the best way to handle this?

I had considered checksumming the largest chunks of memory programmed by the bootstrap loader at each power-on. I know I would have to exclude those memory locations that hold logs, parameters, editable tables, etc as these will be changing. Fortunately, I already place these structures at specific memory locations. I already checksum the tables and parameter data structures and check for errors when these are read at startup. If I take this approach, would you recommend one checksum, or a separate checksum for each memory block?

I would appreciate some of you please sharing your top-level ideas on this.
Thank you,
Mike Raines

Beginning Microcontrollers with the MSP430

Checksum logic is not very robust.

There can be cancellation issues.

A CRC is more robust of a test.

________________________________

From: m... [mailto:m...] On Behalf
Of one00100100
Sent: Tuesday, August 02, 2011 1:34 PM
To: m...
Subject: [msp430] Detecting corrupted or unreadable Flash memory

Hey Guys and Girls,
I would like to see what you guys are doing in your projects to detect
flash emory errors, say, due to corruption or wear. Not that I'm having
a problem, it's just that we are seeking a European certification,
"MID", that refers to a WELMEC 7.2 document (www.welmec.org). Their
official has said, "there must be also some protection of software
itself against some changes that could arise for example from corruption
of flash memory during use." What is the best way to handle this?

I had considered checksumming the largest chunks of memory programmed by
the bootstrap loader at each power-on. I know I would have to exclude
those memory locations that hold logs, parameters, editable tables, etc
as these will be changing. Fortunately, I already place these structures
at specific memory locations. I already checksum the tables and
parameter data structures and check for errors when these are read at
startup. If I take this approach, would you recommend one checksum, or a
separate checksum for each memory block?

I would appreciate some of you please sharing your top-level ideas on
this.
Thank you,
Mike Raines



Hugo,
Thanks for the reply. Of course, CRC. It has become so prevalent, I just say checksum when I mean CRC.
Please replace all my references to checksum with CRC. So what about the rest of the questions?
Thanks,
Mike Raines

________________________________
From: m... [mailto:m...] On Behalf Of Hugo Brunert
Sent: Tuesday, August 02, 2011 1:38 PM
To: m...
Subject: RE: [msp430] Detecting corrupted or unreadable Flash memory

Checksum logic is not very robust.

There can be cancellation issues.

A CRC is more robust of a test.

________________________________

From: m... [mailto:m...] On Behalf
Of one00100100
Sent: Tuesday, August 02, 2011 1:34 PM
To: m...
Subject: [msp430] Detecting corrupted or unreadable Flash memory

Hey Guys and Girls,
I would like to see what you guys are doing in your projects to detect
flash emory errors, say, due to corruption or wear. Not that I'm having
a problem, it's just that we are seeking a European certification,
"MID", that refers to a WELMEC 7.2 document (www.welmec.org). Their
official has said, "there must be also some protection of software
itself against some changes that could arise for example from corruption
of flash memory during use." What is the best way to handle this?

I had considered checksumming the largest chunks of memory programmed by
the bootstrap loader at each power-on. I know I would have to exclude
those memory locations that hold logs, parameters, editable tables, etc
as these will be changing. Fortunately, I already place these structures
at specific memory locations. I already checksum the tables and
parameter data structures and check for errors when these are read at
startup. If I take this approach, would you recommend one checksum, or a
separate checksum for each memory block?

I would appreciate some of you please sharing your top-level ideas on
this.
Thank you,
Mike Raines





On Tue, 02 Aug 2011 17:33:41 -0000, Mike R. wrote:

> I would like to see what you guys are doing in your projects
> to detect flash emory errors, say, due to corruption or
> wear.

Nothing much in any thing I've done. There's extensive work
under Linux systems for flash based file systems and wear
leveling. I'd look there for some ideas (I did, already,
which is why I know there is a lot to be found... but I don't
remember the details right now.) It's not necessarily
something small enough that you'd care, but it might give a
thought or two for you. It certainly pointed up a few
techniques which I could have come up with on my own but
where I now know they are actually useful... which goes a
long way in deciding whether or not to spend time doing it.

> Not that I'm having a problem, it's just that we are
> seeking a European certification, "MID", that refers to a
> WELMEC 7.2 document (www.welmec.org). Their official has
> said, "there must be also some protection of software itself
> against some changes that could arise for example from
> corruption of flash memory during use." What is the best
> way to handle this?

Would be nice to know what "some protection" means. Does it
mean that you can tolerate and recover from random jumps into
memory? Soft errors? What kind of "corruption?" Operating
as a robot at Fukushima under 10 Sieverts/hr radiation?

Or can you just note the error and lock up? And if so, why
not just do nothing at all and just die, anyway?

What's the goal?

> I had considered checksumming the largest chunks of memory
> programmed by the bootstrap loader at each power-on. I know
> I would have to exclude those memory locations that hold
> logs, parameters, editable tables, etc as these will be
> changing. Fortunately, I already place these structures at
> specific memory locations. I already checksum the tables
> and parameter data structures and check for errors when
> these are read at startup. If I take this approach, would
> you recommend one checksum, or a separate checksum for each
> memory block?

Not a checksum. Not good enough until your the threshold of
"some protection" is VERY LOW. You might do some kind of CRC
or else, if you are more anal, perhaps some kind of ECC so
that you can even recover from single (and if you are rabid
about it, maybe even two-bit) soft errors, rewrite the flash
block correctly and move on after logging the fact of the
error. If enough occur, marking it out and not using it
anymore.

> I would appreciate some of you please sharing your top-level
> ideas on this.

Top level? What do you need to really achieve here?

Jon
It would be wiser to do CRC on each type of block.

At least if you have a failure you know where about it is, if you only
do an overall CRC, you don't know exactly in what section it is.

________________________________

From: m... [mailto:m...] On Behalf
Of Mike Raines
Sent: Tuesday, August 02, 2011 1:49 PM
To: m...
Subject: RE: RE: [msp430] Detecting corrupted or unreadable Flash memory

Hugo,
Thanks for the reply. Of course, CRC. It has become so prevalent, I just
say checksum when I mean CRC.
Please replace all my references to checksum with CRC. So what about the
rest of the questions?
Thanks,
Mike Raines

________________________________
From: m...
[mailto:m... ] On
Behalf Of Hugo Brunert
Sent: Tuesday, August 02, 2011 1:38 PM
To: m...
Subject: RE: [msp430] Detecting corrupted or unreadable Flash memory

Checksum logic is not very robust.

There can be cancellation issues.

A CRC is more robust of a test.

________________________________

From: m...
[mailto:m...
] On
Behalf
Of one00100100
Sent: Tuesday, August 02, 2011 1:34 PM
To: m...

Subject: [msp430] Detecting corrupted or unreadable Flash memory

Hey Guys and Girls,
I would like to see what you guys are doing in your projects to detect
flash emory errors, say, due to corruption or wear. Not that I'm having
a problem, it's just that we are seeking a European certification,
"MID", that refers to a WELMEC 7.2 document (www.welmec.org). Their
official has said, "there must be also some protection of software
itself against some changes that could arise for example from corruption
of flash memory during use." What is the best way to handle this?

I had considered checksumming the largest chunks of memory programmed by
the bootstrap loader at each power-on. I know I would have to exclude
those memory locations that hold logs, parameters, editable tables, etc
as these will be changing. Fortunately, I already place these structures
at specific memory locations. I already checksum the tables and
parameter data structures and check for errors when these are read at
startup. If I take this approach, would you recommend one checksum, or a
separate checksum for each memory block?

I would appreciate some of you please sharing your top-level ideas on
this.
Thank you,
Mike Raines







I would like to add that any software running in the same media it is
verifying the integrity will not work reliable.
It is like Jerry pulling it's own tail to float and escape from Tom's
paw... If an error can happen in the FLASH then it can happen in the
part holding the checking software.
The test makes sense to check data logging structures where the wearing
can happen due to continuous use.
If the program must be also checked then you are using the wrong
processor because MSP430 has no ROM of your own to hold such supervisor
system. If it is just a single pass after programming the device then a
CRC on all FLASH area could do the job, provided in case the software
does not give any answer would mean it is already broken.

Memory reliability always reminds me of those windowed EPROMs... the
logic surrounding the EPROM cells was sensitive to photons coming
through the window. An open window and a fluorescent lamp in the room
would result in data reading errors and hours of dazzled engineers
trying to figure out why the system would work only when someone would
look close to the board...
-Augusto

On 02/08/2011 15:02, Jon Kirwan wrote:
>
> On Tue, 02 Aug 2011 17:33:41 -0000, Mike R. wrote:
>
> > I would like to see what you guys are doing in your projects
> > to detect flash emory errors, say, due to corruption or
> > wear.
>
> Nothing much in any thing I've done. There's extensive work
> under Linux systems for flash based file systems and wear
> leveling. I'd look there for some ideas (I did, already,
> which is why I know there is a lot to be found... but I don't
> remember the details right now.) It's not necessarily
> something small enough that you'd care, but it might give a
> thought or two for you. It certainly pointed up a few
> techniques which I could have come up with on my own but
> where I now know they are actually useful... which goes a
> long way in deciding whether or not to spend time doing it.
>
> > Not that I'm having a problem, it's just that we are
> > seeking a European certification, "MID", that refers to a
> > WELMEC 7.2 document (www.welmec.org). Their official has
> > said, "there must be also some protection of software itself
> > against some changes that could arise for example from
> > corruption of flash memory during use." What is the best
> > way to handle this?
>
> Would be nice to know what "some protection" means. Does it
> mean that you can tolerate and recover from random jumps into
> memory? Soft errors? What kind of "corruption?" Operating
> as a robot at Fukushima under 10 Sieverts/hr radiation?
>
> Or can you just note the error and lock up? And if so, why
> not just do nothing at all and just die, anyway?
>
> What's the goal?
>
> > I had considered checksumming the largest chunks of memory
> > programmed by the bootstrap loader at each power-on. I know
> > I would have to exclude those memory locations that hold
> > logs, parameters, editable tables, etc as these will be
> > changing. Fortunately, I already place these structures at
> > specific memory locations. I already checksum the tables
> > and parameter data structures and check for errors when
> > these are read at startup. If I take this approach, would
> > you recommend one checksum, or a separate checksum for each
> > memory block?
>
> Not a checksum. Not good enough until your the threshold of
> "some protection" is VERY LOW. You might do some kind of CRC
> or else, if you are more anal, perhaps some kind of ECC so
> that you can even recover from single (and if you are rabid
> about it, maybe even two-bit) soft errors, rewrite the flash
> block correctly and move on after logging the fact of the
> error. If enough occur, marking it out and not using it
> anymore.
>
> > I would appreciate some of you please sharing your top-level
> > ideas on this.
>
> Top level? What do you need to really achieve here?
>
> Jon


We have a similar challenge on our devices which are implanted medical devices. We use IAR's built-in CRC generator to calculate the CRC each time the program is linked, and use their example CRC code to check the entire code at startup.

Further, we have several blocks of flash which are not code - calibration tables, parameters, logs, etc., each with their own CRC. Those are checked each time they are used. We created some PC utilities to generate the CRCs for those tables and the process works pretty well logistically.

One more piece of info, not that this matters much from a regulatory point-of-view, but we tested the heck out of the flash - millions of erases, writes, reads, at high temperature and NEVER had a failure! Our experience is that if you operate the flash anywhere near its specs, it is quite reliable.

Stuart

--- In m..., "one00100100" wrote:
>
> Hey Guys and Girls,
> I would like to see what you guys are doing in your projects to detect flash emory errors, say, due to corruption or wear. Not that I'm having a problem, it's just that we are seeking a European certification, "MID", that refers to a WELMEC 7.2 document (www.welmec.org). Their official has said, "there must be also some protection of software itself against some changes that could arise for example from corruption of flash memory during use." What is the best way to handle this?
>
> I had considered checksumming the largest chunks of memory programmed by the bootstrap loader at each power-on. I know I would have to exclude those memory locations that hold logs, parameters, editable tables, etc as these will be changing. Fortunately, I already place these structures at specific memory locations. I already checksum the tables and parameter data structures and check for errors when these are read at startup. If I take this approach, would you recommend one checksum, or a separate checksum for each memory block?
>
> I would appreciate some of you please sharing your top-level ideas on this.
> Thank you,
> Mike Raines
>