[cross-post] nand flash bad blocks management

Hi everyone,

We have ~128Mbit of configuration to be stored in a Flash device and for 
reasons related to qualification (HiRel application) we are more 
inclined to the use of NAND technology instead of NOR. Unfortunately 
NAND flash suffers from bad blocks, which may also develop during the 
lifetime of the component and have to be handled.

I've read something about bad block management and it looks like there 
are two essential strategies to cope with the issue of bad blocks:

1. skip block
2. reserved block

The first one will skip a block whenever is bad and write on the first 
free one, updating also the logical block addressing (LBA). While the second 
strategy reserves a dedicated area to remap the bad blocks. In this 
second case the LBA shall be kept updated as well.

I do not see much of a difference between the two strategies except the 
fact that in case 1. I need to 'search' for the first available free 
block, while in second case I reserved a special area for it. Am I 
missing any other major difference?

The second question I have is about 'management'. I do not have a 
software stack to perform the management of these bad blocks and I'm 
obliged to do it with my FPGA. Does anyone here see any potential risk 
in doing so? Would I be better off dedicating a small footprint 
controller in the FPGA to handle the Flash Translation Layer with wear 
leveling and bad block management? Can anyone here point me to some 
IPcores readily available for doing this?

There's a high chance I will need to implement some sort of 'scrubbing' 
to avoid accumulation of errors. All these 'functions' to handle the 
Flash seem to me very suited for software but not for hardware. Does 
anyone here have a different opinion?

Any comment/suggestion/pointer/rant is appreciated.

Cheers,

Al

-- 
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
A: Top-posting.
Q: What is the most annoying thing on usenet and in e-mail?

Reply by Boudewijn Dijkstra ●January 12, 20152015-01-12

Op Mon, 12 Jan 2015 10:23:24 +0100 schreef alb <al.basili@gmail.com>:
> Hi everyone,
>
> We have ~128Mbit of configuration to be stored in a Flash device and for
> reasons related to qualification (HiRel application) we are more
> inclined to the use of NAND technology instead of NOR. Unfortunately
> NAND flash suffers from bad blocks, which may also develop during the
> lifetime of the component and have to be handled.
>
> I've read something about bad block management and it looks like there
> are two essential strategies to cope with the issue of bad blocks:
>
> 1. skip block
> 2. reserved block
>
> The first one will skip a block whenever is bad and write on the first
> free one, updating also the logical block addressing (LBA). While the  
> second
> strategy reserves a dedicated area to remap the bad blocks. In this
> second case the LBA shall be kept updated as well.
>
> I do not see much of a difference between the two strategies except the
> fact that in case 1. I need to 'search' for the first available free
> block, while in second case I reserved a special area for it. Am I
> missing any other major difference?

The second strategy is required when the total logical storage capacity  
must be constant. I can imagine the existence of 'bad sectors' degrading  
performance on some filesystems.

> The second question I have is about 'management'. I do not have a
> software stack to perform the management of these bad blocks and I'm
> obliged to do it with my FPGA. Does anyone here see any potential risk
> in doing so? Would I be better off dedicating a small footprint
> controller in the FPGA to handle the Flash Translation Layer with wear
> leveling and bad block management? Can anyone here point me to some
> IPcores readily available for doing this?

Sounds like you're re-inventing eMMC.

> There's a high chance I will need to implement some sort of 'scrubbing'
> to avoid accumulation of errors.

Indeed regular reading (and IIRC also writing) can increase the longevity  
of the device. But it is up to you whether that is needed at all.

> All these 'functions' to handle the
> Flash seem to me very suited for software but not for hardware. Does
> anyone here have a different opinion?

AFAIK, (e)MMC devices all have a small microcontroller inside.>


-- 
(Remove the obvious prefix to reply privately.)
Gemaakt met Opera's e-mailprogramma: http://www.opera.com/mail/

Reply by alb ●January 12, 20152015-01-12

Hi Boudewijn,

In comp.arch.embedded Boudewijn Dijkstra <sp4mtr4p.boudewijn@indes.com> wrote:
[]
>> I've read something about bad block management and it looks like there
>> are two essential strategies to cope with the issue of bad blocks:
>>
>> 1. skip block
>> 2. reserved block
>>
>> The first one will skip a block whenever is bad and write on the first
>> free one, updating also the logical block addressing (LBA). While the  
>> second
>> strategy reserves a dedicated area to remap the bad blocks. In this
>> second case the LBA shall be kept updated as well.
>>
>> I do not see much of a difference between the two strategies except the
>> fact that in case 1. I need to 'search' for the first available free
>> block, while in second case I reserved a special area for it. Am I
>> missing any other major difference?

> The second strategy is required when the total logical storage capacity  
> must be constant. I can imagine the existence of 'bad sectors' degrading  
> performance on some filesystems.

Ok, that's a valid point, meaning that since I declare the user space 
only the total minus the reserved, the user may rely on that 
information.

But in terms of total amount of bad blocks for the quoted endurance will 
be exactly with the same number. None of the strategies mentioned wear 
less the device.

>> The second question I have is about 'management'. I do not have a
>> software stack to perform the management of these bad blocks and I'm
>> obliged to do it with my FPGA. Does anyone here see any potential risk
>> in doing so? Would I be better off dedicating a small footprint
>> controller in the FPGA to handle the Flash Translation Layer with wear
>> leveling and bad block management? Can anyone here point me to some
>> IPcores readily available for doing this?
> 
> Sounds like you're re-inventing eMMC.

I didn't know there was a name for that. Well if that's so yes, but it's 
not for storing your birthday's picture, rather for space application.

Even if there are several 'experiments' running in low orbit with nand 
flash components, I do not know any operational satellite (like for 
meteo or similar) to have anything like this.

>> There's a high chance I will need to implement some sort of 'scrubbing'
>> to avoid accumulation of errors.
> 
> Indeed regular reading (and IIRC also writing) can increase the longevity  
> of the device. But it is up to you whether that is needed at all.

I'm not aiming to increase longevity. I'm aiming to guarantee that the 
system will cope with the expected bit flip and still guarantee mission 
objectives throughout the intended lifecycle (7.5 years on orbit).

Scrubbing is not so complicated, you read, correct and write back. But 
doing so when you hit a bad block during the rewrite and you have tons 
of other things to do in the meanwhile may have some side effects...to 
be evaluated and handled.

>> All these 'functions' to handle the
>> Flash seem to me very suited for software but not for hardware. Does
>> anyone here have a different opinion?
> 
> AFAIK, (e)MMC devices all have a small microcontroller inside.>

It does not surprise me, I have the requirement not to include *any* 
software onboard! I may let an embedded microcontroller with a hardcoded 
list of instruction slip through, but I'm not so sure.

Al

Reply by Don Y ●January 12, 20152015-01-12

Hi Boudewijn,

On 1/12/2015 3:38 AM, Boudewijn Dijkstra wrote:
> Op Mon, 12 Jan 2015 10:23:24 +0100 schreef alb <al.basili@gmail.com>:

>> The second question I have is about 'management'. I do not have a
>> software stack to perform the management of these bad blocks and I'm
>> obliged to do it with my FPGA. Does anyone here see any potential risk
>> in doing so? Would I be better off dedicating a small footprint
>> controller in the FPGA to handle the Flash Translation Layer with wear
>> leveling and bad block management? Can anyone here point me to some
>> IPcores readily available for doing this?
>
> Sounds like you're re-inventing eMMC.
>
>> There's a high chance I will need to implement some sort of 'scrubbing'
>> to avoid accumulation of errors.
>
> Indeed regular reading (and IIRC also writing) can increase the longevity of
> the device. But it is up to you whether that is needed at all.

Um, *reading* also causes fatigue in the array -- just not as quickly as
*writing*/erase.  In most implementations, this isn't a problem because
you're reading the block *into* RAM and then accessing it from RAM.
But, if you just keep reading blocks repeatedly, you'll discover your
ECC becoming increasingly more active/aggressive in "fixing" the degrading
NAD cells.

So, either KNOW that your access patterns (read and write) *won't*
disturb the array.  *Or*, actively manage it by "refreshing" content
after "lots of" accesses (e.g., 100K-ish) PER PAGE/BANK.

>> All these 'functions' to handle the
>> Flash seem to me very suited for software but not for hardware. Does
>> anyone here have a different opinion?
>
> AFAIK, (e)MMC devices all have a small microcontroller inside.>

I can't see an *economical* way of doing this (in anything less than
huge volumes) with dedicated hardware (e.g., FPGA).

Reply by Boudewijn Dijkstra ●January 13, 20152015-01-13

Op Tue, 13 Jan 2015 01:03:45 +0100 schreef Don Y <this@is.not.me.com>:
> On 1/12/2015 3:38 AM, Boudewijn Dijkstra wrote:
>> Op Mon, 12 Jan 2015 10:23:24 +0100 schreef alb <al.basili@gmail.com>:
>
>>> There's a high chance I will need to implement some sort of 'scrubbing'
>>> to avoid accumulation of errors.
>>
>> Indeed regular reading (and IIRC also writing) can increase the  
>> longevity of
>> the device. But it is up to you whether that is needed at all.
>
> Um, *reading* also causes fatigue in the array -- just not as quickly as
> *writing*/erase.

Indeed; my apologies.  Performing many reads before an erase, will indeed  
cause bit errors that can be repaired by reprogramming.  What I wanted to  
say, but misremembered, is that *not* reading over extended periods may  
also cause bit errors, due to charge leak.  This can also be repaired by  
reprogramming.
(ref: Micron TN2917)

>>> All these 'functions' to handle the
>>> Flash seem to me very suited for software but not for hardware. Does
>>> anyone here have a different opinion?
>>
>> AFAIK, (e)MMC devices all have a small microcontroller inside.
>
> I can't see an *economical* way of doing this (in anything less than
> huge volumes) with dedicated hardware (e.g., FPGA).

Space exploration is not economical (yet).  ;)



-- 
(Remove the obvious prefix to reply privately.)
Gemaakt met Opera's e-mailprogramma: http://www.opera.com/mail/

Reply by alb ●January 13, 20152015-01-13

Hi Don,

In comp.arch.embedded Don Y <this@is.not.me.com> wrote:
[]
>> Indeed regular reading (and IIRC also writing) can increase the 
>> longevity of the device. But it is up to you whether that is needed 
>> at all.
> 
> Um, *reading* also causes fatigue in the array -- just not as quickly 
> as *writing*/erase.  In most implementations, this isn't a problem 
> because you're reading the block *into* RAM and then accessing it from 
> RAM. But, if you just keep reading blocks repeatedly, you'll discover 
> your ECC becoming increasingly more active/aggressive in "fixing" the 
> degrading NAD cells.

reading does not cause *fatigue* in the sense that does not wear the 
device. The effect has been referred to 'read disturb' which may cause 
errors in pages other than the one read. With multiple readings of the 
same page you may end up inducing so many errors that your ECC would not 
be able to cope with when you try to access the *other* pages.

These sorts of problems though are showing up when we talk about a 
number of reading cycles in the hundreds of thousands if not million 
(google: The Inconvenient Truths of NAND Flash Memory).

> So, either KNOW that your access patterns (read and write) *won't* 
> disturb the array.  *Or*, actively manage it by "refreshing" content 
> after "lots of" accesses (e.g., 100K-ish) PER PAGE/BANK.

We have to cope with bit flips anyway (low earth orbit), so we are 
obliged to scrub the memory, in order to avoid errors' accumulation we 
move the entire block, update the LBA and erase the one affected, so it 
becomes again available.

>>> All these 'functions' to handle the Flash seem to me very suited for 
>>> software but not for hardware. Does anyone here have a different 
>>> opinion?
>>
>> AFAIK, (e)MMC devices all have a small microcontroller inside.>
> 
> I can't see an *economical* way of doing this (in anything less than 
> huge volumes) with dedicated hardware (e.g., FPGA).

Well according to our latest estimates we are about at 30% of cell usage 
on an AX2000 (2MGates), without including any scrubbing (yet), but 
including the bad block management.

Al

Reply by Don Y ●January 13, 20152015-01-13

Hi Al,

On 1/13/2015 11:51 AM, alb wrote:
> In comp.arch.embedded Don Y <this@is.not.me.com> wrote:
> []
>>> Indeed regular reading (and IIRC also writing) can increase the
>>> longevity of the device. But it is up to you whether that is needed
>>> at all.
>>
>> Um, *reading* also causes fatigue in the array -- just not as quickly
>> as *writing*/erase.  In most implementations, this isn't a problem
>> because you're reading the block *into* RAM and then accessing it from
>> RAM. But, if you just keep reading blocks repeatedly, you'll discover
>> your ECC becoming increasingly more active/aggressive in "fixing" the
>> degrading NAD cells.
>
> reading does not cause *fatigue* in the sense that does not wear the

Yes, sorry -- I was being imprecise.  My point was that it alters the
data in the device in a manner that will eventually cause data LOSS.
Of course, the effects are even more pronounced on MLC where the number
of electrons is smaller for any given 'state'.

> device. The effect has been referred to 'read disturb' which may cause
> errors in pages other than the one read. With multiple readings of the
> same page you may end up inducing so many errors that your ECC would not
> be able to cope with when you try to access the *other* pages.
>
> These sorts of problems though are showing up when we talk about a
> number of reading cycles in the hundreds of thousands if not million
> (google: The Inconvenient Truths of NAND Flash Memory).

The numbers are only half the story.  I can use a device for YEARS that
exhibits problems after just *hundreds* of cycles -- if I don't burn
those hundreds of cycles in those "years"!  OTOH, something that will
ONLY manifest after a million cycles can plague a design in *minutes*
if the application hammers away at it.

That's why:

>> So, either KNOW that your access patterns (read and write) *won't*
>> disturb the array.  *Or*, actively manage it by "refreshing" content
>> after "lots of" accesses (e.g., 100K-ish) PER PAGE/BANK.
>
> We have to cope with bit flips anyway (low earth orbit), so we are
> obliged to scrub the memory, in order to avoid errors' accumulation we
> move the entire block, update the LBA and erase the one affected, so it
> becomes again available.

Bit flips can be handled probabilistically -- you can model how often
you *expect* to encounter them on an OTHERWISE GOOD data image.  OTOH,
complicate that with a *dubious* data image and the reliability and
predictability falls markedly.

>>>> All these 'functions' to handle the Flash seem to me very suited for
>>>> software but not for hardware. Does anyone here have a different
>>>> opinion?
>>>
>>> AFAIK, (e)MMC devices all have a small microcontroller inside.>
>>
>> I can't see an *economical* way of doing this (in anything less than
>> huge volumes) with dedicated hardware (e.g., FPGA).
>
> Well according to our latest estimates we are about at 30% of cell usage
> on an AX2000 (2MGates), without including any scrubbing (yet), but
> including the bad block management.

Remember, if you are too naive in your implementation, you can increase
*wear*.  I think to get a good algorithm, you probably want to track
knowledge of the entire *device* -- not just the RECENT history of
this block/page.  (i.e., where did the page that *was* here go?  and,
why?  if it had an unusually high error rate, you might not be so keen
on bringing it back into the rotation -- ever!)

I.e., it seems like a lot of "state" to manage in a dedicated piece of hardware
(that you can't *service*!)

Reply by Don Y ●January 13, 20152015-01-13

Hi Boudewijn,

On 1/13/2015 2:17 AM, Boudewijn Dijkstra wrote:

>>>> There's a high chance I will need to implement some sort of 'scrubbing'
>>>> to avoid accumulation of errors.
>>>
>>> Indeed regular reading (and IIRC also writing) can increase the longevity of
>>> the device. But it is up to you whether that is needed at all.
>>
>> Um, *reading* also causes fatigue in the array -- just not as quickly as
>> *writing*/erase.
>
> Indeed; my apologies.  Performing many reads before an erase, will indeed cause
> bit errors that can be repaired by reprogramming.  What I wanted to say, but
> misremembered, is that *not* reading over extended periods may also cause bit
> errors, due to charge leak.  This can also be repaired by reprogramming.
> (ref: Micron TN2917)

Yes, its amazing how many of the issues that were troublesome in OLD
technologies have modern day equivalents!  E.g., "print through" for
tape; write-restore-after-read for core; etc.

>>>> All these 'functions' to handle the
>>>> Flash seem to me very suited for software but not for hardware. Does
>>>> anyone here have a different opinion?
>>>
>>> AFAIK, (e)MMC devices all have a small microcontroller inside.
>>
>> I can't see an *economical* way of doing this (in anything less than
>> huge volumes) with dedicated hardware (e.g., FPGA).
>
> Space exploration is not economical (yet).  ;)

<frown>  Wise ass!  :>

Yes, I meant "economical" in terms of device complexity.  The more complex
the device required for a given functionality, the less reliable (in an
environment where you don't get second-chances)

Reply by srl100 ●January 14, 20152015-01-14

alb wrote:

>The second question I have is about 'management'. I do not have a 
>software stack to perform the management of these bad blocks and I'm 
>obliged to do it with my FPGA. Does anyone here see any potential risk 
>in doing so?

How are you going to configure your FPGA - is that going to be FLASH-based
as well and, if so, could the configuration memory for the FPGA suffer from
corruption?	   
					
---------------------------------------		
Posted through http://www.EmbeddedRelated.com

Reply by alb ●January 14, 20152015-01-14

Hi,

srl100 <76083@embeddedrelated> wrote:
[]
>>The second question I have is about 'management'. I do not have a 
>>software stack to perform the management of these bad blocks and I'm 
>>obliged to do it with my FPGA. Does anyone here see any potential risk 
>>in doing so?
> 
> How are you going to configure your FPGA - is that going to be FLASH-based
> as well and, if so, could the configuration memory for the FPGA suffer from
> corruption?

In the current project we are using antifuse based technology so no 
concern for configuring the FPGA. 

Even in the event of Flash based technology (ex. RT ProAsic) the flash 
cell is radically different from the one used for high density memories. 
First there's no need for high density and cell size is 0.25 um, not 
16nm! Secondly there's no need to 'read' a flash cell in an flash based 
fpga, certainly there are limitation in the writing/erasure process 
which may cause wear due to tunneling of charge across the isolator.

Comning back to the point, NAND flash topologies have multiple nasty 
radiation effects that may increase the handling complexity (SEFI, SEU, 
SEL, to mention a few).

Considering the criticality of the function (we will store critical 
configuration to operate the mission successfully), I'd say it would be 
much more reliable to dedicate a software stack like a Flash Tranlation 
Layer (FTL) rather than do it with a - rather complex - state 
machine...but that is only guts feeling and making the call is not going 
to be an easy task.

Al

Previous12 3 Next

[cross-post] nand flash bad blocks management

Sign in

You might also like...

Search forums

Free PDF Downloads

Blogs - Hall of Fame

Discussion Groups

Quick Links

About EmbeddedRelated.com

Social Networks

The Related Media Group