EmbeddedRelated.com
Forums

[cross-post] nand flash bad blocks management

Started by alb January 12, 2015
Hi everyone,

We have ~128Mbit of configuration to be stored in a Flash device and for 
reasons related to qualification (HiRel application) we are more 
inclined to the use of NAND technology instead of NOR. Unfortunately 
NAND flash suffers from bad blocks, which may also develop during the 
lifetime of the component and have to be handled.

I've read something about bad block management and it looks like there 
are two essential strategies to cope with the issue of bad blocks:

1. skip block
2. reserved block

The first one will skip a block whenever is bad and write on the first 
free one, updating also the logical block addressing (LBA). While the second 
strategy reserves a dedicated area to remap the bad blocks. In this 
second case the LBA shall be kept updated as well.

I do not see much of a difference between the two strategies except the 
fact that in case 1. I need to 'search' for the first available free 
block, while in second case I reserved a special area for it. Am I 
missing any other major difference?

The second question I have is about 'management'. I do not have a 
software stack to perform the management of these bad blocks and I'm 
obliged to do it with my FPGA. Does anyone here see any potential risk 
in doing so? Would I be better off dedicating a small footprint 
controller in the FPGA to handle the Flash Translation Layer with wear 
leveling and bad block management? Can anyone here point me to some 
IPcores readily available for doing this?

There's a high chance I will need to implement some sort of 'scrubbing' 
to avoid accumulation of errors. All these 'functions' to handle the 
Flash seem to me very suited for software but not for hardware. Does 
anyone here have a different opinion?

Any comment/suggestion/pointer/rant is appreciated.

Cheers,

Al

-- 
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
A: Top-posting.
Q: What is the most annoying thing on usenet and in e-mail?
Op Mon, 12 Jan 2015 10:23:24 +0100 schreef alb <al.basili@gmail.com>:
> Hi everyone, > > We have ~128Mbit of configuration to be stored in a Flash device and for > reasons related to qualification (HiRel application) we are more > inclined to the use of NAND technology instead of NOR. Unfortunately > NAND flash suffers from bad blocks, which may also develop during the > lifetime of the component and have to be handled. > > I've read something about bad block management and it looks like there > are two essential strategies to cope with the issue of bad blocks: > > 1. skip block > 2. reserved block > > The first one will skip a block whenever is bad and write on the first > free one, updating also the logical block addressing (LBA). While the > second > strategy reserves a dedicated area to remap the bad blocks. In this > second case the LBA shall be kept updated as well. > > I do not see much of a difference between the two strategies except the > fact that in case 1. I need to 'search' for the first available free > block, while in second case I reserved a special area for it. Am I > missing any other major difference?
The second strategy is required when the total logical storage capacity must be constant. I can imagine the existence of 'bad sectors' degrading performance on some filesystems.
> The second question I have is about 'management'. I do not have a > software stack to perform the management of these bad blocks and I'm > obliged to do it with my FPGA. Does anyone here see any potential risk > in doing so? Would I be better off dedicating a small footprint > controller in the FPGA to handle the Flash Translation Layer with wear > leveling and bad block management? Can anyone here point me to some > IPcores readily available for doing this?
Sounds like you're re-inventing eMMC.
> There's a high chance I will need to implement some sort of 'scrubbing' > to avoid accumulation of errors.
Indeed regular reading (and IIRC also writing) can increase the longevity of the device. But it is up to you whether that is needed at all.
> All these 'functions' to handle the > Flash seem to me very suited for software but not for hardware. Does > anyone here have a different opinion?
AFAIK, (e)MMC devices all have a small microcontroller inside.> -- (Remove the obvious prefix to reply privately.) Gemaakt met Opera's e-mailprogramma: http://www.opera.com/mail/
Hi Boudewijn,

In comp.arch.embedded Boudewijn Dijkstra <sp4mtr4p.boudewijn@indes.com> wrote:
[]
>> I've read something about bad block management and it looks like there >> are two essential strategies to cope with the issue of bad blocks: >> >> 1. skip block >> 2. reserved block >> >> The first one will skip a block whenever is bad and write on the first >> free one, updating also the logical block addressing (LBA). While the >> second >> strategy reserves a dedicated area to remap the bad blocks. In this >> second case the LBA shall be kept updated as well. >> >> I do not see much of a difference between the two strategies except the >> fact that in case 1. I need to 'search' for the first available free >> block, while in second case I reserved a special area for it. Am I >> missing any other major difference?
> The second strategy is required when the total logical storage capacity > must be constant. I can imagine the existence of 'bad sectors' degrading > performance on some filesystems.
Ok, that's a valid point, meaning that since I declare the user space only the total minus the reserved, the user may rely on that information. But in terms of total amount of bad blocks for the quoted endurance will be exactly with the same number. None of the strategies mentioned wear less the device.
>> The second question I have is about 'management'. I do not have a >> software stack to perform the management of these bad blocks and I'm >> obliged to do it with my FPGA. Does anyone here see any potential risk >> in doing so? Would I be better off dedicating a small footprint >> controller in the FPGA to handle the Flash Translation Layer with wear >> leveling and bad block management? Can anyone here point me to some >> IPcores readily available for doing this? > > Sounds like you're re-inventing eMMC.
I didn't know there was a name for that. Well if that's so yes, but it's not for storing your birthday's picture, rather for space application. Even if there are several 'experiments' running in low orbit with nand flash components, I do not know any operational satellite (like for meteo or similar) to have anything like this.
>> There's a high chance I will need to implement some sort of 'scrubbing' >> to avoid accumulation of errors. > > Indeed regular reading (and IIRC also writing) can increase the longevity > of the device. But it is up to you whether that is needed at all.
I'm not aiming to increase longevity. I'm aiming to guarantee that the system will cope with the expected bit flip and still guarantee mission objectives throughout the intended lifecycle (7.5 years on orbit). Scrubbing is not so complicated, you read, correct and write back. But doing so when you hit a bad block during the rewrite and you have tons of other things to do in the meanwhile may have some side effects...to be evaluated and handled.
>> All these 'functions' to handle the >> Flash seem to me very suited for software but not for hardware. Does >> anyone here have a different opinion? > > AFAIK, (e)MMC devices all have a small microcontroller inside.>
It does not surprise me, I have the requirement not to include *any* software onboard! I may let an embedded microcontroller with a hardcoded list of instruction slip through, but I'm not so sure. Al
Hi Boudewijn,

On 1/12/2015 3:38 AM, Boudewijn Dijkstra wrote:
> Op Mon, 12 Jan 2015 10:23:24 +0100 schreef alb <al.basili@gmail.com>:
>> The second question I have is about 'management'. I do not have a >> software stack to perform the management of these bad blocks and I'm >> obliged to do it with my FPGA. Does anyone here see any potential risk >> in doing so? Would I be better off dedicating a small footprint >> controller in the FPGA to handle the Flash Translation Layer with wear >> leveling and bad block management? Can anyone here point me to some >> IPcores readily available for doing this? > > Sounds like you're re-inventing eMMC. > >> There's a high chance I will need to implement some sort of 'scrubbing' >> to avoid accumulation of errors. > > Indeed regular reading (and IIRC also writing) can increase the longevity of > the device. But it is up to you whether that is needed at all.
Um, *reading* also causes fatigue in the array -- just not as quickly as *writing*/erase. In most implementations, this isn't a problem because you're reading the block *into* RAM and then accessing it from RAM. But, if you just keep reading blocks repeatedly, you'll discover your ECC becoming increasingly more active/aggressive in "fixing" the degrading NAD cells. So, either KNOW that your access patterns (read and write) *won't* disturb the array. *Or*, actively manage it by "refreshing" content after "lots of" accesses (e.g., 100K-ish) PER PAGE/BANK.
>> All these 'functions' to handle the >> Flash seem to me very suited for software but not for hardware. Does >> anyone here have a different opinion? > > AFAIK, (e)MMC devices all have a small microcontroller inside.>
I can't see an *economical* way of doing this (in anything less than huge volumes) with dedicated hardware (e.g., FPGA).
Op Tue, 13 Jan 2015 01:03:45 +0100 schreef Don Y <this@is.not.me.com>:
> On 1/12/2015 3:38 AM, Boudewijn Dijkstra wrote: >> Op Mon, 12 Jan 2015 10:23:24 +0100 schreef alb <al.basili@gmail.com>: > >>> There's a high chance I will need to implement some sort of 'scrubbing' >>> to avoid accumulation of errors. >> >> Indeed regular reading (and IIRC also writing) can increase the >> longevity of >> the device. But it is up to you whether that is needed at all. > > Um, *reading* also causes fatigue in the array -- just not as quickly as > *writing*/erase.
Indeed; my apologies. Performing many reads before an erase, will indeed cause bit errors that can be repaired by reprogramming. What I wanted to say, but misremembered, is that *not* reading over extended periods may also cause bit errors, due to charge leak. This can also be repaired by reprogramming. (ref: Micron TN2917)
>>> All these 'functions' to handle the >>> Flash seem to me very suited for software but not for hardware. Does >>> anyone here have a different opinion? >> >> AFAIK, (e)MMC devices all have a small microcontroller inside. > > I can't see an *economical* way of doing this (in anything less than > huge volumes) with dedicated hardware (e.g., FPGA).
Space exploration is not economical (yet). ;) -- (Remove the obvious prefix to reply privately.) Gemaakt met Opera's e-mailprogramma: http://www.opera.com/mail/
Hi Don,

In comp.arch.embedded Don Y <this@is.not.me.com> wrote:
[]
>> Indeed regular reading (and IIRC also writing) can increase the >> longevity of the device. But it is up to you whether that is needed >> at all. > > Um, *reading* also causes fatigue in the array -- just not as quickly > as *writing*/erase. In most implementations, this isn't a problem > because you're reading the block *into* RAM and then accessing it from > RAM. But, if you just keep reading blocks repeatedly, you'll discover > your ECC becoming increasingly more active/aggressive in "fixing" the > degrading NAD cells.
reading does not cause *fatigue* in the sense that does not wear the device. The effect has been referred to 'read disturb' which may cause errors in pages other than the one read. With multiple readings of the same page you may end up inducing so many errors that your ECC would not be able to cope with when you try to access the *other* pages. These sorts of problems though are showing up when we talk about a number of reading cycles in the hundreds of thousands if not million (google: The Inconvenient Truths of NAND Flash Memory).
> So, either KNOW that your access patterns (read and write) *won't* > disturb the array. *Or*, actively manage it by "refreshing" content > after "lots of" accesses (e.g., 100K-ish) PER PAGE/BANK.
We have to cope with bit flips anyway (low earth orbit), so we are obliged to scrub the memory, in order to avoid errors' accumulation we move the entire block, update the LBA and erase the one affected, so it becomes again available.
>>> All these 'functions' to handle the Flash seem to me very suited for >>> software but not for hardware. Does anyone here have a different >>> opinion? >> >> AFAIK, (e)MMC devices all have a small microcontroller inside.> > > I can't see an *economical* way of doing this (in anything less than > huge volumes) with dedicated hardware (e.g., FPGA).
Well according to our latest estimates we are about at 30% of cell usage on an AX2000 (2MGates), without including any scrubbing (yet), but including the bad block management. Al
Hi Al,

On 1/13/2015 11:51 AM, alb wrote:
> In comp.arch.embedded Don Y <this@is.not.me.com> wrote: > [] >>> Indeed regular reading (and IIRC also writing) can increase the >>> longevity of the device. But it is up to you whether that is needed >>> at all. >> >> Um, *reading* also causes fatigue in the array -- just not as quickly >> as *writing*/erase. In most implementations, this isn't a problem >> because you're reading the block *into* RAM and then accessing it from >> RAM. But, if you just keep reading blocks repeatedly, you'll discover >> your ECC becoming increasingly more active/aggressive in "fixing" the >> degrading NAD cells. > > reading does not cause *fatigue* in the sense that does not wear the
Yes, sorry -- I was being imprecise. My point was that it alters the data in the device in a manner that will eventually cause data LOSS. Of course, the effects are even more pronounced on MLC where the number of electrons is smaller for any given 'state'.
> device. The effect has been referred to 'read disturb' which may cause > errors in pages other than the one read. With multiple readings of the > same page you may end up inducing so many errors that your ECC would not > be able to cope with when you try to access the *other* pages. > > These sorts of problems though are showing up when we talk about a > number of reading cycles in the hundreds of thousands if not million > (google: The Inconvenient Truths of NAND Flash Memory).
The numbers are only half the story. I can use a device for YEARS that exhibits problems after just *hundreds* of cycles -- if I don't burn those hundreds of cycles in those "years"! OTOH, something that will ONLY manifest after a million cycles can plague a design in *minutes* if the application hammers away at it. That's why:
>> So, either KNOW that your access patterns (read and write) *won't* >> disturb the array. *Or*, actively manage it by "refreshing" content >> after "lots of" accesses (e.g., 100K-ish) PER PAGE/BANK. > > We have to cope with bit flips anyway (low earth orbit), so we are > obliged to scrub the memory, in order to avoid errors' accumulation we > move the entire block, update the LBA and erase the one affected, so it > becomes again available.
Bit flips can be handled probabilistically -- you can model how often you *expect* to encounter them on an OTHERWISE GOOD data image. OTOH, complicate that with a *dubious* data image and the reliability and predictability falls markedly.
>>>> All these 'functions' to handle the Flash seem to me very suited for >>>> software but not for hardware. Does anyone here have a different >>>> opinion? >>> >>> AFAIK, (e)MMC devices all have a small microcontroller inside.> >> >> I can't see an *economical* way of doing this (in anything less than >> huge volumes) with dedicated hardware (e.g., FPGA). > > Well according to our latest estimates we are about at 30% of cell usage > on an AX2000 (2MGates), without including any scrubbing (yet), but > including the bad block management.
Remember, if you are too naive in your implementation, you can increase *wear*. I think to get a good algorithm, you probably want to track knowledge of the entire *device* -- not just the RECENT history of this block/page. (i.e., where did the page that *was* here go? and, why? if it had an unusually high error rate, you might not be so keen on bringing it back into the rotation -- ever!) I.e., it seems like a lot of "state" to manage in a dedicated piece of hardware (that you can't *service*!)
Hi Boudewijn,

On 1/13/2015 2:17 AM, Boudewijn Dijkstra wrote:

>>>> There's a high chance I will need to implement some sort of 'scrubbing' >>>> to avoid accumulation of errors. >>> >>> Indeed regular reading (and IIRC also writing) can increase the longevity of >>> the device. But it is up to you whether that is needed at all. >> >> Um, *reading* also causes fatigue in the array -- just not as quickly as >> *writing*/erase. > > Indeed; my apologies. Performing many reads before an erase, will indeed cause > bit errors that can be repaired by reprogramming. What I wanted to say, but > misremembered, is that *not* reading over extended periods may also cause bit > errors, due to charge leak. This can also be repaired by reprogramming. > (ref: Micron TN2917)
Yes, its amazing how many of the issues that were troublesome in OLD technologies have modern day equivalents! E.g., "print through" for tape; write-restore-after-read for core; etc.
>>>> All these 'functions' to handle the >>>> Flash seem to me very suited for software but not for hardware. Does >>>> anyone here have a different opinion? >>> >>> AFAIK, (e)MMC devices all have a small microcontroller inside. >> >> I can't see an *economical* way of doing this (in anything less than >> huge volumes) with dedicated hardware (e.g., FPGA). > > Space exploration is not economical (yet). ;)
<frown> Wise ass! :> Yes, I meant "economical" in terms of device complexity. The more complex the device required for a given functionality, the less reliable (in an environment where you don't get second-chances)
alb wrote:

>The second question I have is about 'management'. I do not have a >software stack to perform the management of these bad blocks and I'm >obliged to do it with my FPGA. Does anyone here see any potential risk >in doing so?
How are you going to configure your FPGA - is that going to be FLASH-based as well and, if so, could the configuration memory for the FPGA suffer from corruption? --------------------------------------- Posted through http://www.EmbeddedRelated.com
Hi,

srl100 <76083@embeddedrelated> wrote:
[]
>>The second question I have is about 'management'. I do not have a >>software stack to perform the management of these bad blocks and I'm >>obliged to do it with my FPGA. Does anyone here see any potential risk >>in doing so? > > How are you going to configure your FPGA - is that going to be FLASH-based > as well and, if so, could the configuration memory for the FPGA suffer from > corruption?
In the current project we are using antifuse based technology so no concern for configuring the FPGA. Even in the event of Flash based technology (ex. RT ProAsic) the flash cell is radically different from the one used for high density memories. First there's no need for high density and cell size is 0.25 um, not 16nm! Secondly there's no need to 'read' a flash cell in an flash based fpga, certainly there are limitation in the writing/erasure process which may cause wear due to tunneling of charge across the isolator. Comning back to the point, NAND flash topologies have multiple nasty radiation effects that may increase the handling complexity (SEFI, SEU, SEL, to mention a few). Considering the criticality of the function (we will store critical configuration to operate the mission successfully), I'd say it would be much more reliable to dedicate a software stack like a Flash Tranlation Layer (FTL) rather than do it with a - rather complex - state machine...but that is only guts feeling and making the call is not going to be an easy task. Al