Hi everyone, We have ~128Mbit of configuration to be stored in a Flash device and for reasons related to qualification (HiRel application) we are more inclined to the use of NAND technology instead of NOR. Unfortunately NAND flash suffers from bad blocks, which may also develop during the lifetime of the component and have to be handled. I've read something about bad block management and it looks like there are two essential strategies to cope with the issue of bad blocks: 1. skip block 2. reserved block The first one will skip a block whenever is bad and write on the first free one, updating also the logical block addressing (LBA). While the second strategy reserves a dedicated area to remap the bad blocks. In this second case the LBA shall be kept updated as well. I do not see much of a difference between the two strategies except the fact that in case 1. I need to 'search' for the first available free block, while in second case I reserved a special area for it. Am I missing any other major difference? The second question I have is about 'management'. I do not have a software stack to perform the management of these bad blocks and I'm obliged to do it with my FPGA. Does anyone here see any potential risk in doing so? Would I be better off dedicating a small footprint controller in the FPGA to handle the Flash Translation Layer with wear leveling and bad block management? Can anyone here point me to some IPcores readily available for doing this? There's a high chance I will need to implement some sort of 'scrubbing' to avoid accumulation of errors. All these 'functions' to handle the Flash seem to me very suited for software but not for hardware. Does anyone here have a different opinion? Any comment/suggestion/pointer/rant is appreciated. Cheers, Al -- A: Because it messes up the order in which people normally read text. Q: Why is top-posting such a bad thing? A: Top-posting. Q: What is the most annoying thing on usenet and in e-mail?
[cross-post] nand flash bad blocks management
Started by ●January 12, 2015
Reply by ●January 12, 20152015-01-12
Op Mon, 12 Jan 2015 10:23:24 +0100 schreef alb <al.basili@gmail.com>:> Hi everyone, > > We have ~128Mbit of configuration to be stored in a Flash device and for > reasons related to qualification (HiRel application) we are more > inclined to the use of NAND technology instead of NOR. Unfortunately > NAND flash suffers from bad blocks, which may also develop during the > lifetime of the component and have to be handled. > > I've read something about bad block management and it looks like there > are two essential strategies to cope with the issue of bad blocks: > > 1. skip block > 2. reserved block > > The first one will skip a block whenever is bad and write on the first > free one, updating also the logical block addressing (LBA). While the > second > strategy reserves a dedicated area to remap the bad blocks. In this > second case the LBA shall be kept updated as well. > > I do not see much of a difference between the two strategies except the > fact that in case 1. I need to 'search' for the first available free > block, while in second case I reserved a special area for it. Am I > missing any other major difference?The second strategy is required when the total logical storage capacity must be constant. I can imagine the existence of 'bad sectors' degrading performance on some filesystems.> The second question I have is about 'management'. I do not have a > software stack to perform the management of these bad blocks and I'm > obliged to do it with my FPGA. Does anyone here see any potential risk > in doing so? Would I be better off dedicating a small footprint > controller in the FPGA to handle the Flash Translation Layer with wear > leveling and bad block management? Can anyone here point me to some > IPcores readily available for doing this?Sounds like you're re-inventing eMMC.> There's a high chance I will need to implement some sort of 'scrubbing' > to avoid accumulation of errors.Indeed regular reading (and IIRC also writing) can increase the longevity of the device. But it is up to you whether that is needed at all.> All these 'functions' to handle the > Flash seem to me very suited for software but not for hardware. Does > anyone here have a different opinion?AFAIK, (e)MMC devices all have a small microcontroller inside.> -- (Remove the obvious prefix to reply privately.) Gemaakt met Opera's e-mailprogramma: http://www.opera.com/mail/
Reply by ●January 12, 20152015-01-12
Hi Boudewijn, In comp.arch.embedded Boudewijn Dijkstra <sp4mtr4p.boudewijn@indes.com> wrote: []>> I've read something about bad block management and it looks like there >> are two essential strategies to cope with the issue of bad blocks: >> >> 1. skip block >> 2. reserved block >> >> The first one will skip a block whenever is bad and write on the first >> free one, updating also the logical block addressing (LBA). While the >> second >> strategy reserves a dedicated area to remap the bad blocks. In this >> second case the LBA shall be kept updated as well. >> >> I do not see much of a difference between the two strategies except the >> fact that in case 1. I need to 'search' for the first available free >> block, while in second case I reserved a special area for it. Am I >> missing any other major difference?> The second strategy is required when the total logical storage capacity > must be constant. I can imagine the existence of 'bad sectors' degrading > performance on some filesystems.Ok, that's a valid point, meaning that since I declare the user space only the total minus the reserved, the user may rely on that information. But in terms of total amount of bad blocks for the quoted endurance will be exactly with the same number. None of the strategies mentioned wear less the device.>> The second question I have is about 'management'. I do not have a >> software stack to perform the management of these bad blocks and I'm >> obliged to do it with my FPGA. Does anyone here see any potential risk >> in doing so? Would I be better off dedicating a small footprint >> controller in the FPGA to handle the Flash Translation Layer with wear >> leveling and bad block management? Can anyone here point me to some >> IPcores readily available for doing this? > > Sounds like you're re-inventing eMMC.I didn't know there was a name for that. Well if that's so yes, but it's not for storing your birthday's picture, rather for space application. Even if there are several 'experiments' running in low orbit with nand flash components, I do not know any operational satellite (like for meteo or similar) to have anything like this.>> There's a high chance I will need to implement some sort of 'scrubbing' >> to avoid accumulation of errors. > > Indeed regular reading (and IIRC also writing) can increase the longevity > of the device. But it is up to you whether that is needed at all.I'm not aiming to increase longevity. I'm aiming to guarantee that the system will cope with the expected bit flip and still guarantee mission objectives throughout the intended lifecycle (7.5 years on orbit). Scrubbing is not so complicated, you read, correct and write back. But doing so when you hit a bad block during the rewrite and you have tons of other things to do in the meanwhile may have some side effects...to be evaluated and handled.>> All these 'functions' to handle the >> Flash seem to me very suited for software but not for hardware. Does >> anyone here have a different opinion? > > AFAIK, (e)MMC devices all have a small microcontroller inside.>It does not surprise me, I have the requirement not to include *any* software onboard! I may let an embedded microcontroller with a hardcoded list of instruction slip through, but I'm not so sure. Al
Reply by ●January 12, 20152015-01-12
Hi Boudewijn, On 1/12/2015 3:38 AM, Boudewijn Dijkstra wrote:> Op Mon, 12 Jan 2015 10:23:24 +0100 schreef alb <al.basili@gmail.com>:>> The second question I have is about 'management'. I do not have a >> software stack to perform the management of these bad blocks and I'm >> obliged to do it with my FPGA. Does anyone here see any potential risk >> in doing so? Would I be better off dedicating a small footprint >> controller in the FPGA to handle the Flash Translation Layer with wear >> leveling and bad block management? Can anyone here point me to some >> IPcores readily available for doing this? > > Sounds like you're re-inventing eMMC. > >> There's a high chance I will need to implement some sort of 'scrubbing' >> to avoid accumulation of errors. > > Indeed regular reading (and IIRC also writing) can increase the longevity of > the device. But it is up to you whether that is needed at all.Um, *reading* also causes fatigue in the array -- just not as quickly as *writing*/erase. In most implementations, this isn't a problem because you're reading the block *into* RAM and then accessing it from RAM. But, if you just keep reading blocks repeatedly, you'll discover your ECC becoming increasingly more active/aggressive in "fixing" the degrading NAD cells. So, either KNOW that your access patterns (read and write) *won't* disturb the array. *Or*, actively manage it by "refreshing" content after "lots of" accesses (e.g., 100K-ish) PER PAGE/BANK.>> All these 'functions' to handle the >> Flash seem to me very suited for software but not for hardware. Does >> anyone here have a different opinion? > > AFAIK, (e)MMC devices all have a small microcontroller inside.>I can't see an *economical* way of doing this (in anything less than huge volumes) with dedicated hardware (e.g., FPGA).
Reply by ●January 13, 20152015-01-13
Op Tue, 13 Jan 2015 01:03:45 +0100 schreef Don Y <this@is.not.me.com>:> On 1/12/2015 3:38 AM, Boudewijn Dijkstra wrote: >> Op Mon, 12 Jan 2015 10:23:24 +0100 schreef alb <al.basili@gmail.com>: > >>> There's a high chance I will need to implement some sort of 'scrubbing' >>> to avoid accumulation of errors. >> >> Indeed regular reading (and IIRC also writing) can increase the >> longevity of >> the device. But it is up to you whether that is needed at all. > > Um, *reading* also causes fatigue in the array -- just not as quickly as > *writing*/erase.Indeed; my apologies. Performing many reads before an erase, will indeed cause bit errors that can be repaired by reprogramming. What I wanted to say, but misremembered, is that *not* reading over extended periods may also cause bit errors, due to charge leak. This can also be repaired by reprogramming. (ref: Micron TN2917)>>> All these 'functions' to handle the >>> Flash seem to me very suited for software but not for hardware. Does >>> anyone here have a different opinion? >> >> AFAIK, (e)MMC devices all have a small microcontroller inside. > > I can't see an *economical* way of doing this (in anything less than > huge volumes) with dedicated hardware (e.g., FPGA).Space exploration is not economical (yet). ;) -- (Remove the obvious prefix to reply privately.) Gemaakt met Opera's e-mailprogramma: http://www.opera.com/mail/
Reply by ●January 13, 20152015-01-13
Hi Don, In comp.arch.embedded Don Y <this@is.not.me.com> wrote: []>> Indeed regular reading (and IIRC also writing) can increase the >> longevity of the device. But it is up to you whether that is needed >> at all. > > Um, *reading* also causes fatigue in the array -- just not as quickly > as *writing*/erase. In most implementations, this isn't a problem > because you're reading the block *into* RAM and then accessing it from > RAM. But, if you just keep reading blocks repeatedly, you'll discover > your ECC becoming increasingly more active/aggressive in "fixing" the > degrading NAD cells.reading does not cause *fatigue* in the sense that does not wear the device. The effect has been referred to 'read disturb' which may cause errors in pages other than the one read. With multiple readings of the same page you may end up inducing so many errors that your ECC would not be able to cope with when you try to access the *other* pages. These sorts of problems though are showing up when we talk about a number of reading cycles in the hundreds of thousands if not million (google: The Inconvenient Truths of NAND Flash Memory).> So, either KNOW that your access patterns (read and write) *won't* > disturb the array. *Or*, actively manage it by "refreshing" content > after "lots of" accesses (e.g., 100K-ish) PER PAGE/BANK.We have to cope with bit flips anyway (low earth orbit), so we are obliged to scrub the memory, in order to avoid errors' accumulation we move the entire block, update the LBA and erase the one affected, so it becomes again available.>>> All these 'functions' to handle the Flash seem to me very suited for >>> software but not for hardware. Does anyone here have a different >>> opinion? >> >> AFAIK, (e)MMC devices all have a small microcontroller inside.> > > I can't see an *economical* way of doing this (in anything less than > huge volumes) with dedicated hardware (e.g., FPGA).Well according to our latest estimates we are about at 30% of cell usage on an AX2000 (2MGates), without including any scrubbing (yet), but including the bad block management. Al
Reply by ●January 13, 20152015-01-13
Hi Al, On 1/13/2015 11:51 AM, alb wrote:> In comp.arch.embedded Don Y <this@is.not.me.com> wrote: > [] >>> Indeed regular reading (and IIRC also writing) can increase the >>> longevity of the device. But it is up to you whether that is needed >>> at all. >> >> Um, *reading* also causes fatigue in the array -- just not as quickly >> as *writing*/erase. In most implementations, this isn't a problem >> because you're reading the block *into* RAM and then accessing it from >> RAM. But, if you just keep reading blocks repeatedly, you'll discover >> your ECC becoming increasingly more active/aggressive in "fixing" the >> degrading NAD cells. > > reading does not cause *fatigue* in the sense that does not wear theYes, sorry -- I was being imprecise. My point was that it alters the data in the device in a manner that will eventually cause data LOSS. Of course, the effects are even more pronounced on MLC where the number of electrons is smaller for any given 'state'.> device. The effect has been referred to 'read disturb' which may cause > errors in pages other than the one read. With multiple readings of the > same page you may end up inducing so many errors that your ECC would not > be able to cope with when you try to access the *other* pages. > > These sorts of problems though are showing up when we talk about a > number of reading cycles in the hundreds of thousands if not million > (google: The Inconvenient Truths of NAND Flash Memory).The numbers are only half the story. I can use a device for YEARS that exhibits problems after just *hundreds* of cycles -- if I don't burn those hundreds of cycles in those "years"! OTOH, something that will ONLY manifest after a million cycles can plague a design in *minutes* if the application hammers away at it. That's why:>> So, either KNOW that your access patterns (read and write) *won't* >> disturb the array. *Or*, actively manage it by "refreshing" content >> after "lots of" accesses (e.g., 100K-ish) PER PAGE/BANK. > > We have to cope with bit flips anyway (low earth orbit), so we are > obliged to scrub the memory, in order to avoid errors' accumulation we > move the entire block, update the LBA and erase the one affected, so it > becomes again available.Bit flips can be handled probabilistically -- you can model how often you *expect* to encounter them on an OTHERWISE GOOD data image. OTOH, complicate that with a *dubious* data image and the reliability and predictability falls markedly.>>>> All these 'functions' to handle the Flash seem to me very suited for >>>> software but not for hardware. Does anyone here have a different >>>> opinion? >>> >>> AFAIK, (e)MMC devices all have a small microcontroller inside.> >> >> I can't see an *economical* way of doing this (in anything less than >> huge volumes) with dedicated hardware (e.g., FPGA). > > Well according to our latest estimates we are about at 30% of cell usage > on an AX2000 (2MGates), without including any scrubbing (yet), but > including the bad block management.Remember, if you are too naive in your implementation, you can increase *wear*. I think to get a good algorithm, you probably want to track knowledge of the entire *device* -- not just the RECENT history of this block/page. (i.e., where did the page that *was* here go? and, why? if it had an unusually high error rate, you might not be so keen on bringing it back into the rotation -- ever!) I.e., it seems like a lot of "state" to manage in a dedicated piece of hardware (that you can't *service*!)
Reply by ●January 13, 20152015-01-13
Hi Boudewijn, On 1/13/2015 2:17 AM, Boudewijn Dijkstra wrote:>>>> There's a high chance I will need to implement some sort of 'scrubbing' >>>> to avoid accumulation of errors. >>> >>> Indeed regular reading (and IIRC also writing) can increase the longevity of >>> the device. But it is up to you whether that is needed at all. >> >> Um, *reading* also causes fatigue in the array -- just not as quickly as >> *writing*/erase. > > Indeed; my apologies. Performing many reads before an erase, will indeed cause > bit errors that can be repaired by reprogramming. What I wanted to say, but > misremembered, is that *not* reading over extended periods may also cause bit > errors, due to charge leak. This can also be repaired by reprogramming. > (ref: Micron TN2917)Yes, its amazing how many of the issues that were troublesome in OLD technologies have modern day equivalents! E.g., "print through" for tape; write-restore-after-read for core; etc.>>>> All these 'functions' to handle the >>>> Flash seem to me very suited for software but not for hardware. Does >>>> anyone here have a different opinion? >>> >>> AFAIK, (e)MMC devices all have a small microcontroller inside. >> >> I can't see an *economical* way of doing this (in anything less than >> huge volumes) with dedicated hardware (e.g., FPGA). > > Space exploration is not economical (yet). ;)<frown> Wise ass! :> Yes, I meant "economical" in terms of device complexity. The more complex the device required for a given functionality, the less reliable (in an environment where you don't get second-chances)
Reply by ●January 14, 20152015-01-14
alb wrote:>The second question I have is about 'management'. I do not have a >software stack to perform the management of these bad blocks and I'm >obliged to do it with my FPGA. Does anyone here see any potential risk >in doing so?How are you going to configure your FPGA - is that going to be FLASH-based as well and, if so, could the configuration memory for the FPGA suffer from corruption? --------------------------------------- Posted through http://www.EmbeddedRelated.com
Reply by ●January 14, 20152015-01-14
Hi, srl100 <76083@embeddedrelated> wrote: []>>The second question I have is about 'management'. I do not have a >>software stack to perform the management of these bad blocks and I'm >>obliged to do it with my FPGA. Does anyone here see any potential risk >>in doing so? > > How are you going to configure your FPGA - is that going to be FLASH-based > as well and, if so, could the configuration memory for the FPGA suffer from > corruption?In the current project we are using antifuse based technology so no concern for configuring the FPGA. Even in the event of Flash based technology (ex. RT ProAsic) the flash cell is radically different from the one used for high density memories. First there's no need for high density and cell size is 0.25 um, not 16nm! Secondly there's no need to 'read' a flash cell in an flash based fpga, certainly there are limitation in the writing/erasure process which may cause wear due to tunneling of charge across the isolator. Comning back to the point, NAND flash topologies have multiple nasty radiation effects that may increase the handling complexity (SEFI, SEU, SEL, to mention a few). Considering the criticality of the function (we will store critical configuration to operate the mission successfully), I'd say it would be much more reliable to dedicate a software stack like a Flash Tranlation Layer (FTL) rather than do it with a - rather complex - state machine...but that is only guts feeling and making the call is not going to be an easy task. Al