EmbeddedRelated.com
Forums
The 2026 Embedded Online Conference

factory-marked bad blocks lost

Started by alb February 24, 2015
Hi there,

/thanks/ to a buggy control software and potentially a wrong approach in 
error handling, we have wiped out the factory-marked bad block 
information on a NAND flash device (Micron).

Now, according to the manufacturer (AN2917), if those blocks are used, 
they "may appear to operate normally but may cause other good blocks to 
fail or create additional system errors."

Since the above sentence is not more than just a "hey, we cannot 
guarantee a damned thing, so don't blame us if you screw up!", are there 
any techniques that I can use to recover?

I believe that on a critical system I would not take any risk and 
replace the chip, but should I care otherwise?

Why a factory-marked bad block is of any difference from any other block 
that goes bad after shipping? After all, if I continue to use a bad 
block I will sooner or later realize that is bad and eventually mark it, 
so not a big deal.

Any pointer/suggestion is appreciated.

Al

-- 
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
A: Top-posting.
Q: What is the most annoying thing on usenet and in e-mail?
On 24 Feb 2015 07:43:45 GMT, al.basili@gmail.com (alb) wrote:

>Hi there, > >/thanks/ to a buggy control software and potentially a wrong approach in >error handling, we have wiped out the factory-marked bad block >information on a NAND flash device (Micron). > >Now, according to the manufacturer (AN2917), if those blocks are used, >they "may appear to operate normally but may cause other good blocks to >fail or create additional system errors." > >Since the above sentence is not more than just a "hey, we cannot >guarantee a damned thing, so don't blame us if you screw up!", are there >any techniques that I can use to recover? > >I believe that on a critical system I would not take any risk and >replace the chip, but should I care otherwise? > >Why a factory-marked bad block is of any difference from any other block >that goes bad after shipping? After all, if I continue to use a bad >block I will sooner or later realize that is bad and eventually mark it, >so not a big deal. > >Any pointer/suggestion is appreciated.
At least theoretically, some blocks could be bad because of a group error - say a decoding problem in addressing the array. So the blocks might be good, you just aren't actually consistently writing to the blocks you think you are. While I know that's been done with DRAMs, I have no idea if it applies to flash devices.
Hi Robert,

Robert Wessel <robertwessel2@yahoo.com> wrote:
[]
> At least theoretically, some blocks could be bad because of a group > error - say a decoding problem in addressing the array. So the blocks > might be good, you just aren't actually consistently writing to the > blocks you think you are. While I know that's been done with DRAMs, I > have no idea if it applies to flash devices.
The key point here is 'consistently'. I haven't found lots of literature/references on this subject, but as I understand it the addresses for a specific 'list' of blocks may not work properly (as if there were some sort of propagation delay error in the address decoder). But I don't understand well what you are saying because if that is the case than the manufacturer as well could not mark the block as bad (unless the addressing decoding mechanism for the spare area is /different/ from the block itself), being uncapable of ensure that when he marked the block as bad the address was decoded correctly. Moreover if addressing is not working consistently, how would I recover bad-block markers? they data retrieved may belong to a different block address and be completely wrong. I'm lost here...
On 2/24/2015 12:43 AM, alb wrote:
> Hi there, > > /thanks/ to a buggy control software and potentially a wrong approach in > error handling, we have wiped out the factory-marked bad block > information on a NAND flash device (Micron). > > Now, according to the manufacturer (AN2917), if those blocks are used, > they "may appear to operate normally but may cause other good blocks to > fail or create additional system errors." > > Since the above sentence is not more than just a "hey, we cannot > guarantee a damned thing, so don't blame us if you screw up!", are there > any techniques that I can use to recover?
Sure! *You* can undertake to qualify the device yourself! (good luck with that! ;-) Solution: discard the trashed device and use this as a learning experience to ensure you NEVER do it again!
> I believe that on a critical system I would not take any risk and > replace the chip, but should I care otherwise?
Do you care about the reliability of the component as it pertains to your overall device reliability? If the answer is "no"... (then why are you even USING the device if it "doesn't have to work"?)
> Why a factory-marked bad block is of any difference from any other block > that goes bad after shipping? After all, if I continue to use a bad > block I will sooner or later realize that is bad and eventually mark it, > so not a big deal.
Think about how you *use* a block: i.e., you program, read and erase it (in various combinations). Associated with *each* action are "disturb events" in which OTHER blocks are compromised by your actions on *this* block. The manufacturer has effectively said: "Use of these blocks leads to stronger than expected disturb events". It's just easier (safer) for the manufacturer to overprovision the device and mark those blocks as "avoid these if you want to trust the component at the level stated in our specifications". Unfortunately, the analogy of "bad blocks" on a disk doesn't hold up. There, you *will* eventually discover that a portion of the medium is defective and GROW the defect list to, effectively, recreate the PERMANENT defect list (if you managed to wipe it out).
> Any pointer/suggestion is appreciated.
Research "disturb events" so you see how actions on one block can alter the contents of another. Then, imagine the one block is "leakier than most" and consider how *using* it will affect that "other(s)".
Hi Don,

Don Y <this@is.not.me.com> wrote:
[]
>> Since the above sentence is not more than just a "hey, we cannot >> guarantee a damned thing, so don't blame us if you screw up!", are there >> any techniques that I can use to recover? > > Sure! *You* can undertake to qualify the device yourself! (good luck > with that! ;-) Solution: discard the trashed device and use this as > a learning experience to ensure you NEVER do it again!
This is exactly what I proposed. But this is flight hardware and 'discarding' the device is a ton of money!
>> I believe that on a critical system I would not take any risk and >> replace the chip, but should I care otherwise? > > Do you care about the reliability of the component as it pertains to your > overall device reliability? If the answer is "no"... (then why are you > even USING the device if it "doesn't have to work"?)
reliability issues are not simply related to a 'working device'. The failure rates calculation in the case of a Flash component is rather a specialization of its own. I'm not sure how QAs are litigating on this issue, but it all boils down to the acceptance level required for that specific mission. OTOH you may accept the risk of having a *potentially* failed component if, for example, you do not have the time to change the component. After all flying to a distant planet is not something you can schedule out of the target launching window. And even flying to LEO can shift easily 6 months, causing huge losses.
> >> Why a factory-marked bad block is of any difference from any other block >> that goes bad after shipping? After all, if I continue to use a bad >> block I will sooner or later realize that is bad and eventually mark it, >> so not a big deal. > > Think about how you *use* a block: i.e., you program, read and erase it > (in various combinations). Associated with *each* action are "disturb > events" in which OTHER blocks are compromised by your actions on *this* > block. The manufacturer has effectively said: "Use of these blocks > leads to stronger than expected disturb events". It's just easier > (safer) for the manufacturer to overprovision the device and mark those > blocks as "avoid these if you want to trust the component at the level > stated in our specifications".
Yes, indeed using those block will screw up all failure rate analysis. It may end up that we do not meet the spec anymore. It may happen that a device does not meet its own spec, but here is a bit worse than that. We are knowingly using the device in a condition that goes beyond its 'abs max ratings'.
> Research "disturb events" so you see how actions on one block > can alter the contents of another. Then, imagine the one block > is "leakier than most" and consider how *using* it will affect > that "other(s)".
thanks for that. I was already aware about 'disturb' errors, but AFAIK those ones are all correctable, i.e. you fix the leak by 'scrubbing' with a sufficiently high rate the entire device. When you scrub you fix the errors that have leaked from 'once were bad' blocks. This effect has to be taken into account in the specific use case. We are storing data for a relatively short time (few hours) before being refreshed/overwritten. Imagine a big fifo to compensate a continuous data stream in and an intermittent data stream out. When writing/reading/erasing a 'once were bad' block it may leak (hence disturb) other blocks, *faster* than anticipated. Indeed to a level that whenever we use those blocks we screw up the whole memory content. At a certain point I'd say, big deal! It means I need to intentionally add wear in order to recover the bad block list. Unfortunately nobody can tell me when I'm done...damn it! Still thinking out loud...but not too loud :-) Al
On Tuesday, February 24, 2015 at 12:43:51 AM UTC-7, alb wrote:
> Hi there, > > /thanks/ to a buggy control software and potentially a wrong approach in > error handling, we have wiped out the factory-marked bad block > information on a NAND flash device (Micron). > > Now, according to the manufacturer (AN2917), if those blocks are used, > they "may appear to operate normally but may cause other good blocks to > fail or create additional system errors." > > Since the above sentence is not more than just a "hey, we cannot > guarantee a damned thing, so don't blame us if you screw up!", are there > any techniques that I can use to recover? > > I believe that on a critical system I would not take any risk and > replace the chip, but should I care otherwise? > > Why a factory-marked bad block is of any difference from any other block > that goes bad after shipping? After all, if I continue to use a bad > block I will sooner or later realize that is bad and eventually mark it, > so not a big deal. > > Any pointer/suggestion is appreciated. > > Al > > -- > A: Because it messes up the order in which people normally read text. > Q: Why is top-posting such a bad thing? > A: Top-posting. > Q: What is the most annoying thing on usenet and in e-mail?
I have lots of experience with raw micron nand. Though I dont know how level level you are, you can regrow your bad block list by erasing the entire nand and reading the first spare area byte of the first page of each block. Micron's bad blocks are permanently tagged and will continue reading as a bad block after an erase. Also if you attempt to program a bad block, it should come back as a program failure If you are operating at a higher level than this, say at above the interface (sata, sas, etc), then a good controller's firmware should rebuild it upon secure erase
On Wednesday, February 25, 2015 at 9:17:12 AM UTC-7, jderr...@gmail.com wrote:
> On Tuesday, February 24, 2015 at 12:43:51 AM UTC-7, alb wrote: > > Hi there, > > > > /thanks/ to a buggy control software and potentially a wrong approach in > > error handling, we have wiped out the factory-marked bad block > > information on a NAND flash device (Micron). > > > > Now, according to the manufacturer (AN2917), if those blocks are used, > > they "may appear to operate normally but may cause other good blocks to > > fail or create additional system errors." > > > > Since the above sentence is not more than just a "hey, we cannot > > guarantee a damned thing, so don't blame us if you screw up!", are there > > any techniques that I can use to recover? > > > > I believe that on a critical system I would not take any risk and > > replace the chip, but should I care otherwise? > > > > Why a factory-marked bad block is of any difference from any other block > > that goes bad after shipping? After all, if I continue to use a bad > > block I will sooner or later realize that is bad and eventually mark it, > > so not a big deal. > > > > Any pointer/suggestion is appreciated. > > > > Al > > > > -- > > A: Because it messes up the order in which people normally read text. > > Q: Why is top-posting such a bad thing? > > A: Top-posting. > > Q: What is the most annoying thing on usenet and in e-mail? > > I have lots of experience with raw micron nand. Though I dont know how level level you are, you can regrow your bad block list by erasing the entire nand and reading the first spare area byte of the first page of each block. Micron's bad blocks are permanently tagged and will continue reading as a bad block after an erase. Also if you attempt to program a bad block, it should come back as a program failure > > If you are operating at a higher level than this, say at above the interface (sata, sas, etc), then a good controller's firmware should rebuild it upon secure erase
I should append that I don't actually know what Micron's BB process is. I don't know if they recalculate the bad blocks upon erase, or if they do it when binning at production. From my observations, we have only seen the same bad blocks come up over again during the factory bad block list generation following a full nand erase. Grown bad blocks have to be logged separately using ECC information at runtime.
Am 25.02.2015 um 15:07 schrieb alb:
> Don Y <this@is.not.me.com> wrote:
>> Solution: discard the trashed device and use this as >> a learning experience to ensure you NEVER do it again!
> This is exactly what I proposed. But this is flight hardware and > 'discarding' the device is a ton of money!
Too late to worry about that. That train has left the station.
> reliability issues are not simply related to a 'working device'. The > failure rates calculation in the case of a Flash component is rather a > specialization of its own.
Things being as they are right now, all previous failure rate calculations are invalidated, and attempting new ones would be futile. The only honest answer to the questions: "How do you model the expected failure rate of this element, and what's the model's result?" would currently be "We can't", and "By default, unacceptably high", in that order.
> I'm not sure how QAs are litigating on this issue, but it all boils down > to the acceptance level required for that specific mission.
Right now you can match _no_ requirement worth mentioning.
> Yes, indeed using those block will screw up all failure rate analysis. > It may end up that we do not meet the spec anymore.
Forget about the "may" in that statement. That's a certainty. If you can still meet it, it's not worthy of being called a specification.
On 2/25/2015 7:07 AM, alb wrote:
> Don Y <this@is.not.me.com> wrote: > [] >>> Since the above sentence is not more than just a "hey, we cannot >>> guarantee a damned thing, so don't blame us if you screw up!", are there >>> any techniques that I can use to recover? >> >> Sure! *You* can undertake to qualify the device yourself! (good luck >> with that! ;-) Solution: discard the trashed device and use this as >> a learning experience to ensure you NEVER do it again! > > This is exactly what I proposed. But this is flight hardware and > 'discarding' the device is a ton of money!
*Just* the trashed flash? If it's TRULY "a ton of money", talk to the manufacturer and see if they can requalify it for you (for something *less* than "a ton")
>>> I believe that on a critical system I would not take any risk and >>> replace the chip, but should I care otherwise? >> >> Do you care about the reliability of the component as it pertains to your >> overall device reliability? If the answer is "no"... (then why are you >> even USING the device if it "doesn't have to work"?) > > reliability issues are not simply related to a 'working device'. The > failure rates calculation in the case of a Flash component is rather a > specialization of its own.
Sure. My point was that you *do* care. Thus, want a "reliable" number.
> I'm not sure how QAs are litigating on this issue, but it all boils down > to the acceptance level required for that specific mission. > > OTOH you may accept the risk of having a *potentially* failed component > if, for example, you do not have the time to change the component.
Or, if a replacement simply doesn't exist -- or, is too costly to install. (I've worked on numerous "one off" systems where the cost of replacing the *one* system was exceedingly high)
> After all flying to a distant planet is not something you can schedule > out of the target launching window. And even flying to LEO can shift > easily 6 months, causing huge losses.
Likewise, a *failed* mission has direct costs -- as well as indirect (loss of prestige, opportunity, etc.)
>>> Why a factory-marked bad block is of any difference from any other block >>> that goes bad after shipping? After all, if I continue to use a bad >>> block I will sooner or later realize that is bad and eventually mark it, >>> so not a big deal. >> >> Think about how you *use* a block: i.e., you program, read and erase it >> (in various combinations). Associated with *each* action are "disturb >> events" in which OTHER blocks are compromised by your actions on *this* >> block. The manufacturer has effectively said: "Use of these blocks >> leads to stronger than expected disturb events". It's just easier >> (safer) for the manufacturer to overprovision the device and mark those >> blocks as "avoid these if you want to trust the component at the level >> stated in our specifications". > > Yes, indeed using those block will screw up all failure rate analysis. > It may end up that we do not meet the spec anymore. It may happen that a > device does not meet its own spec, but here is a bit worse than that. We > are knowingly using the device in a condition that goes beyond its 'abs > max ratings'.
If it was *just* the block that was unreliable, then you can quickly return to the point at which those blocks are shuffled out of service (re: my previous discussion on this issue) effectively leaving you with the "good" blocks that you *should* have started with. The problem is that bad blocks can have consequences that affect other data in the array -- in an unpredictable (without detailed knowledge of the implementation and mask) way. To use the disk analogy: If a particular block is truly "bad" (anomalies in the oxide layer in that physical portion of the medium), then you can learn to avoid using it to store data. OTOH, if using that disk block (which, by itself, *might* be able to retain data perfectly!) causes some *other* block on the medium to be corrupted (or, maybe just *compromised*/"disturbed"), then how will you *know* that this has happened? Examine the *entire* medium to see if any data has changed? What if the magnetic domain hasn't been altered enough for it to be seen as having "flipped"? (i.e., for the flash, what if the charge level has changed -- been compromised -- but not enough for it to appear as a "flipped bit" that your ECC could "notice"). How do you know *which* block operation to associate with each *future* data anomaly? (i.e., when the data in that "other" block degrades to a point of being noticeable)
>> Research "disturb events" so you see how actions on one block >> can alter the contents of another. Then, imagine the one block >> is "leakier than most" and consider how *using* it will affect >> that "other(s)". > > thanks for that. I was already aware about 'disturb' errors, but AFAIK > those ones are all correctable, i.e. you fix the leak by 'scrubbing' > with a sufficiently high rate the entire device. When you scrub you fix > the errors that have leaked from 'once were bad' blocks.
But, you have set that "scrub rate" based on the metrics related to a set of IN SPEC flash blocks! I.e., you assume the effect of the disturb events can be characterized for a KNOWN GOOD device (or, for a PORTION of a device that the manufacturer has told you is "well behaved" -- meets spec). Now, suddenly, you are using parts of the device that are NOT well behaved! How do you adjust your scrub rate? Perhaps the particular failure causes *multiple* bits to be disturbed in a single block (something that the manufacturer would know would render the device unusable as the ECC would quickly become ineffective). I.e., you are now using the device in a manner for which the manufacturer has not provided qualification data. E.g., running TTL off of 8V (I'm sure you can find *some* that won't breakdown at that level -- esp with care on output loading, etc.)
> This effect has to be taken into account in the specific use case. We > are storing data for a relatively short time (few hours) before being > refreshed/overwritten. Imagine a big fifo to compensate a continuous > data stream in and an intermittent data stream out. When > writing/reading/erasing a 'once were bad' block it may leak (hence > disturb) other blocks, *faster* than anticipated. Indeed to a level that > whenever we use those blocks we screw up the whole memory content.
But, you don't know *how* much faster -- or even if the nature of the "leak" is the same as "normal". What happens if, at some particular voltage/temperature/rate those accesses have catastrophic consequences? I.e., wipe out big swatches of data in unpredictable ways? [I have no idea how the ACTUAL failures would manifest. But, NEITHER DO YOU! The manufacturer is only telling you how the device will behave *if* you use it in the manner that they have prescribed!]
> At a certain point I'd say, big deal! It means I need to intentionally > add wear in order to recover the bad block list. Unfortunately nobody > can tell me when I'm done...damn it! > > Still thinking out loud...but not too loud :-)
If it's a "ton of money", then you should be involving the folks that know FOR SURE. Not *us*! :>
The 2026 Embedded Online Conference