[cross-post] nand flash bad blocks management| page 2

Reply by alb ●January 14, 20152015-01-14

Hi Don,

Don Y <this@is.not.me.com> wrote:
>> reading does not cause *fatigue* in the sense that does not wear the
> 
> Yes, sorry -- I was being imprecise.  My point was that it alters the
> data in the device in a manner that will eventually cause data LOSS.
> Of course, the effects are even more pronounced on MLC where the number
> of electrons is smaller for any given 'state'.

Correct. Since we will need to continuously scan for 'errors' (either 
induced by SEE or by 'Read Disturb' effects), is going to be unlikely 
that we will read 100K times before erasing.

Main issue to handle is a bad block when writing on a scrubbed area. We 
have lots of things to do and in the meantime we need to deal with a 
'task' that is going to take a lot of push/pull activities on various 
busses.

[]
>> These sorts of problems though are showing up when we talk about a
>> number of reading cycles in the hundreds of thousands if not million
>> (google: The Inconvenient Truths of NAND Flash Memory).
> 

> The numbers are only half the story.  I can use a device for YEARS that
> exhibits problems after just *hundreds* of cycles -- if I don't burn
> those hundreds of cycles in those "years"!  OTOH, something that will
> ONLY manifest after a million cycles can plague a design in *minutes*
> if the application hammers away at it.
> 
> That's why:
> 
>>> So, either KNOW that your access patterns (read and write) *won't*
>>> disturb the array.  *Or*, actively manage it by "refreshing" content
>>> after "lots of" accesses (e.g., 100K-ish) PER PAGE/BANK.

The advice is sound and we already assessed this in our feasibility 
analysis. There's not only the handling of the bad blocks that need to 
be done, but also the continuous correction of errors popping up here 
and there due to specific read patterns.

We foresee to retrieve few pages, or maybe few blocks each ~10 sec...I'm 
already around 3M/year readings, for a 7.5 year mission is a big number. 
Certainly impossible to avoid handling these effects.

>> We have to cope with bit flips anyway (low earth orbit), so we are
>> obliged to scrub the memory, in order to avoid errors' accumulation we
>> move the entire block, update the LBA and erase the one affected, so it
>> becomes again available.
> 
> Bit flips can be handled probabilistically -- you can model how often
> you *expect* to encounter them on an OTHERWISE GOOD data image.  OTOH,
> complicate that with a *dubious* data image and the reliability and
> predictability falls markedly.

Bit flips are handled by the ECC. The probability of events will only 
tell me how often they will occur and how often I should scrub the 
cells. If it is too often for the system to cope with I would need to 
employ an ECC that gives me the possibility to correct *multiple* bit 
flips, so that I can allow the recovery to take *more* time and still be 
protected.

[]
>> Well according to our latest estimates we are about at 30% of cell usage
>> on an AX2000 (2MGates), without including any scrubbing (yet), but
>> including the bad block management.
> 
> Remember, if you are too naive in your implementation, you can increase
> *wear*.  I think to get a good algorithm, you probably want to track
> knowledge of the entire *device* -- not just the RECENT history of
> this block/page.  (i.e., where did the page that *was* here go?  and,
> why?  if it had an unusually high error rate, you might not be so keen
> on bringing it back into the rotation -- ever!)
> 
> I.e., it seems like a lot of "state" to manage in a dedicated piece of 
> hardware (that you can't *service*!)

If I understand you correctly you're talking about wear-leveling. This 
is another aspect that needs to be addressed and can only come out of 
our radiation analysis. If the error rate we need to cope with will 
require a mean number of erase/program cycles that is larger than the 
allowed one, we are obliged to have some sort of mechanism to spread 
usage throughout the whole pool of available blocks and not just 
adjacent ones.

That said I'm start to wonder whether it would be more reliable to 
include a small footprint microcontroller (like an mb-lite) to handle 
the flash and provide a virtual device to the user(s).

Indeed the question could be made broader: what is more appropriate for 
the fpga to do and for the software to do in this specific use case? 
Considering the scenarios and the multiple decision points to implement, 
would a state machine be more appropriate than a software routine 
(procedure/function/algorithm/....)?

I could potentially delegate the fpga to routine tasks like 
moving/erasing/reading/writing blocks, while keeping the whole 
'intelligence' (or stupidity!) in a more easy to handle software stack.

I think the customer should be involved in this reasoning and we should 
share the risk of the technical choice. Including a software stack would 
make the budget skyrocket but it should be put on the table somehow.

Al

-- 
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
A: Top-posting.
Q: What is the most annoying thing on usenet and in e-mail?

Reply by Don Y ●January 14, 20152015-01-14

Hi Al,

[much elided; I've got a deadline this evening...]

On 1/14/2015 3:47 PM, alb wrote:

>>> We have to cope with bit flips anyway (low earth orbit), so we are
>>> obliged to scrub the memory, in order to avoid errors' accumulation we
>>> move the entire block, update the LBA and erase the one affected, so it
>>> becomes again available.
>>
>> Bit flips can be handled probabilistically -- you can model how often
>> you *expect* to encounter them on an OTHERWISE GOOD data image.  OTOH,
>> complicate that with a *dubious* data image and the reliability and
>> predictability falls markedly.
>
> Bit flips are handled by the ECC. The probability of events will only
> tell me how often they will occur and how often I should scrub the
> cells. If it is too often for the system to cope with I would need to
> employ an ECC that gives me the possibility to correct *multiple* bit
> flips, so that I can allow the recovery to take *more* time and still be
> protected.

My point was, you can design your ECC to deal with the sorts of
error rates you expect from *functional* devices.  If the integrity
of the device is already suspect (because of usage patterns that
lead to disturb and fatigue events), then the calculus gets more
"interesting"... now you have to factor in the reduced reliability
of the basic device *plus* any information you have as to how those
additional errors are likely to manifest.

E.g., decades ago, we would design memory tests with the actual
device geometry in mind -- so we could deliberately tickle the
aspects of the memory (devices and array) that were most likely
to give us trouble.

>>> Well according to our latest estimates we are about at 30% of cell usage
>>> on an AX2000 (2MGates), without including any scrubbing (yet), but
>>> including the bad block management.
>>
>> Remember, if you are too naive in your implementation, you can increase
>> *wear*.  I think to get a good algorithm, you probably want to track
>> knowledge of the entire *device* -- not just the RECENT history of
>> this block/page.  (i.e., where did the page that *was* here go?  and,
>> why?  if it had an unusually high error rate, you might not be so keen
>> on bringing it back into the rotation -- ever!)
>>
>> I.e., it seems like a lot of "state" to manage in a dedicated piece of
>> hardware (that you can't *service*!)
>
> If I understand you correctly you're talking about wear-leveling. This

Wear leveling is a different issue.  It *hopes* to "spread the pain"
around the device.

But, consider this scenario:

A block gives you an excessive number of read errors (deal with
the "write/erase" issue later).  So, you schedule it to be "refreshed"
(erased and rewritten).  Note that this might also involve *replacement*
depending on implementation.  But, regardless, you (as an omnipotent
human) can remember that it was *that* (physical) block that gave you
the "problem".

Eventually, you reuse that block (again, depends on your algorithm;
it might get reused for *this* "refreshed" data -- or, some other
purpose).  When it "gives you a problem" (again, we're just looking at
READ errors to simplify the discussion), do you repeat the same
handling practice that you did the *first* time you encountered
a problem with the block?  E.g., what if the *second* read from the
block trips lost of ECC activity in that block?  Do you use your
knowledge of the fact that this block had previously "caused
problems" to *bias* your decision to replace it *earlier* than
you would some "less error prone" block?  Do you ever mark the block
as "bad" and take it out of service (because experience has taught?
you that it's just going to end up being replaced *quickly*, AGAIN,
when put back into service).

I.e., how much state can you afford to track AND BASE YOUR CONTROL
ALGORITHMS ON before it becomes too complex for a "dedicated bit
of control logic" (FPGA)?

I don't know the answer to this.  I'm sure a lot of similar thinking
goes into the design of flash controllers in SSD's, etc.  But, they
have an "out" that you don't -- the customer can replace an SSD
that isn't living up to expectations "for very little money".  You,
OTOH, have a potential *asset* that they don't -- you *know* how
you will be using YOUR memory and also know "what's important" therein.
So, if pushed to the wall, can take steps to ensure what's important
is preserved, regardless.

> is another aspect that needs to be addressed and can only come out of
> our radiation analysis. If the error rate we need to cope with will
> require a mean number of erase/program cycles that is larger than the
> allowed one, we are obliged to have some sort of mechanism to spread
> usage throughout the whole pool of available blocks and not just
> adjacent ones.
>
> That said I'm start to wonder whether it would be more reliable to
> include a small footprint microcontroller (like an mb-lite) to handle
> the flash and provide a virtual device to the user(s).

Well, anything that you could do with a microcontroller can also be
done with logic (just design one and an instruction store!  :> )
You might, however, find it easier (and more economical) to implement
a "sequential machine" to make it easier for folks to implement
these sorts of things (people think much better serially than in
parallel; and, there probably is no real need for parallelism, here!)

> Indeed the question could be made broader: what is more appropriate for
> the fpga to do and for the software to do in this specific use case?
> Considering the scenarios and the multiple decision points to implement,
> would a state machine be more appropriate than a software routine
> (procedure/function/algorithm/....)?
>
> I could potentially delegate the fpga to routine tasks like
> moving/erasing/reading/writing blocks, while keeping the whole
> 'intelligence' (or stupidity!) in a more easy to handle software stack.
>
> I think the customer should be involved in this reasoning and we should
> share the risk of the technical choice. Including a software stack would
> make the budget skyrocket but it should be put on the table somehow.

The *first* thing I would do is get some big (technical) muck-a-mucks
from the memory supplier in and hammer them with what-if's.  Depending
on how much of your design you can freely "share", they may be able to offer
you some simple guidelines that make a world of difference in the
performance you can expect *and* the methods by which you manage (and
measure) that performance.  They may even have some IP that they can
share with you.  Or, put you in touch with some *other* customer that
might be able to assist you better (e.g., an SSD manufacturer).

The first part of the journey is KNOWING the problem...

Reply by alb ●January 15, 20152015-01-15

Hi Don,

Don Y <this@is.not.me.com> wrote:
> [much elided; I've got a deadline this evening...]

your follow-up is much appreciated.

[]
>> Bit flips are handled by the ECC. The probability of events will only
>> tell me how often they will occur and how often I should scrub the
>> cells. If it is too often for the system to cope with I would need to
>> employ an ECC that gives me the possibility to correct *multiple* bit
>> flips, so that I can allow the recovery to take *more* time and still be
>> protected.
> 
> My point was, you can design your ECC to deal with the sorts of
> error rates you expect from *functional* devices.  If the integrity
> of the device is already suspect (because of usage patterns that
> lead to disturb and fatigue events), then the calculus gets more
> "interesting"... now you have to factor in the reduced reliability
> of the basic device *plus* any information you have as to how those
> additional errors are likely to manifest.

Correct. The device shows nominally some sorts of 'failures' that I need 
to deal with, plus some of them which are A. difficult to categorize 
because radiation tests on a *black box* don't provide details on the 
mechanism (manufacturers have the /tendency/ to keep secrets on what's 
inside) and B. there's no much data available yet (therefore we might 
need to perform additional radiation testing and analysis).

> E.g., decades ago, we would design memory tests with the actual
> device geometry in mind -- so we could deliberately tickle the
> aspects of the memory (devices and array) that were most likely
> to give us trouble.

IMO there are two aspects: the memory test and the memory controller 
test. Testing the memory and see if it can handle memory usage patterns 
is a different matter and is out of the context of this thread. We want 
to test the controller, because the component itself is qualified by the 
manufacturer. Moreover we want to test the controller when the memory is 
showing all possible scenarios we have anticipated (and designed to 
handle) and maybe more.

While injecting errors in the memory content is achieved simply 
bypassing the ECC, simulating a bad block is not trivial unless a 'crash 
test' is foreseen and behavior is recorded and analyzed.

I'd be interested to read some articles/opinions on the matter if anyone 
here has some experience.

[]
>> If I understand you correctly you're talking about wear-leveling. This
> 
> Wear leveling is a different issue.  It *hopes* to "spread the pain"
> around the device.

Uhm, wear leveling algorithms do not *hope* anything, actually they 
spread the maximum allowed number of E/P cycles all over the device 
instead of a single block, since endurance is defined _per block_.

You count the number of E/P cycles each block has accumulated and you 
choose to write data in available blocks with the least lived E/P cycles 
(for example). The E/P cycles available become effectively multiplied by 
the number of available blocks.

Another simple wear leveling approach would be to select randomly among 
the pool of empty blocks. This technique does not require any 'counting' 
and is simple enough to implement (seeding the pseudo-random generator 
with a monotonic parameter like time, in order to avoid to repeat the 
same 'pseudo-random' sequence).

[]
> A block gives you an excessive number of read errors (deal with
> the "write/erase" issue later).  So, you schedule it to be "refreshed"
> (erased and rewritten).  Note that this might also involve *replacement*
> depending on implementation.  But, regardless, you (as an omnipotent
> human) can remember that it was *that* (physical) block that gave you
> the "problem".
> 
> Eventually, you reuse that block (again, depends on your algorithm;
> it might get reused for *this* "refreshed" data -- or, some other
> purpose).  When it "gives you a problem" (again, we're just looking at
> READ errors to simplify the discussion), do you repeat the same
> handling practice that you did the *first* time you encountered
> a problem with the block?  E.g., what if the *second* read from the
> block trips lost of ECC activity in that block?  Do you use your
> knowledge of the fact that this block had previously "caused
> problems" to *bias* your decision to replace it *earlier* than
> you would some "less error prone" block?  Do you ever mark the block
> as "bad" and take it out of service (because experience has taught?
> you that it's just going to end up being replaced *quickly*, AGAIN,
> when put back into service).

Depending on your ECC there might be a limited choice, i.e. assume your 
ECC recovers only one bit flip and report multiple ones. Whenever I 
detect (and correct) *one* single error, I have no other choice but to 
act otherwise, if I wait too long my ECC won't help me anymore.

What to do then? Well here as well there are not so many possible actions:
1. move the content somewhere else (contributes to wear leveling)
2. refresh the data in place

OTOH if the ECC allows you to recover multiple bits, you may still end 
up with the same logic, but instead of acting on the first error 
detected *and* corrected, you would act on the 27th!

There are attempts to show some sort of error patterns that are more 
likely to show up and hence can be managed accordingly, but I'm a fond 
of the K.I.S.S. principle and I'm not so much interested in building a 
read/write patterns aware wear leveling algorithm :-).

Remember also that commercial use cases are far different from HiRel 
ones. Even though your usb flash is rated for 30K cycles we go way 
beyond that limit and nothing happen until something is really not 
working.

We have the part stress analysis which verifies if we are using the 
device beyond the manufacturer's maximum limits.

> I.e., how much state can you afford to track AND BASE YOUR CONTROL
> ALGORITHMS ON before it becomes too complex for a "dedicated bit
> of control logic" (FPGA)?
> 
> I don't know the answer to this.  I'm sure a lot of similar thinking
> goes into the design of flash controllers in SSD's, etc.  But, they
> have an "out" that you don't -- the customer can replace an SSD
> that isn't living up to expectations "for very little money".  You,
> OTOH, have a potential *asset* that they don't -- you *know* how
> you will be using YOUR memory and also know "what's important" therein.
> So, if pushed to the wall, can take steps to ensure what's important
> is preserved, regardless.

That is exactly why we want to clearly define the scope and the 
capabilities that we are able to implement in an FPGA and let the 
customer mitigate the risk of failure at his level (FDIR at system 
level).

[]
>> That said I'm start to wonder whether it would be more reliable to
>> include a small footprint microcontroller (like an mb-lite) to handle
>> the flash and provide a virtual device to the user(s).
> 
> Well, anything that you could do with a microcontroller can also be
> done with logic (just design one and an instruction store!  :> )
> You might, however, find it easier (and more economical) to implement
> a "sequential machine" to make it easier for folks to implement
> these sorts of things (people think much better serially than in
> parallel; and, there probably is no real need for parallelism, here!)

The advantage of the microcontroller is not the hardware it represents 
but the languages that can be used to instruct it. I see easily how 
moving a buffer from one point to another is a perfect task for an FPGA, 
dealing with low level protocol with the Flash Interface, but taking the 
decision about what to do on a failure is where the software has an 
added value and can be 'economical' and possibly more reliable given 
that the complexity it can handle is certainly superior (I doubt I'd be 
able to write a journalized flash file system with bad block management, 
wear leveling, scrubbing and garbage collection within an FPGA!).

But it's easy to reason in terms of extremes, where it's obvious what to 
do. What I find difficult is to find the right level of complexity where 
it'd be better to split the task in two subsystems, the sw and the fpga.

>> Indeed the question could be made broader: what is more appropriate for
>> the fpga to do and for the software to do in this specific use case?
>> Considering the scenarios and the multiple decision points to implement,
>> would a state machine be more appropriate than a software routine
>> (procedure/function/algorithm/....)?
[]
> 
> The *first* thing I would do is get some big (technical) muck-a-mucks
> from the memory supplier in and hammer them with what-if's.  Depending
> on how much of your design you can freely "share", they may be able to offer
> you some simple guidelines that make a world of difference in the
> performance you can expect *and* the methods by which you manage (and
> measure) that performance.  They may even have some IP that they can
> share with you.  Or, put you in touch with some *other* customer that
> might be able to assist you better (e.g., an SSD manufacturer).
> 
> The first part of the journey is KNOWING the problem...

That is indeed an extremely good idea. Not sure though how much *they* 
are willing to share...but definitely worth asking.

Al

Reply by Don Y ●January 16, 20152015-01-16

Hi Al,

On 1/15/2015 4:15 PM, alb wrote:
>>> If I understand you correctly you're talking about wear-leveling. This
>>
>> Wear leveling is a different issue.  It *hopes* to "spread the pain"
>> around the device.
>
> Uhm, wear leveling algorithms do not *hope* anything, actually they
> spread the maximum allowed number of E/P cycles all over the device
> instead of a single block, since endurance is defined _per block_.

In an ideal world (process), all blocks would be identical in terms of
performance, durability, etc.  In that case, you can trivially cycle
locks through the algorithm in any "fair" order and never see any
difference in performance (i.e., durability/error resistance).

However, processes aren't ideal.  And, data "patterns" only add to the
uncertainty.  You can't *know* that each block -- having ALL been "recycled"
exactly N times -- will exhibit the same sorts of "read failures" (again,
just talking about reads to simplify the discussion) when asked to store a
particular data pattern (in the context of the *other* data patterns
located "around" it).

You should be able to test this by deliberately exhausting some sample
devices and/or partial writes (cut the write timer short, deliberately)
to *some* blocks but not others.  I.e., don't let entropy average all
of that out across the device.  *Then*, try to use the device with a
"fair" block swapping algorithm and track how long each *actual*
(physical) block "survives" (provides reliable data below ECC threshold that
suggests a need to rescrub) before your controller decides that it needs
to be "recycled" -- erased and rewritten (or, reused to replace another
block).

If you can associate a metric with each physical block that represents its
historical performance, you can "level" based on *performance* instead of
*use*.

I.e., one leveling algorithm just moves recycled blocks to the end of the
queue and always picks the first block in the queue for use when some
existing block needs to be recycled out of service.  This ensures each
block gets *erased* the same number of times -- regardless of how long
a particular physical block "survived" IN ITS MOST RECENT (or sum of
all preceding) USE.

You could, instead, track the number of *reads*-before-error-limit for
each block and arrange for those with the highest "durability" to be
reused *first*.  Over time, their performance will degrade (i.e., they will
start to exhibit more errors, sooner) so that some other block that had
previously been considered "less durable" is regarded as MORE durable
than this "worn" block.

[Sorry, I'm being verbose but I don't feel I'm being *clear*.  :< ]

I'm oversimplifying this as there are other issues related to geometry
that play a part.  My point is to show that *how* you choose to reuse blocks
can be implemented in different ways with different criteria.  The point
being that you want to minimize the number of "recycle events" that you
take on -- especially if the block you *pick* as a replacement block
may not write properly... so, you end up with a "double fault" of sorts.

> You count the number of E/P cycles each block has accumulated and you
> choose to write data in available blocks with the least lived E/P cycles
> (for example). The E/P cycles available become effectively multiplied by
> the number of available blocks.

That's the "each block gets erased the same number of times" approach I
described above.  Wouldn't you rather find a set of blocks that *needs*
to be erased the least number of times??

> Another simple wear leveling approach would be to select randomly among
> the pool of empty blocks. This technique does not require any 'counting'
> and is simple enough to implement (seeding the pseudo-random generator
> with a monotonic parameter like time, in order to avoid to repeat the
> same 'pseudo-random' sequence).
>
> []
>> A block gives you an excessive number of read errors (deal with
>> the "write/erase" issue later).  So, you schedule it to be "refreshed"
>> (erased and rewritten).  Note that this might also involve *replacement*
>> depending on implementation.  But, regardless, you (as an omnipotent
>> human) can remember that it was *that* (physical) block that gave you
>> the "problem".
>>
>> Eventually, you reuse that block (again, depends on your algorithm;
>> it might get reused for *this* "refreshed" data -- or, some other
>> purpose).  When it "gives you a problem" (again, we're just looking at
>> READ errors to simplify the discussion), do you repeat the same
>> handling practice that you did the *first* time you encountered
>> a problem with the block?  E.g., what if the *second* read from the
>> block trips lost of ECC activity in that block?  Do you use your
>> knowledge of the fact that this block had previously "caused
>> problems" to *bias* your decision to replace it *earlier* than
>> you would some "less error prone" block?  Do you ever mark the block
>> as "bad" and take it out of service (because experience has taught?
>> you that it's just going to end up being replaced *quickly*, AGAIN,
>> when put back into service).
>
> Depending on your ECC there might be a limited choice, i.e. assume your
> ECC recovers only one bit flip and report multiple ones. Whenever I
> detect (and correct) *one* single error, I have no other choice but to
> act otherwise, if I wait too long my ECC won't help me anymore.

This depends on the nature of the source of your errors.  E.g., do you
replace the RAM in your PC when it reports *any* CORRECTED errors?
(I've asked this question before and its fun to watch people try to
rationalize how they use that "feature" -- do they even *know* when there
are error events being corrected?  when do they decide there are "too
many"?)

I.e., if you are looking for *any* error to indicate a lack of trust in
your data, then you might want an ECC that could correct one and *detect*
ANY NUMBER (many polynomials will only reliably detect a certain number of
errors; beyond that, the data "looks" OK -- even if it is rife with errors!)

> What to do then? Well here as well there are not so many possible actions:
> 1. move the content somewhere else (contributes to wear leveling)
> 2. refresh the data in place

"In place" is essentially preselecting the *same* block for use with the
"new" (which is actually "old"!) data.

> OTOH if the ECC allows you to recover multiple bits, you may still end
> up with the same logic, but instead of acting on the first error
> detected *and* corrected, you would act on the 27th!
>
> There are attempts to show some sort of error patterns that are more
> likely to show up and hence can be managed accordingly, but I'm a fond
> of the K.I.S.S. principle and I'm not so much interested in building a
> read/write patterns aware wear leveling algorithm :-).

The resource that you can't exhaust is write-erase cycles.  So, you really
want to come up with an algorithm that minimizes the number of times
you "give up" on a written block -- in favor of refreshing it here or
elsewhere.  That's why I was saying (earlier) to rethink your algorithm
to track "measured durability" (i.e., how *long* before errors detected)
and use that to govern the selection of replacement blocks -- instead of just
a FIFO (or even random) algorithm.

> Remember also that commercial use cases are far different from HiRel
> ones. Even though your usb flash is rated for 30K cycles we go way
> beyond that limit and nothing happen until something is really not
> working.
>
> We have the part stress analysis which verifies if we are using the
> device beyond the manufacturer's maximum limits.
>

>>> That said I'm start to wonder whether it would be more reliable to
>>> include a small footprint microcontroller (like an mb-lite) to handle
>>> the flash and provide a virtual device to the user(s).
>>
>> Well, anything that you could do with a microcontroller can also be
>> done with logic (just design one and an instruction store!  :> )
>> You might, however, find it easier (and more economical) to implement
>> a "sequential machine" to make it easier for folks to implement
>> these sorts of things (people think much better serially than in
>> parallel; and, there probably is no real need for parallelism, here!)
>
> The advantage of the microcontroller is not the hardware it represents
> but the languages that can be used to instruct it. I see easily how
> moving a buffer from one point to another is a perfect task for an FPGA,
> dealing with low level protocol with the Flash Interface, but taking the
> decision about what to do on a failure is where the software has an
> added value and can be 'economical' and possibly more reliable given
> that the complexity it can handle is certainly superior (I doubt I'd be
> able to write a journalized flash file system with bad block management,
> wear leveling, scrubbing and garbage collection within an FPGA!).

Exactly.  As I said, hardware is great for parallel operations.  But, when
the algorithm being implemented is thought of in a serial fashion, it
is often easier to resort to a sequencer -- a microcontroller.
     WHEN this happens, do that...
     UNLESS this other thing happens, in which case...

> But it's easy to reason in terms of extremes, where it's obvious what to
> do. What I find difficult is to find the right level of complexity where
> it'd be better to split the task in two subsystems, the sw and the fpga.

Good luck!  I've got to bring another server on-line tonight so I can
dispose of its predecessor this weekend!

Reply by Boudewijn Dijkstra ●January 16, 20152015-01-16

Op Fri, 16 Jan 2015 00:15:03 +0100 schreef alb <al.basili@gmail.com>:
> Don Y <this@is.not.me.com> wrote:
>
> []
> Another simple wear leveling approach would be to select randomly among
> the pool of empty blocks. This technique does not require any 'counting'
> and is simple enough to implement (seeding the pseudo-random generator
> with a monotonic parameter like time, in order to avoid to repeat the
> same 'pseudo-random' sequence).

I'm curious: is random selection better than a simple increase (with  
wrap-around)?


-- 
(Remove the obvious prefix to reply privately.)
Gemaakt met Opera's e-mailprogramma: http://www.opera.com/mail/

Reply by Don Y ●January 16, 20152015-01-16

Hi Boudewijn,

On 1/16/2015 1:43 AM, Boudewijn Dijkstra wrote:
> Op Fri, 16 Jan 2015 00:15:03 +0100 schreef alb <al.basili@gmail.com>:
>> Don Y <this@is.not.me.com> wrote:
>>
>> Another simple wear leveling approach would be to select randomly among
>> the pool of empty blocks. This technique does not require any 'counting'
>> and is simple enough to implement (seeding the pseudo-random generator
>> with a monotonic parameter like time, in order to avoid to repeat the
>> same 'pseudo-random' sequence).
>
> I'm curious: is random selection better than a simple increase (with wrap-around)?

I would suspect *not*.

First, what constitutes "random"?  How do you ensure that the results truly
*are* "random"?  Over what interval?

And, what are the consequences of "random but unfortunate" picks.  E.g.,
587111111111111111110345... can be random digits.  OTOH, if the selection
indicated by each digit were applied to something that had a life expectancy
of "10 invocations", that string of 17's ones could prove problematic!

I think Al is better served by a more deterministic algorithm.  Esp in light
of the fact that he doesn't really get a second chance at this!

[Randomness is an interesting concept.  In gam[bl]ing devices, there's always
the issue of "how random is random".  If a video poker game dealt four
successive royal flushes, you'd be hard pressed convincing ANYONE that this
was a "random" event!  OTOH, if the player was going to be seated there
for another 1,000,000 hands (in which he NEVER sees another royal flush),
this might be acceptable!]

Reply by Aleksandar Kuktin ●January 17, 20152015-01-17

On Thu, 15 Jan 2015 21:46:27 -0700, Don Y wrote:

> Hi Al,
> 
> On 1/15/2015 4:15 PM, alb wrote:
>>>> If I understand you correctly you're talking about wear-leveling.
>>>> This
>>>
>>> Wear leveling is a different issue.  It *hopes* to "spread the pain"
>>> around the device.
>>
>> Uhm, wear leveling algorithms do not *hope* anything, actually they
>> spread the maximum allowed number of E/P cycles all over the device
>> instead of a single block, since endurance is defined _per block_.
> 
> In an ideal world (process), all blocks would be identical in terms of
> performance, durability, etc.  In that case, you can trivially cycle
> locks through the algorithm in any "fair" order and never see any
> difference in performance (i.e., durability/error resistance).
> 
> However, processes aren't ideal.  And, data "patterns" only add to the
> uncertainty.  You can't *know* that each block -- having ALL been
> "recycled" exactly N times -- will exhibit the same sorts of "read
> failures" (again, just talking about reads to simplify the discussion)
> when asked to store a particular data pattern (in the context of the
> *other* data patterns located "around" it).
> 
> You should be able to test this by deliberately exhausting some sample
> devices and/or partial writes (cut the write timer short, deliberately)
> to *some* blocks but not others.  I.e., don't let entropy average all of
> that out across the device.  *Then*, try to use the device with a "fair"
> block swapping algorithm and track how long each *actual*
> (physical) block "survives" (provides reliable data below ECC threshold
> that suggests a need to rescrub) before your controller decides that it
> needs to be "recycled" -- erased and rewritten (or, reused to replace
> another block).
> 
> If you can associate a metric with each physical block that represents
> its historical performance, you can "level" based on *performance*
> instead of *use*.
> 
> I.e., one leveling algorithm just moves recycled blocks to the end of
> the queue and always picks the first block in the queue for use when
> some existing block needs to be recycled out of service.  This ensures
> each block gets *erased* the same number of times -- regardless of how
> long a particular physical block "survived" IN ITS MOST RECENT (or sum
> of all preceding) USE.
> 
> You could, instead, track the number of *reads*-before-error-limit for
> each block and arrange for those with the highest "durability" to be
> reused *first*.  Over time, their performance will degrade (i.e., they
> will start to exhibit more errors, sooner) so that some other block that
> had previously been considered "less durable" is regarded as MORE
> durable than this "worn" block.
> 
> [Sorry, I'm being verbose but I don't feel I'm being *clear*.  :< ]

Clear or not, the message is getting through. :)

If I get it: instead of assuming all the blocks are created equal, assume 
blocks are different and have the algorithm treat different blocks 
differently.

Perhaps, instead of tracking block history (for every block - with info 
stored where?), an algorithm can be made that uses a "descriptor" of 
sorts to inform its thinking about blocks. So, instead of the algorithm 
making notes about unreliable blocks, it gets them hardcoded from the 
programmer.

I'm absolutely not sure how would you generate that descriptor. 
Especially if you assume every device is different.

> <sniped a huge amount>

Reply by Don Y ●January 17, 20152015-01-17

Hi Aleksandar,

On 1/17/2015 10:35 AM, Aleksandar Kuktin wrote:
> On Thu, 15 Jan 2015 21:46:27 -0700, Don Y wrote:
>> On 1/15/2015 4:15 PM, alb wrote:
>>>>> If I understand you correctly you're talking about wear-leveling.
>>>>> This
>>>>
>>>> Wear leveling is a different issue.  It *hopes* to "spread the pain"
>>>> around the device.
>>>
>>> Uhm, wear leveling algorithms do not *hope* anything, actually they
>>> spread the maximum allowed number of E/P cycles all over the device
>>> instead of a single block, since endurance is defined _per block_.
>>
>> In an ideal world (process), all blocks would be identical in terms of
>> performance, durability, etc.  In that case, you can trivially cycle
>> locks through the algorithm in any "fair" order and never see any
>> difference in performance (i.e., durability/error resistance).

>> If you can associate a metric with each physical block that represents
>> its historical performance, you can "level" based on *performance*
>> instead of *use*.

>> You could, instead, track the number of *reads*-before-error-limit for
>> each block and arrange for those with the highest "durability" to be
>> reused *first*.  Over time, their performance will degrade (i.e., they
>> will start to exhibit more errors, sooner) so that some other block that
>> had previously been considered "less durable" is regarded as MORE
>> durable than this "worn" block.
>>
>> [Sorry, I'm being verbose but I don't feel I'm being *clear*.  :< ]

> If I get it: instead of assuming all the blocks are created equal, assume
> blocks are different and have the algorithm treat different blocks
> differently.

Well, all blocks are *never* completely "equal".  There are always
variations.  My point is to monitor performance (with <some> metric)
and use that to bias block selection to favor those that (have) "perform(ed)
best" -- from among those available for use.

> Perhaps, instead of tracking block history (for every block - with info
> stored where?), an algorithm can be made that uses a "descriptor" of
> sorts to inform its thinking about blocks. So, instead of the algorithm
> making notes about unreliable blocks, it gets them hardcoded from the
> programmer.

I don't think it can be as simple as that.  It's not like you know that
"this block is BAD".  Rather, this block is *better* (than those others).

Furthermore, it isn't a static quality.

The data *in* the block will determine how likely it is to create a
particular error (again, looking just at read operations).  And, how
"worn" the individual cells within that block are -- how well they
can hold/isolate charge on their respective gates.

But, also, the access patterns of blocks "electrically adjacent" (poorly
defined concept) will determine the likelihood for a read disturb event
to corrupt the data in *this* block.

As those read patterns may change over time, *one* (actually identical in
terms of reliability!) block may appear worse than another *now*, yet,
when the access pattern changes, the "other" may appear worse.  Despite
the fact that both are actually "equally reliable".

Likewise, the number of write-erase cycles fatigues the cells in the
block.  I.e., there are lots of factors that influence how well a
given set of cells can hold data.

You would need to know a *lot* about your application's access patterns
*AND* couple that with knowledge of your block replacement algorithm
to be able to determine "at design time" where the "most vulnerable"
blocks were (physically) located.

Instead, I'm suggesting you simply measure how well this block is
performing with *these* contents, some previous number of write-erase
cycles as well as the "neighboring data access patterns", happening
*now*.  Assuming failures *slowly* creep into the data, then just
*wait* until you get your first (Nth?) ECC event before recycling
the block.  But, recall how well it performed (how many read cycles
before it's contents were noticeably disturbed -- to the point where
the ECC had to step in!).

Use this information to adjust your "block selection" algorithm when
you need to recycle/refresh a block -- later.

As conditions (local to each particular block) change over time, the
"performance" of individual blocks will also change.  They will
rise and fall in the list of "best performing blocks" to reflect
that performance.

I've no idea how well this would perform in any *particular* application.
You can bet the folks who build SSD's try hard to refine these sorts of
algorithms so they can keep their margins high, costs low and performance
up!  (and, hopefully, without the complexity of the housekeeping leading
to lots of errors in the implelmentation!)

Another approach is simply to over-provision the hell out of things and
hope for the best!  :>

> I'm absolutely not sure how would you generate that descriptor.
> Especially if you assume every device is different.

Reply by alb ●January 19, 20152015-01-19

Hi Boudewijn,

Boudewijn Dijkstra <sp4mtr4p.boudewijn@indes.com> wrote:
> Op Fri, 16 Jan 2015 00:15:03 +0100 schreef alb <al.basili@gmail.com>:
>> Don Y <this@is.not.me.com> wrote:
>>
>> []
>> Another simple wear leveling approach would be to select randomly among
>> the pool of empty blocks. This technique does not require any 'counting'
>> and is simple enough to implement (seeding the pseudo-random generator
>> with a monotonic parameter like time, in order to avoid to repeat the
>> same 'pseudo-random' sequence).
> 
> I'm curious: is random selection better than a simple increase (with  
> wrap-around)?

In terms of spreading usage across the device the main difference 
between a simple increase and wrap-around and a random choice is across 
power off/on cycles.

In the first case you would need to store somewhere in your flash memory 
where was the last pointer for your 'increase and wrap-around' 
algorithm, while in the second case you simply don't.

The reduced amount of metadata management can simplify the task and 
reduce complexity.

Moreover, while in a perfect world power outages are maybe not 
considered, the simple fact that power is cut while you want to store 
your metadata may screw your 'increase and wrap-around' algorithm and 
you'll have uneven usage, while a random approach would simply not care.

Al

Reply by alb ●January 19, 20152015-01-19

Hi Don,

Don Y <this@is.not.me.com> wrote: []
>> Uhm, wear leveling algorithms do not *hope* anything, actually they 
>> spread the maximum allowed number of E/P cycles all over the device 
>> instead of a single block, since endurance is defined _per block_.
> 
> In an ideal world (process), all blocks would be identical in terms of 
> performance, durability, etc.  In that case, you can trivially cycle 
> locks through the algorithm in any "fair" order and never see any 
> difference in performance (i.e., durability/error resistance).
> 
> However, processes aren't ideal.  And, data "patterns" only add to the 
> uncertainty.  You can't *know* that each block -- having ALL been 
> "recycled" exactly N times -- will exhibit the same sorts of "read 
> failures" (again, just talking about reads to simplify the discussion) 
> when asked to store a particular data pattern (in the context of the 
> *other* data patterns located "around" it).
> 
> You should be able to test this by deliberately exhausting some sample 
> devices and/or partial writes (cut the write timer short, 
> deliberately) to *some* blocks but not others.  I.e., don't let 
> entropy average all of that out across the device.  *Then*, try to use 
> the device with a "fair" block swapping algorithm and track how long 
> each *actual* (physical) block "survives" (provides reliable data 
> below ECC threshold that suggests a need to rescrub) before your 
> controller decides that it needs to be "recycled" -- erased and 
> rewritten (or, reused to replace another block).

The main point here is about your reference to *entropy*. In the end you 
have to take *entropy* into accound no matter what kind of data patterns 
you use. Blocks that show better performance now will be used *more* and 
your /wear-leveling/ is not leveling anymore, inducing more wear in the 
preferred blocks, hence reducing their performance (actually if we limit 
the discussion to the 'read disturb' issue I'm not convinced that 
there's any wear going on.)

> If you can associate a metric with each physical block that represents 
> its historical performance, you can "level" based on *performance* 
> instead of *use*.
> 
> I.e., one leveling algorithm just moves recycled blocks to the end of 
> the queue and always picks the first block in the queue for use when 
> some existing block needs to be recycled out of service.  This ensures 
> each block gets *erased* the same number of times -- regardless of how 
> long a particular physical block "survived" IN ITS MOST RECENT (or sum 
> of all preceding) USE.
> 
> You could, instead, track the number of *reads*-before-error-limit for 
> each block and arrange for those with the highest "durability" to be 
> reused *first*.  Over time, their performance will degrade (i.e., they 
> will start to exhibit more errors, sooner) so that some other block 
> that had previously been considered "less durable" is regarded as MORE 
> durable than this "worn" block.

Than what would be the benefit w.r.t. to equally spreading load?

Another point is that unfortunately we *cannot* avoid to take into 
account other failure modes (on write/erase cycles), since they *will* 
certainly induce wear.

Again any metric that would /rank/ blocks according to performance would 
lead to uneven wear that you need to compensate at the end if you want 
to keep memory size /constant/ (i.e. no loss of blocks due to wear).

> [Sorry, I'm being verbose but I don't feel I'm being *clear*.  :< ]

I think I got your point, but I'm not seeing the added value of any 
ranking mechanism.

> I'm oversimplifying this as there are other issues related to geometry 
> that play a part.  My point is to show that *how* you choose to reuse 
> blocks can be implemented in different ways with different criteria.  
> The point being that you want to minimize the number of "recycle 
> events" that you take on -- especially if the block you *pick* as a 
> replacement block may not write properly... so, you end up with a 
> "double fault" of sorts.

But you need to factor in that behavior anyhow, since betting on the 
best performance is not going to pay on the long run and sooner or later 
you'll be facing 'double fault' sorts of errors anyhow. If your 
algorithm is not ready to take that into account you'll be in for a 
nasty surprise.

>> You count the number of E/P cycles each block has accumulated and you 
>> choose to write data in available blocks with the least lived E/P 
>> cycles (for example). The E/P cycles available become effectively 
>> multiplied by the number of available blocks.
> 
> That's the "each block gets erased the same number of times" approach 
> I described above.  Wouldn't you rather find a set of blocks that 
> *needs* to be erased the least number of times??

Assuming you find the necessary metric to /rank/ the blocks, what is the 
overall *gain* if you end up wearing those *more* than others and 
effectively losing that gap in performance more quickly?

[]
>> Depending on your ECC there might be a limited choice, i.e. assume 
>> your ECC recovers only one bit flip and report multiple ones. 
>> Whenever I detect (and correct) *one* single error, I have no other 
>> choice but to act otherwise, if I wait too long my ECC won't help me 
>> anymore.
> 
> This depends on the nature of the source of your errors.  E.g., do you 
> replace the RAM in your PC when it reports *any* CORRECTED errors? 
> (I've asked this question before and its fun to watch people try to 
> rationalize how they use that "feature" -- do they even *know* when 
> there are error events being corrected?  when do they decide there are 
> "too many"?)

AFAIK the ECC used in PCs are more like BHC, with low overhead and 
strong correction capability at the expense of a more complex algorithm.

You could set a threshold on the amount of detected and corrected and 
take an action only when the threshold is crossed. You still have 
reliable data but you better correct those errors before the capability 
of your ECC is not sufficient anymore.

In my case, in order to reduce complexity in the fabric (the algorithm 
will be implemented in an FPGA), I'll be more inclined towards a simple 
hamming that can correct *one* and detect more than *two*. Therefore 
I'll need to take action on *any* error.

[]
>> There are attempts to show some sort of error patterns that are more 
>> likely to show up and hence can be managed accordingly, but I'm a 
>> fond of the K.I.S.S. principle and I'm not so much interested in 
>> building a read/write patterns aware wear leveling algorithm :-).
> 
> The resource that you can't exhaust is write-erase cycles.  So, you 
> really want to come up with an algorithm that minimizes the number of 
> times you "give up" on a written block -- in favor of refreshing it 
> here or elsewhere.  That's why I was saying (earlier) to rethink your 
> algorithm to track "measured durability" (i.e., how *long* before 
> errors detected) and use that to govern the selection of replacement 
> blocks -- instead of just a FIFO (or even random) algorithm.

The thing is that metric 'accumulated' on read operations do not 
necessarily provide any wear level. AFAIK there's no indication that 
your write-erase cycle is *close* to the end of life.

'Read disturb' errors are not related to wear, as well as errors induced 
by heavy ions so it would be difficult to make an algorithm that ranks 
the /strenght/ of each block.

[]
>> The advantage of the microcontroller is not the hardware it 
>> represents but the languages that can be used to instruct it. I see 
>> easily how moving a buffer from one point to another is a perfect 
>> task for an FPGA, dealing with low level protocol with the Flash 
>> Interface, but taking the decision about what to do on a failure is 
>> where the software has an added value and can be 'economical' and 
>> possibly more reliable given that the complexity it can handle is 
>> certainly superior (I doubt I'd be able to write a journalized flash 
>> file system with bad block management, wear leveling, scrubbing and 
>> garbage collection within an FPGA!).
> 
> Exactly.  As I said, hardware is great for parallel operations.  But, 
> when the algorithm being implemented is thought of in a serial 
> fashion, it is often easier to resort to a sequencer -- a 
> microcontroller.
>     WHEN this happens, do that... UNLESS this other thing happens, in 
>     which case...

I've seen an interesting presentation 
(http://www.google.ch/url?sa=t&rct=j&q=improved%20memory%20module%20sitael&source=web&cd=6&cad=rja&uact=8&ved=0CDgQFjAF&url=http%3A%2F%2Fcongrexprojects.com%2Fdocs%2F12c25_2510%2F15errico_tuccio_adcss2012_sitael_we.pdf%3Fsfvrsn%3D2&ei=qTy9VOCADs_taMe7gegF&usg=AFQjCNH242L7zto_ClD3Y-AzNDPL5c2TVg&bvm=bv.83829542,d.d2s) 
on a memory controller implemented on an FPGA. It claims it fits in a 
10% of a 3MGates equivalent FPGA (ProAsic3E A3PE3000), which seems a bit 
less than what I expected.

It would be interesting to see how much resources would take a small 
embedded microcontroller which would implement the FTL in a software 
stack.

Al