EmbeddedRelated.com
Forums

[cross-post] nand flash bad blocks management

Started by alb January 12, 2015
Hi Don,

Don Y <this@is.not.me.com> wrote:
>> reading does not cause *fatigue* in the sense that does not wear the > > Yes, sorry -- I was being imprecise. My point was that it alters the > data in the device in a manner that will eventually cause data LOSS. > Of course, the effects are even more pronounced on MLC where the number > of electrons is smaller for any given 'state'.
Correct. Since we will need to continuously scan for 'errors' (either induced by SEE or by 'Read Disturb' effects), is going to be unlikely that we will read 100K times before erasing. Main issue to handle is a bad block when writing on a scrubbed area. We have lots of things to do and in the meantime we need to deal with a 'task' that is going to take a lot of push/pull activities on various busses. []
>> These sorts of problems though are showing up when we talk about a >> number of reading cycles in the hundreds of thousands if not million >> (google: The Inconvenient Truths of NAND Flash Memory). >
> The numbers are only half the story. I can use a device for YEARS that > exhibits problems after just *hundreds* of cycles -- if I don't burn > those hundreds of cycles in those "years"! OTOH, something that will > ONLY manifest after a million cycles can plague a design in *minutes* > if the application hammers away at it. > > That's why: > >>> So, either KNOW that your access patterns (read and write) *won't* >>> disturb the array. *Or*, actively manage it by "refreshing" content >>> after "lots of" accesses (e.g., 100K-ish) PER PAGE/BANK.
The advice is sound and we already assessed this in our feasibility analysis. There's not only the handling of the bad blocks that need to be done, but also the continuous correction of errors popping up here and there due to specific read patterns. We foresee to retrieve few pages, or maybe few blocks each ~10 sec...I'm already around 3M/year readings, for a 7.5 year mission is a big number. Certainly impossible to avoid handling these effects.
>> We have to cope with bit flips anyway (low earth orbit), so we are >> obliged to scrub the memory, in order to avoid errors' accumulation we >> move the entire block, update the LBA and erase the one affected, so it >> becomes again available. > > Bit flips can be handled probabilistically -- you can model how often > you *expect* to encounter them on an OTHERWISE GOOD data image. OTOH, > complicate that with a *dubious* data image and the reliability and > predictability falls markedly.
Bit flips are handled by the ECC. The probability of events will only tell me how often they will occur and how often I should scrub the cells. If it is too often for the system to cope with I would need to employ an ECC that gives me the possibility to correct *multiple* bit flips, so that I can allow the recovery to take *more* time and still be protected. []
>> Well according to our latest estimates we are about at 30% of cell usage >> on an AX2000 (2MGates), without including any scrubbing (yet), but >> including the bad block management. > > Remember, if you are too naive in your implementation, you can increase > *wear*. I think to get a good algorithm, you probably want to track > knowledge of the entire *device* -- not just the RECENT history of > this block/page. (i.e., where did the page that *was* here go? and, > why? if it had an unusually high error rate, you might not be so keen > on bringing it back into the rotation -- ever!) > > I.e., it seems like a lot of "state" to manage in a dedicated piece of > hardware (that you can't *service*!)
If I understand you correctly you're talking about wear-leveling. This is another aspect that needs to be addressed and can only come out of our radiation analysis. If the error rate we need to cope with will require a mean number of erase/program cycles that is larger than the allowed one, we are obliged to have some sort of mechanism to spread usage throughout the whole pool of available blocks and not just adjacent ones. That said I'm start to wonder whether it would be more reliable to include a small footprint microcontroller (like an mb-lite) to handle the flash and provide a virtual device to the user(s). Indeed the question could be made broader: what is more appropriate for the fpga to do and for the software to do in this specific use case? Considering the scenarios and the multiple decision points to implement, would a state machine be more appropriate than a software routine (procedure/function/algorithm/....)? I could potentially delegate the fpga to routine tasks like moving/erasing/reading/writing blocks, while keeping the whole 'intelligence' (or stupidity!) in a more easy to handle software stack. I think the customer should be involved in this reasoning and we should share the risk of the technical choice. Including a software stack would make the budget skyrocket but it should be put on the table somehow. Al -- A: Because it messes up the order in which people normally read text. Q: Why is top-posting such a bad thing? A: Top-posting. Q: What is the most annoying thing on usenet and in e-mail?
Hi Al,

[much elided; I've got a deadline this evening...]

On 1/14/2015 3:47 PM, alb wrote:

>>> We have to cope with bit flips anyway (low earth orbit), so we are >>> obliged to scrub the memory, in order to avoid errors' accumulation we >>> move the entire block, update the LBA and erase the one affected, so it >>> becomes again available. >> >> Bit flips can be handled probabilistically -- you can model how often >> you *expect* to encounter them on an OTHERWISE GOOD data image. OTOH, >> complicate that with a *dubious* data image and the reliability and >> predictability falls markedly. > > Bit flips are handled by the ECC. The probability of events will only > tell me how often they will occur and how often I should scrub the > cells. If it is too often for the system to cope with I would need to > employ an ECC that gives me the possibility to correct *multiple* bit > flips, so that I can allow the recovery to take *more* time and still be > protected.
My point was, you can design your ECC to deal with the sorts of error rates you expect from *functional* devices. If the integrity of the device is already suspect (because of usage patterns that lead to disturb and fatigue events), then the calculus gets more "interesting"... now you have to factor in the reduced reliability of the basic device *plus* any information you have as to how those additional errors are likely to manifest. E.g., decades ago, we would design memory tests with the actual device geometry in mind -- so we could deliberately tickle the aspects of the memory (devices and array) that were most likely to give us trouble.
>>> Well according to our latest estimates we are about at 30% of cell usage >>> on an AX2000 (2MGates), without including any scrubbing (yet), but >>> including the bad block management. >> >> Remember, if you are too naive in your implementation, you can increase >> *wear*. I think to get a good algorithm, you probably want to track >> knowledge of the entire *device* -- not just the RECENT history of >> this block/page. (i.e., where did the page that *was* here go? and, >> why? if it had an unusually high error rate, you might not be so keen >> on bringing it back into the rotation -- ever!) >> >> I.e., it seems like a lot of "state" to manage in a dedicated piece of >> hardware (that you can't *service*!) > > If I understand you correctly you're talking about wear-leveling. This
Wear leveling is a different issue. It *hopes* to "spread the pain" around the device. But, consider this scenario: A block gives you an excessive number of read errors (deal with the "write/erase" issue later). So, you schedule it to be "refreshed" (erased and rewritten). Note that this might also involve *replacement* depending on implementation. But, regardless, you (as an omnipotent human) can remember that it was *that* (physical) block that gave you the "problem". Eventually, you reuse that block (again, depends on your algorithm; it might get reused for *this* "refreshed" data -- or, some other purpose). When it "gives you a problem" (again, we're just looking at READ errors to simplify the discussion), do you repeat the same handling practice that you did the *first* time you encountered a problem with the block? E.g., what if the *second* read from the block trips lost of ECC activity in that block? Do you use your knowledge of the fact that this block had previously "caused problems" to *bias* your decision to replace it *earlier* than you would some "less error prone" block? Do you ever mark the block as "bad" and take it out of service (because experience has taught? you that it's just going to end up being replaced *quickly*, AGAIN, when put back into service). I.e., how much state can you afford to track AND BASE YOUR CONTROL ALGORITHMS ON before it becomes too complex for a "dedicated bit of control logic" (FPGA)? I don't know the answer to this. I'm sure a lot of similar thinking goes into the design of flash controllers in SSD's, etc. But, they have an "out" that you don't -- the customer can replace an SSD that isn't living up to expectations "for very little money". You, OTOH, have a potential *asset* that they don't -- you *know* how you will be using YOUR memory and also know "what's important" therein. So, if pushed to the wall, can take steps to ensure what's important is preserved, regardless.
> is another aspect that needs to be addressed and can only come out of > our radiation analysis. If the error rate we need to cope with will > require a mean number of erase/program cycles that is larger than the > allowed one, we are obliged to have some sort of mechanism to spread > usage throughout the whole pool of available blocks and not just > adjacent ones. > > That said I'm start to wonder whether it would be more reliable to > include a small footprint microcontroller (like an mb-lite) to handle > the flash and provide a virtual device to the user(s).
Well, anything that you could do with a microcontroller can also be done with logic (just design one and an instruction store! :> ) You might, however, find it easier (and more economical) to implement a "sequential machine" to make it easier for folks to implement these sorts of things (people think much better serially than in parallel; and, there probably is no real need for parallelism, here!)
> Indeed the question could be made broader: what is more appropriate for > the fpga to do and for the software to do in this specific use case? > Considering the scenarios and the multiple decision points to implement, > would a state machine be more appropriate than a software routine > (procedure/function/algorithm/....)? > > I could potentially delegate the fpga to routine tasks like > moving/erasing/reading/writing blocks, while keeping the whole > 'intelligence' (or stupidity!) in a more easy to handle software stack. > > I think the customer should be involved in this reasoning and we should > share the risk of the technical choice. Including a software stack would > make the budget skyrocket but it should be put on the table somehow.
The *first* thing I would do is get some big (technical) muck-a-mucks from the memory supplier in and hammer them with what-if's. Depending on how much of your design you can freely "share", they may be able to offer you some simple guidelines that make a world of difference in the performance you can expect *and* the methods by which you manage (and measure) that performance. They may even have some IP that they can share with you. Or, put you in touch with some *other* customer that might be able to assist you better (e.g., an SSD manufacturer). The first part of the journey is KNOWING the problem...
Hi Don,

Don Y <this@is.not.me.com> wrote:
> [much elided; I've got a deadline this evening...]
your follow-up is much appreciated. []
>> Bit flips are handled by the ECC. The probability of events will only >> tell me how often they will occur and how often I should scrub the >> cells. If it is too often for the system to cope with I would need to >> employ an ECC that gives me the possibility to correct *multiple* bit >> flips, so that I can allow the recovery to take *more* time and still be >> protected. > > My point was, you can design your ECC to deal with the sorts of > error rates you expect from *functional* devices. If the integrity > of the device is already suspect (because of usage patterns that > lead to disturb and fatigue events), then the calculus gets more > "interesting"... now you have to factor in the reduced reliability > of the basic device *plus* any information you have as to how those > additional errors are likely to manifest.
Correct. The device shows nominally some sorts of 'failures' that I need to deal with, plus some of them which are A. difficult to categorize because radiation tests on a *black box* don't provide details on the mechanism (manufacturers have the /tendency/ to keep secrets on what's inside) and B. there's no much data available yet (therefore we might need to perform additional radiation testing and analysis).
> E.g., decades ago, we would design memory tests with the actual > device geometry in mind -- so we could deliberately tickle the > aspects of the memory (devices and array) that were most likely > to give us trouble.
IMO there are two aspects: the memory test and the memory controller test. Testing the memory and see if it can handle memory usage patterns is a different matter and is out of the context of this thread. We want to test the controller, because the component itself is qualified by the manufacturer. Moreover we want to test the controller when the memory is showing all possible scenarios we have anticipated (and designed to handle) and maybe more. While injecting errors in the memory content is achieved simply bypassing the ECC, simulating a bad block is not trivial unless a 'crash test' is foreseen and behavior is recorded and analyzed. I'd be interested to read some articles/opinions on the matter if anyone here has some experience. []
>> If I understand you correctly you're talking about wear-leveling. This > > Wear leveling is a different issue. It *hopes* to "spread the pain" > around the device.
Uhm, wear leveling algorithms do not *hope* anything, actually they spread the maximum allowed number of E/P cycles all over the device instead of a single block, since endurance is defined _per block_. You count the number of E/P cycles each block has accumulated and you choose to write data in available blocks with the least lived E/P cycles (for example). The E/P cycles available become effectively multiplied by the number of available blocks. Another simple wear leveling approach would be to select randomly among the pool of empty blocks. This technique does not require any 'counting' and is simple enough to implement (seeding the pseudo-random generator with a monotonic parameter like time, in order to avoid to repeat the same 'pseudo-random' sequence). []
> A block gives you an excessive number of read errors (deal with > the "write/erase" issue later). So, you schedule it to be "refreshed" > (erased and rewritten). Note that this might also involve *replacement* > depending on implementation. But, regardless, you (as an omnipotent > human) can remember that it was *that* (physical) block that gave you > the "problem". > > Eventually, you reuse that block (again, depends on your algorithm; > it might get reused for *this* "refreshed" data -- or, some other > purpose). When it "gives you a problem" (again, we're just looking at > READ errors to simplify the discussion), do you repeat the same > handling practice that you did the *first* time you encountered > a problem with the block? E.g., what if the *second* read from the > block trips lost of ECC activity in that block? Do you use your > knowledge of the fact that this block had previously "caused > problems" to *bias* your decision to replace it *earlier* than > you would some "less error prone" block? Do you ever mark the block > as "bad" and take it out of service (because experience has taught? > you that it's just going to end up being replaced *quickly*, AGAIN, > when put back into service).
Depending on your ECC there might be a limited choice, i.e. assume your ECC recovers only one bit flip and report multiple ones. Whenever I detect (and correct) *one* single error, I have no other choice but to act otherwise, if I wait too long my ECC won't help me anymore. What to do then? Well here as well there are not so many possible actions: 1. move the content somewhere else (contributes to wear leveling) 2. refresh the data in place OTOH if the ECC allows you to recover multiple bits, you may still end up with the same logic, but instead of acting on the first error detected *and* corrected, you would act on the 27th! There are attempts to show some sort of error patterns that are more likely to show up and hence can be managed accordingly, but I'm a fond of the K.I.S.S. principle and I'm not so much interested in building a read/write patterns aware wear leveling algorithm :-). Remember also that commercial use cases are far different from HiRel ones. Even though your usb flash is rated for 30K cycles we go way beyond that limit and nothing happen until something is really not working. We have the part stress analysis which verifies if we are using the device beyond the manufacturer's maximum limits.
> I.e., how much state can you afford to track AND BASE YOUR CONTROL > ALGORITHMS ON before it becomes too complex for a "dedicated bit > of control logic" (FPGA)? > > I don't know the answer to this. I'm sure a lot of similar thinking > goes into the design of flash controllers in SSD's, etc. But, they > have an "out" that you don't -- the customer can replace an SSD > that isn't living up to expectations "for very little money". You, > OTOH, have a potential *asset* that they don't -- you *know* how > you will be using YOUR memory and also know "what's important" therein. > So, if pushed to the wall, can take steps to ensure what's important > is preserved, regardless.
That is exactly why we want to clearly define the scope and the capabilities that we are able to implement in an FPGA and let the customer mitigate the risk of failure at his level (FDIR at system level). []
>> That said I'm start to wonder whether it would be more reliable to >> include a small footprint microcontroller (like an mb-lite) to handle >> the flash and provide a virtual device to the user(s). > > Well, anything that you could do with a microcontroller can also be > done with logic (just design one and an instruction store! :> ) > You might, however, find it easier (and more economical) to implement > a "sequential machine" to make it easier for folks to implement > these sorts of things (people think much better serially than in > parallel; and, there probably is no real need for parallelism, here!)
The advantage of the microcontroller is not the hardware it represents but the languages that can be used to instruct it. I see easily how moving a buffer from one point to another is a perfect task for an FPGA, dealing with low level protocol with the Flash Interface, but taking the decision about what to do on a failure is where the software has an added value and can be 'economical' and possibly more reliable given that the complexity it can handle is certainly superior (I doubt I'd be able to write a journalized flash file system with bad block management, wear leveling, scrubbing and garbage collection within an FPGA!). But it's easy to reason in terms of extremes, where it's obvious what to do. What I find difficult is to find the right level of complexity where it'd be better to split the task in two subsystems, the sw and the fpga.
>> Indeed the question could be made broader: what is more appropriate for >> the fpga to do and for the software to do in this specific use case? >> Considering the scenarios and the multiple decision points to implement, >> would a state machine be more appropriate than a software routine >> (procedure/function/algorithm/....)?
[]
> > The *first* thing I would do is get some big (technical) muck-a-mucks > from the memory supplier in and hammer them with what-if's. Depending > on how much of your design you can freely "share", they may be able to offer > you some simple guidelines that make a world of difference in the > performance you can expect *and* the methods by which you manage (and > measure) that performance. They may even have some IP that they can > share with you. Or, put you in touch with some *other* customer that > might be able to assist you better (e.g., an SSD manufacturer). > > The first part of the journey is KNOWING the problem...
That is indeed an extremely good idea. Not sure though how much *they* are willing to share...but definitely worth asking. Al
Hi Al,

On 1/15/2015 4:15 PM, alb wrote:
>>> If I understand you correctly you're talking about wear-leveling. This >> >> Wear leveling is a different issue. It *hopes* to "spread the pain" >> around the device. > > Uhm, wear leveling algorithms do not *hope* anything, actually they > spread the maximum allowed number of E/P cycles all over the device > instead of a single block, since endurance is defined _per block_.
In an ideal world (process), all blocks would be identical in terms of performance, durability, etc. In that case, you can trivially cycle locks through the algorithm in any "fair" order and never see any difference in performance (i.e., durability/error resistance). However, processes aren't ideal. And, data "patterns" only add to the uncertainty. You can't *know* that each block -- having ALL been "recycled" exactly N times -- will exhibit the same sorts of "read failures" (again, just talking about reads to simplify the discussion) when asked to store a particular data pattern (in the context of the *other* data patterns located "around" it). You should be able to test this by deliberately exhausting some sample devices and/or partial writes (cut the write timer short, deliberately) to *some* blocks but not others. I.e., don't let entropy average all of that out across the device. *Then*, try to use the device with a "fair" block swapping algorithm and track how long each *actual* (physical) block "survives" (provides reliable data below ECC threshold that suggests a need to rescrub) before your controller decides that it needs to be "recycled" -- erased and rewritten (or, reused to replace another block). If you can associate a metric with each physical block that represents its historical performance, you can "level" based on *performance* instead of *use*. I.e., one leveling algorithm just moves recycled blocks to the end of the queue and always picks the first block in the queue for use when some existing block needs to be recycled out of service. This ensures each block gets *erased* the same number of times -- regardless of how long a particular physical block "survived" IN ITS MOST RECENT (or sum of all preceding) USE. You could, instead, track the number of *reads*-before-error-limit for each block and arrange for those with the highest "durability" to be reused *first*. Over time, their performance will degrade (i.e., they will start to exhibit more errors, sooner) so that some other block that had previously been considered "less durable" is regarded as MORE durable than this "worn" block. [Sorry, I'm being verbose but I don't feel I'm being *clear*. :< ] I'm oversimplifying this as there are other issues related to geometry that play a part. My point is to show that *how* you choose to reuse blocks can be implemented in different ways with different criteria. The point being that you want to minimize the number of "recycle events" that you take on -- especially if the block you *pick* as a replacement block may not write properly... so, you end up with a "double fault" of sorts.
> You count the number of E/P cycles each block has accumulated and you > choose to write data in available blocks with the least lived E/P cycles > (for example). The E/P cycles available become effectively multiplied by > the number of available blocks.
That's the "each block gets erased the same number of times" approach I described above. Wouldn't you rather find a set of blocks that *needs* to be erased the least number of times??
> Another simple wear leveling approach would be to select randomly among > the pool of empty blocks. This technique does not require any 'counting' > and is simple enough to implement (seeding the pseudo-random generator > with a monotonic parameter like time, in order to avoid to repeat the > same 'pseudo-random' sequence). > > [] >> A block gives you an excessive number of read errors (deal with >> the "write/erase" issue later). So, you schedule it to be "refreshed" >> (erased and rewritten). Note that this might also involve *replacement* >> depending on implementation. But, regardless, you (as an omnipotent >> human) can remember that it was *that* (physical) block that gave you >> the "problem". >> >> Eventually, you reuse that block (again, depends on your algorithm; >> it might get reused for *this* "refreshed" data -- or, some other >> purpose). When it "gives you a problem" (again, we're just looking at >> READ errors to simplify the discussion), do you repeat the same >> handling practice that you did the *first* time you encountered >> a problem with the block? E.g., what if the *second* read from the >> block trips lost of ECC activity in that block? Do you use your >> knowledge of the fact that this block had previously "caused >> problems" to *bias* your decision to replace it *earlier* than >> you would some "less error prone" block? Do you ever mark the block >> as "bad" and take it out of service (because experience has taught? >> you that it's just going to end up being replaced *quickly*, AGAIN, >> when put back into service). > > Depending on your ECC there might be a limited choice, i.e. assume your > ECC recovers only one bit flip and report multiple ones. Whenever I > detect (and correct) *one* single error, I have no other choice but to > act otherwise, if I wait too long my ECC won't help me anymore.
This depends on the nature of the source of your errors. E.g., do you replace the RAM in your PC when it reports *any* CORRECTED errors? (I've asked this question before and its fun to watch people try to rationalize how they use that "feature" -- do they even *know* when there are error events being corrected? when do they decide there are "too many"?) I.e., if you are looking for *any* error to indicate a lack of trust in your data, then you might want an ECC that could correct one and *detect* ANY NUMBER (many polynomials will only reliably detect a certain number of errors; beyond that, the data "looks" OK -- even if it is rife with errors!)
> What to do then? Well here as well there are not so many possible actions: > 1. move the content somewhere else (contributes to wear leveling) > 2. refresh the data in place
"In place" is essentially preselecting the *same* block for use with the "new" (which is actually "old"!) data.
> OTOH if the ECC allows you to recover multiple bits, you may still end > up with the same logic, but instead of acting on the first error > detected *and* corrected, you would act on the 27th! > > There are attempts to show some sort of error patterns that are more > likely to show up and hence can be managed accordingly, but I'm a fond > of the K.I.S.S. principle and I'm not so much interested in building a > read/write patterns aware wear leveling algorithm :-).
The resource that you can't exhaust is write-erase cycles. So, you really want to come up with an algorithm that minimizes the number of times you "give up" on a written block -- in favor of refreshing it here or elsewhere. That's why I was saying (earlier) to rethink your algorithm to track "measured durability" (i.e., how *long* before errors detected) and use that to govern the selection of replacement blocks -- instead of just a FIFO (or even random) algorithm.
> Remember also that commercial use cases are far different from HiRel > ones. Even though your usb flash is rated for 30K cycles we go way > beyond that limit and nothing happen until something is really not > working. > > We have the part stress analysis which verifies if we are using the > device beyond the manufacturer's maximum limits. >
>>> That said I'm start to wonder whether it would be more reliable to >>> include a small footprint microcontroller (like an mb-lite) to handle >>> the flash and provide a virtual device to the user(s). >> >> Well, anything that you could do with a microcontroller can also be >> done with logic (just design one and an instruction store! :> ) >> You might, however, find it easier (and more economical) to implement >> a "sequential machine" to make it easier for folks to implement >> these sorts of things (people think much better serially than in >> parallel; and, there probably is no real need for parallelism, here!) > > The advantage of the microcontroller is not the hardware it represents > but the languages that can be used to instruct it. I see easily how > moving a buffer from one point to another is a perfect task for an FPGA, > dealing with low level protocol with the Flash Interface, but taking the > decision about what to do on a failure is where the software has an > added value and can be 'economical' and possibly more reliable given > that the complexity it can handle is certainly superior (I doubt I'd be > able to write a journalized flash file system with bad block management, > wear leveling, scrubbing and garbage collection within an FPGA!).
Exactly. As I said, hardware is great for parallel operations. But, when the algorithm being implemented is thought of in a serial fashion, it is often easier to resort to a sequencer -- a microcontroller. WHEN this happens, do that... UNLESS this other thing happens, in which case...
> But it's easy to reason in terms of extremes, where it's obvious what to > do. What I find difficult is to find the right level of complexity where > it'd be better to split the task in two subsystems, the sw and the fpga.
Good luck! I've got to bring another server on-line tonight so I can dispose of its predecessor this weekend!
Op Fri, 16 Jan 2015 00:15:03 +0100 schreef alb <al.basili@gmail.com>:
> Don Y <this@is.not.me.com> wrote: > > [] > Another simple wear leveling approach would be to select randomly among > the pool of empty blocks. This technique does not require any 'counting' > and is simple enough to implement (seeding the pseudo-random generator > with a monotonic parameter like time, in order to avoid to repeat the > same 'pseudo-random' sequence).
I'm curious: is random selection better than a simple increase (with wrap-around)? -- (Remove the obvious prefix to reply privately.) Gemaakt met Opera's e-mailprogramma: http://www.opera.com/mail/
Hi Boudewijn,

On 1/16/2015 1:43 AM, Boudewijn Dijkstra wrote:
> Op Fri, 16 Jan 2015 00:15:03 +0100 schreef alb <al.basili@gmail.com>: >> Don Y <this@is.not.me.com> wrote: >> >> Another simple wear leveling approach would be to select randomly among >> the pool of empty blocks. This technique does not require any 'counting' >> and is simple enough to implement (seeding the pseudo-random generator >> with a monotonic parameter like time, in order to avoid to repeat the >> same 'pseudo-random' sequence). > > I'm curious: is random selection better than a simple increase (with wrap-around)?
I would suspect *not*. First, what constitutes "random"? How do you ensure that the results truly *are* "random"? Over what interval? And, what are the consequences of "random but unfortunate" picks. E.g., 587111111111111111110345... can be random digits. OTOH, if the selection indicated by each digit were applied to something that had a life expectancy of "10 invocations", that string of 17's ones could prove problematic! I think Al is better served by a more deterministic algorithm. Esp in light of the fact that he doesn't really get a second chance at this! [Randomness is an interesting concept. In gam[bl]ing devices, there's always the issue of "how random is random". If a video poker game dealt four successive royal flushes, you'd be hard pressed convincing ANYONE that this was a "random" event! OTOH, if the player was going to be seated there for another 1,000,000 hands (in which he NEVER sees another royal flush), this might be acceptable!]
On Thu, 15 Jan 2015 21:46:27 -0700, Don Y wrote:

> Hi Al, > > On 1/15/2015 4:15 PM, alb wrote: >>>> If I understand you correctly you're talking about wear-leveling. >>>> This >>> >>> Wear leveling is a different issue. It *hopes* to "spread the pain" >>> around the device. >> >> Uhm, wear leveling algorithms do not *hope* anything, actually they >> spread the maximum allowed number of E/P cycles all over the device >> instead of a single block, since endurance is defined _per block_. > > In an ideal world (process), all blocks would be identical in terms of > performance, durability, etc. In that case, you can trivially cycle > locks through the algorithm in any "fair" order and never see any > difference in performance (i.e., durability/error resistance). > > However, processes aren't ideal. And, data "patterns" only add to the > uncertainty. You can't *know* that each block -- having ALL been > "recycled" exactly N times -- will exhibit the same sorts of "read > failures" (again, just talking about reads to simplify the discussion) > when asked to store a particular data pattern (in the context of the > *other* data patterns located "around" it). > > You should be able to test this by deliberately exhausting some sample > devices and/or partial writes (cut the write timer short, deliberately) > to *some* blocks but not others. I.e., don't let entropy average all of > that out across the device. *Then*, try to use the device with a "fair" > block swapping algorithm and track how long each *actual* > (physical) block "survives" (provides reliable data below ECC threshold > that suggests a need to rescrub) before your controller decides that it > needs to be "recycled" -- erased and rewritten (or, reused to replace > another block). > > If you can associate a metric with each physical block that represents > its historical performance, you can "level" based on *performance* > instead of *use*. > > I.e., one leveling algorithm just moves recycled blocks to the end of > the queue and always picks the first block in the queue for use when > some existing block needs to be recycled out of service. This ensures > each block gets *erased* the same number of times -- regardless of how > long a particular physical block "survived" IN ITS MOST RECENT (or sum > of all preceding) USE. > > You could, instead, track the number of *reads*-before-error-limit for > each block and arrange for those with the highest "durability" to be > reused *first*. Over time, their performance will degrade (i.e., they > will start to exhibit more errors, sooner) so that some other block that > had previously been considered "less durable" is regarded as MORE > durable than this "worn" block. > > [Sorry, I'm being verbose but I don't feel I'm being *clear*. :< ]
Clear or not, the message is getting through. :) If I get it: instead of assuming all the blocks are created equal, assume blocks are different and have the algorithm treat different blocks differently. Perhaps, instead of tracking block history (for every block - with info stored where?), an algorithm can be made that uses a "descriptor" of sorts to inform its thinking about blocks. So, instead of the algorithm making notes about unreliable blocks, it gets them hardcoded from the programmer. I'm absolutely not sure how would you generate that descriptor. Especially if you assume every device is different.
> <sniped a huge amount>
Hi Aleksandar,

On 1/17/2015 10:35 AM, Aleksandar Kuktin wrote:
> On Thu, 15 Jan 2015 21:46:27 -0700, Don Y wrote: >> On 1/15/2015 4:15 PM, alb wrote: >>>>> If I understand you correctly you're talking about wear-leveling. >>>>> This >>>> >>>> Wear leveling is a different issue. It *hopes* to "spread the pain" >>>> around the device. >>> >>> Uhm, wear leveling algorithms do not *hope* anything, actually they >>> spread the maximum allowed number of E/P cycles all over the device >>> instead of a single block, since endurance is defined _per block_. >> >> In an ideal world (process), all blocks would be identical in terms of >> performance, durability, etc. In that case, you can trivially cycle >> locks through the algorithm in any "fair" order and never see any >> difference in performance (i.e., durability/error resistance).
>> If you can associate a metric with each physical block that represents >> its historical performance, you can "level" based on *performance* >> instead of *use*.
>> You could, instead, track the number of *reads*-before-error-limit for >> each block and arrange for those with the highest "durability" to be >> reused *first*. Over time, their performance will degrade (i.e., they >> will start to exhibit more errors, sooner) so that some other block that >> had previously been considered "less durable" is regarded as MORE >> durable than this "worn" block. >> >> [Sorry, I'm being verbose but I don't feel I'm being *clear*. :< ]
> If I get it: instead of assuming all the blocks are created equal, assume > blocks are different and have the algorithm treat different blocks > differently.
Well, all blocks are *never* completely "equal". There are always variations. My point is to monitor performance (with <some> metric) and use that to bias block selection to favor those that (have) "perform(ed) best" -- from among those available for use.
> Perhaps, instead of tracking block history (for every block - with info > stored where?), an algorithm can be made that uses a "descriptor" of > sorts to inform its thinking about blocks. So, instead of the algorithm > making notes about unreliable blocks, it gets them hardcoded from the > programmer.
I don't think it can be as simple as that. It's not like you know that "this block is BAD". Rather, this block is *better* (than those others). Furthermore, it isn't a static quality. The data *in* the block will determine how likely it is to create a particular error (again, looking just at read operations). And, how "worn" the individual cells within that block are -- how well they can hold/isolate charge on their respective gates. But, also, the access patterns of blocks "electrically adjacent" (poorly defined concept) will determine the likelihood for a read disturb event to corrupt the data in *this* block. As those read patterns may change over time, *one* (actually identical in terms of reliability!) block may appear worse than another *now*, yet, when the access pattern changes, the "other" may appear worse. Despite the fact that both are actually "equally reliable". Likewise, the number of write-erase cycles fatigues the cells in the block. I.e., there are lots of factors that influence how well a given set of cells can hold data. You would need to know a *lot* about your application's access patterns *AND* couple that with knowledge of your block replacement algorithm to be able to determine "at design time" where the "most vulnerable" blocks were (physically) located. Instead, I'm suggesting you simply measure how well this block is performing with *these* contents, some previous number of write-erase cycles as well as the "neighboring data access patterns", happening *now*. Assuming failures *slowly* creep into the data, then just *wait* until you get your first (Nth?) ECC event before recycling the block. But, recall how well it performed (how many read cycles before it's contents were noticeably disturbed -- to the point where the ECC had to step in!). Use this information to adjust your "block selection" algorithm when you need to recycle/refresh a block -- later. As conditions (local to each particular block) change over time, the "performance" of individual blocks will also change. They will rise and fall in the list of "best performing blocks" to reflect that performance. I've no idea how well this would perform in any *particular* application. You can bet the folks who build SSD's try hard to refine these sorts of algorithms so they can keep their margins high, costs low and performance up! (and, hopefully, without the complexity of the housekeeping leading to lots of errors in the implelmentation!) Another approach is simply to over-provision the hell out of things and hope for the best! :>
> I'm absolutely not sure how would you generate that descriptor. > Especially if you assume every device is different.
Hi Boudewijn,

Boudewijn Dijkstra <sp4mtr4p.boudewijn@indes.com> wrote:
> Op Fri, 16 Jan 2015 00:15:03 +0100 schreef alb <al.basili@gmail.com>: >> Don Y <this@is.not.me.com> wrote: >> >> [] >> Another simple wear leveling approach would be to select randomly among >> the pool of empty blocks. This technique does not require any 'counting' >> and is simple enough to implement (seeding the pseudo-random generator >> with a monotonic parameter like time, in order to avoid to repeat the >> same 'pseudo-random' sequence). > > I'm curious: is random selection better than a simple increase (with > wrap-around)?
In terms of spreading usage across the device the main difference between a simple increase and wrap-around and a random choice is across power off/on cycles. In the first case you would need to store somewhere in your flash memory where was the last pointer for your 'increase and wrap-around' algorithm, while in the second case you simply don't. The reduced amount of metadata management can simplify the task and reduce complexity. Moreover, while in a perfect world power outages are maybe not considered, the simple fact that power is cut while you want to store your metadata may screw your 'increase and wrap-around' algorithm and you'll have uneven usage, while a random approach would simply not care. Al
Hi Don,

Don Y <this@is.not.me.com> wrote: []
>> Uhm, wear leveling algorithms do not *hope* anything, actually they >> spread the maximum allowed number of E/P cycles all over the device >> instead of a single block, since endurance is defined _per block_. > > In an ideal world (process), all blocks would be identical in terms of > performance, durability, etc. In that case, you can trivially cycle > locks through the algorithm in any "fair" order and never see any > difference in performance (i.e., durability/error resistance). > > However, processes aren't ideal. And, data "patterns" only add to the > uncertainty. You can't *know* that each block -- having ALL been > "recycled" exactly N times -- will exhibit the same sorts of "read > failures" (again, just talking about reads to simplify the discussion) > when asked to store a particular data pattern (in the context of the > *other* data patterns located "around" it). > > You should be able to test this by deliberately exhausting some sample > devices and/or partial writes (cut the write timer short, > deliberately) to *some* blocks but not others. I.e., don't let > entropy average all of that out across the device. *Then*, try to use > the device with a "fair" block swapping algorithm and track how long > each *actual* (physical) block "survives" (provides reliable data > below ECC threshold that suggests a need to rescrub) before your > controller decides that it needs to be "recycled" -- erased and > rewritten (or, reused to replace another block).
The main point here is about your reference to *entropy*. In the end you have to take *entropy* into accound no matter what kind of data patterns you use. Blocks that show better performance now will be used *more* and your /wear-leveling/ is not leveling anymore, inducing more wear in the preferred blocks, hence reducing their performance (actually if we limit the discussion to the 'read disturb' issue I'm not convinced that there's any wear going on.)
> If you can associate a metric with each physical block that represents > its historical performance, you can "level" based on *performance* > instead of *use*. > > I.e., one leveling algorithm just moves recycled blocks to the end of > the queue and always picks the first block in the queue for use when > some existing block needs to be recycled out of service. This ensures > each block gets *erased* the same number of times -- regardless of how > long a particular physical block "survived" IN ITS MOST RECENT (or sum > of all preceding) USE. > > You could, instead, track the number of *reads*-before-error-limit for > each block and arrange for those with the highest "durability" to be > reused *first*. Over time, their performance will degrade (i.e., they > will start to exhibit more errors, sooner) so that some other block > that had previously been considered "less durable" is regarded as MORE > durable than this "worn" block.
Than what would be the benefit w.r.t. to equally spreading load? Another point is that unfortunately we *cannot* avoid to take into account other failure modes (on write/erase cycles), since they *will* certainly induce wear. Again any metric that would /rank/ blocks according to performance would lead to uneven wear that you need to compensate at the end if you want to keep memory size /constant/ (i.e. no loss of blocks due to wear).
> [Sorry, I'm being verbose but I don't feel I'm being *clear*. :< ]
I think I got your point, but I'm not seeing the added value of any ranking mechanism.
> I'm oversimplifying this as there are other issues related to geometry > that play a part. My point is to show that *how* you choose to reuse > blocks can be implemented in different ways with different criteria. > The point being that you want to minimize the number of "recycle > events" that you take on -- especially if the block you *pick* as a > replacement block may not write properly... so, you end up with a > "double fault" of sorts.
But you need to factor in that behavior anyhow, since betting on the best performance is not going to pay on the long run and sooner or later you'll be facing 'double fault' sorts of errors anyhow. If your algorithm is not ready to take that into account you'll be in for a nasty surprise.
>> You count the number of E/P cycles each block has accumulated and you >> choose to write data in available blocks with the least lived E/P >> cycles (for example). The E/P cycles available become effectively >> multiplied by the number of available blocks. > > That's the "each block gets erased the same number of times" approach > I described above. Wouldn't you rather find a set of blocks that > *needs* to be erased the least number of times??
Assuming you find the necessary metric to /rank/ the blocks, what is the overall *gain* if you end up wearing those *more* than others and effectively losing that gap in performance more quickly? []
>> Depending on your ECC there might be a limited choice, i.e. assume >> your ECC recovers only one bit flip and report multiple ones. >> Whenever I detect (and correct) *one* single error, I have no other >> choice but to act otherwise, if I wait too long my ECC won't help me >> anymore. > > This depends on the nature of the source of your errors. E.g., do you > replace the RAM in your PC when it reports *any* CORRECTED errors? > (I've asked this question before and its fun to watch people try to > rationalize how they use that "feature" -- do they even *know* when > there are error events being corrected? when do they decide there are > "too many"?)
AFAIK the ECC used in PCs are more like BHC, with low overhead and strong correction capability at the expense of a more complex algorithm. You could set a threshold on the amount of detected and corrected and take an action only when the threshold is crossed. You still have reliable data but you better correct those errors before the capability of your ECC is not sufficient anymore. In my case, in order to reduce complexity in the fabric (the algorithm will be implemented in an FPGA), I'll be more inclined towards a simple hamming that can correct *one* and detect more than *two*. Therefore I'll need to take action on *any* error. []
>> There are attempts to show some sort of error patterns that are more >> likely to show up and hence can be managed accordingly, but I'm a >> fond of the K.I.S.S. principle and I'm not so much interested in >> building a read/write patterns aware wear leveling algorithm :-). > > The resource that you can't exhaust is write-erase cycles. So, you > really want to come up with an algorithm that minimizes the number of > times you "give up" on a written block -- in favor of refreshing it > here or elsewhere. That's why I was saying (earlier) to rethink your > algorithm to track "measured durability" (i.e., how *long* before > errors detected) and use that to govern the selection of replacement > blocks -- instead of just a FIFO (or even random) algorithm.
The thing is that metric 'accumulated' on read operations do not necessarily provide any wear level. AFAIK there's no indication that your write-erase cycle is *close* to the end of life. 'Read disturb' errors are not related to wear, as well as errors induced by heavy ions so it would be difficult to make an algorithm that ranks the /strenght/ of each block. []
>> The advantage of the microcontroller is not the hardware it >> represents but the languages that can be used to instruct it. I see >> easily how moving a buffer from one point to another is a perfect >> task for an FPGA, dealing with low level protocol with the Flash >> Interface, but taking the decision about what to do on a failure is >> where the software has an added value and can be 'economical' and >> possibly more reliable given that the complexity it can handle is >> certainly superior (I doubt I'd be able to write a journalized flash >> file system with bad block management, wear leveling, scrubbing and >> garbage collection within an FPGA!). > > Exactly. As I said, hardware is great for parallel operations. But, > when the algorithm being implemented is thought of in a serial > fashion, it is often easier to resort to a sequencer -- a > microcontroller. > WHEN this happens, do that... UNLESS this other thing happens, in > which case...
I've seen an interesting presentation (http://www.google.ch/url?sa=t&rct=j&q=improved%20memory%20module%20sitael&source=web&cd=6&cad=rja&uact=8&ved=0CDgQFjAF&url=http%3A%2F%2Fcongrexprojects.com%2Fdocs%2F12c25_2510%2F15errico_tuccio_adcss2012_sitael_we.pdf%3Fsfvrsn%3D2&ei=qTy9VOCADs_taMe7gegF&usg=AFQjCNH242L7zto_ClD3Y-AzNDPL5c2TVg&bvm=bv.83829542,d.d2s) on a memory controller implemented on an FPGA. It claims it fits in a 10% of a 3MGates equivalent FPGA (ProAsic3E A3PE3000), which seems a bit less than what I expected. It would be interesting to see how much resources would take a small embedded microcontroller which would implement the FTL in a software stack. Al