Reply by Don Y January 20, 20152015-01-20
Hi Al,

On 1/20/2015 3:08 AM, alb wrote:
> Don Y <this@is.not.me.com> wrote: > If the only aim is to preserve write-erase cycle than I wouldn't care, > the aim would be to guarantee equal number of cycles. But if the aim is > different (as it sounds like) than it becomes more important to rank the > usage with some metrics and level performance instead of write-erasure > cycles. > > In the end you'll have an uneven distribution of write-erase cycles, but > an equally distributed 'probability' to show errors.
Exactly. What you're interested in is preserving data. If the data exhibits no errors, why "do anything" *to* it? And, if you have one portion that exhibits a higher error rate, then *it* (all else being equal) drives the performance of your store. If you can effectively *improve* it's perfomance at the expense of portions that are currently performing *better*...
>>> Again any metric that would /rank/ blocks according to performance >>> would lead to uneven wear that you need to compensate at the end if >>> you want to keep memory size /constant/ (i.e. no loss of blocks due >>> to wear). >> >> As a block wears, it's performance metric *degrades* (assuming wear >> causes ECC events to be triggered more frequently). So, a "preferred >> block" eventually loses that status to other blocks that previously >> had *worse* "performance". I.e., over time, the *performance* of >> blocks is what you end up leveling -- not the number of write-erase >> cycles! > > Exactly. Whether this is a more preferrable approach I have hard time to > judge.
It's not as simple as this argument has been, to date. I've deliberately avoided other "real" issues that add even more layers of complexity to the analysis, design, argument, etc. E.g., imagine the existing data have settled into nice, "very reliable" blocks (i.e., no ECC events). But, you need to do a write (because some data has changed). And, by chance, your spare block list is exhausted! Which block do you choose -- from among those *reliable* blocks?
> What I'm sure about is that in HiRel applications you'll have hard time > to convince the customer that your algorithm is not guaranteeing to be > within spec (max number of write-erase cycles) even if the aim is to get > the most out of the memory.
There are two ways to look at this: - if the memory is performing correctly, why do you care about the number of write cycles that it has incurred? Do you deliberately end the product's life when the number of write cycles has been attained -- even if it *seems* to be working properly?? "Hello, Mars Rover... turn yourself OFF, now; your design life has been attained!" - qualify the device so you *know* what sorts of ACTUAL performance you can get from it and design your algorithm(s) around that data. I routinely encounter consumer kit that has been designed with *way* too little margin (9VDC on 10V caps, etc.). For the consumer, this sucks because the devices invariably have short lives. OTOH, for the manufacturer, it's "getting the most out of the least" (cost). They've *obviously* done the math and realized that these design choices will allow the device to survive the warranty interval (so, *they* never see the cost of repair or replacement). There is also nothing that prevents you from addressing multiple criteria simultaneously. *Except* the complexity of such a solution! (i.e., again returning to the software vs hardware approach and consequences/opportunities of each)
> It would be like trying to show that your power mosfet is used beyond > max. ratings but 'believe me that this is where I get the most out of > it'. A respectable QA would never allow to slip that out of the house > and if that happens a respectable customer would never allow that to > slip in.
Even if your circuit had provisions to automatically replace the MOSFET when/if it failed? Again, do you automatically decommission the product when it has achieved it's *specified* life expectancy? I.e., under nominal loads, each MOSFET will last XXX POHr's. So, *at* XXX hours, we will replace the MOSFET, discarding it as "used" -- even if it has usable life remaining. When we replace the N-th of these devices (after N * XXX hours), we will have run out of MOSFETs and, by design, have to decommission the device. Said another way: "We'll select a bigger MOSFET to handle the increased load and determine the PDF over time that it will survive." What if it *doesn't*?
>>> But you need to factor in that behavior anyhow, since betting on the >>> best performance is not going to pay on the long run and sooner or >>> later you'll be facing 'double fault' sorts of errors anyhow. If your >>> algorithm is not ready to take that into account you'll be in for a >>> nasty surprise. >> >> All I'm changing in the algorithm is the "block selection algorithm" >> -- how you pick *which* block to use from your overprovisioning store >> to handle the *next* "recycle" event. If *it* fails the write, then >> you pick another block -- using the same criteria. I.e., how you >> handle an ECC event *or* a write-fault doesn't change -- just how you >> *pick* which block to use for your recovery attempt after such an >> event. > > And within your selection essentially lies your goal. Leveling > 'performance' might be another strategy which might be the winning one > on some applications, but I don't see how that can be levereged in a > field where it will be never accepted to 'go beyond' the quoted limits.
I'm not claiming this is the right solution for you. Rather, I am indicating how the strategy for managing the flash array can easily become complex. Citing a portion of the conversation several posts back: >>> All these 'functions' to handle the >>> Flash seem to me very suited for software but not for hardware. Does >>> anyone here have a different opinion? >> >> AFAIK, (e)MMC devices all have a small microcontroller inside.> > > I can't see an *economical* way of doing this (in anything less than > huge volumes) with dedicated hardware (e.g., FPGA). followed, in a subsequent post, by: > Remember, if you are too naive in your implementation, you can increase > *wear*. I think to get a good algorithm, you probably want to track > knowledge of the entire *device* -- not just the RECENT history of > this block/page. (i.e., where did the page that *was* here go? and, > why? if it had an unusually high error rate, you might not be so keen > on bringing it back into the rotation -- ever!) > > I.e., it seems like a lot of "state" to manage in a dedicated piece of > hardware (that you can't *service*!) I.e., your choice to implement this *in* an FPGA places some (realistic) constraints on what you can effectively do in your algorithm.
> I realize now (sorry for having been so hard headed!) that essentially > your thinking might have been driven by another goal.
Figure out what your driving priorities are as well as how much leeway you have in achieving them. Then, think about the consequences of your implementation if something *isn't* what you expected it to be (i.e., if a device does NOT meet its stated performance). E.g., in my application, I want the array to *look* like an ideal disk. I don't want the application to have to be aware of the implementation. Yet, I want the implementation to accommodate the needs of the application without undue wear on the array. So, the application shouldn't have to consider whether it has recently written "this value" to "this sector" -- or not. Yet, the array shouldn't naively do what it is told and, in this case, incur unnecessary wear. All while trying to stay "inexpensive" (i.e., replacing the device because the flash wore out would very significantly increase the TCO) [Sorry, had to elide the rest as I'm late for a meeting this morning...]
Reply by Don Y January 20, 20152015-01-20
Hi Al,

On 1/20/2015 3:31 AM, alb wrote:
> Don Y <this@is.not.me.com> wrote: > [] >>> In the first case you would need to store somewhere in your flash memory >>> where was the last pointer for your 'increase and wrap-around' >>> algorithm, while in the second case you simply don't. >> >> That assumes you have a reliable source of entropy that is not affected by >> loss of power. I'd be nervous about just how "random" any such source >> would actually be; if it isn't, then it introduces an irrecoverable >> bias to the algorithm. Unexpected "patterns" in supposedly random >> events have a nasty tendency to manifest. E.g., power application >> introduces a voltage spike to some external circuit that attempts to >> measure thermal noise on a diode junction, etc. >> >> Any *digital* machine without state would obviously not be "random". > > search for 'random number generator fpga' and you'll find plenty of > hits! I think you may achieve a flat distribution without much of a > hassle.
Those are *pseudo* random. How do you "seed" it when power reappears? I.e., it will start up in the same place that it *started* up last time power was cycled -- and the time before that, etc. -- because it doesn't remember where it *was* (in its pseudo-random sequence) when it was powered *off*. Even if you clock it at some relatively high rate (wrt "replacement block selection rate"), you still need some external source of entropy to ensure your *digital* system doesn't simply pick the same iteration of the RNG each time it gets up and running. I.e., here's a pseudo-random (over *some* interval) string of digits: 3593450671875346641049135080357695788035424... But, if the next time I apply power causes the same series to be regenerated (because there is no source of entropy in the system), then the first block chosen for replacement will always be the same -- '3'. [If you clock the RNG at a high frequency, you'll move the selection of *which* digit you will examine -- but, will always move it by the same amount relative to power-up. Unless there is some external event that is asynchronous wrt your system's operation on which you can rely to introduce some sense of "randomness" to the selection of this first digit]
> We will most probably need to implement it as well for our dither > generator, so the may function can serve multiple uses. > >>> Moreover, while in a perfect world power outages are maybe not >>> considered, the simple fact that power is cut while you want to store >>> your metadata may screw your 'increase and wrap-around' algorithm and >>> you'll have uneven usage, while a random approach would simply not care. >> >> You have to deal with outages with *any* write-erase cycles. > > Are you implying a journalized file system? I certainly don't want to go > that far. Metadata like bad block list (BBL) and Logical block address > (LBA) shall be confined in one location only and preferrably be > 'atomic'.
If accesses are NOT "atomic", then how will you verify the integrity of that data? How will you *know* which updates (of that data) were "in progress" when the power died? Everything that you expect to be "persistent" has to be preserved in some form or another. I.e., if you have just decided that block #N must be recycled and are in the process of erasing it, you must have made a durable reference to this fact *somewhere* so that if power goes away before you have completed the erasure, you will know which block (N) was in the process of being erased (and can restart the erase operation). Similarly, when you are *done* erasing it, you need to make a durable note of this fact so that you don't re-erase it, unnecessarily. The same applies to write cycles. Otherwise, an interrupted write can result in you repeating the write when power is restored -- only to encounter a write failure (because you had previously *almost* completed the write and are now restarting it -- effectively lengthening the write timer). Likewise, if you are tracking block read counts (for read fatigue), each of those data must be persistent.
> If you find a new bad block and copy the data nothing will > happen if a power cut happens since original data is still accessible.
But you've now got a write that is unaccounted for! I.e., that block is now no longer "erased and ready to be rewritten" nor "written and ready to be used". You have to remember that you were in the process of writing it and something has prevented that write from completing. The only safe recourse is to assume it was completed for the sake of tracking "program cycles"; yet assume it was NOT completed for the sake of the validity of the data that it contains -- thus, re-erase it (and count that erasure)
> Certainly if the power cut happens while you are updating the BBL and > LBA there's not much that can help you unless you guarantee enough > energy storage in bypass caps to hold on until the operation is > complete.
Exactly. This is true of *anything* in the array that must be persistent.
> Losing the BBL is not a major issue since you can rebuild it from the > device (at the expense of rescanning the whole memory), losing LBA is > more of an issue since your data are scrumbled and may not be simple to > recover (but again you may conceive a mechanism to recover this).
How do you know (for sure) which blocks are corrupt, retired, etc.?
> Certainly these functionalities are more appropriate to software rather > than hardware.
I think the only sort of algorithm you can hope to implement (in hardware) has to rely on very limited information about the state of the memory array. I.e., a FIFO structure for recycling blocks, an "open loop" criteria for when those blocks get recycled, etc. You'd be well served to research the current literature in this regard. Unlike magnetic media whose wear/failure modes are reasonably well understood, there seems to be a lot of hit-or-miss activities when it comes to effective management of flash. If you can "guarantee", a priori, that you need a fixed life from the design *and* the components will behave in a guaranteed manner (in those conditions), then you can probably come up with a solution that "fits" and forget about it. The research I've been doing has been geared towards high usage in a consumer grade device; as cheap as possible and as *durable* as possible -- yet trying not to impose too many constraints on how it is *used* (to give the other aspects of the system as much flexibility as possible in their design)
Reply by alb January 20, 20152015-01-20
Hi Dimiter,

Dimiter_Popoff <dp@tgi-sci.com> wrote:
[]
> I think everyone so far has been dramatically overdoing the > complexity. The simplest solution would likely be most efficient, I > think they use it on HDD-s for ages.
Ideally any solution which is not the simplest to reach the goal is either too complex or too simplistic. Still I'd be more comfortable with something that is overdoing a bit than with something that would not meet the requirement.
> Reserve some virgin space to relocate bad sectors (blocks, whatever).
This is the first flaw. You cannot anticipate how many bad sectors you are allowed to have. With worst case scenarios, manufacturers guarantee only the minimum amount of valid block only through the life endurance of the memory which is indeed a parameter of a block. So if you're allowed 100K write-erase cycle and you have 4K blocks, with a wear leveling algorithm you may extend the amount of cycles to nearly 4K * 100K. At this point the total amount of bad blocks quoted by the manufacturer is meaningless. But even if you can estimate what's the total amount of bad block you are going to have, put a cap on it would be rather pessimistic. Once the mission is on orbit it can maybe last more than anticipated while your cap will limit its functionality beyond the reserved area.
> Then write to the non-reserved with no further consideration; verify > after each write. Once a write fails, relocate the block (obviously > you have to keep a backup copy of the block prior to writing).
So if you have a bad block in one of those reserved ones you still need to move it to yet another one. Would the replacement for it be in the same reserved area? What will happen when the reserved area is worn out? As stated alread (IIRC in the OP) the data are mainly configuration, meaning you are reading continuously from these blocks with very little modifications. If you move every faulty block to the reserved area you'll be soon confined in that reserved area and can never get out. []
> Obviously you must ensure you have enough power to complete the write, > verify and potentially the relocate& write operations in the power > capacitors so the thing can survive a blackout at any moment.
Power management is something to take into account anyhow.
> Then on a space mission you might want to "RAID" the data across say 3 > flash chips.
In the past we've triplified the data stored in the same chip/component. Since metadata are stored locally in RAM (protected by scrubbing) there's no need for triplifying the hardware. The system is redundant on its own, so a hard failure on the component will lead to switching to the redundant one. Al
Reply by alb January 20, 20152015-01-20
Hi Don,

Don Y <this@is.not.me.com> wrote:
[]
>> In the first case you would need to store somewhere in your flash memory >> where was the last pointer for your 'increase and wrap-around' >> algorithm, while in the second case you simply don't. > > That assumes you have a reliable source of entropy that is not affected by > loss of power. I'd be nervous about just how "random" any such source > would actually be; if it isn't, then it introduces an irrecoverable > bias to the algorithm. Unexpected "patterns" in supposedly random > events have a nasty tendency to manifest. E.g., power application > introduces a voltage spike to some external circuit that attempts to > measure thermal noise on a diode junction, etc. > > Any *digital* machine without state would obviously not be "random".
search for 'random number generator fpga' and you'll find plenty of hits! I think you may achieve a flat distribution without much of a hassle. We will most probably need to implement it as well for our dither generator, so the may function can serve multiple uses.
>> Moreover, while in a perfect world power outages are maybe not >> considered, the simple fact that power is cut while you want to store >> your metadata may screw your 'increase and wrap-around' algorithm and >> you'll have uneven usage, while a random approach would simply not care. > > You have to deal with outages with *any* write-erase cycles.
Are you implying a journalized file system? I certainly don't want to go that far. Metadata like bad block list (BBL) and Logical block address (LBA) shall be confined in one location only and preferrably be 'atomic'. If you find a new bad block and copy the data nothing will happen if a power cut happens since original data is still accessible. Certainly if the power cut happens while you are updating the BBL and LBA there's not much that can help you unless you guarantee enough energy storage in bypass caps to hold on until the operation is complete. Losing the BBL is not a major issue since you can rebuild it from the device (at the expense of rescanning the whole memory), losing LBA is more of an issue since your data are scrumbled and may not be simple to recover (but again you may conceive a mechanism to recover this). Certainly these functionalities are more appropriate to software rather than hardware. Al
Reply by alb January 20, 20152015-01-20
Hi Don,

Don Y <this@is.not.me.com> wrote: 
[]
>> The main point here is about your reference to *entropy*. In the end >> you have to take *entropy* into accound no matter what kind of data >> patterns you use. Blocks that show better performance now will be >> used *more* and your /wear-leveling/ is not leveling anymore, >> inducing more wear in the preferred blocks, hence reducing their >> performance (actually if we limit the discussion to the 'read >> disturb' issue I'm not convinced that there's any wear going on.) > > The "wear" (from the "read disturb") comes because those blocks > eventually get recycled and incur write-erase cycles.
you have a point. Indeed a block that shows less ECC errors will see less recycling events, hence write-erase cycles.
> You can't claim that the "preferred" blocks will see more wear -- > there are other blocks IN USE that are "less preferred". Presumably, > they are less preferred because they trip the ECC *EARLIER*. As such, > they will experience a recycle event *sooner* than the "more > preferred" blocks (because the more preferred blocks trip the ECC less > often).
I simply rephrased your statement 'Over time, their [preferred cells] performance will degrade'. OTOH you're implying that write-erase cycles limits are not made *equal* for all cells, i.e. more robust cells will likely hold more than the quoted (conservative) limit. If this is the case (and I'm not sure whether data exist to back the reasoning), than your approach will level performance instead of write-erase cycles and get the storage size last longer. []
>> Than what would be the benefit w.r.t. to equally spreading load? > > Think of it this way: you have N blocks in the device. You need M > (M<N). ASSUME all blocks are NOT created equal (process variation, > usage patterns, etc.). You can now *choose* which blocks you want to > use to STORE YOUR DATA (i.e., that's the goal -- storing information > and later being able to confidently RETRIEVE it!). Would you choose > the blocks that tend to exhibit the MOST errors (over time)? The > least errors? Or, with no concern at all for their past performance??
If the only aim is to preserve write-erase cycle than I wouldn't care, the aim would be to guarantee equal number of cycles. But if the aim is different (as it sounds like) than it becomes more important to rank the usage with some metrics and level performance instead of write-erasure cycles. In the end you'll have an uneven distribution of write-erase cycles, but an equally distributed 'probability' to show errors.
> ASSUME you *can* recover from an ECC event (i.e., the first such event > is a "warning" -- and no data has been lost prior to this event). Do > you *welcome* the possibility of having to take an exception, *now*, > and rewrite that data that you've now been warned about? Esp given > that the rewrite event has some cost *and* probability of failing? > > Wouldn't you rather *avoid* this possibility, where possible?
An ECC warning, as you referred to it, should be treated to avoid the possibility to be unable to recover data integrity. Now the only choice left is which block should I use first. If the faulty block is nevertheless my top one, I may consider to rewrite in place, otherwise I'll move the data to yet another place. Rewriting in place is not so trivial since you would need a block size buffer at your disposal, and not a page size that you would need in the event of a move to another location. If the algorithm needs to be twisted because of other constraints, the end result might be biased as well. []
>> Again any metric that would /rank/ blocks according to performance >> would lead to uneven wear that you need to compensate at the end if >> you want to keep memory size /constant/ (i.e. no loss of blocks due >> to wear). > > As a block wears, it's performance metric *degrades* (assuming wear > causes ECC events to be triggered more frequently). So, a "preferred > block" eventually loses that status to other blocks that previously > had *worse* "performance". I.e., over time, the *performance* of > blocks is what you end up leveling -- not the number of write-erase > cycles!
Exactly. Whether this is a more preferrable approach I have hard time to judge. What I'm sure about is that in HiRel applications you'll have hard time to convince the customer that your algorithm is not guaranteeing to be within spec (max number of write-erase cycles) even if the aim is to get the most out of the memory. It would be like trying to show that your power mosfet is used beyond max. ratings but 'believe me that this is where I get the most out of it'. A respectable QA would never allow to slip that out of the house and if that happens a respectable customer would never allow that to slip in. []
>> But you need to factor in that behavior anyhow, since betting on the >> best performance is not going to pay on the long run and sooner or >> later you'll be facing 'double fault' sorts of errors anyhow. If your >> algorithm is not ready to take that into account you'll be in for a >> nasty surprise. > > All I'm changing in the algorithm is the "block selection algorithm" > -- how you pick *which* block to use from your overprovisioning store > to handle the *next* "recycle" event. If *it* fails the write, then > you pick another block -- using the same criteria. I.e., how you > handle an ECC event *or* a write-fault doesn't change -- just how you > *pick* which block to use for your recovery attempt after such an > event.
And within your selection essentially lies your goal. Leveling 'performance' might be another strategy which might be the winning one on some applications, but I don't see how that can be levereged in a field where it will be never accepted to 'go beyond' the quoted limits. I realize now (sorry for having been so hard headed!) that essentially your thinking might have been driven by another goal.
>>>> You count the number of E/P cycles each block has accumulated and >>>> you choose to write data in available blocks with the least lived >>>> E/P cycles (for example). The E/P cycles available become >>>> effectively multiplied by the number of available blocks. >>> >>> That's the "each block gets erased the same number of times" >>> approach I described above. Wouldn't you rather find a set of >>> blocks that *needs* to be erased the least number of times?? >> >> Assuming you find the necessary metric to /rank/ the blocks, what is >> the overall *gain* if you end up wearing those *more* than others and >> effectively losing that gap in performance more quickly? > > You're not "wearing" them faster (see above). > > Given: You are going to perform N reads in some unit of time/effort. > The goal is to minimize the number of read errors that you encounter.
This goal is certainly not wrong, but if you add that you shall maintain the total number of write-erase cycles within spec (per each block), your choice does not really matter since a better performing block will last longer (number of reads before a recycle event) independently of the strategy to pick it.
> Read errors lead to write-erase cycles (recycle events) which cost > resources (time/performance) AND reduce the overall durability of the > device. Wouldn't you *naturally* want to pick the blocks that give > you the least number of read errors AND, thus, recycle events?
A block that shows errors less likely than others *will* be used more and eventually degrade. If I pick the best performing block before others it wouldn't change its performance so why bothering in the ranking (considering that I cannot go beyond the quoted write-erase cycles).
>> AFAIK the ECC used in PCs are more like BHC, with low overhead and >> strong correction capability at the expense of a more complex >> algorithm. > > The point of my parenthetical comment was: "How do you decide when the > ECC is *too* active? What criteria do you use to determine that > errors may be slipping through undetected? How do you (PC user) even > KNOW that this is happening??
This is typically handled by the FTL in my filesystem of choice. Pick the right one and you may even be free to modify the wear-leveling algorithm! There's a miriad of flash file systems popping out, in all sorts of flavors. If you are really interested just dig deep enough. If you are not interested and want to live in a -not so well- fenced world, just select any closed system and pray someone has taken your worries into account. The choice, as usual, is in your hand ;-). []
>> The thing is that metric 'accumulated' on read operations do not >> necessarily provide any wear level. AFAIK there's no indication that >> your write-erase cycle is *close* to the end of life. > > When you run out of write-erase cycles, your "recycle" events (or, > write events) start failing. The point of my "read history" metric is > to keep blocks from going *through* write-erase cycles needlessly > (i.e., if a particular block isn't giving you ECC errors, why rewrite > it?
I'm not sure I understand your point here. Why would I ever want to rewrite a block if is not giving ECC errors?
> Why incur that "wear event" if not needed? Similarly, why not pick > blocks that don't need to be rewritten as frequently if you have such > a metric?
Because if the goal is to not go beyond the max write-erase count on each block it does not really matter how you choose them.
> [The point of my initial comment was to suggest that these algorithms > are too complex to implement in dedicated logic. Esp if you are > trying to get the best (most reliable) performance from the array.]
I think I'll try (in my spare time) to see what would be the footprint of my FTL implemented in software (LOC and memory) in order to have a cost/benefit ratio between the two approaches. Performance as well should be taken into account (throughput, latency, ...).
>> 'Read disturb' errors are not related to wear, as well as errors >> induced by heavy ions so it would be difficult to make an algorithm >> that ranks the /strenght/ of each block. > > Each write-erase cycle alters the characteristics of the cells in the > block. I.e., "wear" is nonreversible. As a cell wears, it is less > capable of preserving the correct state of charge (i.e., storing the > desired "data") This manifests as a greater susceptibility to disturb > events (read/write) and an earlier ECC event. I.e., "time til ECC > event gives you a practical way of measuring how "reliable" a > particular block is GIVEN YOUR USAGE PATTERNS. > > This should be easy for you to test. Pick a block and run a bunch of > write-erase cycles on it. I.e., pre-wear THAT block so it is expected > to have less useful life than other blocks in the array. Now, apply > an algorithm and watch to see how well *it* performs compared to other > "unworn" blocks (accelerated wear testing). See which blocks get > marked as permanently bad, *first* using a "fair" algorithm. > Similarly, track the number of recycle events that you incur. > > Then, tweek the algorithm to alter the recycled block selection choice > and see how the performance changes (do you get more use from the > array? fewer recycle events? more total life before blocks are > considered "bad"?)
That's an interesting approach, so altering *wear* would increase the statement 'not all cells are made equal' and magnify the impact of my algorithm. But I believe that I need to repeat the exercise with the tweaked algorithm on a different device with same starting conditions of induced *wear* otherwise results may easily be misinterpreted. []
>> It would be interesting to see how much resources would take a small >> embedded microcontroller which would implement the FTL in a software >> stack. > > The difference that I am highlighting is the range of algorithms that > are EFFECTIVELY available to each type of implementation (software FTL > vs. dedicated hardware). Duality ensures you can implement any in > *each* approach. But, getting it *correct* is a different matter...
Ensuring correctness of the algorithm is a separate subject, I agree. Still you can imagine that while it is *doable* to implement in both types of implementation, the efforts (development time, verification time) and the costs associated to hardware/software/tools/equipment may lead to preferring one vs. the other. Al
Reply by Dimiter_Popoff January 20, 20152015-01-20
On 19.1.2015 &#1075;. 08:44, alb wrote:
> Hi Boudewijn, > > Boudewijn Dijkstra <sp4mtr4p.boudewijn@indes.com> wrote: >> Op Fri, 16 Jan 2015 00:15:03 +0100 schreef alb <al.basili@gmail.com>: >>> Don Y <this@is.not.me.com> wrote: >>> >>> [] >>> Another simple wear leveling approach would be to select randomly among >>> the pool of empty blocks. This technique does not require any 'counting' >>> and is simple enough to implement (seeding the pseudo-random generator >>> with a monotonic parameter like time, in order to avoid to repeat the >>> same 'pseudo-random' sequence). >> >> I'm curious: is random selection better than a simple increase (with >> wrap-around)? > > In terms of spreading usage across the device the main difference > between a simple increase and wrap-around and a random choice is across > power off/on cycles.
I think everyone so far has been dramatically overdoing the complexity. The simplest solution would likely be most efficient, I think they use it on HDD-s for ages. Reserve some virgin space to relocate bad sectors (blocks, whatever). Then write to the non-reserved with no further consideration; verify after each write. Once a write fails, relocate the block (obviously you have to keep a backup copy of the block prior to writing). This will take care of variations between blocks, writing frequency etc. Obviously you must ensure you have enough power to complete the write, verify and potentially the relocate& write operations in the power capacitors so the thing can survive a blackout at any moment. Then on a space mission you might want to "RAID" the data across say 3 flash chips. Dimiter ------------------------------------------------------ Dimiter Popoff, TGI http://www.tgi-sci.com ------------------------------------------------------ http://www.flickr.com/photos/didi_tgi/
Reply by Don Y January 19, 20152015-01-19
On 1/18/2015 11:44 PM, alb wrote:

>>> Another simple wear leveling approach would be to select randomly among >>> the pool of empty blocks. This technique does not require any 'counting' >>> and is simple enough to implement (seeding the pseudo-random generator >>> with a monotonic parameter like time, in order to avoid to repeat the >>> same 'pseudo-random' sequence). >> >> I'm curious: is random selection better than a simple increase (with >> wrap-around)? > > In terms of spreading usage across the device the main difference > between a simple increase and wrap-around and a random choice is across > power off/on cycles. > > In the first case you would need to store somewhere in your flash memory > where was the last pointer for your 'increase and wrap-around' > algorithm, while in the second case you simply don't.
That assumes you have a reliable source of entropy that is not affected by loss of power. I'd be nervous about just how "random" any such source would actually be; if it isn't, then it introduces an irrecoverable bias to the algorithm. Unexpected "patterns" in supposedly random events have a nasty tendency to manifest. E.g., power application introduces a voltage spike to some external circuit that attempts to measure thermal noise on a diode junction, etc. Any *digital* machine without state would obviously not be "random".
> The reduced amount of metadata management can simplify the task and > reduce complexity. > > Moreover, while in a perfect world power outages are maybe not > considered, the simple fact that power is cut while you want to store > your metadata may screw your 'increase and wrap-around' algorithm and > you'll have uneven usage, while a random approach would simply not care.
You have to deal with outages with *any* write-erase cycles.
Reply by Don Y January 19, 20152015-01-19
Hi Al,

On 1/19/2015 10:24 AM, alb wrote:
> Don Y <this@is.not.me.com> wrote: [] >>> Uhm, wear leveling algorithms do not *hope* anything, actually they >>> spread the maximum allowed number of E/P cycles all over the device >>> instead of a single block, since endurance is defined _per block_. >> >> In an ideal world (process), all blocks would be identical in terms of >> performance, durability, etc. In that case, you can trivially cycle >> locks through the algorithm in any "fair" order and never see any >> difference in performance (i.e., durability/error resistance). >> >> However, processes aren't ideal. And, data "patterns" only add to the >> uncertainty. You can't *know* that each block -- having ALL been >> "recycled" exactly N times -- will exhibit the same sorts of "read >> failures" (again, just talking about reads to simplify the discussion) >> when asked to store a particular data pattern (in the context of the >> *other* data patterns located "around" it). >> >> You should be able to test this by deliberately exhausting some sample >> devices and/or partial writes (cut the write timer short, >> deliberately) to *some* blocks but not others. I.e., don't let >> entropy average all of that out across the device. *Then*, try to use >> the device with a "fair" block swapping algorithm and track how long >> each *actual* (physical) block "survives" (provides reliable data >> below ECC threshold that suggests a need to rescrub) before your >> controller decides that it needs to be "recycled" -- erased and >> rewritten (or, reused to replace another block). > > The main point here is about your reference to *entropy*. In the end you > have to take *entropy* into accound no matter what kind of data patterns > you use. Blocks that show better performance now will be used *more* and > your /wear-leveling/ is not leveling anymore, inducing more wear in the > preferred blocks, hence reducing their performance (actually if we limit > the discussion to the 'read disturb' issue I'm not convinced that > there's any wear going on.)
The "wear" (from the "read disturb") comes because those blocks eventually get recycled and incur write-erase cycles. You can't claim that the "preferred" blocks will see more wear -- there are other blocks IN USE that are "less preferred". Presumably, they are less preferred because they trip the ECC *EARLIER*. As such, they will experience a recycle event *sooner* than the "more preferred" blocks (because the more preferred blocks trip the ECC less often).
>> If you can associate a metric with each physical block that represents >> its historical performance, you can "level" based on *performance* >> instead of *use*. >> >> I.e., one leveling algorithm just moves recycled blocks to the end of >> the queue and always picks the first block in the queue for use when >> some existing block needs to be recycled out of service. This ensures >> each block gets *erased* the same number of times -- regardless of how >> long a particular physical block "survived" IN ITS MOST RECENT (or sum >> of all preceding) USE. >> >> You could, instead, track the number of *reads*-before-error-limit for >> each block and arrange for those with the highest "durability" to be >> reused *first*. Over time, their performance will degrade (i.e., they >> will start to exhibit more errors, sooner) so that some other block >> that had previously been considered "less durable" is regarded as MORE >> durable than this "worn" block. > > Than what would be the benefit w.r.t. to equally spreading load?
Think of it this way: you have N blocks in the device. You need M (M<N). ASSUME all blocks are NOT created equal (process variation, usage patterns, etc.). You can now *choose* which blocks you want to use to STORE YOUR DATA (i.e., that's the goal -- storing information and later being able to confidently RETRIEVE it!). Would you choose the blocks that tend to exhibit the MOST errors (over time)? The least errors? Or, with no concern at all for their past performance?? ASSUME you *can* recover from an ECC event (i.e., the first such event is a "warning" -- and no data has been lost prior to this event). Do you *welcome* the possibility of having to take an exception, *now*, and rewrite that data that you've now been warned about? Esp given that the rewrite event has some cost *and* probability of failing? Wouldn't you rather *avoid* this possibility, where possible?
> Another point is that unfortunately we *cannot* avoid to take into > account other failure modes (on write/erase cycles), since they *will* > certainly induce wear. > > Again any metric that would /rank/ blocks according to performance would > lead to uneven wear that you need to compensate at the end if you want > to keep memory size /constant/ (i.e. no loss of blocks due to wear).
As a block wears, it's performance metric *degrades* (assuming wear causes ECC events to be triggered more frequently). So, a "preferred block" eventually loses that status to other blocks that previously had *worse* "performance". I.e., over time, the *performance* of blocks is what you end up leveling -- not the number of write-erase cycles!
>> [Sorry, I'm being verbose but I don't feel I'm being *clear*. :< ] > > I think I got your point, but I'm not seeing the added value of any > ranking mechanism. > >> I'm oversimplifying this as there are other issues related to geometry >> that play a part. My point is to show that *how* you choose to reuse >> blocks can be implemented in different ways with different criteria. >> The point being that you want to minimize the number of "recycle >> events" that you take on -- especially if the block you *pick* as a >> replacement block may not write properly... so, you end up with a >> "double fault" of sorts. > > But you need to factor in that behavior anyhow, since betting on the > best performance is not going to pay on the long run and sooner or later > you'll be facing 'double fault' sorts of errors anyhow. If your > algorithm is not ready to take that into account you'll be in for a > nasty surprise.
All I'm changing in the algorithm is the "block selection algorithm" -- how you pick *which* block to use from your overprovisioning store to handle the *next* "recycle" event. If *it* fails the write, then you pick another block -- using the same criteria. I.e., how you handle an ECC event *or* a write-fault doesn't change -- just how you *pick* which block to use for your recovery attempt after such an event.
>>> You count the number of E/P cycles each block has accumulated and you >>> choose to write data in available blocks with the least lived E/P >>> cycles (for example). The E/P cycles available become effectively >>> multiplied by the number of available blocks. >> >> That's the "each block gets erased the same number of times" approach >> I described above. Wouldn't you rather find a set of blocks that >> *needs* to be erased the least number of times?? > > Assuming you find the necessary metric to /rank/ the blocks, what is the > overall *gain* if you end up wearing those *more* than others and > effectively losing that gap in performance more quickly?
You're not "wearing" them faster (see above). Given: You are going to perform N reads in some unit of time/effort. The goal is to minimize the number of read errors that you encounter. Read errors lead to write-erase cycles (recycle events) which cost resources (time/performance) AND reduce the overall durability of the device. Wouldn't you *naturally* want to pick the blocks that give you the least number of read errors AND, thus, recycle events?
>>> Depending on your ECC there might be a limited choice, i.e. assume >>> your ECC recovers only one bit flip and report multiple ones. >>> Whenever I detect (and correct) *one* single error, I have no other >>> choice but to act otherwise, if I wait too long my ECC won't help me >>> anymore. >> >> This depends on the nature of the source of your errors. E.g., do you >> replace the RAM in your PC when it reports *any* CORRECTED errors? >> (I've asked this question before and its fun to watch people try to >> rationalize how they use that "feature" -- do they even *know* when >> there are error events being corrected? when do they decide there are >> "too many"?) > > AFAIK the ECC used in PCs are more like BHC, with low overhead and > strong correction capability at the expense of a more complex algorithm.
The point of my parenthetical comment was: "How do you decide when the ECC is *too* active? What criteria do you use to determine that errors may be slipping through undetected? How do you (PC user) even KNOW that this is happening??
> You could set a threshold on the amount of detected and corrected and > take an action only when the threshold is crossed. You still have > reliable data but you better correct those errors before the capability > of your ECC is not sufficient anymore. > > In my case, in order to reduce complexity in the fabric (the algorithm > will be implemented in an FPGA), I'll be more inclined towards a simple > hamming that can correct *one* and detect more than *two*. Therefore > I'll need to take action on *any* error. > >>> There are attempts to show some sort of error patterns that are more >>> likely to show up and hence can be managed accordingly, but I'm a >>> fond of the K.I.S.S. principle and I'm not so much interested in >>> building a read/write patterns aware wear leveling algorithm :-). >> >> The resource that you can't exhaust is write-erase cycles. So, you >> really want to come up with an algorithm that minimizes the number of >> times you "give up" on a written block -- in favor of refreshing it >> here or elsewhere. That's why I was saying (earlier) to rethink your >> algorithm to track "measured durability" (i.e., how *long* before >> errors detected) and use that to govern the selection of replacement >> blocks -- instead of just a FIFO (or even random) algorithm. > > The thing is that metric 'accumulated' on read operations do not > necessarily provide any wear level. AFAIK there's no indication that > your write-erase cycle is *close* to the end of life.
When you run out of write-erase cycles, your "recycle" events (or, write events) start failing. The point of my "read history" metric is to keep blocks from going *through* write-erase cycles needlessly (i.e., if a particular block isn't giving you ECC errors, why rewrite it? Why incur that "wear event" if not needed? Similarly, why not pick blocks that don't need to be rewritten as frequently if you have such a metric? [The point of my initial comment was to suggest that these algorithms are too complex to implement in dedicated logic. Esp if you are trying to get the best (most reliable) performance from the array.]
> 'Read disturb' errors are not related to wear, as well as errors induced > by heavy ions so it would be difficult to make an algorithm that ranks > the /strenght/ of each block.
Each write-erase cycle alters the characteristics of the cells in the block. I.e., "wear" is nonreversible. As a cell wears, it is less capable of preserving the correct state of charge (i.e., storing the desired "data") This manifests as a greater susceptibility to disturb events (read/write) and an earlier ECC event. I.e., "time til ECC event gives you a practical way of measuring how "reliable" a particular block is GIVEN YOUR USAGE PATTERNS. This should be easy for you to test. Pick a block and run a bunch of write-erase cycles on it. I.e., pre-wear THAT block so it is expected to have less useful life than other blocks in the array. Now, apply an algorithm and watch to see how well *it* performs compared to other "unworn" blocks (accelerated wear testing). See which blocks get marked as permanently bad, *first* using a "fair" algorithm. Similarly, track the number of recycle events that you incur. Then, tweek the algorithm to alter the recycled block selection choice and see how the performance changes (do you get more use from the array? fewer recycle events? more total life before blocks are considered "bad"?)
>>> The advantage of the microcontroller is not the hardware it >>> represents but the languages that can be used to instruct it. I see >>> easily how moving a buffer from one point to another is a perfect >>> task for an FPGA, dealing with low level protocol with the Flash >>> Interface, but taking the decision about what to do on a failure is >>> where the software has an added value and can be 'economical' and >>> possibly more reliable given that the complexity it can handle is >>> certainly superior (I doubt I'd be able to write a journalized flash >>> file system with bad block management, wear leveling, scrubbing and >>> garbage collection within an FPGA!). >> >> Exactly. As I said, hardware is great for parallel operations. But, >> when the algorithm being implemented is thought of in a serial >> fashion, it is often easier to resort to a sequencer -- a >> microcontroller. >> WHEN this happens, do that... UNLESS this other thing happens, in >> which case... > > I've seen an interesting presentation
[snipped line length]
> on a memory controller implemented on an FPGA. It claims it fits in a > 10% of a 3MGates equivalent FPGA (ProAsic3E A3PE3000), which seems a bit > less than what I expected. > > It would be interesting to see how much resources would take a small > embedded microcontroller which would implement the FTL in a software > stack.
The difference that I am highlighting is the range of algorithms that are EFFECTIVELY available to each type of implementation (software FTL vs. dedicated hardware). Duality ensures you can implement any in *each* approach. But, getting it *correct* is a different matter...
Reply by alb January 19, 20152015-01-19
Hi Don,

Don Y <this@is.not.me.com> wrote: []
>> Uhm, wear leveling algorithms do not *hope* anything, actually they >> spread the maximum allowed number of E/P cycles all over the device >> instead of a single block, since endurance is defined _per block_. > > In an ideal world (process), all blocks would be identical in terms of > performance, durability, etc. In that case, you can trivially cycle > locks through the algorithm in any "fair" order and never see any > difference in performance (i.e., durability/error resistance). > > However, processes aren't ideal. And, data "patterns" only add to the > uncertainty. You can't *know* that each block -- having ALL been > "recycled" exactly N times -- will exhibit the same sorts of "read > failures" (again, just talking about reads to simplify the discussion) > when asked to store a particular data pattern (in the context of the > *other* data patterns located "around" it). > > You should be able to test this by deliberately exhausting some sample > devices and/or partial writes (cut the write timer short, > deliberately) to *some* blocks but not others. I.e., don't let > entropy average all of that out across the device. *Then*, try to use > the device with a "fair" block swapping algorithm and track how long > each *actual* (physical) block "survives" (provides reliable data > below ECC threshold that suggests a need to rescrub) before your > controller decides that it needs to be "recycled" -- erased and > rewritten (or, reused to replace another block).
The main point here is about your reference to *entropy*. In the end you have to take *entropy* into accound no matter what kind of data patterns you use. Blocks that show better performance now will be used *more* and your /wear-leveling/ is not leveling anymore, inducing more wear in the preferred blocks, hence reducing their performance (actually if we limit the discussion to the 'read disturb' issue I'm not convinced that there's any wear going on.)
> If you can associate a metric with each physical block that represents > its historical performance, you can "level" based on *performance* > instead of *use*. > > I.e., one leveling algorithm just moves recycled blocks to the end of > the queue and always picks the first block in the queue for use when > some existing block needs to be recycled out of service. This ensures > each block gets *erased* the same number of times -- regardless of how > long a particular physical block "survived" IN ITS MOST RECENT (or sum > of all preceding) USE. > > You could, instead, track the number of *reads*-before-error-limit for > each block and arrange for those with the highest "durability" to be > reused *first*. Over time, their performance will degrade (i.e., they > will start to exhibit more errors, sooner) so that some other block > that had previously been considered "less durable" is regarded as MORE > durable than this "worn" block.
Than what would be the benefit w.r.t. to equally spreading load? Another point is that unfortunately we *cannot* avoid to take into account other failure modes (on write/erase cycles), since they *will* certainly induce wear. Again any metric that would /rank/ blocks according to performance would lead to uneven wear that you need to compensate at the end if you want to keep memory size /constant/ (i.e. no loss of blocks due to wear).
> [Sorry, I'm being verbose but I don't feel I'm being *clear*. :< ]
I think I got your point, but I'm not seeing the added value of any ranking mechanism.
> I'm oversimplifying this as there are other issues related to geometry > that play a part. My point is to show that *how* you choose to reuse > blocks can be implemented in different ways with different criteria. > The point being that you want to minimize the number of "recycle > events" that you take on -- especially if the block you *pick* as a > replacement block may not write properly... so, you end up with a > "double fault" of sorts.
But you need to factor in that behavior anyhow, since betting on the best performance is not going to pay on the long run and sooner or later you'll be facing 'double fault' sorts of errors anyhow. If your algorithm is not ready to take that into account you'll be in for a nasty surprise.
>> You count the number of E/P cycles each block has accumulated and you >> choose to write data in available blocks with the least lived E/P >> cycles (for example). The E/P cycles available become effectively >> multiplied by the number of available blocks. > > That's the "each block gets erased the same number of times" approach > I described above. Wouldn't you rather find a set of blocks that > *needs* to be erased the least number of times??
Assuming you find the necessary metric to /rank/ the blocks, what is the overall *gain* if you end up wearing those *more* than others and effectively losing that gap in performance more quickly? []
>> Depending on your ECC there might be a limited choice, i.e. assume >> your ECC recovers only one bit flip and report multiple ones. >> Whenever I detect (and correct) *one* single error, I have no other >> choice but to act otherwise, if I wait too long my ECC won't help me >> anymore. > > This depends on the nature of the source of your errors. E.g., do you > replace the RAM in your PC when it reports *any* CORRECTED errors? > (I've asked this question before and its fun to watch people try to > rationalize how they use that "feature" -- do they even *know* when > there are error events being corrected? when do they decide there are > "too many"?)
AFAIK the ECC used in PCs are more like BHC, with low overhead and strong correction capability at the expense of a more complex algorithm. You could set a threshold on the amount of detected and corrected and take an action only when the threshold is crossed. You still have reliable data but you better correct those errors before the capability of your ECC is not sufficient anymore. In my case, in order to reduce complexity in the fabric (the algorithm will be implemented in an FPGA), I'll be more inclined towards a simple hamming that can correct *one* and detect more than *two*. Therefore I'll need to take action on *any* error. []
>> There are attempts to show some sort of error patterns that are more >> likely to show up and hence can be managed accordingly, but I'm a >> fond of the K.I.S.S. principle and I'm not so much interested in >> building a read/write patterns aware wear leveling algorithm :-). > > The resource that you can't exhaust is write-erase cycles. So, you > really want to come up with an algorithm that minimizes the number of > times you "give up" on a written block -- in favor of refreshing it > here or elsewhere. That's why I was saying (earlier) to rethink your > algorithm to track "measured durability" (i.e., how *long* before > errors detected) and use that to govern the selection of replacement > blocks -- instead of just a FIFO (or even random) algorithm.
The thing is that metric 'accumulated' on read operations do not necessarily provide any wear level. AFAIK there's no indication that your write-erase cycle is *close* to the end of life. 'Read disturb' errors are not related to wear, as well as errors induced by heavy ions so it would be difficult to make an algorithm that ranks the /strenght/ of each block. []
>> The advantage of the microcontroller is not the hardware it >> represents but the languages that can be used to instruct it. I see >> easily how moving a buffer from one point to another is a perfect >> task for an FPGA, dealing with low level protocol with the Flash >> Interface, but taking the decision about what to do on a failure is >> where the software has an added value and can be 'economical' and >> possibly more reliable given that the complexity it can handle is >> certainly superior (I doubt I'd be able to write a journalized flash >> file system with bad block management, wear leveling, scrubbing and >> garbage collection within an FPGA!). > > Exactly. As I said, hardware is great for parallel operations. But, > when the algorithm being implemented is thought of in a serial > fashion, it is often easier to resort to a sequencer -- a > microcontroller. > WHEN this happens, do that... UNLESS this other thing happens, in > which case...
I've seen an interesting presentation (http://www.google.ch/url?sa=t&rct=j&q=improved%20memory%20module%20sitael&source=web&cd=6&cad=rja&uact=8&ved=0CDgQFjAF&url=http%3A%2F%2Fcongrexprojects.com%2Fdocs%2F12c25_2510%2F15errico_tuccio_adcss2012_sitael_we.pdf%3Fsfvrsn%3D2&ei=qTy9VOCADs_taMe7gegF&usg=AFQjCNH242L7zto_ClD3Y-AzNDPL5c2TVg&bvm=bv.83829542,d.d2s) on a memory controller implemented on an FPGA. It claims it fits in a 10% of a 3MGates equivalent FPGA (ProAsic3E A3PE3000), which seems a bit less than what I expected. It would be interesting to see how much resources would take a small embedded microcontroller which would implement the FTL in a software stack. Al
Reply by alb January 19, 20152015-01-19
Hi Boudewijn,

Boudewijn Dijkstra <sp4mtr4p.boudewijn@indes.com> wrote:
> Op Fri, 16 Jan 2015 00:15:03 +0100 schreef alb <al.basili@gmail.com>: >> Don Y <this@is.not.me.com> wrote: >> >> [] >> Another simple wear leveling approach would be to select randomly among >> the pool of empty blocks. This technique does not require any 'counting' >> and is simple enough to implement (seeding the pseudo-random generator >> with a monotonic parameter like time, in order to avoid to repeat the >> same 'pseudo-random' sequence). > > I'm curious: is random selection better than a simple increase (with > wrap-around)?
In terms of spreading usage across the device the main difference between a simple increase and wrap-around and a random choice is across power off/on cycles. In the first case you would need to store somewhere in your flash memory where was the last pointer for your 'increase and wrap-around' algorithm, while in the second case you simply don't. The reduced amount of metadata management can simplify the task and reduce complexity. Moreover, while in a perfect world power outages are maybe not considered, the simple fact that power is cut while you want to store your metadata may screw your 'increase and wrap-around' algorithm and you'll have uneven usage, while a random approach would simply not care. Al