On 7/25/2021 11:53 AM, Kent Dickey wrote:
> In article <sdhb29$ibh$1@gioia.aioe.org>, Dave Nadler <drn@nadler.com> wrote:
>> Hi All - I'm wondering what other folks do about this issue...
>>
>> Consumer flash storage devices (USB memory stick, SD card, etc)
>> have a nice internal wear-leveling controller. When one does a write
>> operation, lots of sectors may be internally rejiggered to provide
>> uniform wear (so things that are never rewritten from the application
>> point of view are actually moved around and rewritten). This activity is
>> invisible from normal application's point of view, but takes some time.
>> If the device is powered off during such activity, data corruption...
>> So, its necessary to provide the device some time after the last write
>> before it is powered off. Plenty of embedded stuff I've seen too often
>> corrupts memory at power off, and of course the device manufacturers
>> blame the memory manufacturers...
>>
>> So, how do you avoid this kind of corruption in your designs?
>>
>> Thanks,
>> Best Regards, Dave
>
> It is not necessary to do this. Here's a very simple way to recover with
> no data loss ever. It is simply "log-only" or "journal-only" storage.
>
> The device keeps 10% capacity reserved (pre-erased) (or less, this is a
> performance number). The driver has enough RAM to track where every logical
> block is (this isn't much RAM). And then writing is always done as a
> logfile--write whatever the user data is starting at block 0, and move up.
> After writing user data, write a log entry (this can be just right after the
> user data). Writes never go to the block the user indicates, they always go
> only to the log pointer. Reading looks up where that block's latest copy is,
> and reads that.
>
> When you're at less than 10% space left (near the end of the device for
> first-pass writes), you simply read from +10% from the current log write
> position, and write it to the log (compacting it to avoid re-writing any
> blocks which are now obsolete) until you free up space. Since many blocks are
> rewritten, when we compact them we free space. Once copied, then erase the
> blocks which were compacted. Then just keep growing the "log" through the
> newly erased blocks.
>
> Upon power up, scan entire device for the log info, and rebuild the RAM index.
>
> Power can be removed at any time, and fully recovered, although obviously the
> most recent data may be lost, but we can roll back to the last consistent
> state just like journaled file systems (the log entries need a checksum so
> we can validate them). Wear leveling is achieved by design.
>
> There are lots of small ways to improve the above, so do that.
>
> I have no idea what actual devices do, I suspect it's something much more
> complex and less robust.
You're missing the point/role of the internal controller in the mix.
Imagine 8 blocks of memory in the device. Blocks 1, 2 and 3 (no special
significance) contain highly static code/data. They get written *once*
when the device is manufactured. So, each has seen *1* program/erase/write
cycle.
Meanwhile, blocks 4-8 are constantly hammered on. The code is constantly
updating the "files" that occupy those blocks (the files don't TOUCH the
first 3 blocks, by definition).
After 1,000 updates, blocks 4-8 have seen 1,000 program/erase/write cycles.
The underlying FLASH device will have a MAXCYCLES specified, based on the
technology used (SLC, MLC, TLC, QLC, etc.), process geometry, etc.
For consumer devices, this number tends to be lower -- because consumers
tend to want to purchase "big" instead of "durable" (so, MLC/TLC/etc.
technology).
Assume the MAXCYCLES is 1,002 (actual numbers are unimportant -- if Dave feels
he will be taxing the medium over the life of his design).
So, after two more updates, the device is *broken*. Because blocks 4-8 will
each have hit their 1,002 cycle limit and stopped working (yeah, I know it's
not a brick wall; but there is *some* number at which ECC errors prove
unmanageable and the controller marks those blocks (4-8) as bad.
Meanwhile, 1,2 and 3 each have 1,001 cycles of wear available -- that they
aren't (and WON'T!) be using. Had those 3,003 cycles been "shared" among
all of the blocks, then the files being stored in 4-8 would still be
writable (even though the next write might be into blocks 1, 3, 5, 6 and 8).
So, the controller has to deliberately *move* a copy of the data in blocks
1, 2 and/or 3 into 4-8 -- consuming one cycle of 4-8. But, freeing up
1001 cycles in 1, 2, 3!
You, on the outside, can't tell how it is doing this, when it is doing this
or even *if* it is doing this! So, you can't predict what parts of the flash
will be corrupted as power goes away. Maybe the VTOC gets hosed (so you can't
even FIND the files). Or, some bookkeeping metadata used by the controller.
The logging idea works for magnetic media where what's there, stays there,
and all you have to worry about is whether or not the *new* stuff made it in
"under the wire", or not.
As an analogy, next time your RAID array is rebuilding, cycle power.
See how easily it sorts out WHERE it was, in the process (and think
about the resources that it spent to make that possible -- all in a $5
thumb drive??)