On 6/19/2017 6:42 PM, George Neuner wrote:
> On Fri, 16 Jun 2017 23:40:21 -0700, Don Y
> <blockedofcourse@foo.invalid> wrote:
>
>> On 6/16/2017 10:33 PM, George Neuner wrote:
>>>
>>> Normally, the only way [corruption] can happen is if the OS/hardware
>>> lies to the DBMS about whether data really has been written to the
>>> media. If the media is such that a failure during a write can corrupt
>>> more than the one block actually being written, then the media has to
>>> be protected against failures during writes.
>>
>> That's exactly the problem with these types of media. If you violate
>> the parameters of the write/erase cycle, all bets are off -- especially
>> if the power to the device may be at a dubious level, etc.
>
> No software can be guaranteed to work correctly in the face of
> byzantine failures. Failing "gracefully" - for some definition - even
> if possible, still is failing.
But that's the nature of the "power down problem" (the nature of the OP's
question)! It requires special attention in hardware *and* software.
To think you can magically execute a sequence of commands and be
guaranteed NOT to have "corruption" is naive.
>> Devices with built-in controllers (e.g., SD cards, SSD's, etc.) do lots
>> of behind the scenes juggling when you do a "simple" write to them so
>> you don't know when the ramifications of your "write" are done. I.e.,
>> the nonvolatile components *in* the device may be accessed differently
>> than your mental model of the device would expect.
>
> The erase "block" size != write "page" size of SSDs is a known
> problem.
>
> A DBMS can't address this by itself in software: "huge" VMM pages
> [sometimes] are good for in-memory performance - but for reliable i/o,
> huge file blocks *suck* both for performance and for space efficiency.
>
> A professional DBMS hosting databases on SSD requires the SSDs to be
> battery/supercap backed so power can't fail during a write, and also
> that multiple SSDs be configured in a RAID ... not for speed, but for
> increased reliability.
>
> Caching on SSD is not really an issue, because if the "fast copy" is
> unavailable the DBMS can go back to the [presumably slower] primary
> store. But *hosting* databases completely on SSD is a problem.
Exactly. I suspect the OP isn't using rust to store his data
(and other corruptible items). So, he's using some nonvolatile
*semiconductor* medium. Most probably FLASH. Most probably NAND
Flash. Most probably MLC NAND Flash.
(i.e., one of the more problematic media to use reliably in this
situation)
>> Without knowing where each datum resides on the individual memory
>> components at any given time, you can't predict what can be corrupted
>> by a botched write/erase. Data that was committed three weeks ago
>> (and not "touched" in the intervening time) can be clobbered -- how
>> will you know?
>>
>>>> Given a record (in the DBMS) that conceptually looks like:
>>>> char name[20];
>>>> ...
>>>> char address[40];
>>>> ...
>>>> time_t birthdate;
>>>> an access to "address" that is in progress when the power falls
>>>> into the realm that isn't guaranteed to yield reliable operation
>>>> can corrupt ANY of these stored values. Similarly, an access to
>>>> some datum not shown, above, can corrupt any of *these*! You need
>>>> to understand whereeach datum resides if you want to risk an
>>>> "interrupted write".
>
> My issue with these statements is that they are misleading, and that
> in the sense where they are true, the problem can only be handled at
> the system level by use of additional hardware - there's no way it can
> be addressed locally, entirely in software.
>
> No DBMS will seek into the middle of a file block and try to write 40
> bytes. Storage always is block oriented.
>
> It's *true* that in your example above, e.g., updating the address
> field will result in rewriting the entire file block (or blocks if
> spanning) that contains the target data.
>
> But it's *misleading* to say that you need to know, e.g., where is the
> name field relative to the address field because the name field might
> be corrupted by updating the address field. It's true, but irrelevant
> because the DBMS deals with that possibility automatically.
No, you're still missing the point. You need to know physically, on the
*medium*, where the actual cells holding the bits of data for these
"variables" (records, etc.) reside because "issues" that cause the
memory devices to be corrupted have ramifications based on chip
topography (geography).
I.e., if a cell that you ("you" being the file system and FTL layers WELL
BELOW the DBMS) *think* you are altering happens to be adjacent to some
other cell (which it will almost assuredly be), then that adjacent cell
can be corrupted by the malformed actions consequential to the power
transition putting the chip(s) in a compromised operating state.
E.g., you go to circle a name in a (deadtree) phone book and your
hand (or the book) shudders in the process (because you're feeling faint
and on the verge of passing out). Can you guarantee that you will
circle the *name* that you intended? Or, some nearby name? Or,
maybe an address or phone number somewhere in that vicinity?
It doesn't matter that you double or triple checked the spelling of the
name to be sure you'd found the right one. Or, that you deliberately
chose a very fine point pen to ensure you would ONLY circle the item of
interest (i.e., that your software has been robustly designed). When
the actual time comes for the pen to touch the paper, if you're not
"fully operational", all bets are off.
I.e., some MECHANISM is needed (not software) that will block your hand
from marking the page if you are unsteady.
Absent that (or, in the presence of a poorly conceived mechanism),
you have no way of knowing *later*, when you've "recovered", if you
may have done some damage (corruption) during that event. Indeed,
you may not even be aware that you were unsteady at the time!
> A proper DBMS always will work with a dynamic copy of the target file
> block (the reason it should be run from RAM instead of entirely from
> r/w Flash). The journal (WAL) records the original block(s)
> containing the record(s) to be changed, and the modifications made to
> them. If the write to stable storage fails, the journal allows
> recovering either the original data or the modified data.
But, as I noted above, you can't KNOW that the journal hasn't been
collaterally damaged (by your shakey hands).
In *normal* operation, writing (and to a lesser extent, READING) to
FLASH disturbs the data in nearby (i.e., NOT BEING ACCESSED) memory
cells. When power and signals (levels and timing) are suspect
(i.e., as power is failing), this problem is magnified.
> The journal always is written prior to modifying the stable store. If
> the journal writes fail, the write to the stable copy never will be
> attempted: a "journal crash" is a halting error.
>
> A DBMS run without journaling enabled is unsafe.
>
> The longwinded point is that the DBMS *expects* that any file block it
> tries to change may be corrupted during i/o, and it takes steps to
> protect against losing data because of that.
What if I corrupt two blocks at the same time -- two UNRELATED (by any
notion that *you*, the developer, can fathom) blocks. ANY two that I
want. Can you recover? Can you even guarantee to KNOW that this
has happened?
I.e., some other table in the same tablespace has been wacked as a
consequence of this errant write. A table that hasn't been written
in months (no record of the most recent changes in the journal/WAL).
Will you *know* that it has been wacked? How? WHEN??
> But SSDs - even when working (more or less) properly - introduce a
> failure mode where updating a single "file block" (SSD page) drops an
> atomic bomb in the middle of the file system, with fallout affecting
> other, possibly unrelated, "file blocks" (pages) as well.
Exactly. You can't know -- nor predict -- which blocks/pages/cells
of the medium will be corrupted. You probably won't even know which
of these were being *targeted* when the event occurred, let alone which
are affected by "collateral damage".
The whole point is that the system isn't operating as *intended*
(by the naive software developer) during these periods. The hardware
and system designers have to provide guidance for THAT SPECIFIC SYSTEM
so the software developer knows what he can, can't and shouldn't do
as power failure approaches (along with recovery therefrom).
Early nonvolatile semiconductor memory (discounting WAROM) was
typically implemented as BBSRAM. It was often protected by gating the
write line with a "POWER_OK" signal. Obvious, right? Power failing
should block writes!
But, that led to data being corrupted -- because the POWER_OK
(write inhibit) signal was asynchronous with the memory cycle.
So, a write could be prematurely terminated and corrupt the
data that was intended to be written leading to different outcomes:
- old data remains
- new data overwrites
- bogus data results
But, it tended to be just *the* location that was addressed
(unless the write inhibit happened too late in the power loss
scenario)
Moving to bigger blocks of memory say BBDRAM replace the BBSRAM.
DRAM requiring less power per bit to operate (sustain).
A bit more complicated to implement as the refresh controller
has to remain active in the absence of power. The flaw, here,
would be failing to synchronize the "inhibit" and potentially
aborting a RAS or CAS -- and clobbering an entire *row* in the
device (leaving it with unpredictable contents).
SRAM is now bigger *and* lower power -- and folks understand the
need to synchronously protect accesses. So, its trivial to design
a (large!) block of BBSRAM that operates on lower power. As you
can't "synchronize with the future", its easier to just give an
early warning to the processor (e.g., NMI) and have it deliberately
toggle a "protect memory" latch thereafter KNOWING that it shouldn't
wven bother trying to write to that memory!
But FLASH (esp SSD's and "memory cards") have progressed to the
point where they have their own controllers, etc. on board. So,
from the outside, you can neither tell where (physically) a
particular write will affect the contents of the chip(s) packaged
within, nor can you know for sure what is happening (at a signal
level) inside the device.
So, how can you know how far in advance to stop writing?
How can you know, for sure, that your last write will actually
manage to end up being committed to the memory chip(s) within
the device (what if its controller encounters a write error
and opts to retry the write on a different block of memory,
adjusting its bookkeeping in the process)?
You do all the power calculations assuming the bulk capacity
in your power supply is at the *low* end of its rating -- for
the current temperature -- and assume your electronics are using
the *maximum* amount of power (including the memory card!)
and predict how much "up time" you have before the voltage(s)
in the system fall out of spec. Then, back that off by some
amount of derating to make it easier to sleep at night.
> It's like: what should the flight computer do if the wings fall off?
> There's absolutely nothing it can do, so the developer of the flight
> software should not waste time worrying about it. It's a *system*
> level issue.
And, the OP is the system designer, as far as we are concerned.
Or, at least the *conduit* from USENET to that designer!
>> Remember that a FLASH write (parallels exist for other technologies) is
>> actually an *erase* operation followed by a write. And, that you're
>> actually dealing with "pages"/blocks of data, not individual bytes.
>
> Block based i/o is not the issue. The issue is that SSDs do the
> equivalent of rewriting a whole platter track to change a single
> sector.
The salient point in the above is that a write is TWO operations:
erase followed by write. And, depending on the controller and the
state of wear in the actual underlying medium, possibly some
housekeeping as blocks are remapped.
The issue is that there is a window of time in which the operation is
"in progress". But, in a VULNERABLE STATE!
If I issue a write to a magnetic disk, the "process" begins the
moment that I issue the right. But, there are lots of delays
built into that process (rotational delay, access delay, etc.).
So, the actual window of vulnerability is very small: when are the
heads actually positioned over the correct portion of the medium
to alter the magnetic domains therein.
And, if this event is "interfered with", the consequences are
confined to that portion of the medium -- not some other track
or platter or sector.
That's not the case with the current types of semiconductor
nonvolatile memory. The "window of vulnerability" extends
throughout the duration of the write operation (erase, write,
internal verify and possible remapping/rewriting).
>> The erase may take a fraction of a millisecond ("milli", not "micro")
>> to be followed by a slightly shorter time to actually write the new
>> contents back into the cells.
>>
>> [The times are often doubled for MLC devices!]
>>
>> During this "window of vulnerability", if the power supplies (or signal
>> voltages) go out of spec, the device can misbehave in unpredictable ways.
>>
>> [This assumes the CPU itself isn't ALSO misbehaving as a result of the same
>> issues!]
>>
>> This can manifest as:
>> - the wrong value getting written
>> - the right value getting written to the wrong location
>> - the wrong value getting written to the wrong location
>> - the entire page being partially erased
>> - some other page being erased
>> etc.
>
> Byzantine failure.
>
> The duration of the "window of vulnerability" is not the issue. The
> issue is the unpredictability of the result.
Of course the size of the window is important! The software can't do
*squat* while the operation is in process. It can't decide that
it doesn't *really* want to do the write, please restore the previous
contents of that memory (block!).
And, the software can do nothing about the power remaining in
the power supply's bulk filter. It's like skidding on black ice
and just *hoping* things come to a graceful conclusion BEFORE
you slam into the guardrail!
> DBMS were designed at a time when disks were unreliable and operating
> systems [if even present] were primitive. Early DBMS often included
> their own device code and took direct control of disk and tape devices
> so that they could guarantee operation.
How can I "accidentally" alter block 0 of a mounted tape when we're
at EOT (or any other place physically removed from block 0)
A semiconductor memory can alter ANYTHING at any time! Signal
levels inside the die determine which rows are strobed. Its
possible for NO row to be strobed, two rows, 5 rows, etc. -- the
decoders are only designed to generate "unique" outputs when
they are operating within their specified parameters.
Let Vcc sag, ground bounce, signals shift in amplitude/offset/timing
and you can't predict how they will affect the CHARGE stored in the
device.
>> And, a week later, the data that I had stored in "birthdate" is no longer
>> present in the journal as it has been previously committed to the store.
>> So, when it gets corrupted by an errant "name update", you'll have no
>> record of what it *should* have been.
>
> We've had this conversation previously also: database terminology
> today is almost universally misunderstood and misused by everyone.
>
> The file(s) on the disk are not the "database" but merely a point in
> time snapshot of the database.
>
> The "database" really is the historical evolution of the stable store.
> To recover from a failure, you need a point in time basis, and the
> journal from that instance to the point of failure. If you have every
> journal entry from the beginning, you can reconstruct the data just
> prior to the failure starting from an empty basis.
>
> The point here being that you always need backups unless you can
> afford to rebuild from scratch.
This is c.a.e. Do you really think the OP has a second copy of the
data set hiding on the medium? And, that it is somehow magically
protected from the sorts of corruption described, here?
>> The DBMS counts on the store having "integrity" -- so, all the DBMS has
>> to do is get the correct data into it THE FIRST TIME and it expects it
>> to remain intact thereafter. It *surely* doesn't expect a write of
>> one value to one record to alter some other value in some other
>> (unrelated) record!
>
> DBMS are designed to maintain logically consistent data ... the "I" in
> "ACID" stands for "isolation", not for "integrity".
>
> No DBMS can operate reliably if the underlying storage is faulty.
Exactly. And, there are no commands/opcodes that the OP can execute
that will "avoid corruption on power off". If there were, the DBMS
would employ them and make that guarantee REGARDLESS OF STORAGE MEDIUM!
:>
>> [Recall this was one of the assets I considered when opting to use a
>> DBMS in my design instead of just a "memory device"; it lets me perform
>> checks on the data going *in* so I don't have to check the data coming
>> *out* (the output is known to be "valid" -- unless the medium has
>> failed -- which is the case for power sequencing on many (most?) nonvolatile
>> storage media.]
>
> And recall that I warned you about the problems of trying to run a
> reliable DBMS on an unattended appliance. We didn't really discuss
> the issue of SSDs per se, but we did discuss journaling (logging) and
> trying to run the DBMS out of Flash rather than from RAM.
The firmware in SSD's has to address all types of potential users
and deployments.
I don't. I have ONE application that is accessing the nonvolatile
memory pool so I can tailor they design of that store to fit the
needs and expected usage of its one "client". I.e., ensure that
the hardware behaves as the DBMS expects it to.
The bigger problem is addressing applications that screw up their
own datasets. There is NOTHING that I can do to solve that
problem -- even hiring someone to babysit the system 24/7/365.
A buggy application is a buggy application. Fix it.
I *can* ensure ApplicationA can't mess with ApplicationB's
dataset(s). I *can* put triggers and admittance criteria
on data going *into* the tables (to try to intercept
stuff that doesn't pass the smell test). But, a "determined"
application can still write bogus data to the objects to
which it is granted access.
Just like it could write bogus data in raw "files".