pozz <pozzugno@gmail.com> wrote:
> I'm interesting to know what are the methods to use to reduce the 
> probability of corruption.

In a Linux-based product from another part of the corporation I work 
for, when power loss was detected, all non-essential services were 
killed, and all in-flight data was written out to a log that could be 
replayed on system startup. This was powered by a supercapacitor 
dimensioned to last some small number of seconds. However, this was a 
custom, bare-bone distribution running only their own software, and the 
filesystem was mounted read-only. IIRC the log used dedicated storage. 
My understanding is that this scheme worked well for them.

-a

>> I think that in this matter, most reliable is UPS. For example, in our RB300 we are using UPS 
>> based on supercaps. The microcontroller monitors the power supply voltage and controls the system 
>> shutdown in case of a power failure. More details:
>> http://pigeoncomputers.com/documentation/hardware/ups/
> 
> Does the UPS supply only CM, so at low voltage? How long the supercaps are able to supply correctly 
> the CM, after cutting the input voltage?

UPS is 5V, so it supplies all devices with 5V, 3.3V and 1.8V power supply voltages. Compute module 
is properly powered about 140 seconds, without devices connected to USB ports. In the case of high 
energy requirements by USB connected devices, there is a possibility unmount devices and turn off 
USB power supply (this can be controlled by GPIO).

Il 20/06/2017 09:34, Krzysztof Kajstura ha scritto:
> W dniu 2017-06-16 o 12:10, pozz pisze:
>> I'm playing with a Raspberry system, however I think my question is 
>> about Linux embedded in general.
>>
>> We all know that the OS (linux or windows or whatever) *should* be 
>> gracefully powered down with a shutdown procedure (shutdown command in 
>> Linux). We must avoid cutting the power abruptly.
>>
>> If this is possible for desktop systems, IMHO it's impossible to 
>> achieve in embedded systems.  The user usually switch off a small box 
>> by pressing an OFF button that usually is connected to the main power 
>> supply input.  In any case, he could immediately unplug the power cord 
>> without waiting for the end of the shutdown procedure.
>>
>> I'm interesting to know what are the methods to use to reduce the 
>> probability of corruption.
>>
>> For example, I choose to use a sqlite database to save non-volatile 
>> user configurable settings. sqlite is transaction based, so a power 
>> interruption in the middle of a transaction shouldn't corrupt the 
>> entire database. With normal text files this should be more difficult.
>>
>> I know the write requests on non-volatile memories (HDD, embedded 
>> Flash memories) are usually buffered by OS and we don't know when they 
>> will be really executed by the kernel. Is there a method to force the 
>> buffered writing requests immediately?
>>
>> Other aspects to consider?
> 
> 
> I think that in this matter, most reliable is UPS. For example, in our 
> RB300 we are using UPS based on supercaps. The microcontroller monitors 
> the power supply voltage and controls the system shutdown in case of a 
> power failure. More details:
> http://pigeoncomputers.com/documentation/hardware/ups/

Does the UPS supply only CM, so at low voltage? How long the supercaps 
are able to supply correctly the CM, after cutting the input voltage?

Il 19/06/2017 13:27, Jack ha scritto:
> Il giorno luned&igrave; 19 giugno 2017 12:47:33 UTC+2, pozz ha scritto:
>> Il 16/06/2017 13:31, David Brown ha scritto:
>>> And some file types are more susceptible to
>>> problems - sqlite databases are notorious for being corrupted if writes
>>> are interrupted.
>>
>> What?  I choosed sqlite because they say the corruption of a database is
>> a very rare event.
>>
>>
>> https://www.sqlite.org/howtocorrupt.html

Ok, however those problems (memories that lie about the write has really 
finished) are common to all other file types, not only sqlite.

W dniu 2017-06-16 o 12:10, pozz pisze:
> I'm playing with a Raspberry system, however I think my question is about Linux embedded in general.
> 
> We all know that the OS (linux or windows or whatever) *should* be gracefully powered down with a 
> shutdown procedure (shutdown command in Linux). We must avoid cutting the power abruptly.
> 
> If this is possible for desktop systems, IMHO it's impossible to achieve in embedded systems.  The 
> user usually switch off a small box by pressing an OFF button that usually is connected to the main 
> power supply input.  In any case, he could immediately unplug the power cord without waiting for the 
> end of the shutdown procedure.
> 
> I'm interesting to know what are the methods to use to reduce the probability of corruption.
> 
> For example, I choose to use a sqlite database to save non-volatile user configurable settings. 
> sqlite is transaction based, so a power interruption in the middle of a transaction shouldn't 
> corrupt the entire database. With normal text files this should be more difficult.
> 
> I know the write requests on non-volatile memories (HDD, embedded Flash memories) are usually 
> buffered by OS and we don't know when they will be really executed by the kernel. Is there a method 
> to force the buffered writing requests immediately?
> 
> Other aspects to consider?


I think that in this matter, most reliable is UPS. For example, in our RB300 we are using UPS based 
on supercaps. The microcontroller monitors the power supply voltage and controls the system shutdown 
in case of a power failure. More details:
http://pigeoncomputers.com/documentation/hardware/ups/

On 6/19/2017 8:36 PM, Don Y wrote:
> On 6/19/2017 6:42 PM, George Neuner wrote:
>> On Fri, 16 Jun 2017 23:40:21 -0700, Don Y
>> <blockedofcourse@foo.invalid> wrote:
>>
>>> On 6/16/2017 10:33 PM, George Neuner wrote:
>>>>
>>>> Normally, the only way [corruption] can happen is if the OS/hardware
>>>> lies to the DBMS about whether data really has been written to the
>>>> media. If the media is such that a failure during a write can corrupt
>>>> more than the one block actually being written, then the media has to
>>>> be protected against failures during writes.
>>>
>>> That's exactly the problem with these types of media.  If you violate
>>> the parameters of the write/erase cycle, all bets are off -- especially
>>> if the power to the device may be at a dubious level, etc.
>>
>> No software can be guaranteed to work correctly in the face of
>> byzantine failures.  Failing "gracefully" - for some definition - even
>> if possible, still is failing.
>
> But that's the nature of the "power down problem" (the nature of the OP's
> question)!  It requires special attention in hardware *and* software.
> To think you can magically execute a sequence of commands and be
> guaranteed NOT to have "corruption" is naive.

I can't afford (thermal budget) to power up yet another server
to access my literature archive (did I mention it is hot, here?
119F, today).  But, some looleg-ing turns up a few practical
references:

<https://cseweb.ucsd.edu/~swanson/papers/DAC2011PowerCut.pdf>
<http://www.embedded.com/design/prototyping-and-development/4006422/Avoid-corruption-in-nonvolatile-memory>
<https://hackaday.com/2016/08/03/single-board-revolution-preventing-flash-memory-corruption/>

Remember, even WITHIN a PCB, conditions on and in each chip can differ
from moment-to-moment due to the reactive nature of the traces and
dynamics of power consumption "around" the board.  So, just because
power is "good" at your "power supervisory circuit" doesn't mean
it's good throughout (and through-in?) the circuit.

On 6/19/2017 6:42 PM, George Neuner wrote:
> On Fri, 16 Jun 2017 23:40:21 -0700, Don Y
> <blockedofcourse@foo.invalid> wrote:
>
>> On 6/16/2017 10:33 PM, George Neuner wrote:
>>>
>>> Normally, the only way [corruption] can happen is if the OS/hardware
>>> lies to the DBMS about whether data really has been written to the
>>> media. If the media is such that a failure during a write can corrupt
>>> more than the one block actually being written, then the media has to
>>> be protected against failures during writes.
>>
>> That's exactly the problem with these types of media.  If you violate
>> the parameters of the write/erase cycle, all bets are off -- especially
>> if the power to the device may be at a dubious level, etc.
>
> No software can be guaranteed to work correctly in the face of
> byzantine failures.  Failing "gracefully" - for some definition - even
> if possible, still is failing.

But that's the nature of the "power down problem" (the nature of the OP's
question)!  It requires special attention in hardware *and* software.
To think you can magically execute a sequence of commands and be
guaranteed NOT to have "corruption" is naive.

>> Devices with built-in controllers (e.g., SD cards, SSD's, etc.) do lots
>> of behind the scenes juggling when you do a "simple" write to them so
>> you don't know when the ramifications of your "write" are done.  I.e.,
>> the nonvolatile components *in* the device may be accessed differently
>> than your mental model of the device would expect.
>
> The erase "block" size != write "page" size of SSDs is a known
> problem.
>
> A DBMS can't address this by itself in software: "huge" VMM pages
> [sometimes] are good for in-memory performance - but for reliable i/o,
> huge file blocks *suck* both for performance and for space efficiency.
>
> A professional DBMS hosting databases on SSD requires the SSDs to be
> battery/supercap backed so power can't fail during a write, and also
> that multiple SSDs be configured in a RAID ... not for speed, but for
> increased reliability.
>
> Caching on SSD is not really an issue, because if the "fast copy" is
> unavailable the DBMS can go back to the [presumably slower] primary
> store.  But *hosting* databases completely on SSD is a problem.

Exactly.  I suspect the OP isn't using rust to store his data
(and other corruptible items).  So, he's using some nonvolatile
*semiconductor* medium.  Most probably FLASH.  Most probably NAND
Flash.  Most probably MLC NAND Flash.

(i.e., one of the more problematic media to use reliably in this
situation)

>> Without knowing where each datum resides on the individual memory
>> components at any given time, you can't predict what can be corrupted
>> by a botched write/erase.  Data that was committed three weeks ago
>> (and not "touched" in the intervening time) can be clobbered -- how
>> will you know?
>>
>>>> Given a record (in the DBMS) that conceptually looks like:
>>>>     char name[20];
>>>>     ...
>>>>     char address[40];
>>>>     ...
>>>>     time_t birthdate;
>>>> an access to "address" that is in progress when the power falls
>>>> into the realm that isn't guaranteed to yield reliable operation
>>>> can corrupt ANY of these stored values.  Similarly, an access to
>>>> some datum not shown, above, can corrupt any of *these*!  You need
>>>> to understand whereeach datum resides if you want to risk an
>>>> "interrupted write".
>
> My issue with these statements is that they are misleading, and that
> in the sense where they are true, the problem can only be handled at
> the system level by use of additional hardware - there's no way it can
> be addressed locally, entirely in software.
>
> No DBMS will seek into the middle of a file block and try to write 40
> bytes.  Storage always is block oriented.
>
> It's *true* that in your example above, e.g., updating the address
> field will result in rewriting the entire file block (or blocks if
> spanning) that contains the target data.
>
> But it's *misleading* to say that you need to know, e.g., where is the
> name field relative to the address field because the name field might
> be corrupted by updating the address field.  It's true, but irrelevant
> because the DBMS deals with that possibility automatically.

No, you're still missing the point.   You need to know physically, on the
*medium*, where the actual cells holding the bits of data for these
"variables" (records, etc.) reside because "issues" that cause the
memory devices to be corrupted have ramifications based on chip
topography (geography).

I.e., if a cell that you ("you" being the file system and FTL layers WELL
BELOW the DBMS) *think* you are altering happens to be adjacent to some
other cell (which it will almost assuredly be), then that adjacent cell
can be corrupted by the malformed actions consequential to the power
transition putting the chip(s) in a compromised operating state.

E.g., you go to circle a name in a (deadtree) phone book and your
hand (or the book) shudders in the process (because you're feeling faint
and on the verge of passing out).  Can you guarantee that you will
circle the *name* that you intended?  Or, some nearby name?  Or,
maybe an address or phone number somewhere in that vicinity?

It doesn't matter that you double or triple checked the spelling of the
name to be sure you'd found the right one.  Or, that you deliberately
chose a very fine point pen to ensure you would ONLY circle the item of
interest (i.e., that your software has been robustly designed).  When
the actual time comes for the pen to touch the paper, if you're not
"fully operational", all bets are off.

I.e., some MECHANISM is needed (not software) that will block your hand
from marking the page if you are unsteady.

Absent that (or, in the presence of a poorly conceived mechanism),
you have no way of knowing *later*, when you've "recovered", if you
may have done some damage (corruption) during that event.  Indeed,
you may not even be aware that you were unsteady at the time!

> A proper DBMS always will work with a dynamic copy of the target file
> block (the reason it should be run from RAM instead of entirely from
> r/w Flash).  The journal (WAL) records the original block(s)
> containing the record(s) to be changed, and the modifications made to
> them.  If the write to stable storage fails, the journal allows
> recovering either the original data or the modified data.

But, as I noted above, you can't KNOW that the journal hasn't been
collaterally damaged (by your shakey hands).

In *normal* operation, writing (and to a lesser extent, READING) to
FLASH disturbs the data in nearby (i.e., NOT BEING ACCESSED) memory
cells.  When power and signals (levels and timing) are suspect
(i.e., as power is failing), this problem is magnified.

> The journal always is written prior to modifying the stable store.  If
> the journal writes fail, the write to the stable copy never will be
> attempted: a "journal crash" is a halting error.
>
> A DBMS run without journaling enabled is unsafe.
>
> The longwinded point is that the DBMS *expects* that any file block it
> tries to change may be corrupted during i/o, and it takes steps to
> protect against losing data because of that.

What if I corrupt two blocks at the same time -- two UNRELATED (by any
notion that *you*, the developer, can fathom) blocks.  ANY two that I
want.  Can you recover?  Can you even guarantee to KNOW that this
has happened?

I.e., some other table in the same tablespace has been wacked as a
consequence of this errant write.  A table that hasn't been written
in months (no record of the most recent changes in the journal/WAL).
Will you *know* that it has been wacked?  How?  WHEN??

> But SSDs - even when working (more or less) properly - introduce a
> failure mode where updating a single "file block" (SSD page) drops an
> atomic bomb in the middle of the file system, with fallout affecting
> other, possibly unrelated, "file blocks" (pages) as well.

Exactly.  You can't know -- nor predict -- which blocks/pages/cells
of the medium will be corrupted.  You probably won't even know which
of these were being *targeted* when the event occurred, let alone which
are affected by "collateral damage".

The whole point is that the system isn't operating as *intended*
(by the naive software developer) during these periods.  The hardware
and system designers have to provide guidance for THAT SPECIFIC SYSTEM
so the software developer knows what he can, can't and shouldn't do
as power failure approaches (along with recovery therefrom).

Early nonvolatile semiconductor memory (discounting WAROM) was
typically implemented as BBSRAM.  It was often protected by gating the
write line with a "POWER_OK" signal.  Obvious, right?  Power failing
should block writes!

But, that led to data being corrupted -- because the POWER_OK
(write inhibit) signal was asynchronous with the memory cycle.
So, a write could be prematurely terminated and corrupt the
data that was intended to be written leading to different outcomes:
- old data remains
- new data overwrites
- bogus data results
But, it tended to be just *the* location that was addressed
(unless the write inhibit happened too late in the power loss
scenario)

Moving to bigger blocks of memory say BBDRAM replace the BBSRAM.
DRAM requiring less power per bit to operate (sustain).

A bit more complicated to implement as the refresh controller
has to remain active in the absence of power.  The flaw, here,
would be failing to synchronize the "inhibit" and potentially
aborting a RAS or CAS -- and clobbering an entire *row* in the
device (leaving it with unpredictable contents).

SRAM is now bigger *and* lower power -- and folks understand the
need to synchronously protect accesses.  So, its trivial to design
a (large!) block of BBSRAM that operates on lower power.  As you
can't "synchronize with the future", its easier to just give an
early warning to the processor (e.g., NMI) and have it deliberately
toggle a "protect memory" latch thereafter KNOWING that it shouldn't
wven bother trying to write to that memory!

But FLASH (esp SSD's and "memory cards") have progressed to the
point where they have their own controllers, etc. on board.  So,
from the outside, you can neither tell where (physically) a
particular write will affect the contents of the chip(s) packaged
within, nor can you know for sure what is happening (at a signal
level) inside the device.

So, how can you know how far in advance to stop writing?
How can you know, for sure, that your last write will actually
manage to end up being committed to the memory chip(s) within
the device (what if its controller encounters a write error
and opts to retry the write on a different block of memory,
adjusting its bookkeeping in the process)?

You do all the power calculations assuming the bulk capacity
in your power supply is at the *low* end of its rating -- for
the current temperature -- and assume your electronics are using
the *maximum* amount of power (including the memory card!)
and predict how much "up time" you have before the voltage(s)
in the system fall out of spec.  Then, back that off by some
amount of derating to make it easier to sleep at night.

> It's like: what should the flight computer do if the wings fall off?
> There's absolutely nothing it can do, so the developer of the flight
> software should not waste time worrying about it.  It's a *system*
> level issue.

And, the OP is the system designer, as far as we are concerned.
Or, at least the *conduit* from USENET to that designer!

>> Remember that a FLASH write (parallels exist for other technologies) is
>> actually an *erase* operation followed by a write.  And, that you're
>> actually dealing with "pages"/blocks of data, not individual bytes.
>
> Block based i/o is not the issue.  The issue is that SSDs do the
> equivalent of rewriting a whole platter track to change a single
> sector.

The salient point in the above is that a write is TWO operations:
erase followed by write.  And, depending on the controller and the
state of wear in the actual underlying medium, possibly some
housekeeping as blocks are remapped.

The issue is that there is a window of time in which the operation is
"in progress".  But, in a VULNERABLE STATE!

If I issue a write to a magnetic disk, the "process" begins the
moment that I issue the right.  But, there are lots of delays
built into that process (rotational delay, access delay, etc.).
So, the actual window of vulnerability is very small:  when are the
heads actually positioned over the correct portion of the medium
to alter the magnetic domains therein.

And, if this event is "interfered with", the consequences are
confined to that portion of the medium -- not some other track
or platter or sector.

That's not the case with the current types of semiconductor
nonvolatile memory.  The "window of vulnerability" extends
throughout the duration of the write operation (erase, write,
internal verify and possible remapping/rewriting).

>> The erase may take a fraction of a millisecond ("milli", not "micro")
>> to be followed by a slightly shorter time to actually write the new
>> contents back into the cells.
>>
>> [The times are often doubled for MLC devices!]
>>
>> During this "window of vulnerability", if the power supplies (or signal
>> voltages) go out of spec, the device can misbehave in unpredictable ways.
>>
>> [This assumes the CPU itself isn't ALSO misbehaving as a result of the same
>> issues!]
>>
>> This can manifest as:
>> - the wrong value getting written
>> - the right value getting written to the wrong location
>> - the wrong value getting written to the wrong location
>> - the entire page being partially erased
>> - some other page being erased
>> etc.
>
> Byzantine failure.
>
> The duration of the "window of vulnerability" is not the issue.  The
> issue is the unpredictability of the result.

Of course the size of the window is important!  The software can't do
*squat* while the operation is in process.  It can't decide that
it doesn't *really* want to do the write, please restore the previous
contents of that memory (block!).

And, the software can do nothing about the power remaining in
the power supply's bulk filter.  It's like skidding on black ice
and just *hoping* things come to a graceful conclusion BEFORE
you slam into the guardrail!

> DBMS were designed at a time when disks were unreliable and operating
> systems [if even present] were primitive.  Early DBMS often included
> their own device code and took direct control of disk and tape devices
> so that they could guarantee operation.

How can I "accidentally" alter block 0 of a mounted tape when we're
at EOT (or any other place physically removed from block 0)

A semiconductor memory can alter ANYTHING at any time!  Signal
levels inside the die determine which rows are strobed.  Its
possible for NO row to be strobed, two rows, 5 rows, etc. -- the
decoders are only designed to generate "unique" outputs when
they are operating within their specified parameters.

Let Vcc sag, ground bounce, signals shift in amplitude/offset/timing
and you can't predict how they will affect the CHARGE stored in the
device.

>> And, a week later, the data that I had stored in "birthdate" is no longer
>> present in the journal as it has been previously committed to the store.
>> So, when it gets corrupted by an errant "name update", you'll have no
>> record of what it *should* have been.
>
> We've had this conversation previously also: database terminology
> today is almost universally misunderstood and misused by everyone.
>
> The file(s) on the disk are not the "database" but merely a point in
> time snapshot of the database.
>
> The "database" really is the historical evolution of the stable store.
> To recover from a failure, you need a point in time basis, and the
> journal from that instance to the point of failure.  If you have every
> journal entry from the beginning, you can reconstruct the data just
> prior to the failure starting from an empty basis.
>
> The point here being that you always need backups unless you can
> afford to rebuild from scratch.

This is c.a.e.  Do you really think the OP has a second copy of the
data set hiding on the medium?  And, that it is somehow magically
protected from the sorts of corruption described, here?

>> The DBMS counts on the store having "integrity" -- so, all the DBMS has
>> to do is get the correct data into it THE FIRST TIME and it expects it
>> to remain intact thereafter.  It *surely* doesn't expect a write of
>> one value to one record to alter some other value in some other
>> (unrelated) record!
>
> DBMS are designed to maintain logically consistent data ... the "I" in
> "ACID" stands for "isolation", not for "integrity".
>
> No DBMS can operate reliably if the underlying storage is faulty.

Exactly.  And, there are no commands/opcodes that the OP can execute
that will "avoid corruption on power off".  If there were, the DBMS
would employ them and make that guarantee REGARDLESS OF STORAGE MEDIUM!

:>

>> [Recall this was one of the assets I considered when opting to use a
>> DBMS in my design instead of just a "memory device"; it lets me perform
>> checks on the data going *in* so I don't have to check the data coming
>> *out* (the output is known to be "valid" -- unless the medium has
>> failed -- which is the case for power sequencing on many (most?) nonvolatile
>> storage media.]
>
> And recall that I warned you about the problems of trying to run a
> reliable DBMS on an unattended appliance.  We didn't really discuss
> the issue of SSDs per se, but we did discuss journaling (logging) and
> trying to run the DBMS out of Flash rather than from RAM.

The firmware in SSD's has to address all types of potential users
and deployments.

I don't.  I have ONE application that is accessing the nonvolatile
memory pool so I can tailor they design of that store to fit the
needs and expected usage of its one "client".  I.e., ensure that
the hardware behaves as the DBMS expects it to.

The bigger problem is addressing applications that screw up their
own datasets.  There is NOTHING that I can do to solve that
problem -- even hiring someone to babysit the system 24/7/365.
A buggy application is a buggy application.  Fix it.

I *can* ensure ApplicationA can't mess with ApplicationB's
dataset(s).  I *can* put triggers and admittance criteria
on data going *into* the tables (to try to intercept
stuff that doesn't pass the smell test).  But, a "determined"
application can still write bogus data to the objects to
which it is granted access.

Just like it could write bogus data in raw "files".

On Mon, 19 Jun 2017 12:47:32 +0200, pozz <pozzugno@gmail.com> wrote:

>Il 16/06/2017 13:31, David Brown ha scritto:
>> And some file types are more susceptible to
>> problems - sqlite databases are notorious for being corrupted if writes
>> are interrupted.
>
>What?  I choosed sqlite because they say the corruption of a database is 
>a very rare event.
>
>
>https://www.sqlite.org/howtocorrupt.html

It depends.  Sqlite is designed to be embedded in a foreign program,
so it is vulnerable to developer errors in ways that server based DBMS
are not.

But that's only part of it.  There also are hardware - primarily
storage - reliability issues to consider.

As part of the embedded focus, by default, Sqlite does not enable
"WAL" which stands for "write ahead logging".  More generally, this
capability is referred to as "journaling".

You must enable WAL for the database if you want maximum safety.  But
even with WAL enabled, the database can be corrupted by hardware
hiccups during a write.  

The point I've been debating with DonY in another thread is that, when
using WAL, a database corrupted during normal operation ALWAYS SHOULD
BE RECOVERABLE.

The problem is that tuning WAL use is a balancing act: the safer you
want to be, the more extra file space you need for the log.

SSDs mess with the DBMS's ability to recover from a write failure: the
update/erase "block" size != write "page" size issue is a serious
problem.  It's a failure mode that [essentially] can't happen with
other types of storage systems, and something that DBMS never were
designed to handle.  

If the SSD erase block (not "page") size is not ridiculously large,
you can configure the DBMS's "file block" size to match it.  It will
require additional RAM, may slow down i/o (depends on the SSD), and it
will require lots of additional file space for the WAL, but it will
greatly improve the odds that your database won't get corrupted.

If you can't match block sizes, then you have a problem.  The best
solution is multiple SSDs in a self repairing RAID.  If that isn't
possible, then the only viable solution is frequent offline backups of
the database.

George

On Fri, 16 Jun 2017 23:40:21 -0700, Don Y
<blockedofcourse@foo.invalid> wrote:

>On 6/16/2017 10:33 PM, George Neuner wrote:
>>
>> Normally, the only way [corruption] can happen is if the OS/hardware
>> lies to the DBMS about whether data really has been written to the
>> media. If the media is such that a failure during a write can corrupt
>> more than the one block actually being written, then the media has to
>> be protected against failures during writes.
>
>That's exactly the problem with these types of media.  If you violate
>the parameters of the write/erase cycle, all bets are off -- especially
>if the power to the device may be at a dubious level, etc.

No software can be guaranteed to work correctly in the face of
byzantine failures.  Failing "gracefully" - for some definition - even
if possible, still is failing.

>Devices with built-in controllers (e.g., SD cards, SSD's, etc.) do lots
>of behind the scenes juggling when you do a "simple" write to them so
>you don't know when the ramifications of your "write" are done.  I.e.,
>the nonvolatile components *in* the device may be accessed differently
>than your mental model of the device would expect.

The erase "block" size != write "page" size of SSDs is a known
problem.  

A DBMS can't address this by itself in software: "huge" VMM pages
[sometimes] are good for in-memory performance - but for reliable i/o,
huge file blocks *suck* both for performance and for space efficiency.

A professional DBMS hosting databases on SSD requires the SSDs to be
battery/supercap backed so power can't fail during a write, and also
that multiple SSDs be configured in a RAID ... not for speed, but for
increased reliability.

Caching on SSD is not really an issue, because if the "fast copy" is
unavailable the DBMS can go back to the [presumably slower] primary
store.  But *hosting* databases completely on SSD is a problem.

>Without knowing where each datum resides on the individual memory
>components at any given time, you can't predict what can be corrupted
>by a botched write/erase.  Data that was committed three weeks ago
>(and not "touched" in the intervening time) can be clobbered -- how
>will you know?
>
>>> Given a record (in the DBMS) that conceptually looks like:
>>>     char name[20];
>>>     ...
>>>     char address[40];
>>>     ...
>>>     time_t birthdate;
>>> an access to "address" that is in progress when the power falls
>>> into the realm that isn't guaranteed to yield reliable operation
>>> can corrupt ANY of these stored values.  Similarly, an access to
>>> some datum not shown, above, can corrupt any of *these*!  You need
>>> to understand whereeach datum resides if you want to risk an 
>>> "interrupted write".

My issue with these statements is that they are misleading, and that
in the sense where they are true, the problem can only be handled at
the system level by use of additional hardware - there's no way it can
be addressed locally, entirely in software.

No DBMS will seek into the middle of a file block and try to write 40
bytes.  Storage always is block oriented.

It's *true* that in your example above, e.g., updating the address
field will result in rewriting the entire file block (or blocks if
spanning) that contains the target data.

But it's *misleading* to say that you need to know, e.g., where is the
name field relative to the address field because the name field might
be corrupted by updating the address field.  It's true, but irrelevant
because the DBMS deals with that possibility automatically.

A proper DBMS always will work with a dynamic copy of the target file
block (the reason it should be run from RAM instead of entirely from
r/w Flash).  The journal (WAL) records the original block(s)
containing the record(s) to be changed, and the modifications made to
them.  If the write to stable storage fails, the journal allows
recovering either the original data or the modified data.

The journal always is written prior to modifying the stable store.  If
the journal writes fail, the write to the stable copy never will be
attempted: a "journal crash" is a halting error.

A DBMS run without journaling enabled is unsafe.

The longwinded point is that the DBMS *expects* that any file block it
tries to change may be corrupted during i/o, and it takes steps to
protect against losing data because of that.

But SSDs - even when working (more or less) properly - introduce a
failure mode where updating a single "file block" (SSD page) drops an
atomic bomb in the middle of the file system, with fallout affecting
other, possibly unrelated, "file blocks" (pages) as well.

It's like: what should the flight computer do if the wings fall off?
There's absolutely nothing it can do, so the developer of the flight
software should not waste time worrying about it.  It's a *system*
level issue.

>Remember that a FLASH write (parallels exist for other technologies) is
>actually an *erase* operation followed by a write.  And, that you're
>actually dealing with "pages"/blocks of data, not individual bytes.

Block based i/o is not the issue.  The issue is that SSDs do the
equivalent of rewriting a whole platter track to change a single
sector.

>The erase may take a fraction of a millisecond ("milli", not "micro")
>to be followed by a slightly shorter time to actually write the new
>contents back into the cells.
>
>[The times are often doubled for MLC devices!]
>
>During this "window of vulnerability", if the power supplies (or signal
>voltages) go out of spec, the device can misbehave in unpredictable ways.
>
>[This assumes the CPU itself isn't ALSO misbehaving as a result of the same
>issues!]
>
>This can manifest as:
>- the wrong value getting written
>- the right value getting written to the wrong location
>- the wrong value getting written to the wrong location
>- the entire page being partially erased
>- some other page being erased
>etc.

Byzantine failure.

The duration of the "window of vulnerability" is not the issue.  The
issue is the unpredictability of the result.

DBMS were designed at a time when disks were unreliable and operating
systems [if even present] were primitive.  Early DBMS often included
their own device code and took direct control of disk and tape devices
so that they could guarantee operation.

>And, a week later, the data that I had stored in "birthdate" is no longer
>present in the journal as it has been previously committed to the store.
>So, when it gets corrupted by an errant "name update", you'll have no
>record of what it *should* have been.

We've had this conversation previously also: database terminology
today is almost universally misunderstood and misused by everyone.

The file(s) on the disk are not the "database" but merely a point in
time snapshot of the database.

The "database" really is the historical evolution of the stable store.
To recover from a failure, you need a point in time basis, and the
journal from that instance to the point of failure.  If you have every
journal entry from the beginning, you can reconstruct the data just
prior to the failure starting from an empty basis.

The point here being that you always need backups unless you can
afford to rebuild from scratch.

>The DBMS counts on the store having "integrity" -- so, all the DBMS has
>to do is get the correct data into it THE FIRST TIME and it expects it
>to remain intact thereafter.  It *surely* doesn't expect a write of
>one value to one record to alter some other value in some other
>(unrelated) record!

DBMS are designed to maintain logically consistent data ... the "I" in
"ACID" stands for "isolation", not for "integrity".

No DBMS can operate reliably if the underlying storage is faulty.

>[Recall this was one of the assets I considered when opting to use a
>DBMS in my design instead of just a "memory device"; it lets me perform
>checks on the data going *in* so I don't have to check the data coming
>*out* (the output is known to be "valid" -- unless the medium has
>failed -- which is the case for power sequencing on many (most?) nonvolatile
>storage media.]

And recall that I warned you about the problems of trying to run a
reliable DBMS on an unattended appliance.  We didn't really discuss
the issue of SSDs per se, but we did discuss journaling (logging) and
trying to run the DBMS out of Flash rather than from RAM.

YMMV,
George

On 19/06/17 12:47, pozz wrote:
> Il 16/06/2017 13:31, David Brown ha scritto:
>> And some file types are more susceptible to
>> problems - sqlite databases are notorious for being corrupted if writes
>> are interrupted.
> 
> What?  I choosed sqlite because they say the corruption of a database is
> a very rare event.
> 
> 
> https://www.sqlite.org/howtocorrupt.html

Sorry for causing alarm, I was mixing this up with something else.

You are, of course, still reliant on the OS and the hardware acting
together to get the right behaviour here.  The OS will tell the sqlite
library when it believes the data has been written to the disk, but a
memory card may still mess around with writes, garbage collection, etc.,
after that.