Linux embedded: how to avoid corruption on power off| page 2

Reply by Joe Chisolm ●June 16, 20172017-06-16

On Fri, 16 Jun 2017 12:10:26 +0200, pozz wrote:

> I'm playing with a Raspberry system, however I think my question is 
> about Linux embedded in general.
> 
> We all know that the OS (linux or windows or whatever) *should* be 
> gracefully powered down with a shutdown procedure (shutdown command in 
> Linux). We must avoid cutting the power abruptly.
> 
> If this is possible for desktop systems, IMHO it's impossible to achieve 
> in embedded systems.  The user usually switch off a small box by 
> pressing an OFF button that usually is connected to the main power 
> supply input.  In any case, he could immediately unplug the power cord 
> without waiting for the end of the shutdown procedure.
> 
> I'm interesting to know what are the methods to use to reduce the 
> probability of corruption.
> 
> For example, I choose to use a sqlite database to save non-volatile user 
> configurable settings. sqlite is transaction based, so a power 
> interruption in the middle of a transaction shouldn't corrupt the entire 
> database. With normal text files this should be more difficult.
> 
> I know the write requests on non-volatile memories (HDD, embedded Flash 
> memories) are usually buffered by OS and we don't know when they will be 
> really executed by the kernel. Is there a method to force the buffered 
> writing requests immediately?
> 
> Other aspects to consider?

Google raspberry pi ups
There are several options if you dont want to roll your own




-- 
Chisolm
Republic of Texas

Reply by Don Y ●June 16, 20172017-06-16

On 6/16/2017 11:24 AM, John Speth wrote:
>>>> I'm interesting to know what are the methods to use to reduce the
>>>> probability of corruption.
>>>
>>> you put a big capacitor on the power line so that when the main power is
>>> cut, the capacitor keep the system running the time necessary to gracefully
>>> shutdown.
>
>> E.g. for 24 V (typically diesel engines) use a series diode, a _big_
>> storage capacitor and a SMPS with input voltage range at least 8-28 V,
>> this should be enough time to sync out the data.
>
> You'll also need some hardware to signal the software that external power has
> turned off.  That event will tell your system that a little bit of time remains
> to shut down before the caps drain and the world ends.

IME, you need TWO signals (if you have a physical power switch):
- one to indicate that the user has turned the power off (perhaps
   because it is easier than using whatever "soft" mechanisms have
   been put in place; or, because the system has crashed or is
   otherwise not responsive ENOUGH to use that primary approach)
- one to indicate that the user had hoped to keep the device running
   but the mains power has been removed

Depending on the topology of the power supply, you can get a lot
of advanced warning (e.g., monitoring the "rectified mains" on
a switcher) or very little (e.g., monitoring the regulated output(s)
to detect as they are falling below their minimum required voltages.

Note that your advanced warning has to tell you when ANY of the supplies
you will be *requiring* will fall out of spec.

Reply by George Neuner ●June 17, 20172017-06-17

Hi Don,

I know *you* know this (because we've been over it before) ... I'm
just providing information for the group.

On Fri, 16 Jun 2017 11:26:39 -0700, Don Y
<blockedofcourse@foo.invalid> wrote:

>On 6/16/2017 3:10 AM, pozz wrote:
>
>> For example, I choose to use a sqlite database to save non-volatile user
>> configurable settings. sqlite is transaction based, so a power interruption in
>> the middle of a transaction shouldn't corrupt the entire database. With normal
>> text files this should be more difficult.
>
>That's not true.  If the DBMS software is in the process of writing to
>persistent store *as* the power falls into the "not guaranteed to work"
>realm, it can corrupt OTHER memory beyond that related to your current
>transaction -- memory for transactions that have already been *committed*.

That's correct as far as it goes, but from the DBMS point of view
transient data in RAM is expected to be lost in a failure ... the
objective of the DBMS is to protect data on stable storage.  

If you are using a journal (WAL) - and you need your head examined if
you aren't - transactions already committed are guaranteed to be
recoverable unless the journal copy and the stable copy BOTH are
corrupted.  

Normally, the only way this can happen is if the OS/hardware lies to
the DBMS about whether data really has been written to the media.  
If the media is such that a failure during a write can corrupt more
than the one block actually being written, then the media has to be
protected against failures during writes.

The journal copy and the stable copy of any given record are never
written simultaneously, so even if the journal and the stable file
both are being written to at the time of failure, the data involved in
the respective writes will be different.

>Given a record (in the DBMS) that conceptually looks like:
>     char name[20];
>     ...
>     char address[40];
>     ...
>     time_t birthdate;
>an access to "address" that is in progress when the power falls into the
>realm that isn't guaranteed to yield reliable operation can corrupt
>ANY of these stored values.  Similarly, an access to some datum not
>shown, above, can corrupt any of *these*!  You need to understand where
>each datum resides if you want to risk an "interrupted write".

No.  Records may be written or updated in pieces if the data involved
is not contiguous, but the journal process guarantees that, seen from
on high, the update occurs atomically.

The *entire* record to be updated will 1st be copied unmodified into
the journal.  Then the journal will record the changes to be made to
be made to the record.  After journaling is complete, the changes will
be made to the record in the stable file.

At every point in the update process, the DBMS has a consistent
version of the *entire* record available to be recovered.  The
consistent version may not be the latest from a client's perspective,
but it won't be a corrupt, partially updated version.

Ideally, journaling will be done at the file block level, so as to
protect data that co-resident in the same block(s) as the target
record.  But not all DBMS do this because a block journal can become
very large, very quickly.  Sqlite (and MySQL and others) journal at a
meta level: recording just the original data and changes to be made to
it - which isn't quite as safe as block journaling, but is acceptable
for most purposes.

Note that OS/filesystem journaling is not a replacement for DBMS
journaling. The addition of filesystem journaling has no affect on
DBMS reliability, but it does affect DBMS i/o performance.

YMMV,
George

Reply by Don Y ●June 17, 20172017-06-17

On 6/16/2017 10:33 PM, George Neuner wrote:
>>> For example, I choose to use a sqlite database to save non-volatile user
>>> configurable settings. sqlite is transaction based, so a power interruption in
>>> the middle of a transaction shouldn't corrupt the entire database. With normal
>>> text files this should be more difficult.
>>
>> That's not true.  If the DBMS software is in the process of writing to
>> persistent store *as* the power falls into the "not guaranteed to work"
>> realm, it can corrupt OTHER memory beyond that related to your current
>> transaction -- memory for transactions that have already been *committed*.
>
> That's correct as far as it goes, but from the DBMS point of view
> transient data in RAM is expected to be lost in a failure ... the
> objective of the DBMS is to protect data on stable storage.
>
> If you are using a journal (WAL) - and you need your head examined if
> you aren't - transactions already committed are guaranteed to be
> recoverable unless the journal copy and the stable copy BOTH are
> corrupted.
>
> Normally, the only way this can happen is if the OS/hardware lies to
> the DBMS about whether data really has been written to the media.
> If the media is such that a failure during a write can corrupt more
> than the one block actually being written, then the media has to be
> protected against failures during writes.

That's exactly the problem with these types of media.  If you violate
the parameters of the write/erase cycle, all bets are off -- especially
if the power to the device may be at a dubious level, etc.

Devices with built-in controllers (e.g., SD cards, SSD's, etc.) do lots
of behind the scenes juggling when you do a "simple" write to them so
you don't know when the ramifications of your "write" are done.  I.e.,
the nonvolatile components *in* the device may be accessed differently
than your mental model of the device would expect.

Without knowing where each datum resides on the individual memory
components at any given time, you can't predict what can be corrupted
by a botched write/erase.  Data that was committed three weeks ago
(and not "touched" in the intervening time) can be clobbered -- how
will you know?

> The journal copy and the stable copy of any given record are never
> written simultaneously, so even if the journal and the stable file
> both are being written to at the time of failure, the data involved in
> the respective writes will be different.
>
>> Given a record (in the DBMS) that conceptually looks like:
>>     char name[20];
>>     ...
>>     char address[40];
>>     ...
>>     time_t birthdate;
>> an access to "address" that is in progress when the power falls into the
>> realm that isn't guaranteed to yield reliable operation can corrupt
>> ANY of these stored values.  Similarly, an access to some datum not
>> shown, above, can corrupt any of *these*!  You need to understand where
>> each datum resides if you want to risk an "interrupted write".
>
> No.  Records may be written or updated in pieces if the data involved
> is not contiguous, but the journal process guarantees that, seen from
> on high, the update occurs atomically.

You're missing the point.  Imagine you are writing to <something> -- but,
I reach in and twiddle with the electrical signals that control that write
and do so in a manner that isn't easily predicted.  You (the software and
hardware) THINK that you are doing one thing but, in fact, something else
is happening, entirely.  The hardware doesn't think anything "wrong" is
happening (its notion of right and wrong are suspect!)

Remember that a FLASH write (parallels exist for other technologies) is
actually an *erase* operation followed by a write.  And, that you're
actually dealing with "pages"/blocks of data, not individual bytes.
The erase may take a fraction of a millisecond ("milli", not "micro")
to be followed by a slightly shorter time to actually write the new
contents back into the cells.

[The times are often doubled for MLC devices!]

During this "window of vulnerability", if the power supplies (or signal
voltages) go out of spec, the device can misbehave in unpredictable ways.

[This assumes the CPU itself isn't ALSO misbehaving as a result of the same
issues!]

This can manifest as:
- the wrong value getting written
- the right value getting written to the wrong location
- the wrong value getting written to the wrong location
- the entire page being partially erased
- some other page being erased
etc.

Change the manufacturer -- or part number -- and the behavior can change
as well.

What's even worse is that this sort of failure can have FUTURE consequences
for reliability.  Given that power was failing at the time of the event,
the system probably doesn't know (nor can it easily remember) what it
was in the process of doing when the power failed.  So, it doesn't know
that a certain page may have only been partially programmed -- even if
it has the correct "values", there might not be the required amount of
charge stored in each cell (you can't count electrons from the pin interface!)

So, data that *looks* good may actually be more susceptible to things like
read (or write) disturbances LATER.  You operated the device outside its
specified operating conditions so you can't rely on the data retention
that the manufacturer implies you'd have!

If your "memory device" has any "smarts" in it (SSD, SD card, etc.)
then it can perform many of these operations for *each* operation
that you THINK it is performing (e.g., as it shuffles data around
for load leveling and to accommodate failing pages/blocks).

*You* think you're updating the "name" portion of a record but
the actual memory device decides to stuff some bogus value in the
"birthdate" portion -- perhaps of a different record!

> The *entire* record to be updated will 1st be copied unmodified into
> the journal.  Then the journal will record the changes to be made to
> be made to the record.  After journaling is complete, the changes will
> be made to the record in the stable file.

And, a week later, the data that I had stored in "birthdate" is no longer
present in the journal as it has been previously committed to the store.
So, when it gets corrupted by an errant "name update", you'll have no
record of what it *should* have been.

The DBMS counts on the store having "integrity" -- so, all the DBMS has
to do is get the correct data into it THE FIRST TIME and it expects it
to remain intact thereafter.  It *surely* doesn't expect a write of
one value to one record to alter some other value in some other
(unrelated) record!

[Recall this was one of the assets I considered when opting to use a
DBMS in my design instead of just a "memory device"; it lets me perform
checks on the data going *in* so I don't have to check the data coming
*out* (the output is known to be "valid" -- unless the medium has
failed -- which is the case for power sequencing on many (most?) nonvolatile
storage media.]

> At every point in the update process, the DBMS has a consistent
> version of the *entire* record available to be recovered.  The
> consistent version may not be the latest from a client's perspective,
> but it won't be a corrupt, partially updated version.
>
> Ideally, journaling will be done at the file block level, so as to
> protect data that co-resident in the same block(s) as the target
> record.  But not all DBMS do this because a block journal can become
> very large, very quickly.  Sqlite (and MySQL and others) journal at a
> meta level: recording just the original data and changes to be made to
> it - which isn't quite as safe as block journaling, but is acceptable
> for most purposes.
>
> Note that OS/filesystem journaling is not a replacement for DBMS
> journaling. The addition of filesystem journaling has no affect on
> DBMS reliability, but it does affect DBMS i/o performance.

Reply by ●June 17, 20172017-06-17

On Fri, 16 Jun 2017 11:24:52 -0700, John Speth <johnspeth@yahoo.com>
wrote:

>>>> I'm interesting to know what are the methods to use to reduce the
>>>> probability of corruption.
>>>
>>> you put a big capacitor on the power line so that when the main power is cut, the capacitor keep the system running the time necessary to gracefully shutdown.
>
>> E.g. for 24 V (typically diesel engines) use a series diode, a _big_
>> storage capacitor and a SMPS with input voltage range at least 8-28 V,
>> this should be enough time to sync out the data.
>
>You'll also need some hardware to signal the software that external 
>power has turned off.  That event will tell your system that a little 
>bit of time remains to shut down before the caps drain and the world ends.
>
>JJS

Very little extra hardware is required. Use an optoisolator and you
can easily monitor the primary input voltage, including mains voltage.
Some comparator functionality and it will drive some of the RS-232
handshake pins, which then generates an interrupt.

One issue with big storage capacitors is how fast it is charged when
the power is restored. The SMPS reset circuitry should operate
reliably even when the input voltage is slowly restored.

With big diesels especially at cold temperatures, the starting current
will be huge, dropping the battery voltage momentarily. This voltage
drop will initiate the data save, but after that, the routines should
check if the input voltage has returned and based on this, continue
normal operation or perform a full shutdown.

Reply by ●June 18, 20172017-06-18

Den l&oslash;rdag den 17. juni 2017 kl. 16.35.51 UTC+2 skrev upsid...@downunder.com:
> On Fri, 16 Jun 2017 11:24:52 -0700, John Speth <johnspeth@yahoo.com>
> wrote:
> 
> >>>> I'm interesting to know what are the methods to use to reduce the
> >>>> probability of corruption.
> >>>
> >>> you put a big capacitor on the power line so that when the main power is cut, the capacitor keep the system running the time necessary to gracefully shutdown.
> >
> >> E.g. for 24 V (typically diesel engines) use a series diode, a _big_
> >> storage capacitor and a SMPS with input voltage range at least 8-28 V,
> >> this should be enough time to sync out the data.
> >
> >You'll also need some hardware to signal the software that external 
> >power has turned off.  That event will tell your system that a little 
> >bit of time remains to shut down before the caps drain and the world ends.
> >
> >JJS
> 
> Very little extra hardware is required. Use an optoisolator and you
> can easily monitor the primary input voltage, including mains voltage.
> Some comparator functionality and it will drive some of the RS-232
> handshake pins, which then generates an interrupt.
> 
> One issue with big storage capacitors is how fast it is charged when
> the power is restored. The SMPS reset circuitry should operate
> reliably even when the input voltage is slowly restored.
> 
> With big diesels especially at cold temperatures, the starting current
> will be huge, dropping the battery voltage momentarily. This voltage
> drop will initiate the data save, but after that, the routines should
> check if the input voltage has returned and based on this, continue
> normal operation or perform a full shutdown.

I've done UPS for a linux system with supercaps, cpu held in reset until 
the supercaps are fully charged. One pin is used to monitor input supply 
and issue a shutdown if it goes away, another pin is used with the 
gpio-poweroff driver to issue a reset in case the power has returned when 
the shutdown is complete

-Lasse

Reply by pozz ●June 19, 20172017-06-19

Il 16/06/2017 13:31, David Brown ha scritto:
> And some file types are more susceptible to
> problems - sqlite databases are notorious for being corrupted if writes
> are interrupted.

What?  I choosed sqlite because they say the corruption of a database is 
a very rare event.


https://www.sqlite.org/howtocorrupt.html

Reply by Jack ●June 19, 20172017-06-19

Il giorno luned&igrave; 19 giugno 2017 12:47:33 UTC+2, pozz ha scritto:
> Il 16/06/2017 13:31, David Brown ha scritto:
> > And some file types are more susceptible to
> > problems - sqlite databases are notorious for being corrupted if writes
> > are interrupted.
> 
> What?  I choosed sqlite because they say the corruption of a database is 
> a very rare event.
> 
> 
> https://www.sqlite.org/howtocorrupt.html

Chapter 3.1 and 4

Bye Jack

Reply by David Brown ●June 19, 20172017-06-19

On 19/06/17 12:47, pozz wrote:
> Il 16/06/2017 13:31, David Brown ha scritto:
>> And some file types are more susceptible to
>> problems - sqlite databases are notorious for being corrupted if writes
>> are interrupted.
> 
> What?  I choosed sqlite because they say the corruption of a database is
> a very rare event.
> 
> 
> https://www.sqlite.org/howtocorrupt.html

Sorry for causing alarm, I was mixing this up with something else.

You are, of course, still reliant on the OS and the hardware acting
together to get the right behaviour here.  The OS will tell the sqlite
library when it believes the data has been written to the disk, but a
memory card may still mess around with writes, garbage collection, etc.,
after that.

Reply by George Neuner ●June 19, 20172017-06-19

On Fri, 16 Jun 2017 23:40:21 -0700, Don Y
<blockedofcourse@foo.invalid> wrote:

>On 6/16/2017 10:33 PM, George Neuner wrote:
>>
>> Normally, the only way [corruption] can happen is if the OS/hardware
>> lies to the DBMS about whether data really has been written to the
>> media. If the media is such that a failure during a write can corrupt
>> more than the one block actually being written, then the media has to
>> be protected against failures during writes.
>
>That's exactly the problem with these types of media.  If you violate
>the parameters of the write/erase cycle, all bets are off -- especially
>if the power to the device may be at a dubious level, etc.

No software can be guaranteed to work correctly in the face of
byzantine failures.  Failing "gracefully" - for some definition - even
if possible, still is failing.

>Devices with built-in controllers (e.g., SD cards, SSD's, etc.) do lots
>of behind the scenes juggling when you do a "simple" write to them so
>you don't know when the ramifications of your "write" are done.  I.e.,
>the nonvolatile components *in* the device may be accessed differently
>than your mental model of the device would expect.

The erase "block" size != write "page" size of SSDs is a known
problem.  

A DBMS can't address this by itself in software: "huge" VMM pages
[sometimes] are good for in-memory performance - but for reliable i/o,
huge file blocks *suck* both for performance and for space efficiency.

A professional DBMS hosting databases on SSD requires the SSDs to be
battery/supercap backed so power can't fail during a write, and also
that multiple SSDs be configured in a RAID ... not for speed, but for
increased reliability.

Caching on SSD is not really an issue, because if the "fast copy" is
unavailable the DBMS can go back to the [presumably slower] primary
store.  But *hosting* databases completely on SSD is a problem.

>Without knowing where each datum resides on the individual memory
>components at any given time, you can't predict what can be corrupted
>by a botched write/erase.  Data that was committed three weeks ago
>(and not "touched" in the intervening time) can be clobbered -- how
>will you know?
>
>>> Given a record (in the DBMS) that conceptually looks like:
>>>     char name[20];
>>>     ...
>>>     char address[40];
>>>     ...
>>>     time_t birthdate;
>>> an access to "address" that is in progress when the power falls
>>> into the realm that isn't guaranteed to yield reliable operation
>>> can corrupt ANY of these stored values.  Similarly, an access to
>>> some datum not shown, above, can corrupt any of *these*!  You need
>>> to understand whereeach datum resides if you want to risk an 
>>> "interrupted write".

My issue with these statements is that they are misleading, and that
in the sense where they are true, the problem can only be handled at
the system level by use of additional hardware - there's no way it can
be addressed locally, entirely in software.

No DBMS will seek into the middle of a file block and try to write 40
bytes.  Storage always is block oriented.

It's *true* that in your example above, e.g., updating the address
field will result in rewriting the entire file block (or blocks if
spanning) that contains the target data.

But it's *misleading* to say that you need to know, e.g., where is the
name field relative to the address field because the name field might
be corrupted by updating the address field.  It's true, but irrelevant
because the DBMS deals with that possibility automatically.

A proper DBMS always will work with a dynamic copy of the target file
block (the reason it should be run from RAM instead of entirely from
r/w Flash).  The journal (WAL) records the original block(s)
containing the record(s) to be changed, and the modifications made to
them.  If the write to stable storage fails, the journal allows
recovering either the original data or the modified data.

The journal always is written prior to modifying the stable store.  If
the journal writes fail, the write to the stable copy never will be
attempted: a "journal crash" is a halting error.

A DBMS run without journaling enabled is unsafe.

The longwinded point is that the DBMS *expects* that any file block it
tries to change may be corrupted during i/o, and it takes steps to
protect against losing data because of that.

But SSDs - even when working (more or less) properly - introduce a
failure mode where updating a single "file block" (SSD page) drops an
atomic bomb in the middle of the file system, with fallout affecting
other, possibly unrelated, "file blocks" (pages) as well.

It's like: what should the flight computer do if the wings fall off?
There's absolutely nothing it can do, so the developer of the flight
software should not waste time worrying about it.  It's a *system*
level issue.

>Remember that a FLASH write (parallels exist for other technologies) is
>actually an *erase* operation followed by a write.  And, that you're
>actually dealing with "pages"/blocks of data, not individual bytes.

Block based i/o is not the issue.  The issue is that SSDs do the
equivalent of rewriting a whole platter track to change a single
sector.

>The erase may take a fraction of a millisecond ("milli", not "micro")
>to be followed by a slightly shorter time to actually write the new
>contents back into the cells.
>
>[The times are often doubled for MLC devices!]
>
>During this "window of vulnerability", if the power supplies (or signal
>voltages) go out of spec, the device can misbehave in unpredictable ways.
>
>[This assumes the CPU itself isn't ALSO misbehaving as a result of the same
>issues!]
>
>This can manifest as:
>- the wrong value getting written
>- the right value getting written to the wrong location
>- the wrong value getting written to the wrong location
>- the entire page being partially erased
>- some other page being erased
>etc.

Byzantine failure.

The duration of the "window of vulnerability" is not the issue.  The
issue is the unpredictability of the result.

DBMS were designed at a time when disks were unreliable and operating
systems [if even present] were primitive.  Early DBMS often included
their own device code and took direct control of disk and tape devices
so that they could guarantee operation.

>And, a week later, the data that I had stored in "birthdate" is no longer
>present in the journal as it has been previously committed to the store.
>So, when it gets corrupted by an errant "name update", you'll have no
>record of what it *should* have been.

We've had this conversation previously also: database terminology
today is almost universally misunderstood and misused by everyone.

The file(s) on the disk are not the "database" but merely a point in
time snapshot of the database.

The "database" really is the historical evolution of the stable store.
To recover from a failure, you need a point in time basis, and the
journal from that instance to the point of failure.  If you have every
journal entry from the beginning, you can reconstruct the data just
prior to the failure starting from an empty basis.

The point here being that you always need backups unless you can
afford to rebuild from scratch.

>The DBMS counts on the store having "integrity" -- so, all the DBMS has
>to do is get the correct data into it THE FIRST TIME and it expects it
>to remain intact thereafter.  It *surely* doesn't expect a write of
>one value to one record to alter some other value in some other
>(unrelated) record!

DBMS are designed to maintain logically consistent data ... the "I" in
"ACID" stands for "isolation", not for "integrity".

No DBMS can operate reliably if the underlying storage is faulty.

>[Recall this was one of the assets I considered when opting to use a
>DBMS in my design instead of just a "memory device"; it lets me perform
>checks on the data going *in* so I don't have to check the data coming
>*out* (the output is known to be "valid" -- unless the medium has
>failed -- which is the case for power sequencing on many (most?) nonvolatile
>storage media.]

And recall that I warned you about the problems of trying to run a
reliable DBMS on an unattended appliance.  We didn't really discuss
the issue of SSDs per se, but we did discuss journaling (logging) and
trying to run the DBMS out of Flash rather than from RAM.

YMMV,
George