EmbeddedRelated.com
Forums

Linux embedded: how to avoid corruption on power off

Started by pozz June 16, 2017
On Fri, 16 Jun 2017 12:10:26 +0200, pozz wrote:

> I'm playing with a Raspberry system, however I think my question is > about Linux embedded in general. > > We all know that the OS (linux or windows or whatever) *should* be > gracefully powered down with a shutdown procedure (shutdown command in > Linux). We must avoid cutting the power abruptly. > > If this is possible for desktop systems, IMHO it's impossible to achieve > in embedded systems. The user usually switch off a small box by > pressing an OFF button that usually is connected to the main power > supply input. In any case, he could immediately unplug the power cord > without waiting for the end of the shutdown procedure. > > I'm interesting to know what are the methods to use to reduce the > probability of corruption. > > For example, I choose to use a sqlite database to save non-volatile user > configurable settings. sqlite is transaction based, so a power > interruption in the middle of a transaction shouldn't corrupt the entire > database. With normal text files this should be more difficult. > > I know the write requests on non-volatile memories (HDD, embedded Flash > memories) are usually buffered by OS and we don't know when they will be > really executed by the kernel. Is there a method to force the buffered > writing requests immediately? > > Other aspects to consider?
Google raspberry pi ups There are several options if you dont want to roll your own -- Chisolm Republic of Texas
On 6/16/2017 11:24 AM, John Speth wrote:
>>>> I'm interesting to know what are the methods to use to reduce the >>>> probability of corruption. >>> >>> you put a big capacitor on the power line so that when the main power is >>> cut, the capacitor keep the system running the time necessary to gracefully >>> shutdown. > >> E.g. for 24 V (typically diesel engines) use a series diode, a _big_ >> storage capacitor and a SMPS with input voltage range at least 8-28 V, >> this should be enough time to sync out the data. > > You'll also need some hardware to signal the software that external power has > turned off. That event will tell your system that a little bit of time remains > to shut down before the caps drain and the world ends.
IME, you need TWO signals (if you have a physical power switch): - one to indicate that the user has turned the power off (perhaps because it is easier than using whatever "soft" mechanisms have been put in place; or, because the system has crashed or is otherwise not responsive ENOUGH to use that primary approach) - one to indicate that the user had hoped to keep the device running but the mains power has been removed Depending on the topology of the power supply, you can get a lot of advanced warning (e.g., monitoring the "rectified mains" on a switcher) or very little (e.g., monitoring the regulated output(s) to detect as they are falling below their minimum required voltages. Note that your advanced warning has to tell you when ANY of the supplies you will be *requiring* will fall out of spec.
Hi Don,

I know *you* know this (because we've been over it before) ... I'm
just providing information for the group.


On Fri, 16 Jun 2017 11:26:39 -0700, Don Y
<blockedofcourse@foo.invalid> wrote:

>On 6/16/2017 3:10 AM, pozz wrote: > >> For example, I choose to use a sqlite database to save non-volatile user >> configurable settings. sqlite is transaction based, so a power interruption in >> the middle of a transaction shouldn't corrupt the entire database. With normal >> text files this should be more difficult. > >That's not true. If the DBMS software is in the process of writing to >persistent store *as* the power falls into the "not guaranteed to work" >realm, it can corrupt OTHER memory beyond that related to your current >transaction -- memory for transactions that have already been *committed*.
That's correct as far as it goes, but from the DBMS point of view transient data in RAM is expected to be lost in a failure ... the objective of the DBMS is to protect data on stable storage. If you are using a journal (WAL) - and you need your head examined if you aren't - transactions already committed are guaranteed to be recoverable unless the journal copy and the stable copy BOTH are corrupted. Normally, the only way this can happen is if the OS/hardware lies to the DBMS about whether data really has been written to the media. If the media is such that a failure during a write can corrupt more than the one block actually being written, then the media has to be protected against failures during writes. The journal copy and the stable copy of any given record are never written simultaneously, so even if the journal and the stable file both are being written to at the time of failure, the data involved in the respective writes will be different.
>Given a record (in the DBMS) that conceptually looks like: > char name[20]; > ... > char address[40]; > ... > time_t birthdate; >an access to "address" that is in progress when the power falls into the >realm that isn't guaranteed to yield reliable operation can corrupt >ANY of these stored values. Similarly, an access to some datum not >shown, above, can corrupt any of *these*! You need to understand where >each datum resides if you want to risk an "interrupted write".
No. Records may be written or updated in pieces if the data involved is not contiguous, but the journal process guarantees that, seen from on high, the update occurs atomically. The *entire* record to be updated will 1st be copied unmodified into the journal. Then the journal will record the changes to be made to be made to the record. After journaling is complete, the changes will be made to the record in the stable file. At every point in the update process, the DBMS has a consistent version of the *entire* record available to be recovered. The consistent version may not be the latest from a client's perspective, but it won't be a corrupt, partially updated version. Ideally, journaling will be done at the file block level, so as to protect data that co-resident in the same block(s) as the target record. But not all DBMS do this because a block journal can become very large, very quickly. Sqlite (and MySQL and others) journal at a meta level: recording just the original data and changes to be made to it - which isn't quite as safe as block journaling, but is acceptable for most purposes. Note that OS/filesystem journaling is not a replacement for DBMS journaling. The addition of filesystem journaling has no affect on DBMS reliability, but it does affect DBMS i/o performance. YMMV, George
On 6/16/2017 10:33 PM, George Neuner wrote:
>>> For example, I choose to use a sqlite database to save non-volatile user >>> configurable settings. sqlite is transaction based, so a power interruption in >>> the middle of a transaction shouldn't corrupt the entire database. With normal >>> text files this should be more difficult. >> >> That's not true. If the DBMS software is in the process of writing to >> persistent store *as* the power falls into the "not guaranteed to work" >> realm, it can corrupt OTHER memory beyond that related to your current >> transaction -- memory for transactions that have already been *committed*. > > That's correct as far as it goes, but from the DBMS point of view > transient data in RAM is expected to be lost in a failure ... the > objective of the DBMS is to protect data on stable storage. > > If you are using a journal (WAL) - and you need your head examined if > you aren't - transactions already committed are guaranteed to be > recoverable unless the journal copy and the stable copy BOTH are > corrupted. > > Normally, the only way this can happen is if the OS/hardware lies to > the DBMS about whether data really has been written to the media. > If the media is such that a failure during a write can corrupt more > than the one block actually being written, then the media has to be > protected against failures during writes.
That's exactly the problem with these types of media. If you violate the parameters of the write/erase cycle, all bets are off -- especially if the power to the device may be at a dubious level, etc. Devices with built-in controllers (e.g., SD cards, SSD's, etc.) do lots of behind the scenes juggling when you do a "simple" write to them so you don't know when the ramifications of your "write" are done. I.e., the nonvolatile components *in* the device may be accessed differently than your mental model of the device would expect. Without knowing where each datum resides on the individual memory components at any given time, you can't predict what can be corrupted by a botched write/erase. Data that was committed three weeks ago (and not "touched" in the intervening time) can be clobbered -- how will you know?
> The journal copy and the stable copy of any given record are never > written simultaneously, so even if the journal and the stable file > both are being written to at the time of failure, the data involved in > the respective writes will be different. > >> Given a record (in the DBMS) that conceptually looks like: >> char name[20]; >> ... >> char address[40]; >> ... >> time_t birthdate; >> an access to "address" that is in progress when the power falls into the >> realm that isn't guaranteed to yield reliable operation can corrupt >> ANY of these stored values. Similarly, an access to some datum not >> shown, above, can corrupt any of *these*! You need to understand where >> each datum resides if you want to risk an "interrupted write". > > No. Records may be written or updated in pieces if the data involved > is not contiguous, but the journal process guarantees that, seen from > on high, the update occurs atomically.
You're missing the point. Imagine you are writing to <something> -- but, I reach in and twiddle with the electrical signals that control that write and do so in a manner that isn't easily predicted. You (the software and hardware) THINK that you are doing one thing but, in fact, something else is happening, entirely. The hardware doesn't think anything "wrong" is happening (its notion of right and wrong are suspect!) Remember that a FLASH write (parallels exist for other technologies) is actually an *erase* operation followed by a write. And, that you're actually dealing with "pages"/blocks of data, not individual bytes. The erase may take a fraction of a millisecond ("milli", not "micro") to be followed by a slightly shorter time to actually write the new contents back into the cells. [The times are often doubled for MLC devices!] During this "window of vulnerability", if the power supplies (or signal voltages) go out of spec, the device can misbehave in unpredictable ways. [This assumes the CPU itself isn't ALSO misbehaving as a result of the same issues!] This can manifest as: - the wrong value getting written - the right value getting written to the wrong location - the wrong value getting written to the wrong location - the entire page being partially erased - some other page being erased etc. Change the manufacturer -- or part number -- and the behavior can change as well. What's even worse is that this sort of failure can have FUTURE consequences for reliability. Given that power was failing at the time of the event, the system probably doesn't know (nor can it easily remember) what it was in the process of doing when the power failed. So, it doesn't know that a certain page may have only been partially programmed -- even if it has the correct "values", there might not be the required amount of charge stored in each cell (you can't count electrons from the pin interface!) So, data that *looks* good may actually be more susceptible to things like read (or write) disturbances LATER. You operated the device outside its specified operating conditions so you can't rely on the data retention that the manufacturer implies you'd have! If your "memory device" has any "smarts" in it (SSD, SD card, etc.) then it can perform many of these operations for *each* operation that you THINK it is performing (e.g., as it shuffles data around for load leveling and to accommodate failing pages/blocks). *You* think you're updating the "name" portion of a record but the actual memory device decides to stuff some bogus value in the "birthdate" portion -- perhaps of a different record!
> The *entire* record to be updated will 1st be copied unmodified into > the journal. Then the journal will record the changes to be made to > be made to the record. After journaling is complete, the changes will > be made to the record in the stable file.
And, a week later, the data that I had stored in "birthdate" is no longer present in the journal as it has been previously committed to the store. So, when it gets corrupted by an errant "name update", you'll have no record of what it *should* have been. The DBMS counts on the store having "integrity" -- so, all the DBMS has to do is get the correct data into it THE FIRST TIME and it expects it to remain intact thereafter. It *surely* doesn't expect a write of one value to one record to alter some other value in some other (unrelated) record! [Recall this was one of the assets I considered when opting to use a DBMS in my design instead of just a "memory device"; it lets me perform checks on the data going *in* so I don't have to check the data coming *out* (the output is known to be "valid" -- unless the medium has failed -- which is the case for power sequencing on many (most?) nonvolatile storage media.]
> At every point in the update process, the DBMS has a consistent > version of the *entire* record available to be recovered. The > consistent version may not be the latest from a client's perspective, > but it won't be a corrupt, partially updated version. > > Ideally, journaling will be done at the file block level, so as to > protect data that co-resident in the same block(s) as the target > record. But not all DBMS do this because a block journal can become > very large, very quickly. Sqlite (and MySQL and others) journal at a > meta level: recording just the original data and changes to be made to > it - which isn't quite as safe as block journaling, but is acceptable > for most purposes. > > Note that OS/filesystem journaling is not a replacement for DBMS > journaling. The addition of filesystem journaling has no affect on > DBMS reliability, but it does affect DBMS i/o performance.
On Fri, 16 Jun 2017 11:24:52 -0700, John Speth <johnspeth@yahoo.com>
wrote:

>>>> I'm interesting to know what are the methods to use to reduce the >>>> probability of corruption. >>> >>> you put a big capacitor on the power line so that when the main power is cut, the capacitor keep the system running the time necessary to gracefully shutdown. > >> E.g. for 24 V (typically diesel engines) use a series diode, a _big_ >> storage capacitor and a SMPS with input voltage range at least 8-28 V, >> this should be enough time to sync out the data. > >You'll also need some hardware to signal the software that external >power has turned off. That event will tell your system that a little >bit of time remains to shut down before the caps drain and the world ends. > >JJS
Very little extra hardware is required. Use an optoisolator and you can easily monitor the primary input voltage, including mains voltage. Some comparator functionality and it will drive some of the RS-232 handshake pins, which then generates an interrupt. One issue with big storage capacitors is how fast it is charged when the power is restored. The SMPS reset circuitry should operate reliably even when the input voltage is slowly restored. With big diesels especially at cold temperatures, the starting current will be huge, dropping the battery voltage momentarily. This voltage drop will initiate the data save, but after that, the routines should check if the input voltage has returned and based on this, continue normal operation or perform a full shutdown.
Den l&oslash;rdag den 17. juni 2017 kl. 16.35.51 UTC+2 skrev upsid...@downunder.com:
> On Fri, 16 Jun 2017 11:24:52 -0700, John Speth <johnspeth@yahoo.com> > wrote: > > >>>> I'm interesting to know what are the methods to use to reduce the > >>>> probability of corruption. > >>> > >>> you put a big capacitor on the power line so that when the main power is cut, the capacitor keep the system running the time necessary to gracefully shutdown. > > > >> E.g. for 24 V (typically diesel engines) use a series diode, a _big_ > >> storage capacitor and a SMPS with input voltage range at least 8-28 V, > >> this should be enough time to sync out the data. > > > >You'll also need some hardware to signal the software that external > >power has turned off. That event will tell your system that a little > >bit of time remains to shut down before the caps drain and the world ends. > > > >JJS > > Very little extra hardware is required. Use an optoisolator and you > can easily monitor the primary input voltage, including mains voltage. > Some comparator functionality and it will drive some of the RS-232 > handshake pins, which then generates an interrupt. > > One issue with big storage capacitors is how fast it is charged when > the power is restored. The SMPS reset circuitry should operate > reliably even when the input voltage is slowly restored. > > With big diesels especially at cold temperatures, the starting current > will be huge, dropping the battery voltage momentarily. This voltage > drop will initiate the data save, but after that, the routines should > check if the input voltage has returned and based on this, continue > normal operation or perform a full shutdown.
I've done UPS for a linux system with supercaps, cpu held in reset until the supercaps are fully charged. One pin is used to monitor input supply and issue a shutdown if it goes away, another pin is used with the gpio-poweroff driver to issue a reset in case the power has returned when the shutdown is complete -Lasse
Il 16/06/2017 13:31, David Brown ha scritto:
> And some file types are more susceptible to > problems - sqlite databases are notorious for being corrupted if writes > are interrupted.
What? I choosed sqlite because they say the corruption of a database is a very rare event. https://www.sqlite.org/howtocorrupt.html
Il giorno luned&igrave; 19 giugno 2017 12:47:33 UTC+2, pozz ha scritto:
> Il 16/06/2017 13:31, David Brown ha scritto: > > And some file types are more susceptible to > > problems - sqlite databases are notorious for being corrupted if writes > > are interrupted. > > What? I choosed sqlite because they say the corruption of a database is > a very rare event. > > > https://www.sqlite.org/howtocorrupt.html
Chapter 3.1 and 4 Bye Jack
On 19/06/17 12:47, pozz wrote:
> Il 16/06/2017 13:31, David Brown ha scritto: >> And some file types are more susceptible to >> problems - sqlite databases are notorious for being corrupted if writes >> are interrupted. > > What? I choosed sqlite because they say the corruption of a database is > a very rare event. > > > https://www.sqlite.org/howtocorrupt.html
Sorry for causing alarm, I was mixing this up with something else. You are, of course, still reliant on the OS and the hardware acting together to get the right behaviour here. The OS will tell the sqlite library when it believes the data has been written to the disk, but a memory card may still mess around with writes, garbage collection, etc., after that.
On Fri, 16 Jun 2017 23:40:21 -0700, Don Y
<blockedofcourse@foo.invalid> wrote:

>On 6/16/2017 10:33 PM, George Neuner wrote: >> >> Normally, the only way [corruption] can happen is if the OS/hardware >> lies to the DBMS about whether data really has been written to the >> media. If the media is such that a failure during a write can corrupt >> more than the one block actually being written, then the media has to >> be protected against failures during writes. > >That's exactly the problem with these types of media. If you violate >the parameters of the write/erase cycle, all bets are off -- especially >if the power to the device may be at a dubious level, etc.
No software can be guaranteed to work correctly in the face of byzantine failures. Failing "gracefully" - for some definition - even if possible, still is failing.
>Devices with built-in controllers (e.g., SD cards, SSD's, etc.) do lots >of behind the scenes juggling when you do a "simple" write to them so >you don't know when the ramifications of your "write" are done. I.e., >the nonvolatile components *in* the device may be accessed differently >than your mental model of the device would expect.
The erase "block" size != write "page" size of SSDs is a known problem. A DBMS can't address this by itself in software: "huge" VMM pages [sometimes] are good for in-memory performance - but for reliable i/o, huge file blocks *suck* both for performance and for space efficiency. A professional DBMS hosting databases on SSD requires the SSDs to be battery/supercap backed so power can't fail during a write, and also that multiple SSDs be configured in a RAID ... not for speed, but for increased reliability. Caching on SSD is not really an issue, because if the "fast copy" is unavailable the DBMS can go back to the [presumably slower] primary store. But *hosting* databases completely on SSD is a problem.
>Without knowing where each datum resides on the individual memory >components at any given time, you can't predict what can be corrupted >by a botched write/erase. Data that was committed three weeks ago >(and not "touched" in the intervening time) can be clobbered -- how >will you know? > >>> Given a record (in the DBMS) that conceptually looks like: >>> char name[20]; >>> ... >>> char address[40]; >>> ... >>> time_t birthdate; >>> an access to "address" that is in progress when the power falls >>> into the realm that isn't guaranteed to yield reliable operation >>> can corrupt ANY of these stored values. Similarly, an access to >>> some datum not shown, above, can corrupt any of *these*! You need >>> to understand whereeach datum resides if you want to risk an >>> "interrupted write".
My issue with these statements is that they are misleading, and that in the sense where they are true, the problem can only be handled at the system level by use of additional hardware - there's no way it can be addressed locally, entirely in software. No DBMS will seek into the middle of a file block and try to write 40 bytes. Storage always is block oriented. It's *true* that in your example above, e.g., updating the address field will result in rewriting the entire file block (or blocks if spanning) that contains the target data. But it's *misleading* to say that you need to know, e.g., where is the name field relative to the address field because the name field might be corrupted by updating the address field. It's true, but irrelevant because the DBMS deals with that possibility automatically. A proper DBMS always will work with a dynamic copy of the target file block (the reason it should be run from RAM instead of entirely from r/w Flash). The journal (WAL) records the original block(s) containing the record(s) to be changed, and the modifications made to them. If the write to stable storage fails, the journal allows recovering either the original data or the modified data. The journal always is written prior to modifying the stable store. If the journal writes fail, the write to the stable copy never will be attempted: a "journal crash" is a halting error. A DBMS run without journaling enabled is unsafe. The longwinded point is that the DBMS *expects* that any file block it tries to change may be corrupted during i/o, and it takes steps to protect against losing data because of that. But SSDs - even when working (more or less) properly - introduce a failure mode where updating a single "file block" (SSD page) drops an atomic bomb in the middle of the file system, with fallout affecting other, possibly unrelated, "file blocks" (pages) as well. It's like: what should the flight computer do if the wings fall off? There's absolutely nothing it can do, so the developer of the flight software should not waste time worrying about it. It's a *system* level issue.
>Remember that a FLASH write (parallels exist for other technologies) is >actually an *erase* operation followed by a write. And, that you're >actually dealing with "pages"/blocks of data, not individual bytes.
Block based i/o is not the issue. The issue is that SSDs do the equivalent of rewriting a whole platter track to change a single sector.
>The erase may take a fraction of a millisecond ("milli", not "micro") >to be followed by a slightly shorter time to actually write the new >contents back into the cells. > >[The times are often doubled for MLC devices!] > >During this "window of vulnerability", if the power supplies (or signal >voltages) go out of spec, the device can misbehave in unpredictable ways. > >[This assumes the CPU itself isn't ALSO misbehaving as a result of the same >issues!] > >This can manifest as: >- the wrong value getting written >- the right value getting written to the wrong location >- the wrong value getting written to the wrong location >- the entire page being partially erased >- some other page being erased >etc.
Byzantine failure. The duration of the "window of vulnerability" is not the issue. The issue is the unpredictability of the result. DBMS were designed at a time when disks were unreliable and operating systems [if even present] were primitive. Early DBMS often included their own device code and took direct control of disk and tape devices so that they could guarantee operation.
>And, a week later, the data that I had stored in "birthdate" is no longer >present in the journal as it has been previously committed to the store. >So, when it gets corrupted by an errant "name update", you'll have no >record of what it *should* have been.
We've had this conversation previously also: database terminology today is almost universally misunderstood and misused by everyone. The file(s) on the disk are not the "database" but merely a point in time snapshot of the database. The "database" really is the historical evolution of the stable store. To recover from a failure, you need a point in time basis, and the journal from that instance to the point of failure. If you have every journal entry from the beginning, you can reconstruct the data just prior to the failure starting from an empty basis. The point here being that you always need backups unless you can afford to rebuild from scratch.
>The DBMS counts on the store having "integrity" -- so, all the DBMS has >to do is get the correct data into it THE FIRST TIME and it expects it >to remain intact thereafter. It *surely* doesn't expect a write of >one value to one record to alter some other value in some other >(unrelated) record!
DBMS are designed to maintain logically consistent data ... the "I" in "ACID" stands for "isolation", not for "integrity". No DBMS can operate reliably if the underlying storage is faulty.
>[Recall this was one of the assets I considered when opting to use a >DBMS in my design instead of just a "memory device"; it lets me perform >checks on the data going *in* so I don't have to check the data coming >*out* (the output is known to be "valid" -- unless the medium has >failed -- which is the case for power sequencing on many (most?) nonvolatile >storage media.]
And recall that I warned you about the problems of trying to run a reliable DBMS on an unattended appliance. We didn't really discuss the issue of SSDs per se, but we did discuss journaling (logging) and trying to run the DBMS out of Flash rather than from RAM. YMMV, George