EmbeddedRelated.com
Forums
The 2024 Embedded Online Conference

Linux embedded: how to avoid corruption on power off

Started by pozz June 16, 2017
On Mon, 19 Jun 2017 12:47:32 +0200, pozz <pozzugno@gmail.com> wrote:

>Il 16/06/2017 13:31, David Brown ha scritto: >> And some file types are more susceptible to >> problems - sqlite databases are notorious for being corrupted if writes >> are interrupted. > >What? I choosed sqlite because they say the corruption of a database is >a very rare event. > > >https://www.sqlite.org/howtocorrupt.html
It depends. Sqlite is designed to be embedded in a foreign program, so it is vulnerable to developer errors in ways that server based DBMS are not. But that's only part of it. There also are hardware - primarily storage - reliability issues to consider. As part of the embedded focus, by default, Sqlite does not enable "WAL" which stands for "write ahead logging". More generally, this capability is referred to as "journaling". You must enable WAL for the database if you want maximum safety. But even with WAL enabled, the database can be corrupted by hardware hiccups during a write. The point I've been debating with DonY in another thread is that, when using WAL, a database corrupted during normal operation ALWAYS SHOULD BE RECOVERABLE. The problem is that tuning WAL use is a balancing act: the safer you want to be, the more extra file space you need for the log. SSDs mess with the DBMS's ability to recover from a write failure: the update/erase "block" size != write "page" size issue is a serious problem. It's a failure mode that [essentially] can't happen with other types of storage systems, and something that DBMS never were designed to handle. If the SSD erase block (not "page") size is not ridiculously large, you can configure the DBMS's "file block" size to match it. It will require additional RAM, may slow down i/o (depends on the SSD), and it will require lots of additional file space for the WAL, but it will greatly improve the odds that your database won't get corrupted. If you can't match block sizes, then you have a problem. The best solution is multiple SSDs in a self repairing RAID. If that isn't possible, then the only viable solution is frequent offline backups of the database. George
On 6/19/2017 6:42 PM, George Neuner wrote:
> On Fri, 16 Jun 2017 23:40:21 -0700, Don Y > <blockedofcourse@foo.invalid> wrote: > >> On 6/16/2017 10:33 PM, George Neuner wrote: >>> >>> Normally, the only way [corruption] can happen is if the OS/hardware >>> lies to the DBMS about whether data really has been written to the >>> media. If the media is such that a failure during a write can corrupt >>> more than the one block actually being written, then the media has to >>> be protected against failures during writes. >> >> That's exactly the problem with these types of media. If you violate >> the parameters of the write/erase cycle, all bets are off -- especially >> if the power to the device may be at a dubious level, etc. > > No software can be guaranteed to work correctly in the face of > byzantine failures. Failing "gracefully" - for some definition - even > if possible, still is failing.
But that's the nature of the "power down problem" (the nature of the OP's question)! It requires special attention in hardware *and* software. To think you can magically execute a sequence of commands and be guaranteed NOT to have "corruption" is naive.
>> Devices with built-in controllers (e.g., SD cards, SSD's, etc.) do lots >> of behind the scenes juggling when you do a "simple" write to them so >> you don't know when the ramifications of your "write" are done. I.e., >> the nonvolatile components *in* the device may be accessed differently >> than your mental model of the device would expect. > > The erase "block" size != write "page" size of SSDs is a known > problem. > > A DBMS can't address this by itself in software: "huge" VMM pages > [sometimes] are good for in-memory performance - but for reliable i/o, > huge file blocks *suck* both for performance and for space efficiency. > > A professional DBMS hosting databases on SSD requires the SSDs to be > battery/supercap backed so power can't fail during a write, and also > that multiple SSDs be configured in a RAID ... not for speed, but for > increased reliability. > > Caching on SSD is not really an issue, because if the "fast copy" is > unavailable the DBMS can go back to the [presumably slower] primary > store. But *hosting* databases completely on SSD is a problem.
Exactly. I suspect the OP isn't using rust to store his data (and other corruptible items). So, he's using some nonvolatile *semiconductor* medium. Most probably FLASH. Most probably NAND Flash. Most probably MLC NAND Flash. (i.e., one of the more problematic media to use reliably in this situation)
>> Without knowing where each datum resides on the individual memory >> components at any given time, you can't predict what can be corrupted >> by a botched write/erase. Data that was committed three weeks ago >> (and not "touched" in the intervening time) can be clobbered -- how >> will you know? >> >>>> Given a record (in the DBMS) that conceptually looks like: >>>> char name[20]; >>>> ... >>>> char address[40]; >>>> ... >>>> time_t birthdate; >>>> an access to "address" that is in progress when the power falls >>>> into the realm that isn't guaranteed to yield reliable operation >>>> can corrupt ANY of these stored values. Similarly, an access to >>>> some datum not shown, above, can corrupt any of *these*! You need >>>> to understand whereeach datum resides if you want to risk an >>>> "interrupted write". > > My issue with these statements is that they are misleading, and that > in the sense where they are true, the problem can only be handled at > the system level by use of additional hardware - there's no way it can > be addressed locally, entirely in software. > > No DBMS will seek into the middle of a file block and try to write 40 > bytes. Storage always is block oriented. > > It's *true* that in your example above, e.g., updating the address > field will result in rewriting the entire file block (or blocks if > spanning) that contains the target data. > > But it's *misleading* to say that you need to know, e.g., where is the > name field relative to the address field because the name field might > be corrupted by updating the address field. It's true, but irrelevant > because the DBMS deals with that possibility automatically.
No, you're still missing the point. You need to know physically, on the *medium*, where the actual cells holding the bits of data for these "variables" (records, etc.) reside because "issues" that cause the memory devices to be corrupted have ramifications based on chip topography (geography). I.e., if a cell that you ("you" being the file system and FTL layers WELL BELOW the DBMS) *think* you are altering happens to be adjacent to some other cell (which it will almost assuredly be), then that adjacent cell can be corrupted by the malformed actions consequential to the power transition putting the chip(s) in a compromised operating state. E.g., you go to circle a name in a (deadtree) phone book and your hand (or the book) shudders in the process (because you're feeling faint and on the verge of passing out). Can you guarantee that you will circle the *name* that you intended? Or, some nearby name? Or, maybe an address or phone number somewhere in that vicinity? It doesn't matter that you double or triple checked the spelling of the name to be sure you'd found the right one. Or, that you deliberately chose a very fine point pen to ensure you would ONLY circle the item of interest (i.e., that your software has been robustly designed). When the actual time comes for the pen to touch the paper, if you're not "fully operational", all bets are off. I.e., some MECHANISM is needed (not software) that will block your hand from marking the page if you are unsteady. Absent that (or, in the presence of a poorly conceived mechanism), you have no way of knowing *later*, when you've "recovered", if you may have done some damage (corruption) during that event. Indeed, you may not even be aware that you were unsteady at the time!
> A proper DBMS always will work with a dynamic copy of the target file > block (the reason it should be run from RAM instead of entirely from > r/w Flash). The journal (WAL) records the original block(s) > containing the record(s) to be changed, and the modifications made to > them. If the write to stable storage fails, the journal allows > recovering either the original data or the modified data.
But, as I noted above, you can't KNOW that the journal hasn't been collaterally damaged (by your shakey hands). In *normal* operation, writing (and to a lesser extent, READING) to FLASH disturbs the data in nearby (i.e., NOT BEING ACCESSED) memory cells. When power and signals (levels and timing) are suspect (i.e., as power is failing), this problem is magnified.
> The journal always is written prior to modifying the stable store. If > the journal writes fail, the write to the stable copy never will be > attempted: a "journal crash" is a halting error. > > A DBMS run without journaling enabled is unsafe. > > The longwinded point is that the DBMS *expects* that any file block it > tries to change may be corrupted during i/o, and it takes steps to > protect against losing data because of that.
What if I corrupt two blocks at the same time -- two UNRELATED (by any notion that *you*, the developer, can fathom) blocks. ANY two that I want. Can you recover? Can you even guarantee to KNOW that this has happened? I.e., some other table in the same tablespace has been wacked as a consequence of this errant write. A table that hasn't been written in months (no record of the most recent changes in the journal/WAL). Will you *know* that it has been wacked? How? WHEN??
> But SSDs - even when working (more or less) properly - introduce a > failure mode where updating a single "file block" (SSD page) drops an > atomic bomb in the middle of the file system, with fallout affecting > other, possibly unrelated, "file blocks" (pages) as well.
Exactly. You can't know -- nor predict -- which blocks/pages/cells of the medium will be corrupted. You probably won't even know which of these were being *targeted* when the event occurred, let alone which are affected by "collateral damage". The whole point is that the system isn't operating as *intended* (by the naive software developer) during these periods. The hardware and system designers have to provide guidance for THAT SPECIFIC SYSTEM so the software developer knows what he can, can't and shouldn't do as power failure approaches (along with recovery therefrom). Early nonvolatile semiconductor memory (discounting WAROM) was typically implemented as BBSRAM. It was often protected by gating the write line with a "POWER_OK" signal. Obvious, right? Power failing should block writes! But, that led to data being corrupted -- because the POWER_OK (write inhibit) signal was asynchronous with the memory cycle. So, a write could be prematurely terminated and corrupt the data that was intended to be written leading to different outcomes: - old data remains - new data overwrites - bogus data results But, it tended to be just *the* location that was addressed (unless the write inhibit happened too late in the power loss scenario) Moving to bigger blocks of memory say BBDRAM replace the BBSRAM. DRAM requiring less power per bit to operate (sustain). A bit more complicated to implement as the refresh controller has to remain active in the absence of power. The flaw, here, would be failing to synchronize the "inhibit" and potentially aborting a RAS or CAS -- and clobbering an entire *row* in the device (leaving it with unpredictable contents). SRAM is now bigger *and* lower power -- and folks understand the need to synchronously protect accesses. So, its trivial to design a (large!) block of BBSRAM that operates on lower power. As you can't "synchronize with the future", its easier to just give an early warning to the processor (e.g., NMI) and have it deliberately toggle a "protect memory" latch thereafter KNOWING that it shouldn't wven bother trying to write to that memory! But FLASH (esp SSD's and "memory cards") have progressed to the point where they have their own controllers, etc. on board. So, from the outside, you can neither tell where (physically) a particular write will affect the contents of the chip(s) packaged within, nor can you know for sure what is happening (at a signal level) inside the device. So, how can you know how far in advance to stop writing? How can you know, for sure, that your last write will actually manage to end up being committed to the memory chip(s) within the device (what if its controller encounters a write error and opts to retry the write on a different block of memory, adjusting its bookkeeping in the process)? You do all the power calculations assuming the bulk capacity in your power supply is at the *low* end of its rating -- for the current temperature -- and assume your electronics are using the *maximum* amount of power (including the memory card!) and predict how much "up time" you have before the voltage(s) in the system fall out of spec. Then, back that off by some amount of derating to make it easier to sleep at night.
> It's like: what should the flight computer do if the wings fall off? > There's absolutely nothing it can do, so the developer of the flight > software should not waste time worrying about it. It's a *system* > level issue.
And, the OP is the system designer, as far as we are concerned. Or, at least the *conduit* from USENET to that designer!
>> Remember that a FLASH write (parallels exist for other technologies) is >> actually an *erase* operation followed by a write. And, that you're >> actually dealing with "pages"/blocks of data, not individual bytes. > > Block based i/o is not the issue. The issue is that SSDs do the > equivalent of rewriting a whole platter track to change a single > sector.
The salient point in the above is that a write is TWO operations: erase followed by write. And, depending on the controller and the state of wear in the actual underlying medium, possibly some housekeeping as blocks are remapped. The issue is that there is a window of time in which the operation is "in progress". But, in a VULNERABLE STATE! If I issue a write to a magnetic disk, the "process" begins the moment that I issue the right. But, there are lots of delays built into that process (rotational delay, access delay, etc.). So, the actual window of vulnerability is very small: when are the heads actually positioned over the correct portion of the medium to alter the magnetic domains therein. And, if this event is "interfered with", the consequences are confined to that portion of the medium -- not some other track or platter or sector. That's not the case with the current types of semiconductor nonvolatile memory. The "window of vulnerability" extends throughout the duration of the write operation (erase, write, internal verify and possible remapping/rewriting).
>> The erase may take a fraction of a millisecond ("milli", not "micro") >> to be followed by a slightly shorter time to actually write the new >> contents back into the cells. >> >> [The times are often doubled for MLC devices!] >> >> During this "window of vulnerability", if the power supplies (or signal >> voltages) go out of spec, the device can misbehave in unpredictable ways. >> >> [This assumes the CPU itself isn't ALSO misbehaving as a result of the same >> issues!] >> >> This can manifest as: >> - the wrong value getting written >> - the right value getting written to the wrong location >> - the wrong value getting written to the wrong location >> - the entire page being partially erased >> - some other page being erased >> etc. > > Byzantine failure. > > The duration of the "window of vulnerability" is not the issue. The > issue is the unpredictability of the result.
Of course the size of the window is important! The software can't do *squat* while the operation is in process. It can't decide that it doesn't *really* want to do the write, please restore the previous contents of that memory (block!). And, the software can do nothing about the power remaining in the power supply's bulk filter. It's like skidding on black ice and just *hoping* things come to a graceful conclusion BEFORE you slam into the guardrail!
> DBMS were designed at a time when disks were unreliable and operating > systems [if even present] were primitive. Early DBMS often included > their own device code and took direct control of disk and tape devices > so that they could guarantee operation.
How can I "accidentally" alter block 0 of a mounted tape when we're at EOT (or any other place physically removed from block 0) A semiconductor memory can alter ANYTHING at any time! Signal levels inside the die determine which rows are strobed. Its possible for NO row to be strobed, two rows, 5 rows, etc. -- the decoders are only designed to generate "unique" outputs when they are operating within their specified parameters. Let Vcc sag, ground bounce, signals shift in amplitude/offset/timing and you can't predict how they will affect the CHARGE stored in the device.
>> And, a week later, the data that I had stored in "birthdate" is no longer >> present in the journal as it has been previously committed to the store. >> So, when it gets corrupted by an errant "name update", you'll have no >> record of what it *should* have been. > > We've had this conversation previously also: database terminology > today is almost universally misunderstood and misused by everyone. > > The file(s) on the disk are not the "database" but merely a point in > time snapshot of the database. > > The "database" really is the historical evolution of the stable store. > To recover from a failure, you need a point in time basis, and the > journal from that instance to the point of failure. If you have every > journal entry from the beginning, you can reconstruct the data just > prior to the failure starting from an empty basis. > > The point here being that you always need backups unless you can > afford to rebuild from scratch.
This is c.a.e. Do you really think the OP has a second copy of the data set hiding on the medium? And, that it is somehow magically protected from the sorts of corruption described, here?
>> The DBMS counts on the store having "integrity" -- so, all the DBMS has >> to do is get the correct data into it THE FIRST TIME and it expects it >> to remain intact thereafter. It *surely* doesn't expect a write of >> one value to one record to alter some other value in some other >> (unrelated) record! > > DBMS are designed to maintain logically consistent data ... the "I" in > "ACID" stands for "isolation", not for "integrity". > > No DBMS can operate reliably if the underlying storage is faulty.
Exactly. And, there are no commands/opcodes that the OP can execute that will "avoid corruption on power off". If there were, the DBMS would employ them and make that guarantee REGARDLESS OF STORAGE MEDIUM! :>
>> [Recall this was one of the assets I considered when opting to use a >> DBMS in my design instead of just a "memory device"; it lets me perform >> checks on the data going *in* so I don't have to check the data coming >> *out* (the output is known to be "valid" -- unless the medium has >> failed -- which is the case for power sequencing on many (most?) nonvolatile >> storage media.] > > And recall that I warned you about the problems of trying to run a > reliable DBMS on an unattended appliance. We didn't really discuss > the issue of SSDs per se, but we did discuss journaling (logging) and > trying to run the DBMS out of Flash rather than from RAM.
The firmware in SSD's has to address all types of potential users and deployments. I don't. I have ONE application that is accessing the nonvolatile memory pool so I can tailor they design of that store to fit the needs and expected usage of its one "client". I.e., ensure that the hardware behaves as the DBMS expects it to. The bigger problem is addressing applications that screw up their own datasets. There is NOTHING that I can do to solve that problem -- even hiring someone to babysit the system 24/7/365. A buggy application is a buggy application. Fix it. I *can* ensure ApplicationA can't mess with ApplicationB's dataset(s). I *can* put triggers and admittance criteria on data going *into* the tables (to try to intercept stuff that doesn't pass the smell test). But, a "determined" application can still write bogus data to the objects to which it is granted access. Just like it could write bogus data in raw "files".
On 6/19/2017 8:36 PM, Don Y wrote:
> On 6/19/2017 6:42 PM, George Neuner wrote: >> On Fri, 16 Jun 2017 23:40:21 -0700, Don Y >> <blockedofcourse@foo.invalid> wrote: >> >>> On 6/16/2017 10:33 PM, George Neuner wrote: >>>> >>>> Normally, the only way [corruption] can happen is if the OS/hardware >>>> lies to the DBMS about whether data really has been written to the >>>> media. If the media is such that a failure during a write can corrupt >>>> more than the one block actually being written, then the media has to >>>> be protected against failures during writes. >>> >>> That's exactly the problem with these types of media. If you violate >>> the parameters of the write/erase cycle, all bets are off -- especially >>> if the power to the device may be at a dubious level, etc. >> >> No software can be guaranteed to work correctly in the face of >> byzantine failures. Failing "gracefully" - for some definition - even >> if possible, still is failing. > > But that's the nature of the "power down problem" (the nature of the OP's > question)! It requires special attention in hardware *and* software. > To think you can magically execute a sequence of commands and be > guaranteed NOT to have "corruption" is naive.
I can't afford (thermal budget) to power up yet another server to access my literature archive (did I mention it is hot, here? 119F, today). But, some looleg-ing turns up a few practical references: <https://cseweb.ucsd.edu/~swanson/papers/DAC2011PowerCut.pdf> <http://www.embedded.com/design/prototyping-and-development/4006422/Avoid-corruption-in-nonvolatile-memory> <https://hackaday.com/2016/08/03/single-board-revolution-preventing-flash-memory-corruption/> Remember, even WITHIN a PCB, conditions on and in each chip can differ from moment-to-moment due to the reactive nature of the traces and dynamics of power consumption "around" the board. So, just because power is "good" at your "power supervisory circuit" doesn't mean it's good throughout (and through-in?) the circuit.
W dniu 2017-06-16 o 12:10, pozz pisze:
> I'm playing with a Raspberry system, however I think my question is about Linux embedded in general. > > We all know that the OS (linux or windows or whatever) *should* be gracefully powered down with a > shutdown procedure (shutdown command in Linux). We must avoid cutting the power abruptly. > > If this is possible for desktop systems, IMHO it's impossible to achieve in embedded systems. The > user usually switch off a small box by pressing an OFF button that usually is connected to the main > power supply input. In any case, he could immediately unplug the power cord without waiting for the > end of the shutdown procedure. > > I'm interesting to know what are the methods to use to reduce the probability of corruption. > > For example, I choose to use a sqlite database to save non-volatile user configurable settings. > sqlite is transaction based, so a power interruption in the middle of a transaction shouldn't > corrupt the entire database. With normal text files this should be more difficult. > > I know the write requests on non-volatile memories (HDD, embedded Flash memories) are usually > buffered by OS and we don't know when they will be really executed by the kernel. Is there a method > to force the buffered writing requests immediately? > > Other aspects to consider?
I think that in this matter, most reliable is UPS. For example, in our RB300 we are using UPS based on supercaps. The microcontroller monitors the power supply voltage and controls the system shutdown in case of a power failure. More details: http://pigeoncomputers.com/documentation/hardware/ups/
Il 19/06/2017 13:27, Jack ha scritto:
> Il giorno luned&igrave; 19 giugno 2017 12:47:33 UTC+2, pozz ha scritto: >> Il 16/06/2017 13:31, David Brown ha scritto: >>> And some file types are more susceptible to >>> problems - sqlite databases are notorious for being corrupted if writes >>> are interrupted. >> >> What? I choosed sqlite because they say the corruption of a database is >> a very rare event. >> >> >> https://www.sqlite.org/howtocorrupt.html
Ok, however those problems (memories that lie about the write has really finished) are common to all other file types, not only sqlite.
Il 20/06/2017 09:34, Krzysztof Kajstura ha scritto:
> W dniu 2017-06-16 o 12:10, pozz pisze: >> I'm playing with a Raspberry system, however I think my question is >> about Linux embedded in general. >> >> We all know that the OS (linux or windows or whatever) *should* be >> gracefully powered down with a shutdown procedure (shutdown command in >> Linux). We must avoid cutting the power abruptly. >> >> If this is possible for desktop systems, IMHO it's impossible to >> achieve in embedded systems. The user usually switch off a small box >> by pressing an OFF button that usually is connected to the main power >> supply input. In any case, he could immediately unplug the power cord >> without waiting for the end of the shutdown procedure. >> >> I'm interesting to know what are the methods to use to reduce the >> probability of corruption. >> >> For example, I choose to use a sqlite database to save non-volatile >> user configurable settings. sqlite is transaction based, so a power >> interruption in the middle of a transaction shouldn't corrupt the >> entire database. With normal text files this should be more difficult. >> >> I know the write requests on non-volatile memories (HDD, embedded >> Flash memories) are usually buffered by OS and we don't know when they >> will be really executed by the kernel. Is there a method to force the >> buffered writing requests immediately? >> >> Other aspects to consider? > > > I think that in this matter, most reliable is UPS. For example, in our > RB300 we are using UPS based on supercaps. The microcontroller monitors > the power supply voltage and controls the system shutdown in case of a > power failure. More details: > http://pigeoncomputers.com/documentation/hardware/ups/
Does the UPS supply only CM, so at low voltage? How long the supercaps are able to supply correctly the CM, after cutting the input voltage?
>> I think that in this matter, most reliable is UPS. For example, in our RB300 we are using UPS >> based on supercaps. The microcontroller monitors the power supply voltage and controls the system >> shutdown in case of a power failure. More details: >> http://pigeoncomputers.com/documentation/hardware/ups/ > > Does the UPS supply only CM, so at low voltage? How long the supercaps are able to supply correctly > the CM, after cutting the input voltage?
UPS is 5V, so it supplies all devices with 5V, 3.3V and 1.8V power supply voltages. Compute module is properly powered about 140 seconds, without devices connected to USB ports. In the case of high energy requirements by USB connected devices, there is a possibility unmount devices and turn off USB power supply (this can be controlled by GPIO).
pozz <pozzugno@gmail.com> wrote:
> I'm interesting to know what are the methods to use to reduce the > probability of corruption.
In a Linux-based product from another part of the corporation I work for, when power loss was detected, all non-essential services were killed, and all in-flight data was written out to a log that could be replayed on system startup. This was powered by a supercapacitor dimensioned to last some small number of seconds. However, this was a custom, bare-bone distribution running only their own software, and the filesystem was mounted read-only. IIRC the log used dedicated storage. My understanding is that this scheme worked well for them. -a

The 2024 Embedded Online Conference