EmbeddedRelated.com
Forums

EEPROM guarantees after power loss during a write

Started by John Devereux February 5, 2008
On Feb 6, 5:25=A0pm, John Devereux <jdREM...@THISdevereux.me.uk> wrote:
> ssubbarayan <ssu...@gmail.com> writes: > > On Feb 5, 11:28=A0pm, John Devereux <jdREM...@THISdevereux.me.uk> wrote:=
> >> Hi, > > >> I am wondering what guarantees are there for existing EEPROM data, > >> after power is lost during a write operation? > > >> I am writing a datalogging routine that writes records to an > >> EEPROM. It's an Atmel 24C1024, although the question is probably > >> applicable to other devices too. This uses "page mode" for writes - > >> the device seems to be organised as 256 byte pages. > > >> Say power is lost during a write to a single byte in a page. What can > >> I assume? Is just that byte suspect, or the whole page (or the whole > >> device)? > > >> The microcontroller has brownout protection, so isn't going to run > >> wild - but what about the EEPROM internal state machine? Are they > >> generally protected against brownout? > > >> If I write a single byte, does this in fact involve a hidden > >> erase/write of the whole page? > > >> I can't find any information on this stuff. > > >> -- > > >> John Devereux > > > John, > > We encountered the same problem with our product(still > > encountering...!).Even though we did not have a right fix,the way we > > approached to provide a work around for this: > > We implemented a checksum in our software to detect data corruption in > > eeprom and incase we find corruption,have a known good copy of eeprom > > data backup in ROM.(external flash).This data would be copied back to > > the eeprom during bootup.So this ensures customer has good data when > > he bootsup. > > When wrong data is updated due to brownouts,checksum is prone to vary. > > We will backup good data during a situation where we conclude at least > > one known set of good data is there.(This can be ascertained again by > > comparing with known checksum). > > This is equivalent to what I was planning. Although I don't think I > need a checksum. I was going to have "valid" markers, separate from > the data blocks. So it would go > > =A0 mark copy 1 invalid > =A0 write new copy 1 > =A0 mark copy 1 valid > =A0 mark copy 2 invalid > =A0 write new copy 2 > =A0 mark copy 2 valid > > On power up both copy valid flags would be checked, and any "invalid" > copy overwritten with the valid one. The "copy valid" markers would be > stored on separate pages from the data (and each other), so hopefully > will not get corrupted at the same time as the data they refer to. > > Only problem with this is it requires 4 pages to be written instead of > one. Using a checksum to replace the separate flags could mean just > two pages - perhaps that is better after all. > > > We have used this workaround and after this workaround was > > implemented,we never faced any problems with the content of > > eeprom.Even though brownout situation still continues to happen,the > > impact was greatly minimised. > > As far this brownout,like your situation we also did not have either > > an external capacitor or an brownout protection pin in our board.We > > use ST's eeprom.I have raised a similar query to this a couple of > > months bak.Given below is the link: > > 1)http://groups.google.co.in/group/comp.arch.embedded/browse_thread/thr.=
..
> > I will look at these. > > By the way, long links often get scrambled up on usenet. You can make > it easier for some people if you enclose in angle brackets > > <http://groups.google.co.in/group/comp.arch.embedded/browse_thread/thr...>=
> > This seems to stop them getting split up by news readers. > > > 2)Regarding checksum:http://groups.google.co.in/group/ > > comp.arch.embedded/browse_thread/thread/7bb610e206733fdf/ > > 70757e6c50a8dfb6?hl=3Den&lnk=3Dgst&q=3Dsubbarayan#70757e6c50a8dfb6 > > > P.S:ours is an consumer electronics product.Processor:ST,EEPROM:ST's > > M24128BW . > > > This solution may or may not be suitable to you depending on your > > product. > > Thank you. > > > Hope this helps, > > Regards, > > s.subbarayan > > -- > > John Devereux- Hide quoted text - > > - Show quoted text -
Hi John, I will continue using brackets while posting long links. By the way,I have a question regarding your implementation. In your algorithm to make two copies of any data in nvram,what if during updation to both the copies you encounter power brownout?Since power brown outs are unpredictable,how are we going to guarentee atleast one good copy exists with us? The scenerio which I am referring to here would be the first time when you are updating the data.During first updation,you wont be able to ascertain whether the copy is good or bad. Another question is at what point of time you would update the validity flag for the data? On what basis you would come to know data is valid given that you dont have a checksum? I am sorry if these questions look amature,I am trying to understand it and felt your algorithm is more simpler then mine except for extra memory needed for having copies. Looking farward for your reply and advanced thanks, Regards, s.subbarayan
ssubbarayan wrote:
>
... snip ...
> > I will continue using brackets while posting long links.
FYI the proper way to transmit links is within <> pairs. See the page URL in my sig. below for an example. Another would be: <http://cbfalconer.home.att.net/download/> -- [mail]: Chuck F (cbfalconer at maineline dot net) [page]: <http://cbfalconer.home.att.net> Try the download section. -- Posted via a free Usenet account from http://www.teranews.com
ssubbarayan <ssubba@gmail.com> writes:

> On Feb 6, 5:25&nbsp;pm, John Devereux <jdREM...@THISdevereux.me.uk> wrote:
[...]
>> >> This is equivalent to what I was planning. Although I don't think I >> need a checksum. I was going to have "valid" markers, separate from >> the data blocks. So it would go >> >> &nbsp; mark copy 1 invalid >> &nbsp; write new copy 1 >> &nbsp; mark copy 1 valid >> &nbsp; mark copy 2 invalid >> &nbsp; write new copy 2 >> &nbsp; mark copy 2 valid >> >> On power up both copy valid flags would be checked, and any "invalid" >> copy overwritten with the valid one. The "copy valid" markers would be >> stored on separate pages from the data (and each other), so hopefully >> will not get corrupted at the same time as the data they refer to. >> >> Only problem with this is it requires 4 pages to be written instead of >> one. Using a checksum to replace the separate flags could mean just >> two pages - perhaps that is better after all. >>
[...]
> Hi John, > I will continue using brackets while posting long links.
They have to be *angle* brackets, < >. But you are using google groups, which usually scrambles everything up anyway.
> By the way,I have a question regarding your implementation. > In your algorithm to make two copies of any data in nvram,what if > during updation to both the copies you encounter power brownout?Since > power brown outs are unpredictable,how are we going to guarentee > atleast one good copy exists with us? > The scenerio which I am referring to here would be the first time when > you are updating the data.During first updation,you wont be able to > ascertain whether the copy is good or bad. > Another question is at what point of time you would update the > validity flag for the data?
update(): a) mark copy 1 invalid b) write new copy 1 c) mark copy 1 valid [same again for copy 2] startup(): any copy marked invalid is replaced by the copy marked valid. The steps happen in strict order. Each previous step must complete successfully before the next is started. So the only way the valid flag can be set is if the data has been successfuly written, without interruption.
> On what basis you would come to know data is valid given that you dont > have a checksum?
The data is marked valid only *after* it has been successfully written. If writing of data is interrupted, then the flag never set either. So next time it powers up we know that copy may be bad, and restore from the good one. There is always at least one good copy. Let us look at what happens if programming is interrupted during a,b,d above. a) The copy 1 valid *flag* is left in an unknown state. But the actual data is valid. So either the startup will see it invalid and restore the data, or it sees it valid and all is OK. b) The data is marked invalid, and the *data* is left in an unknown state. This is OK, the startup will see the invalid flag and restore the data. c) The data has been correctly written, but the valid flag is left in an unknown state. If the startup sees the flag as valid, that is OK, because the data is in fact valid. If it sees it as invalid, the data will be restored from the other copy. Still OK. Obviously this make a few assumptions: the eeprom has not worn out, and that there is some brownout protection so that the CPU does not go crazy and erase everything. Another assumption is that the flags are either programmed or not programmed. But what if the flag programming gets interrupted so that the flag state is not only unknown, but is actually *unreliable*. That is, it is only "half programmed" (or half erased), so sometimes reads "valid" and sometimes "invalid"? In this condition the state read could depend on temperature,age or supply noise. It would require a very unlikely sequence of events, but you could have: update() ... mark copy 2 invalid write copy 2 mark copy 2 valid <interrupted> Then on power up, copy 2 valid flag is unreliable. But at startup happens to read OK. Then next time we do an update, we get *another* power cut, this time during copy 1 update. And at power up, this time copy 2 reads *invalid*. So we have no valid copies. I think the solution is to reprogram the "valid" flags every startup.
> I am sorry if these questions look amature,I am trying to understand > it and felt your algorithm is more simpler then mine except for extra > memory needed for having copies.
I find it a difficult area, too. (And it gets harder if you start thinking about wear-levelling or if you don't want to allocate a whole page to a record, or if the record does not fit in a single page...)
> Looking farward for your reply and advanced thanks, > Regards, > s.subbarayan
-- John Devereux
John Devereux wrote:

> update(): > > a) mark copy 1 invalid > b) write new copy 1 > c) mark copy 1 valid > > [same again for copy 2] > > startup(): > any copy marked invalid is replaced by the copy marked valid. > > The steps happen in strict order. Each previous step must complete > successfully before the next is started. So the only way the valid > flag can be set is if the data has been successfuly written, without > interruption. > >> On what basis you would come to know data is valid given that you dont >> have a checksum? > > The data is marked valid only *after* it has been successfully > written. If writing of data is interrupted, then the flag never set > either. So next time it powers up we know that copy may be bad, and > restore from the good one. > > There is always at least one good copy. > > Let us look at what happens if programming is interrupted during a,b,d > above. > > a) The copy 1 valid *flag* is left in an unknown state. But the actual > data is valid. So either the startup will see it invalid and restore > the data, or it sees it valid and all is OK. > > b) The data is marked invalid, and the *data* is left in an unknown > state. This is OK, the startup will see the invalid flag and restore > the data. > > c) The data has been correctly written, but the valid flag is left in > an unknown state. If the startup sees the flag as valid, that is OK, > because the data is in fact valid. If it sees it as invalid, the data > will be restored from the other copy. Still OK. > > Obviously this make a few assumptions: the eeprom has not worn out, > and that there is some brownout protection so that the CPU does not go > crazy and erase everything. > > Another assumption is that the flags are either programmed or not > programmed. But what if the flag programming gets interrupted so that > the flag state is not only unknown, but is actually *unreliable*. That > is, it is only "half programmed" (or half erased), so sometimes reads > "valid" and sometimes "invalid"? In this condition the state read > could depend on temperature,age or supply noise. > > It would require a very unlikely sequence of events, but you could > have: > > update() > ... > mark copy 2 invalid > write copy 2 > mark copy 2 valid <interrupted> > > Then on power up, copy 2 valid flag is unreliable. But at startup > happens to read OK. > > Then next time we do an update, we get *another* power cut, this time > during copy 1 update. And at power up, this time copy 2 reads > *invalid*. So we have no valid copies. > > I think the solution is to reprogram the "valid" flags every startup. > >> I am sorry if these questions look amature,I am trying to understand >> it and felt your algorithm is more simpler then mine except for extra >> memory needed for having copies. > > I find it a difficult area, too. (And it gets harder if you start > thinking about wear-levelling or if you don't want to allocate a whole > page to a record, or if the record does not fit in a single page...) >
A better method is to have a version stamp along with your data. You have two blocks, each structured as "version stamp, data". At startup, you verify each block based on having a valid version (and possibly a checksum as well, if you are particularly paranoid). The latest valid version shows which block you use as your data. For an update, you erase the block containing the older version of the data. Then you save your data to this block, then you write your new version stamp. There is no need to write your data a second time - it gives no advantages, and halves your eeprom/flash life expectancy.
David Brown <david@westcontrol.removethisbit.com> writes:

> John Devereux wrote: > >> update(): >> >> a) mark copy 1 invalid >> b) write new copy 1 >> c) mark copy 1 valid >> >> [same again for copy 2] >> >> startup(): any copy marked invalid is replaced by the copy marked >> valid. >> >> The steps happen in strict order. Each previous step must complete >> successfully before the next is started. So the only way the valid >> flag can be set is if the data has been successfuly written, without >> interruption.
[...]
> A better method is to have a version stamp along with your data. You > have two blocks, each structured as "version stamp, data". At > startup, you verify each block based on having a valid version (and > possibly a checksum as well, if you are particularly paranoid). The > latest valid version shows which block you use as your data. > > For an update, you erase the block containing the older version of the > data. Then you save your data to this block, then you write your new > version stamp. There is no need to write your data a second time - it > gives no advantages, and halves your eeprom/flash life expectancy.
That does seem a better idea. I have used versioned structures before, for a flash based system. So I don't know why I did not suggest it here too. -- John Devereux
On Feb 7, 4:25=A0pm, John Devereux <jdREM...@THISdevereux.me.uk> wrote:
> ssubbarayan <ssu...@gmail.com> writes: > > On Feb 6, 5:25=A0pm, John Devereux <jdREM...@THISdevereux.me.uk> wrote: > > [...] > > > > > > > > >> This is equivalent to what I was planning. Although I don't think I > >> need a checksum. I was going to have "valid" markers, separate from > >> the data blocks. So it would go > > >> =A0 mark copy 1 invalid > >> =A0 write new copy 1 > >> =A0 mark copy 1 valid > >> =A0 mark copy 2 invalid > >> =A0 write new copy 2 > >> =A0 mark copy 2 valid > > >> On power up both copy valid flags would be checked, and any "invalid" > >> copy overwritten with the valid one. The "copy valid" markers would be > >> stored on separate pages from the data (and each other), so hopefully > >> will not get corrupted at the same time as the data they refer to. > > >> Only problem with this is it requires 4 pages to be written instead of > >> one. Using a checksum to replace the separate flags could mean just > >> two pages - perhaps that is better after all. > > [...] > > > Hi John, > > =A0 =A0 =A0I will continue using brackets while posting long links. > > They have to be *angle* brackets, < >. But you are using google > groups, which usually scrambles everything up anyway. > > > By the way,I have a question regarding your implementation. > > In your algorithm to make two copies of any data in nvram,what if > > during updation to both the copies you encounter power brownout?Since > > power brown outs are unpredictable,how are we going to guarentee > > atleast one good copy exists with us? > > The scenerio which I am referring to here would be the first time when > > you are updating the data.During first updation,you wont be able to > > ascertain whether the copy is good or bad. > > Another question is at what point of time you would update the > > validity flag for the data? > > update(): > > =A0 a) mark copy 1 invalid > =A0 b) write new copy 1 > =A0 c) mark copy 1 valid > > =A0 =A0 =A0[same again for copy 2] > > startup(): > =A0 any copy marked invalid is replaced by the copy marked valid. > > The steps happen in strict order. Each previous step must complete > successfully before the next is started. So the only way the valid > flag can be set is if the data has been successfuly written, without > interruption. > > > On what basis you would come to know data is valid given that you dont > > have a checksum? > > The data is marked valid only *after* it has been successfully > written. If writing of data is interrupted, then the flag never set > either. So next time it powers up we know that copy may be bad, and > restore from the good one. > > There is always at least one good copy. > > Let us look at what happens if programming is interrupted during a,b,d > above. > > a) The copy 1 valid *flag* is left in an unknown state. But the actual > data is valid. So either the startup will see it invalid and restore > the data, or it sees it valid and all is OK. > > b) The data is marked invalid, and the *data* is left in an unknown > state. This is OK, the startup will see the invalid flag and restore > the data. > > c) The data has been correctly written, but the valid flag is left in > an unknown state. If the startup sees the flag as valid, that is OK, > because the data is in fact valid. If it sees it as invalid, the data > will be restored from the other copy. Still OK. > > Obviously this make a few assumptions: the eeprom has not worn out, > and that there is some brownout protection so that the CPU does not go > crazy and erase everything. > > Another assumption is that the flags are either programmed or not > programmed. But what if the flag programming gets interrupted so that > the flag state is not only unknown, but is actually *unreliable*. That > is, it is only "half programmed" (or half erased), so sometimes reads > "valid" and sometimes "invalid"? In this condition the state read > could depend on temperature,age or supply noise. > > It would require a very unlikely sequence of events, but you could > have: > > update() > =A0 ... > =A0 mark copy 2 invalid > =A0 write copy 2 > =A0 mark copy 2 valid <interrupted> > > Then on power up, copy 2 valid flag is unreliable. But at startup > happens to read OK. > > Then next time we do an update, we get *another* power cut, this time > during copy 1 update. And at power up, this time copy 2 reads > *invalid*. So we have no valid copies. > > I think the solution is to reprogram the "valid" flags every startup. > > > I am sorry if these questions look amature,I am trying to understand > > it and felt your algorithm is more simpler then mine except for extra > > memory needed for having copies. > > I find it a difficult area, too. (And it gets harder if you start > thinking about wear-levelling or if you don't want to allocate a whole > page to a record, or if the record does not fit in a single page...) > > > Looking farward for your reply and advanced thanks, > > Regards, > > s.subbarayan > > -- > > John Devereux- Hide quoted text - > > - Show quoted text -
John, My only worry was getting atleast one good copy.In your whole algorithm,you have assumed atleast one good copy exists.I was wondering what would be situation when the first time(no copy available,freshly you are writing data),and you encounter power brown out situation.I guess in this scenerio theres nothing you can do about it.How ever if you have any solutions in mind for this,please let me know. Regards, s.subbarayan
ssubbarayan <ssubba@gmail.com> writes:


[...]

> > John, > My only worry was getting atleast one good copy.In your whole > algorithm,you have assumed atleast one good copy exists.I was > wondering what would be situation when the first time(no copy > available,freshly you are writing data),and you encounter power brown > out situation.
Firstly, Davids algorithm is better - use a version number based system like he describes. For any possible algorithm, if the power fails during writing of data, you are always going to lose *that version*. Just as if the power failed before you started to write it. Assuming your eeprom is initially filled with 0xff, and a 32 bit version number, then a version number of 0xffffffff (or -1) would indicate a missing copy.
> I guess in this scenerio theres nothing you can do about > it.How ever if you have any solutions in mind for this,please let me > know. > > Regards, > s.subbarayan
-- John Devereux
John Devereux wrote:
> ssubbarayan <ssubba@gmail.com> writes: > > > [...] > >> John, >> My only worry was getting atleast one good copy.In your whole >> algorithm,you have assumed atleast one good copy exists.I was >> wondering what would be situation when the first time(no copy >> available,freshly you are writing data),and you encounter power brown >> out situation. > > Firstly, Davids algorithm is better - use a version number based > system like he describes. > > For any possible algorithm, if the power fails during writing of data, > you are always going to lose *that version*. Just as if the power > failed before you started to write it. > > Assuming your eeprom is initially filled with 0xff, and a 32 bit > version number, then a version number of 0xffffffff (or -1) would > indicate a missing copy. >
It's actually enough with the versioning stamp to distinguish between invalid, and newer or later versions. All you really need are versions 1, 2, and 3, and wrap to 1 again after 3. Anything other than 1, 2, or 3 is invalid. One thing to watch out for, however, is the possibility of corruption at addresses other than the one you are writing. External serial eeproms generally have protection against this, but Atmel AVRs are known to be able to corrupt byte 0 of the eeprom if they get a reset during a write (the address register gets cleared to 0, but the write continues - thus the data at address 0 may be half overwritten). The same problem can probably occur on many other eeproms - I don't know if the AVRs are a particular high risk, or if Atmel is just unusually honest!
David Brown <david@westcontrol.removethisbit.com> writes:

> John Devereux wrote: >> ssubbarayan <ssubba@gmail.com> writes: >> >> >> [...] >> >>> John, >>> My only worry was getting atleast one good copy.In your whole >>> algorithm,you have assumed atleast one good copy exists.I was >>> wondering what would be situation when the first time(no copy >>> available,freshly you are writing data),and you encounter power brown >>> out situation. >> >> Firstly, Davids algorithm is better - use a version number based >> system like he describes. >> >> For any possible algorithm, if the power fails during writing of data, >> you are always going to lose *that version*. Just as if the power >> failed before you started to write it. >> >> Assuming your eeprom is initially filled with 0xff, and a 32 bit >> version number, then a version number of 0xffffffff (or -1) would >> indicate a missing copy. >> > > It's actually enough with the versioning stamp to distinguish between > invalid, and newer or later versions. All you really need are > versions 1, 2, and 3, and wrap to 1 again after 3. Anything other > than 1, 2, or 3 is invalid.
Cool - I was thinking of avoiding the wrap entirely by having a range so high it would never happen :)
> One thing to watch out for, however, is the possibility of corruption > at addresses other than the one you are writing. External serial > eeproms generally have protection against this, but Atmel AVRs are > known to be able to corrupt byte 0 of the eeprom if they get a reset > during a write (the address register gets cleared to 0, but the write > continues - thus the data at address 0 may be half overwritten). The > same problem can probably occur on many other eeproms - I don't know > if the AVRs are a particular high risk, or if Atmel is just unusually > honest!
I would still love to know, for sure, that a write to part of a page does not involve an internal erasure of the entire page. Without knowing this each version stamp needs a page of its own as far as I can see. The act of writing the version number must be guaranteed not to upset the data it refers to, if it gets interrupted. I think I will have to try and test this. -- John Devereux
On Feb 8, 11:58 am, John Devereux <jdREM...@THISdevereux.me.uk> wrote:

> I would still love to know, for sure, that a write to part of a page > does not involve an internal erasure of the entire page. Without > knowing this each version stamp needs a page of its own as far as I > can see. The act of writing the version number must be guaranteed not > to upset the data it refers to, if it gets interrupted. > > I think I will have to try and test this.
To test the system, you could make a simple test jig that switches the power to your board. Use another controller to switch the power in random intervals. The random interval timing should match the discharge rate of the power supply capacitors such that the board suffers a lot of brown out conditions. Add an extra R/C filter if necessary. On the device you're testing, set up some special firmware that continously writes updates to the EEPROM. Instead of real data, write a verifiable test pattern, and have the software check it regularly. If it finds corrupted data in a 'valid' block, trigger an alarm. Then leave the test setup in a corner of the lab, 24/7.