Forums

writing program memory programmatically in MB-lite

Started by alb November 28, 2014
Hi there,

I'm trying to understand if in an MB-lite I could use the program to 
read/write itself.

The main idea behind is the implementation of a 'memory scrubber' which 
detects and corrects errors in its own instruction memory (thanks to an 
EDAC).

I know that Hardware Architecture has program memory and data memory 
separated and I believe that instruction memory can only be fetched in 
the pipeline. Is my understanding correct?

Any hint/suggestion/pointer is appreciated,

Al

-- 
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
A: Top-posting.
Q: What is the most annoying thing on usenet and in e-mail?
On 11/28/2014 7:07 AM, alb wrote:
> Hi there, > > I'm trying to understand if in an MB-lite I could use the program to > read/write itself. > > The main idea behind is the implementation of a 'memory scrubber' which > detects and corrects errors in its own instruction memory (thanks to an > EDAC). > > I know that Hardware Architecture has program memory and data memory > separated and I believe that instruction memory can only be fetched in > the pipeline. Is my understanding correct? > > Any hint/suggestion/pointer is appreciated,
It's a Harvard Architecture. As such, a traditional deployment would not support direct reading or writing of the program store *as* "data". However, you can always add whatever external logic (data paths) you deem appropriate to allow this to happen -- outside the scope of the processor itself. Dealing with cache/pipeline issues would have to be addressed based on your resulting hardware design and software implementation. You're better off designing a self-correcting program store (with some feedback to the processor/support electronics to handle "insurmountable problems" if/when they occur) and handling all the error detection and correction outside of the scope of the software itself. This sure sounds like "homework" and not "production hardware"...
On 14-11-28 17:50 , Don Y wrote:
> On 11/28/2014 7:07 AM, alb wrote: >> Hi there, >> >> I'm trying to understand if in an MB-lite I could use the program to >> read/write itself. >> >> The main idea behind is the implementation of a 'memory scrubber' which >> detects and corrects errors in its own instruction memory (thanks to an >> EDAC). >> >> I know that Hardware Architecture has program memory and data memory >> separated and I believe that instruction memory can only be fetched in >> the pipeline. Is my understanding correct? >> >> Any hint/suggestion/pointer is appreciated, > > It's a Harvard Architecture. As such, a traditional deployment would > not support direct reading or writing of the program store *as* "data".
Yes - but some very traditional Harvards have the ability to write to the program store with special instructions which interpret the address as an address in the program store (but I don't know if MB has such).
> However, you can always add whatever external logic (data paths) you > deem appropriate to allow this to happen -- outside the scope of > the processor itself.
Often, if the data address space is large enough, a part of the data address space is mapped to the program store, for write accesses.
> Dealing with cache/pipeline issues would have to be addressed based > on your resulting hardware design and software implementation.
In the OP's case, if I understood correctly, it is a matter of writing back to the program store the same value as was read, in order to refresh the EDAC error-correction bits. If a value read earlier from the same program store address is in the caches/pipelines, that's OK because it is the same value.
> You're better off designing a self-correcting program store (with > some feedback to the processor/support electronics to handle > "insurmountable problems" if/when they occur) and handling all > the error detection and correction outside of the scope of the > software itself.
In the space-based EDAC-equipped memory systems with which I am familiar, the EDAC HW usually only corrects errors on the fly, as the data are read, but does not itself write back the corrected data. That is done by SW scrubbing, as I understand the OP intends. -- Niklas Holsti Tidorum Ltd niklas holsti tidorum fi . @ .
On 11/28/2014 11:52 AM, Niklas Holsti wrote:
> On 14-11-28 17:50 , Don Y wrote: >> On 11/28/2014 7:07 AM, alb wrote: >>> Hi there, >>> >>> I'm trying to understand if in an MB-lite I could use the program to >>> read/write itself. >>> >>> The main idea behind is the implementation of a 'memory scrubber' which >>> detects and corrects errors in its own instruction memory (thanks to an >>> EDAC). >>> >>> I know that Hardware Architecture has program memory and data memory >>> separated and I believe that instruction memory can only be fetched in >>> the pipeline. Is my understanding correct? >>> >>> Any hint/suggestion/pointer is appreciated, >> >> It's a Harvard Architecture. As such, a traditional deployment would >> not support direct reading or writing of the program store *as* "data". > > Yes - but some very traditional Harvards have the ability to write to the > program store with special instructions which interpret the address as an > address in the program store (but I don't know if MB has such).
It's not a traditional Harvard arch -- it's an MB-lite. AFAICT, there are no provisions in the standard core to provide "data" paths to the program store.
>> However, you can always add whatever external logic (data paths) you >> deem appropriate to allow this to happen -- outside the scope of >> the processor itself. > > Often, if the data address space is large enough, a part of the data address > space is mapped to the program store, for write accesses.
The OP would be better served (?) just adding the ancillary hardware to the core as he's already got the VHDL for the core; assuming the interface needn't be "terribly fast", adding an autoincrement register at a specific place in the (data) address space to which he can write a specific "starting (program) address"... and, another from which he can read the contents of the program memory and *write* back to it (letting the autoincrement register advance him to the next address) seems to be the easiest interface.
>> Dealing with cache/pipeline issues would have to be addressed based >> on your resulting hardware design and software implementation. > > In the OP's case, if I understood correctly, it is a matter of writing back to > the program store the same value as was read, in order to refresh the EDAC > error-correction bits. If a value read earlier from the same program store > address is in the caches/pipelines, that's OK because it is the same value.
Presumably, the OP will be *designing* the EDAC -- over in another corner of the same die that implements the processor, itself. As such, he can build the scrubbing funtionality into it directly -- and just report status to the processor (i.e., instead of correcting the bits fed to the CPU, implement a RMW cycle in place of the "opcode fetch"). He may want to be able to control this as it can slow down execution (or, cause the pipeline to starve). But, it seems more prudent to "fix" the bad read *now* (automatically) rather than hope some software routine gets around to it "eventually". [the hooks to read/write "arbitrary" program memory locations still seem to be needed if he truly wants to walk *all* of program memory to ensure it is periodically scrubbed. The bigger issue I would pose (I've posed this to folks running server farms) is: what do you do when you get an error (corrected or otherwise)? And, when do you lose confidence in the memory subsystem? How many undetected errors are creeping in if some number of corrected/*uncorrected* errors occur??]
>> You're better off designing a self-correcting program store (with >> some feedback to the processor/support electronics to handle >> "insurmountable problems" if/when they occur) and handling all >> the error detection and correction outside of the scope of the >> software itself. > > In the space-based EDAC-equipped memory systems with which I am familiar, the > EDAC HW usually only corrects errors on the fly, as the data are read, but does > not itself write back the corrected data. That is done by SW scrubbing, as I > understand the OP intends. >
Hi Don,

Don Y <this@is.not.me.com> wrote:
[]
> It's a Harvard Architecture. As such, a traditional deployment would > not support direct reading or writing of the program store *as* "data". >
That confirms what I was thinking. In the past I've worked with an ADSP218x that allows you to write and read PM 'at your leasure'. We've implemented CRC over the program memory to detect memory corruption which would have eventually compromised the behavior of the program, but we couldn't do anything more then flagging the issue to the upper level system which would have taken care of the recovery.
> However, you can always add whatever external logic (data paths) you > deem appropriate to allow this to happen -- outside the scope of > the processor itself. > > Dealing with cache/pipeline issues would have to be addressed based > on your resulting hardware design and software implementation.
likely the core allows you to 'freeze' the data and instruction memory access, so the external logic could perform scrubbing while freezing the processor and then let it run from where it was. Scrubbing is just a read and potentially write operation (in the event the EDAC reports an error), so few clock cycle every now and then should be totally transparent. At some stage I thought the scrubber can be implemented by the software itself, but the harward architecture exclude that option.
> You're better off designing a self-correcting program store (with > some feedback to the processor/support electronics to handle > "insurmountable problems" if/when they occur) and handling all > the error detection and correction outside of the scope of the > software itself.
The EDAC unit is on the memory path, so your data is 32bit but you store 40 with the hamming, while during reading, if there's one error only it will be corrected on the 32bit side, but not in the memory (that is why you need scrubbing to refresh the correct content when appropriate). In the unlikely event of a double error, not correctable, the scrubber shall report the event to the upper level, but the software can move on. It is still possible that the double error is in a part of the code which is not used at the moment.
> > This sure sounds like "homework" and not "production hardware"...
Uhm...nope, I've passed my 'homework' phase long ago, and even if I don't imply that I shouldn't go back and do some more, the question is certainly not for homework. Al
Hi Niklas,

Niklas Holsti <niklas.holsti@tidorum.invalid> wrote:
[]
>> It's a Harvard Architecture. As such, a traditional deployment would >> not support direct reading or writing of the program store *as* "data". > > Yes - but some very traditional Harvards have the ability to write to > the program store with special instructions which interpret the address > as an address in the program store (but I don't know if MB has such).
Any reference for those ones? It doesn't seem to me the MB has instructions which provide such functionality.
>> However, you can always add whatever external logic (data paths) you >> deem appropriate to allow this to happen -- outside the scope of >> the processor itself. > > Often, if the data address space is large enough, a part of the data > address space is mapped to the program store, for write accesses.
This is what I also thought, but you'd probably need a dual port memory to handle the two different address busses. How would you handle fetching an instruction from address N while writing it (scrubbing) to address M?
>> Dealing with cache/pipeline issues would have to be addressed based >> on your resulting hardware design and software implementation. > > In the OP's case, if I understood correctly, it is a matter of writing > back to the program store the same value as was read, in order to > refresh the EDAC error-correction bits. If a value read earlier from the > same program store address is in the caches/pipelines, that's OK because > it is the same value.
Exactly. The software could perform scrubbing if it had access to the pm. []
> In the space-based EDAC-equipped memory systems with which I am > familiar, the EDAC HW usually only corrects errors on the fly, as the > data are read, but does not itself write back the corrected data. That > is done by SW scrubbing, as I understand the OP intends.
This was indeed the main intent. But considering the features of such architecture it would be simpler to have the scrubber implemented at hardware level (in the fpga fabric). The processor would be 'frozen' (simply by using the 'busy' memory bit in the memory interfaces) and the scrubber would get access to the memory (instruction and data) in order to perform the read/write ops necessary to refresh the correct value in the memory (1 bit correction). In the event of a double error the scrubber can flag the error to upper level and let the processor continue since it will be unlikely the error is right there where the software is running. Even in the event of a corrupted software, there are enough hardware protections and Failure Detection and Isolation mechanisms in place that no harm would be caused to the unit or to external ones. In a past project on the International Space Station, the units had PM protected with a simple CRC and routinely we ran a software routine to verify the CRC. Every 2/3 days we would have a failed CRC over ~250 units and a simple reboot would recover the faulty memory content. Over two years I've seen maybe once a unit with a bit flip which caused the software onboard to go bananas. Al
Hi Don,

Don Y <this@is.not.me.com> wrote:
[]
> It's not a traditional Harvard arch -- it's an MB-lite.
AFAIK the MB-lite *is* a traditional Harvard arch. Separate instruction memory from data memory, simultaneous access to both instruction *and* data.
> AFAICT, there > are no provisions in the standard core to provide "data" paths to the > program store.
This is also my conclusion.
>> Often, if the data address space is large enough, a part of the data >> address space is mapped to the program store, for write accesses. > > The OP would be better served (?) just adding the ancillary hardware > to the core as he's already got the VHDL for the core; assuming the > interface needn't be "terribly fast", adding an autoincrement register > at a specific place in the (data) address space to which he can write > a specific "starting (program) address"... and, another from which he > can read the contents of the program memory and *write* back to it > (letting the autoincrement register advance him to the next address) > seems to be the easiest interface.
Even easier would be to let an external scrubber (few states FSM) to take control over the memory bus (instruction and data), while the processor is on hold. Since scrubbing rate is not that high, it would be rather transparent to the processor which would not even realize something has happened. []
> Presumably, the OP will be *designing* the EDAC -- over in another > corner of the same die that implements the processor, itself. As > such, he can build the scrubbing funtionality into it directly -- and > just report status to the processor (i.e., instead of correcting the > bits fed to the CPU, implement a RMW cycle in place of the "opcode > fetch").
replacing the 'fetch' with such an artillery would be rather unefficient. Scrubbing is necessary only to avoid errors accumulation and bump in a situation of double error (not correctable in our case). So scrubbing should not be done for *every* fetch, but rather for every 10K fetches or even more.
> He may want to be able to control this as it can slow down execution > (or, cause the pipeline to starve). But, it seems more prudent to > "fix" the bad read *now* (automatically) rather than hope some > software routine gets around to it "eventually".
is a matter of tolerance to soft errors you want to achieve. I don't have the numbers on the top of my head, but scrubbing a memory cell every 500 us is something that would get you going without any impact for quite some time (read years). The processor, running at 40MHz does not even know the memory has been scrubbed.
> [the hooks to read/write "arbitrary" program memory locations still > seem to be needed if he truly wants to walk *all* of program memory to > ensure it is periodically scrubbed.
As said elsewhere in this thread, I've came to the conclusion that it would be easier to implement the necessary hardware *around* the processor in order to get the scrubbing done.
> The bigger issue I would pose (I've posed this to folks running server > farms) is: what do you do when you get an error (corrected or > otherwise)?
If the error is corrected you need to write it back in order to avoid accumulation of errors, where your hamming code cannot give you more than a double error failure. At that point you'd need to reboot or rewrite the memory from a pristine area you trust (typically some sort of non volatile memory).
> And, when do you lose confidence in the memory subsystem?
That's a tough question. How do you measure confidence? The way I approach the problem is very simple: how many errors per unit of time you can tolerate? Then from that number work out the probabilities your system is likely to encounter an error and apply the necessary mitigation techniques in order to meet desired goal *within* a certain level of confidence (3 sigmas, 5 sigmas, 7 sigmas...typically this factor is proportional to the level of criticality of your system).
> How many undetected errors are creeping in if some number of > corrected/*uncorrected* errors occur??]
There's no such a thing as 'undetected errors'. Your code provides you with a mean to *detect* and/or *correct* a certain class of errors. Given the class of errors you want to protect, because in terms of probability they correspond to the bulk of your errors, your code implementation will provide the necessary protection. In case of a server, which is supposedly using non rad-hard components, a reasonable estimate of soft error over DRAM would be ranging between few tens and few thousands FIT per Mb, while SRAM would be much more sensitive, around few hundred thousands FIT. This is why todays caches are ECC protected otherwise they'd experience a fault every half an hour! On the contrary protecting DRAM may not be extremely necessary (once in two years for for a 10 Gib system). [1] In the space market, EDAC are pretty the standard way of going when coming to memory, but alone would not suffice. You need a scrubber that read the corrected data and writes it back. In the write operation the fault bit will be reset to the correct value and no accumulation occurs. Al [1] a good reference for all those numbers: http://lambda-diode.com/opinion/investigations/transmutations/.../ecc-memory-3
On 11/29/14, 4:39 PM, alb wrote:
> Hi Niklas, > > Niklas Holsti <niklas.holsti@tidorum.invalid> wrote: > [] >>> It's a Harvard Architecture. As such, a traditional deployment would >>> not support direct reading or writing of the program store *as* "data". >> >> Yes - but some very traditional Harvards have the ability to write to >> the program store with special instructions which interpret the address >> as an address in the program store (but I don't know if MB has such). > > Any reference for those ones? It doesn't seem to me the MB has > instructions which provide such functionality. >
The Microchip PICs (at least the PIC24 and dsPIC) have this ability. There are special instructions to read and write to the program memory (since program memory is flash on most of these parts, the write is slow, and really is expected to be done in a block). Note that data memory is 16 bits wide, while program memory is 24 bits wide, so there are two read and two write instructions, one for the low 16 bits, and one for the upper 8 bits. It can also map a section of the program memory into a window in the data memory address space (lower 16 bits only).
On 28 Nov 2014 14:07:15 GMT, al.basili@gmail.com (alb) wrote:

>Hi there, > >I'm trying to understand if in an MB-lite I could use the program to >read/write itself. > >The main idea behind is the implementation of a 'memory scrubber' which >detects and corrects errors in its own instruction memory (thanks to an >EDAC).
What kind of memory are you using for program storage, Flash or (D)RAM? If RAM, how do you initially load it from some non-volatile storage ? If some non-volatile storage + RAM is used, why not periodically reload the program from non-volatile storage to RAM. By 'memory scrubber' do you mean something similar to memory 'flusher' as they had to use on the Hubble telescope (HST) each time it flew through he South Atlantic Anomaly (SAA) ?
Hi upsidedown,

upsidedown@downunder.com wrote:
[]
> What kind of memory are you using for program storage, Flash or > (D)RAM?
It's an SRAM.
> If RAM, how do you initially load it from some non-volatile storage ?
The FPGA logic takes care of copying the content from an EEPROM to the SRAM before releasing the reset of the processor core. The memory interface of the core should also allow to hold the memory access and allow the external logic to perform the scrubbing.
> If some non-volatile storage + RAM is used, why not periodically > reload the program from non-volatile storage to RAM.
because I do not have the time to reload it (the program running implements a PID for a motor controller and I cannot suspend it for such a long time).
> By 'memory scrubber' do you mean something similar to memory 'flusher' > as they had to use on the Hubble telescope (HST) each time it flew > through he South Atlantic Anomaly (SAA) ?
Nope, a 'memory scrubber' is a mechanism to read and write back the *same* content into a memory cell. The EDAC that is on the data path will correct the reading, while the writing through the EDAC allows to correct the memory content. I do not know the details of the HST, but onboard the ISS we had few errors per week due to SEU, while passing several times per day in the SAA. There was no EDAC nor scrubbing on board and none of the logic was triplified, yet we managed to get quite valuable results (http://www.google.com/url?sa=t&rct=j&q=ams02%20article%20basili&source=web&cd=1&ved=0CCEQFjAA&url=https%3A%2F%2Fphysics.aps.org%2Ffeatured-article-pdf%2F10.1103%2FPhysRevLett.110.141102&ei=uSJ8VJb7H8vcPYbUgLAB&usg=AFQjCNHC0dWEuQ3vAIUjKnlWs9y1uctM2w&bvm=bv.80642063,d.ZWU). The only mechanism was to verify the CRC of the program memory for *most* (not all) of the processing units every half an hour (or there about) and if a CRC failure was in place we would have stopped the acquisition run, reboot the node and restart a new run. The SAA (as well as the poles) was to be avoided at all costs during calibrations, given the high cosmic rate which would have caused the instrument to saturate and report a completely wrong backgrounds in nominal observations. We would usually calibrate around the equator crossings, with a timed task set with a periodicity of ~92 minutes and updated every once in a while due to the drag. Al