Reply by Richard Damon February 14, 20132013-02-14
On 1/4/13 5:32 PM, Jon Kirwan wrote:
> On Fri, 04 Jan 2013 14:28:25 -0800, I wrote: > >> <snip> > >> Well, even the 8051 does that. (Four sets, I think.) And that >> goes back a long ways. Good idea and simple. But I see this >> patent from Microchip in 2010: >> >> http://www.pat2pdf.org/patents/pat20100262805.pdf > > Or patent application, anyway. No prior art mentioned in it. > I'm not savvy enough about patents to know if this was > actually issued. Looks like an application to me, though. > > Jon >
Reading the application, they don't seem to be claiming a patent on the use of shadow registers, but on methods of assigning shadow registers to various interrupts.
Reply by Jon Kirwan January 4, 20132013-01-04
On Sat, 5 Jan 2013 01:38:20 +0000 (UTC),
Anders.Montonen@kapsi.spam.stop.fi.invalid wrote:

>Jon Kirwan <jonk@infinitefactors.org> wrote: >> From your points and the above, if a particular >> implementation of the core is chosen (a specific part from a >> specific manufacturer) then would it be possible to establish >> timer interrupts together with crafted software in order to >> drive I/O pins with guaranteed known latencies? > >It may be, provided you are able to avoid all pipeline stalls and pay >careful attention to the system memory interface. Not all multicycle >instruction are explicitly stated to be interruptible in the >architecture documentation, and it is also not stated what happens in >the presence of pipeline stalls. You'd have to verify that with the part >manufacturer. There is an upper bound of 12 cycles (assuming no memory >wait states), but that's not really helpful here.
Thanks again, Anders. I appreciate the carefully crafted advice -- it is worth money. Jon
Reply by January 4, 20132013-01-04
Jon Kirwan <jonk@infinitefactors.org> wrote:
> From your points and the above, if a particular > implementation of the core is chosen (a specific part from a > specific manufacturer) then would it be possible to establish > timer interrupts together with crafted software in order to > drive I/O pins with guaranteed known latencies?
It may be, provided you are able to avoid all pipeline stalls and pay careful attention to the system memory interface. Not all multicycle instruction are explicitly stated to be interruptible in the architecture documentation, and it is also not stated what happens in the presence of pipeline stalls. You'd have to verify that with the part manufacturer. There is an upper bound of 12 cycles (assuming no memory wait states), but that's not really helpful here. -a
Reply by Jon Kirwan January 4, 20132013-01-04
On Fri, 04 Jan 2013 14:28:25 -0800, I wrote:

><snip>
>Well, even the 8051 does that. (Four sets, I think.) And that >goes back a long ways. Good idea and simple. But I see this >patent from Microchip in 2010: > >http://www.pat2pdf.org/patents/pat20100262805.pdf
Or patent application, anyway. No prior art mentioned in it. I'm not savvy enough about patents to know if this was actually issued. Looks like an application to me, though. Jon
Reply by Jon Kirwan January 4, 20132013-01-04
On Fri, 4 Jan 2013 14:20:49 -0800, Mark Borgerson
<mborgerson@comcast.net> wrote:

>In article <faiee8hui7b32r2934r00vm6q0je0onvte@4ax.com>, >jonk@infinitefactors.org says... >> >> On Fri, 04 Jan 2013 12:40:59 -0800, Jon Kirwan >> <jonk@infinitefactors.org> wrote: >> >> ><snip> >> >> >>Mark B. wrote: >> >>Doesn't the ability to rotate right by 1 to 32 bits in a single >> >>cycle imply a barrel shifter? >> > >> >I suppose. The one in the ADSP-21xx requires much more logic. >> >The ADSP-21xx barrel shifter can do both normalization and >> >denormalization in a single cycle. Lane changes alone is, in >> >my mind, only part of the job. Once you have the ability to >> >do a 0-31 lane change, it's a shame to not add the gates for >> >normalization. >> >> See this: >> http://www.lr.ttu.ee/~juliad/IRZ0070/21xxUM/Chap_2.pdf >> >> It covers the 2100 Family barrel shifter unit, starting on >> page 2-22 (section 2.4). >> >> The overview says, >> >> "The shifter provides a complete set of shifting functions >> for 16-bit inputs, yielding a 32-bit output. These include >> arithmetic shift, logical shift and normalization. The >> shifter also performs derivation of exponent and derivation >> of common exponent for an entire block of numbers. These >> basic functions can be combined to efficiently implement any >> degree of numerical format control, including full >> floating-point representation." >> >> My kind of barrel shifter module. Wouldn't mind a 32x64. But >> this is quite tolerable. > >Hmmm. I don't know that I'd call that a barrel shifter. I've >always considered the old visual image of a circle with >arrows from each position to all other positions. That implies >that the output is exactly as many bits wide as the input. > >What you're describing seems to be something else---with variable >width inputs and outputs and some other combinatorial logic.
The additional logic goes a LONG WAY.
>> For interrupt latency (and I was using the timer here and had >> complete control over the memory system), see: >> http://www.lr.ttu.ee/~juliad/IRZ0070/21xxUM/Chap_3.pdf >> >> In this case, section 3.4.3.1, page 3-19ff. >> >> "For the timer interrupt on these processors, the latency >> from when the interrupt occurs to when the first instruction >> of the service routine is executed is only one cycle. This is >> shown in Figure 3.3. The single cycle of latency is needed to >> fetch the instruction stored at the interrupt vector >> location." >> >> My kind of interrupt latency variability. > >Does the CPU stack used registers and status in that single >clock---or does it use some sort of register-map switch which >would imply some limits on nesting? > > >Oops... found a partial answer in your reference: > >"The ALU contains a duplicate bank of registers, shown in Figure 2.2 >behind the primary registers. There are actually two sets of AR, AF, AX, >and AY register files. Only one bank is accessible at a time. The >additional bank of registers can be activated (such as during an >interrupt service routine) for extremely fast context switching. A new >task, like an interrupt service routine, can be >executed without transferring current states to storage."
Bingo!
>IOW, no nesting of interrupts for one-cycle response. I suppose you >could get invariant timing on a Cortex interrupt if no nesting was >allowed--but it would still take some cycles to stack registers.
In my application I didn't need nesting. It was carefully crafted to avoid it and it didn't impair the application in any way.
>Don't some of the PIC chips do register swaps at interrupts? I have >vague memories (or perhaps shadowy nightmares) from a decade or >so back when I worked with one of the PIC chips.
Well, even the 8051 does that. (Four sets, I think.) And that goes back a long ways. Good idea and simple. But I see this patent from Microchip in 2010: http://www.pat2pdf.org/patents/pat20100262805.pdf Jon
Reply by Jon Kirwan January 4, 20132013-01-04
On Fri, 4 Jan 2013 14:06:22 -0800, Mark Borgerson
<mborgerson@comcast.net> wrote:

>In article <v1eee8lqemj1eqn0gd4bg86r2914gvq7rv@4ax.com>, >jonk@infinitefactors.org says... >> >> On Fri, 4 Jan 2013 08:26:23 -0800, Mark Borgerson >> <mborgerson@comcast.net> wrote: >> >> >In article <e65de8t4vl1fh4srvt3ul0eb5lhs40bp70@4ax.com>, >> >jonk@infinitefactors.org says... >> >> >> >> On Thu, 3 Jan 2013 19:10:05 -0800, Mark Borgerson >> >> <mborgerson@comcast.net> wrote: >> >> >> >> >In article <gq3ce898jeru18r5ufgarts0tb7kfl88ri@4ax.com>, >> >> >jonk@infinitefactors.org says... ><<SNIP>> >> >> > >> >> >I think all the Cortex M3 and M4s have single cycle barrel shifters and >> >> >single-cycle multiply. Integer divides can take a few cycles. >> >> >> >> I don't think the M4 has a barrel shifter -- not one that is >> >> available to the instruction set. The ADSP-21xx could find >> >> the leading bit in a 16 bit word in 1 clock, in a 32-bit word >> >> in two clocks (two seperate instructions.) But during that >> >> time, I could also do two memory moves per cycle, as well. >> > >> >Doesn't the ability to rotate right by 1 to 32 bits in a single >> >cycle imply a barrel shifter? >> >> I suppose. The one in the ADSP-21xx requires much more logic. >> The ADSP-21xx barrel shifter can do both normalization and >> denormalization in a single cycle. Lane changes alone is, in >> my mind, only part of the job. Once you have the ability to >> do a 0-31 lane change, it's a shame to not add the gates for >> normalization. >> >> >I think the Cortex M4 can find the leading bit in a 32-bit register >> >with the CLZ (Count Leading Zeroes) instruction in a single cycle. >> >> If this is a processor with a floating point unit, it's not >> something I care about. I'd be looking for integer units (as >> I wouldn't want to waste power on clocking substantial die >> space when not in use.) > >There are control bits that enable Cortex M4 FPU, but I don't know >whether they control the FPU clocks or just access to the registers. > >The Cortex M3 chips also have the shift and CLZ instructions, but don't >have the floating point unit. IIRC, they are code compatible (and some >of the STM32s are pin-compatible). Peripheral registers and the >memory map may be different---but the ARM cores are pretty similar. > >I replaced an MSP430 with a Cortex M3 in an instrument that measures the >frequency output of a pressure sensor. I got about the same power >dissipation but 8 times higher resolution due to the difference between >measuring period with an 8Mhz clock and a 60MHz clock. Battery drain >was minimized by shutting down the CPU clock between the input capture >interrupts and by shutting off all peripherals except the timers. >When 64K of buffer RAM got filled, it was written to the SD card. >The MSP430 had to write to SD much more often because it had only >about 10K of RAM and a slower SPI-based SD card interface.
Interesting tidbit. Thanks, Mark.
>> A quick google tells me there is an M4 and an M4F, but then >> looking at the web page below you point towards, I see that >> there is a chapter (3.11) called "Floating-point >> instructions" underneath the heading of "Cortex-M4 Devices >> Generic User Guide"... so I don't know if all of them include >> FP or if some do and some don't and which you may be >> discussing here. > >I know there are (were) Kinetis M4 chips without the FPU, but I >thing all the STM32F4 chips have the FPU. The IAR compiler >has flags that allow you to choose hardware or software floating >point. I think the GCC compiler does the same. I don't know >if you get lower power dissipation if the FPU is present >but not used.
Understood. It remains to be determined.
>A lot of the web blurbs point out that you can make a choice >between clocking the CPU for 10 microseconds or the CPU + FPU >for 1 microsecond in applications where you can sleep >between calculations. I haven't gotten to the point of >calculating the power advantages either way in any of >my apps. However, you can now get fancy JTAG debug modules >that measure power almost on a cycle-by-cycle basis and >compute the power stats for you.
I have a REALLY FANCY board that does that from Energy Micro. Got a great price on it and was very much impressed with all it offered. Never used such a board before.
>The ChiBIOS RTOS I'm playing with has the option to >turn off the CPU clock during the idle thread. I'll have >to try that out once I get my hobby autonomous navigation >app running. I think that app will spend a lot of time >in the idle thread between 1Hz gps updates.
I use that all the time with the MSP430, of course. It's kind of standard practice there, I suppose. I do that with an O/S I wrote, but I've not had reason to port it on the MSP430 yet. One can just sit on a halt instruction, if that's available, too.
>> >> So it normalized and denormalized in 1 to 2 clocks depending >> >> on the word size I was using. The number of shifts required >> >> (or used) was stored in another register. >> > >> >> If you know of the instructions on the M4 that do that, >> >> please let me know. >> > >> >The ARM reference suggests a way to normalize a 32-bit word >> >in 2 clocks using the CLZ and shift instructions: >> > >> >"Use the CLZ Thumb-2 instruction followed by a left shift of Rm by the >> >resulting Rd value to normalize the value of register Rm. Use MOVS, >> >rather than MOV, to flag the case where Rm is zero: >> > CLZ r5, r9 >> > MOVS r9, r9, LSL r5" >> > >> >http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0553a/CIHJJEIH.html >> >> Thanks! >> >> >Of course, if you have an FPU, you would generally let it handle >> >normalization and denormalization. IIRC, the CM4 can >> >convert a 32-bit integer to IEEE-854 floating point with a >> >single instruction. You may have to set some global rounding >> >and saturation flags before that. >> >> I do specialized floating point which permits me to optimize >> for the application. Generic FP is great for generic work. >> Not great for some things where, for example, dynamic range >> can be traded for precision or visa versa or where I know, a >> priori, that an entire vector will all share the same >> exponent. Just as a few real world examples in actual >> applications already fielded. >> >Hmmm, that's a neat idea. I could see that happening >with a lot of oceanographic instruments where the data doesn't >vary by more than a factor of two over the interval >of an FIR filter. (I do the FIR on the raw ADC counts. >After demeaning, there's a bit more dynamic range.)
I used the idea a number of times for good benefit.
>> I want the core tools, but I want to write my own microcode >> (in effect.) And I want small die space (better yield, lower >> cost, lower power consumption.) Just give me the basic lower >> level components of FP. > >As you've no doubt discovered---the basic lower-level component >in much lower volume may cost much more per unit. Those >billions of cell phones and tablets have driven ARM SOC chip >prices to levels I wouldn't have imagined 5 years ago. > >Have you considered FPGAs? You could certainly get the >chip you want---but the learning curve might be higher >than you'd like.
I used the Xilinx 4000 series, years ago. Wrote in VHDL (and also verilog) to design a CPU, for example, and test it out. I really enjoyed the experiences, a lot. But no, they are (or were at the time) expensive, big, power hungry, never exactly the right size, etc. It wouldn't have been competitive. I enjoyed the learning curve, already. Probably the more difficult part for me, anyway, was the floorplanning part of it. Maybe some folks enjoy that a lot. The automatic floor planner was horrible at the time and even an idiot neophyte like me could do better, at the time anyway.
>I never got past fairly simple CPLDs, but an undergrad >that soldered boards for me for a few months told me that >they used FPGAs in the control systems for the Oregon State >Baja racer built as an ME student project. > >A properly sized and laid out FPGA seems to be the tool >of choice for some applications requiring speed, >deterministic behavior, etc. etc. They used to be >all over in TVs, cable boxes, DVD players, etc. etc. >The newer and faster ARM chips may have displaced many >of the FPGAs and ASICs in consumer apps outside the direct >video processsing path for smaller companies. For Samsung >Apple, and Sony, I suppose custom chips and ASICs are still >the way to go. > >IIRC, the ARM core in IPhones and IPads have FPUs--- >although I don't know how much need those devices have >for floating point. The burden in milliWatts and pennies >must not be too high since Apple wants to sqeeze out >every possible minute of battery life and production cost.
It's an informed assumption of mine, based upon some knowledge and experience here, that an ALU with the basic tool box for FP work.. but NOT the entire IEEE floating point support... will take up less die space with better yields and lower cost to the manufacturer and consume less power (given the same FAB and design rules) for a crafted application. Less is tied into the clock chain and, besides, I often do NOT want the IEEE FP, anyway. IEEE FP sucks (power and money.) But it is great for people who have no clue and just want something they don't need to think much about. Just give me the functional units and let me decide how to use them for the application. Jon
Reply by Mark Borgerson January 4, 20132013-01-04
In article <faiee8hui7b32r2934r00vm6q0je0onvte@4ax.com>, 
jonk@infinitefactors.org says...
> > On Fri, 04 Jan 2013 12:40:59 -0800, Jon Kirwan > <jonk@infinitefactors.org> wrote: > > ><snip> > > >>Mark B. wrote: > >>Doesn't the ability to rotate right by 1 to 32 bits in a single > >>cycle imply a barrel shifter? > > > >I suppose. The one in the ADSP-21xx requires much more logic. > >The ADSP-21xx barrel shifter can do both normalization and > >denormalization in a single cycle. Lane changes alone is, in > >my mind, only part of the job. Once you have the ability to > >do a 0-31 lane change, it's a shame to not add the gates for > >normalization. > > See this: > http://www.lr.ttu.ee/~juliad/IRZ0070/21xxUM/Chap_2.pdf > > It covers the 2100 Family barrel shifter unit, starting on > page 2-22 (section 2.4). > > The overview says, > > "The shifter provides a complete set of shifting functions > for 16-bit inputs, yielding a 32-bit output. These include > arithmetic shift, logical shift and normalization. The > shifter also performs derivation of exponent and derivation > of common exponent for an entire block of numbers. These > basic functions can be combined to efficiently implement any > degree of numerical format control, including full > floating-point representation." > > My kind of barrel shifter module. Wouldn't mind a 32x64. But > this is quite tolerable.
Hmmm. I don't know that I'd call that a barrel shifter. I've always considered the old visual image of a circle with arrows from each position to all other positions. That implies that the output is exactly as many bits wide as the input. What you're describing seems to be something else---with variable width inputs and outputs and some other combinatorial logic.
> > For interrupt latency (and I was using the timer here and had > complete control over the memory system), see: > http://www.lr.ttu.ee/~juliad/IRZ0070/21xxUM/Chap_3.pdf > > In this case, section 3.4.3.1, page 3-19ff. > > "For the timer interrupt on these processors, the latency > from when the interrupt occurs to when the first instruction > of the service routine is executed is only one cycle. This is > shown in Figure 3.3. The single cycle of latency is needed to > fetch the instruction stored at the interrupt vector > location." > > My kind of interrupt latency variability.
Does the CPU stack used registers and status in that single clock---or does it use some sort of register-map switch which would imply some limits on nesting? Oops... found a partial answer in your reference: "The ALU contains a duplicate bank of registers, shown in Figure 2.2 behind the primary registers. There are actually two sets of AR, AF, AX, and AY register files. Only one bank is accessible at a time. The additional bank of registers can be activated (such as during an interrupt service routine) for extremely fast context switching. A new task, like an interrupt service routine, can be executed without transferring current states to storage." IOW, no nesting of interrupts for one-cycle response. I suppose you could get invariant timing on a Cortex interrupt if no nesting was allowed--but it would still take some cycles to stack registers. Don't some of the PIC chips do register swaps at interrupts? I have vague memories (or perhaps shadowy nightmares) from a decade or so back when I worked with one of the PIC chips. Mark Borgerson
Reply by Mark Borgerson January 4, 20132013-01-04
In article <v1eee8lqemj1eqn0gd4bg86r2914gvq7rv@4ax.com>, 
jonk@infinitefactors.org says...
> > On Fri, 4 Jan 2013 08:26:23 -0800, Mark Borgerson > <mborgerson@comcast.net> wrote: > > >In article <e65de8t4vl1fh4srvt3ul0eb5lhs40bp70@4ax.com>, > >jonk@infinitefactors.org says... > >> > >> On Thu, 3 Jan 2013 19:10:05 -0800, Mark Borgerson > >> <mborgerson@comcast.net> wrote: > >> > >> >In article <gq3ce898jeru18r5ufgarts0tb7kfl88ri@4ax.com>, > >> >jonk@infinitefactors.org says...
<<SNIP>>
> >> > > >> >I think all the Cortex M3 and M4s have single cycle barrel shifters and > >> >single-cycle multiply. Integer divides can take a few cycles. > >> > >> I don't think the M4 has a barrel shifter -- not one that is > >> available to the instruction set. The ADSP-21xx could find > >> the leading bit in a 16 bit word in 1 clock, in a 32-bit word > >> in two clocks (two seperate instructions.) But during that > >> time, I could also do two memory moves per cycle, as well. > > > >Doesn't the ability to rotate right by 1 to 32 bits in a single > >cycle imply a barrel shifter? > > I suppose. The one in the ADSP-21xx requires much more logic. > The ADSP-21xx barrel shifter can do both normalization and > denormalization in a single cycle. Lane changes alone is, in > my mind, only part of the job. Once you have the ability to > do a 0-31 lane change, it's a shame to not add the gates for > normalization. > > >I think the Cortex M4 can find the leading bit in a 32-bit register > >with the CLZ (Count Leading Zeroes) instruction in a single cycle. > > If this is a processor with a floating point unit, it's not > something I care about. I'd be looking for integer units (as > I wouldn't want to waste power on clocking substantial die > space when not in use.)
There are control bits that enable Cortex M4 FPU, but I don't know whether they control the FPU clocks or just access to the registers. The Cortex M3 chips also have the shift and CLZ instructions, but don't have the floating point unit. IIRC, they are code compatible (and some of the STM32s are pin-compatible). Peripheral registers and the memory map may be different---but the ARM cores are pretty similar. I replaced an MSP430 with a Cortex M3 in an instrument that measures the frequency output of a pressure sensor. I got about the same power dissipation but 8 times higher resolution due to the difference between measuring period with an 8Mhz clock and a 60MHz clock. Battery drain was minimized by shutting down the CPU clock between the input capture interrupts and by shutting off all peripherals except the timers. When 64K of buffer RAM got filled, it was written to the SD card. The MSP430 had to write to SD much more often because it had only about 10K of RAM and a slower SPI-based SD card interface.
> > A quick google tells me there is an M4 and an M4F, but then > looking at the web page below you point towards, I see that > there is a chapter (3.11) called "Floating-point > instructions" underneath the heading of "Cortex-M4 Devices > Generic User Guide"... so I don't know if all of them include > FP or if some do and some don't and which you may be > discussing here.
I know there are (were) Kinetis M4 chips without the FPU, but I thing all the STM32F4 chips have the FPU. The IAR compiler has flags that allow you to choose hardware or software floating point. I think the GCC compiler does the same. I don't know if you get lower power dissipation if the FPU is present but not used. A lot of the web blurbs point out that you can make a choice between clocking the CPU for 10 microseconds or the CPU + FPU for 1 microsecond in applications where you can sleep between calculations. I haven't gotten to the point of calculating the power advantages either way in any of my apps. However, you can now get fancy JTAG debug modules that measure power almost on a cycle-by-cycle basis and compute the power stats for you. The ChiBIOS RTOS I'm playing with has the option to turn off the CPU clock during the idle thread. I'll have to try that out once I get my hobby autonomous navigation app running. I think that app will spend a lot of time in the idle thread between 1Hz gps updates.
> > >> So it normalized and denormalized in 1 to 2 clocks depending > >> on the word size I was using. The number of shifts required > >> (or used) was stored in another register. > > > >> If you know of the instructions on the M4 that do that, > >> please let me know. > > > >The ARM reference suggests a way to normalize a 32-bit word > >in 2 clocks using the CLZ and shift instructions: > > > >"Use the CLZ Thumb-2 instruction followed by a left shift of Rm by the > >resulting Rd value to normalize the value of register Rm. Use MOVS, > >rather than MOV, to flag the case where Rm is zero: > > CLZ r5, r9 > > MOVS r9, r9, LSL r5" > > > >http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0553a/CIHJJEIH.html > > Thanks! > > >Of course, if you have an FPU, you would generally let it handle > >normalization and denormalization. IIRC, the CM4 can > >convert a 32-bit integer to IEEE-854 floating point with a > >single instruction. You may have to set some global rounding > >and saturation flags before that. > > I do specialized floating point which permits me to optimize > for the application. Generic FP is great for generic work. > Not great for some things where, for example, dynamic range > can be traded for precision or visa versa or where I know, a > priori, that an entire vector will all share the same > exponent. Just as a few real world examples in actual > applications already fielded. >
Hmmm, that's a neat idea. I could see that happening with a lot of oceanographic instruments where the data doesn't vary by more than a factor of two over the interval of an FIR filter. (I do the FIR on the raw ADC counts. After demeaning, there's a bit more dynamic range.)
> I want the core tools, but I want to write my own microcode > (in effect.) And I want small die space (better yield, lower > cost, lower power consumption.) Just give me the basic lower > level components of FP. >
As you've no doubt discovered---the basic lower-level component in much lower volume may cost much more per unit. Those billions of cell phones and tablets have driven ARM SOC chip prices to levels I wouldn't have imagined 5 years ago. Have you considered FPGAs? You could certainly get the chip you want---but the learning curve might be higher than you'd like. I never got past fairly simple CPLDs, but an undergrad that soldered boards for me for a few months told me that they used FPGAs in the control systems for the Oregon State Baja racer built as an ME student project. A properly sized and laid out FPGA seems to be the tool of choice for some applications requiring speed, deterministic behavior, etc. etc. They used to be all over in TVs, cable boxes, DVD players, etc. etc. The newer and faster ARM chips may have displaced many of the FPGAs and ASICs in consumer apps outside the direct video processsing path for smaller companies. For Samsung Apple, and Sony, I suppose custom chips and ASICs are still the way to go. IIRC, the ARM core in IPhones and IPads have FPUs--- although I don't know how much need those devices have for floating point. The burden in milliWatts and pennies must not be too high since Apple wants to sqeeze out every possible minute of battery life and production cost. <<SNIP>> Mark Borgerson
Reply by Jon Kirwan January 4, 20132013-01-04
On Fri, 4 Jan 2013 19:19:44 +0000 (UTC),
Anders.Montonen@kapsi.spam.stop.fi.invalid wrote:

>Mark Borgerson <mborgerson@comcast.net> wrote: >> Hmmm. If the process requiring minimal variation was the highest >> priority, it shouldn't have to worry about variations from tail- >> chaining. Doesn't that only happen with an interrupt of lower >> ore equal priority is triggered and whose handler gets executed after >> the handler of the higher priority interrupt is finished? > >A higher-priority interrupt can arrive during exception return. Quoting >section B1.5.12 of the ARMv7-M ARM: >"The ARMv7-M architecture does not specify the point at which the >processor recognizes any asynchronous exception that arrives during an >exception. If the processor recognizes a new exception while it is >tail-chaining another exception, and the new exception has higher priority >than the exception being tail-chained, then the processor can, instead, >take the new exception, using late-arrival preemption. It is >IMPLEMENTATION DEFINED what conditions, if any, lead to late arrival >preemption."
Thanks, Anders. I can see that there is interesting reading ahead should I decide to use this architecture for certain applications. I don't mind nuance, so long as it is predictable. From your points and the above, if a particular implementation of the core is chosen (a specific part from a specific manufacturer) then would it be possible to establish timer interrupts together with crafted software in order to drive I/O pins with guaranteed known latencies? ("Implementation defined" connotes to me that it may actually be defined for some specific implementation.) To put the question in concrete terms, assume there is a background task running but that I want to use a timer to trigger an ADC sample and hold circuit, followed by another triggering the ADC conversion start, where an exact number of CPU cycles from one to the other is vital... and do this WITHOUT the use of a timer counter output module designed in hardware? (That isn't a real example. I would normally use the output module's features. But removing that possibility gets at the question I'm asking better without having to describe the real application in detail. So assume no hardware support except for the timer interrupt event.) Thanks by the way for what you've already added! Jon
Reply by Jon Kirwan January 4, 20132013-01-04
On Fri, 04 Jan 2013 12:40:59 -0800, Jon Kirwan
<jonk@infinitefactors.org> wrote:

><snip>
>>Mark B. wrote: >>Doesn't the ability to rotate right by 1 to 32 bits in a single >>cycle imply a barrel shifter? > >I suppose. The one in the ADSP-21xx requires much more logic. >The ADSP-21xx barrel shifter can do both normalization and >denormalization in a single cycle. Lane changes alone is, in >my mind, only part of the job. Once you have the ability to >do a 0-31 lane change, it's a shame to not add the gates for >normalization.
See this: http://www.lr.ttu.ee/~juliad/IRZ0070/21xxUM/Chap_2.pdf It covers the 2100 Family barrel shifter unit, starting on page 2-22 (section 2.4). The overview says, "The shifter provides a complete set of shifting functions for 16-bit inputs, yielding a 32-bit output. These include arithmetic shift, logical shift and normalization. The shifter also performs derivation of exponent and derivation of common exponent for an entire block of numbers. These basic functions can be combined to efficiently implement any degree of numerical format control, including full floating-point representation." My kind of barrel shifter module. Wouldn't mind a 32x64. But this is quite tolerable. For interrupt latency (and I was using the timer here and had complete control over the memory system), see: http://www.lr.ttu.ee/~juliad/IRZ0070/21xxUM/Chap_3.pdf In this case, section 3.4.3.1, page 3-19ff. "For the timer interrupt on these processors, the latency from when the interrupt occurs to when the first instruction of the service routine is executed is only one cycle. This is shown in Figure 3.3. The single cycle of latency is needed to fetch the instruction stored at the interrupt vector location." My kind of interrupt latency variability. Jon