On 1/4/13 5:32 PM, Jon Kirwan wrote:
> On Fri, 04 Jan 2013 14:28:25 -0800, I wrote:
> 
>> <snip>
> 
>> Well, even the 8051 does that. (Four sets, I think.) And that
>> goes back a long ways. Good idea and simple. But I see this
>> patent from Microchip in 2010:
>>
>> http://www.pat2pdf.org/patents/pat20100262805.pdf
> 
> Or patent application, anyway. No prior art mentioned in it.
> I'm not savvy enough about patents to know if this was
> actually issued. Looks like an application to me, though.
> 
> Jon
> 

Reading the application, they don't seem to be claiming a patent on the
use of shadow registers, but on methods of assigning shadow registers to
various interrupts.

On Sat, 5 Jan 2013 01:38:20 +0000 (UTC),
Anders.Montonen@kapsi.spam.stop.fi.invalid wrote:

>Jon Kirwan <jonk@infinitefactors.org> wrote:
>> From your points and the above, if a particular
>> implementation of the core is chosen (a specific part from a
>> specific manufacturer) then would it be possible to establish
>> timer interrupts together with crafted software in order to
>> drive I/O pins with guaranteed known latencies?
>
>It may be, provided you are able to avoid all pipeline stalls and pay 
>careful attention to the system memory interface. Not all multicycle 
>instruction are explicitly stated to be interruptible in the 
>architecture documentation, and it is also not stated what happens in 
>the presence of pipeline stalls. You'd have to verify that with the part 
>manufacturer. There is an upper bound of 12 cycles (assuming no memory 
>wait states), but that's not really helpful here.

Thanks again, Anders. I appreciate the carefully crafted
advice -- it is worth money.

Jon

Jon Kirwan <jonk@infinitefactors.org> wrote:
> From your points and the above, if a particular
> implementation of the core is chosen (a specific part from a
> specific manufacturer) then would it be possible to establish
> timer interrupts together with crafted software in order to
> drive I/O pins with guaranteed known latencies?

It may be, provided you are able to avoid all pipeline stalls and pay 
careful attention to the system memory interface. Not all multicycle 
instruction are explicitly stated to be interruptible in the 
architecture documentation, and it is also not stated what happens in 
the presence of pipeline stalls. You'd have to verify that with the part 
manufacturer. There is an upper bound of 12 cycles (assuming no memory 
wait states), but that's not really helpful here.

-a

On Fri, 04 Jan 2013 14:28:25 -0800, I wrote:

><snip>

>Well, even the 8051 does that. (Four sets, I think.) And that
>goes back a long ways. Good idea and simple. But I see this
>patent from Microchip in 2010:
>
>http://www.pat2pdf.org/patents/pat20100262805.pdf

Or patent application, anyway. No prior art mentioned in it.
I'm not savvy enough about patents to know if this was
actually issued. Looks like an application to me, though.

Jon

On Fri, 4 Jan 2013 14:20:49 -0800, Mark Borgerson
<mborgerson@comcast.net> wrote:

>In article <faiee8hui7b32r2934r00vm6q0je0onvte@4ax.com>, 
>jonk@infinitefactors.org says...
>> 
>> On Fri, 04 Jan 2013 12:40:59 -0800, Jon Kirwan
>> <jonk@infinitefactors.org> wrote:
>> 
>> ><snip>
>> 
>> >>Mark B. wrote:
>> >>Doesn't the ability to rotate right by 1 to 32 bits in a single
>> >>cycle imply a barrel shifter?
>> >
>> >I suppose. The one in the ADSP-21xx requires much more logic.
>> >The ADSP-21xx barrel shifter can do both normalization and
>> >denormalization in a single cycle. Lane changes alone is, in
>> >my mind, only part of the job. Once you have the ability to
>> >do a 0-31 lane change, it's a shame to not add the gates for
>> >normalization.
>> 
>> See this:
>>  http://www.lr.ttu.ee/~juliad/IRZ0070/21xxUM/Chap_2.pdf
>> 
>> It covers the 2100 Family barrel shifter unit, starting on
>> page 2-22 (section 2.4).
>> 
>> The overview says,
>> 
>> "The shifter provides a complete set of shifting functions
>> for 16-bit inputs, yielding a 32-bit output. These include
>> arithmetic shift, logical shift and normalization. The
>> shifter also performs derivation of exponent and derivation
>> of common exponent for an entire block of numbers. These
>> basic functions can be combined to efficiently implement any
>> degree of numerical format control, including full
>> floating-point representation."
>> 
>> My kind of barrel shifter module. Wouldn't mind a 32x64. But
>> this is quite tolerable.
>
>Hmmm.  I don't know that I'd call that a barrel shifter.  I've
>always considered the old visual image of  a circle with
>arrows from each position to all other positions.  That implies
>that the output is exactly as many bits wide as the input.
>
>What you're describing seems to be something else---with variable
>width inputs and outputs and some other combinatorial logic.

The additional logic goes a LONG WAY.

>> For interrupt latency (and I was using the timer here and had
>> complete control over the memory system), see:
>>  http://www.lr.ttu.ee/~juliad/IRZ0070/21xxUM/Chap_3.pdf
>> 
>> In this case, section 3.4.3.1, page 3-19ff.
>> 
>> "For the timer interrupt on these processors, the latency
>> from when the interrupt occurs to when the first instruction
>> of the service routine is executed is only one cycle. This is
>> shown in Figure 3.3. The single cycle of latency is needed to
>> fetch the instruction stored at the interrupt vector
>> location."
>> 
>> My kind of interrupt latency variability.
>
>Does the CPU stack used registers and status in that single
>clock---or does it use some sort of register-map switch which
>would imply some limits on nesting?
>
>
>Oops... found a partial answer in your reference:
>
>"The ALU contains a duplicate bank of registers, shown in Figure 2.2 
>behind the primary registers. There are actually two sets of AR, AF, AX, 
>and AY register files. Only one bank is accessible at a time. The 
>additional bank of registers can be activated (such as during an 
>interrupt service routine) for extremely fast context switching. A new 
>task, like an interrupt service routine, can be
>executed without transferring current states to storage."

Bingo!

>IOW,  no nesting of interrupts for one-cycle response.   I suppose you 
>could get invariant timing on a Cortex interrupt if no nesting was 
>allowed--but it would still take some cycles to stack registers.

In my application I didn't need nesting. It was carefully
crafted to avoid it and it didn't impair the application in
any way.

>Don't some of the PIC chips do register swaps at interrupts?  I have
>vague memories (or perhaps shadowy nightmares) from a decade or
>so back when I worked with one of the PIC chips.

Well, even the 8051 does that. (Four sets, I think.) And that
goes back a long ways. Good idea and simple. But I see this
patent from Microchip in 2010:

http://www.pat2pdf.org/patents/pat20100262805.pdf

Jon

On Fri, 4 Jan 2013 14:06:22 -0800, Mark Borgerson
<mborgerson@comcast.net> wrote:

>In article <v1eee8lqemj1eqn0gd4bg86r2914gvq7rv@4ax.com>, 
>jonk@infinitefactors.org says...
>> 
>> On Fri, 4 Jan 2013 08:26:23 -0800, Mark Borgerson
>> <mborgerson@comcast.net> wrote:
>> 
>> >In article <e65de8t4vl1fh4srvt3ul0eb5lhs40bp70@4ax.com>, 
>> >jonk@infinitefactors.org says...
>> >> 
>> >> On Thu, 3 Jan 2013 19:10:05 -0800, Mark Borgerson
>> >> <mborgerson@comcast.net> wrote:
>> >> 
>> >> >In article <gq3ce898jeru18r5ufgarts0tb7kfl88ri@4ax.com>, 
>> >> >jonk@infinitefactors.org says...
><<SNIP>>
>> >> >
>> >> >I think all the Cortex M3 and M4s have single cycle barrel shifters and
>> >> >single-cycle multiply.   Integer divides can take a few cycles.
>> >> 
>> >> I don't think the M4 has a barrel shifter -- not one that is
>> >> available to the instruction set. The ADSP-21xx could find
>> >> the leading bit in a 16 bit word in 1 clock, in a 32-bit word
>> >> in two clocks (two seperate instructions.) But during that
>> >> time, I could also do two memory moves per cycle, as well.
>> >
>> >Doesn't the ability to rotate right by 1 to 32 bits in a single
>> >cycle imply a barrel shifter?
>> 
>> I suppose. The one in the ADSP-21xx requires much more logic.
>> The ADSP-21xx barrel shifter can do both normalization and
>> denormalization in a single cycle. Lane changes alone is, in
>> my mind, only part of the job. Once you have the ability to
>> do a 0-31 lane change, it's a shame to not add the gates for
>> normalization.
>> 
>> >I think the Cortex M4 can find the leading bit in a 32-bit register
>> >with the CLZ (Count Leading Zeroes) instruction in a single cycle.
>> 
>> If this is a processor with a floating point unit, it's not
>> something I care about. I'd be looking for integer units (as
>> I wouldn't want to waste power on clocking substantial die
>> space when not in use.)
>
>There are control bits that enable Cortex M4  FPU, but I don't know
>whether they control the FPU clocks or just access to the registers.
>
>The Cortex M3 chips also have the shift and CLZ instructions, but don't
>have the floating point unit.  IIRC, they are code compatible (and some 
>of the STM32s are pin-compatible).   Peripheral registers and the 
>memory map may be different---but the ARM cores are pretty similar.
>
>I replaced an MSP430 with a Cortex M3 in an instrument that measures the 
>frequency output of a pressure sensor.  I got about the same power 
>dissipation but 8 times higher resolution due to the difference between
>measuring period with an 8Mhz clock and a 60MHz clock. Battery drain
>was minimized by shutting down the CPU clock between the input capture
>interrupts and by shutting off all peripherals except the timers.
>When 64K of buffer RAM got filled, it was written to the SD card.
>The MSP430 had to write to SD much more often because it had only
>about 10K of RAM and a slower SPI-based SD card interface.

Interesting tidbit. Thanks, Mark.

>> A quick google tells me there is an M4 and an M4F, but then
>> looking at the web page below you point towards, I see that
>> there is a chapter (3.11) called "Floating-point
>> instructions" underneath the heading of "Cortex-M4 Devices
>> Generic User Guide"... so I don't know if all of them include
>> FP or if some do and some don't and which you may be
>> discussing here.
>
>I know there are (were) Kinetis M4 chips without the FPU, but I
>thing all the STM32F4 chips have the FPU.  The IAR compiler
>has flags that allow you to choose hardware or software floating
>point.   I think the GCC compiler does the same.  I don't know
>if you get lower power dissipation if the FPU is present
>but not used.

Understood. It remains to be determined.

>A lot of the web blurbs point out that you can make a choice
>between clocking the CPU for 10 microseconds or the CPU + FPU
>for 1 microsecond in applications where you can sleep 
>between calculations.  I haven't gotten to the point of
>calculating the power advantages either way in any of
>my apps.  However, you can now get fancy JTAG debug modules
>that measure power almost on a cycle-by-cycle basis and
>compute the power stats for you.

I have a REALLY FANCY board that does that from Energy Micro.
Got a great price on it and was very much impressed with all
it offered. Never used such a board before.

>The ChiBIOS RTOS I'm playing with has the option to
>turn off the CPU clock during the idle thread.  I'll have
>to try that out once I get my hobby autonomous navigation
>app running.  I think that app will spend a lot of time
>in the idle thread between 1Hz gps updates.

I use that all the time with the MSP430, of course. It's kind
of standard practice there, I suppose. I do that with an O/S
I wrote, but I've not had reason to port it on the MSP430
yet. One can just sit on a halt instruction, if that's
available, too.

>> >> So it normalized and denormalized in 1 to 2 clocks depending
>> >> on the word size I was using. The number of shifts required
>> >> (or used) was stored in another register.
>> >
>> >> If you know of the instructions on the M4 that do that,
>> >> please let me know.
>> >
>> >The ARM reference suggests a way to normalize  a 32-bit word
>> >in 2 clocks using the CLZ and shift instructions:
>> >
>> >"Use the CLZ Thumb-2 instruction followed by a left shift of Rm by the 
>> >resulting Rd value to normalize the value of register Rm. Use MOVS, 
>> >rather than MOV, to flag the case where Rm is zero:
>> >    CLZ r5, r9
>> >    MOVS r9, r9, LSL r5"
>> >
>> >http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0553a/CIHJJEIH.html
>> 
>> Thanks!
>> 
>> >Of course, if you have an FPU, you would generally let it handle
>> >normalization and denormalization.  IIRC, the CM4 can 
>> >convert a 32-bit integer to IEEE-854 floating point with a 
>> >single instruction.  You may have to set some global rounding
>> >and saturation flags before that.
>> 
>> I do specialized floating point which permits me to optimize
>> for the application. Generic FP is great for generic work.
>> Not great for some things where, for example, dynamic range
>> can be traded for precision or visa versa or where I know, a
>> priori, that an entire vector will all share the same
>> exponent. Just as a few real world examples in actual
>> applications already fielded.
>> 
>Hmmm, that's a neat idea.  I could see that happening
>with a lot of oceanographic instruments where the data doesn't
>vary by more than a factor of two over the interval
>of an FIR filter. (I do the FIR on the raw ADC counts.
>After demeaning,  there's a bit more dynamic range.)

I used the idea a number of times for good benefit.

>> I want the core tools, but I want to write my own microcode
>> (in effect.) And I want small die space (better yield, lower
>> cost, lower power consumption.) Just give me the basic lower
>> level components of FP.
>
>As you've no doubt discovered---the basic lower-level component
>in much lower volume may cost much more per unit.  Those
>billions of cell phones and tablets have driven ARM SOC chip
>prices to levels I wouldn't have imagined 5 years ago.
>
>Have you considered FPGAs?  You could certainly get the
>chip you want---but the learning curve might be higher
>than you'd like. 

I used the Xilinx 4000 series, years ago. Wrote in VHDL (and
also verilog) to design a CPU, for example, and test it out.
I really enjoyed the experiences, a lot. But no, they are (or
were at the time) expensive, big, power hungry, never exactly
the right size, etc. It wouldn't have been competitive.

I enjoyed the learning curve, already. Probably the more
difficult part for me, anyway, was the floorplanning part of
it. Maybe some folks enjoy that a lot. The automatic floor
planner was horrible at the time and even an idiot neophyte
like me could do better, at the time anyway.

>I never got past fairly simple CPLDs, but an undergrad
>that soldered boards for me for a few months told me that
>they used FPGAs in the control systems for the Oregon State
>Baja racer built as an ME student project.
>
>A properly sized and laid out FPGA seems to be the tool
>of choice for some applications requiring speed,
>deterministic behavior, etc. etc.  They used to be
>all over in TVs, cable boxes, DVD players, etc. etc.
>The newer and faster ARM chips may have displaced many
>of the FPGAs and ASICs in consumer apps outside the direct
>video processsing path for smaller companies.  For Samsung
>Apple, and Sony, I suppose custom chips and ASICs are still
>the way to go.
>
>IIRC, the ARM core in IPhones and IPads have  FPUs---
>although I don't know how much need those devices have
>for floating point.  The burden in milliWatts and pennies
>must not be too high since Apple wants to sqeeze out 
>every possible minute of battery life and production cost.

It's an informed assumption of mine, based upon some
knowledge and experience here, that an ALU with the basic
tool box for FP work.. but NOT the entire IEEE floating point
support... will take up less die space with better yields and
lower cost to the manufacturer and consume less power (given
the same FAB and design rules) for a crafted application.
Less is tied into the clock chain and, besides, I often do
NOT want the IEEE FP, anyway.

IEEE FP sucks (power and money.) But it is great for people
who have no clue and just want something they don't need to
think much about.

Just give me the functional units and let me decide how to
use them for the application.

Jon

In article <faiee8hui7b32r2934r00vm6q0je0onvte@4ax.com>, 
jonk@infinitefactors.org says...
> 
> On Fri, 04 Jan 2013 12:40:59 -0800, Jon Kirwan
> <jonk@infinitefactors.org> wrote:
> 
> ><snip>
> 
> >>Mark B. wrote:
> >>Doesn't the ability to rotate right by 1 to 32 bits in a single
> >>cycle imply a barrel shifter?
> >
> >I suppose. The one in the ADSP-21xx requires much more logic.
> >The ADSP-21xx barrel shifter can do both normalization and
> >denormalization in a single cycle. Lane changes alone is, in
> >my mind, only part of the job. Once you have the ability to
> >do a 0-31 lane change, it's a shame to not add the gates for
> >normalization.
> 
> See this:
>  http://www.lr.ttu.ee/~juliad/IRZ0070/21xxUM/Chap_2.pdf
> 
> It covers the 2100 Family barrel shifter unit, starting on
> page 2-22 (section 2.4).
> 
> The overview says,
> 
> "The shifter provides a complete set of shifting functions
> for 16-bit inputs, yielding a 32-bit output. These include
> arithmetic shift, logical shift and normalization. The
> shifter also performs derivation of exponent and derivation
> of common exponent for an entire block of numbers. These
> basic functions can be combined to efficiently implement any
> degree of numerical format control, including full
> floating-point representation."
> 
> My kind of barrel shifter module. Wouldn't mind a 32x64. But
> this is quite tolerable.

Hmmm.  I don't know that I'd call that a barrel shifter.  I've
always considered the old visual image of  a circle with
arrows from each position to all other positions.  That implies
that the output is exactly as many bits wide as the input.

What you're describing seems to be something else---with variable
width inputs and outputs and some other combinatorial logic.
> 
> For interrupt latency (and I was using the timer here and had
> complete control over the memory system), see:
>  http://www.lr.ttu.ee/~juliad/IRZ0070/21xxUM/Chap_3.pdf
> 
> In this case, section 3.4.3.1, page 3-19ff.
> 
> "For the timer interrupt on these processors, the latency
> from when the interrupt occurs to when the first instruction
> of the service routine is executed is only one cycle. This is
> shown in Figure 3.3. The single cycle of latency is needed to
> fetch the instruction stored at the interrupt vector
> location."
> 
> My kind of interrupt latency variability.

Does the CPU stack used registers and status in that single
clock---or does it use some sort of register-map switch which
would imply some limits on nesting?

Oops... found a partial answer in your reference:

"The ALU contains a duplicate bank of registers, shown in Figure 2.2 
behind the primary registers. There are actually two sets of AR, AF, AX, 
and AY register files. Only one bank is accessible at a time. The 
additional bank of registers can be activated (such as during an 
interrupt service routine) for extremely fast context switching. A new 
task, like an interrupt service routine, can be
executed without transferring current states to storage."

IOW,  no nesting of interrupts for one-cycle response.   I suppose you 
could get invariant timing on a Cortex interrupt if no nesting was 
allowed--but it would still take some cycles to stack registers.

Don't some of the PIC chips do register swaps at interrupts?  I have
vague memories (or perhaps shadowy nightmares) from a decade or
so back when I worked with one of the PIC chips.

Mark Borgerson

In article <v1eee8lqemj1eqn0gd4bg86r2914gvq7rv@4ax.com>, 
jonk@infinitefactors.org says...
> 
> On Fri, 4 Jan 2013 08:26:23 -0800, Mark Borgerson
> <mborgerson@comcast.net> wrote:
> 
> >In article <e65de8t4vl1fh4srvt3ul0eb5lhs40bp70@4ax.com>, 
> >jonk@infinitefactors.org says...
> >> 
> >> On Thu, 3 Jan 2013 19:10:05 -0800, Mark Borgerson
> >> <mborgerson@comcast.net> wrote:
> >> 
> >> >In article <gq3ce898jeru18r5ufgarts0tb7kfl88ri@4ax.com>, 
> >> >jonk@infinitefactors.org says...
<<SNIP>>
> >> >
> >> >I think all the Cortex M3 and M4s have single cycle barrel shifters and
> >> >single-cycle multiply.   Integer divides can take a few cycles.
> >> 
> >> I don't think the M4 has a barrel shifter -- not one that is
> >> available to the instruction set. The ADSP-21xx could find
> >> the leading bit in a 16 bit word in 1 clock, in a 32-bit word
> >> in two clocks (two seperate instructions.) But during that
> >> time, I could also do two memory moves per cycle, as well.
> >
> >Doesn't the ability to rotate right by 1 to 32 bits in a single
> >cycle imply a barrel shifter?
> 
> I suppose. The one in the ADSP-21xx requires much more logic.
> The ADSP-21xx barrel shifter can do both normalization and
> denormalization in a single cycle. Lane changes alone is, in
> my mind, only part of the job. Once you have the ability to
> do a 0-31 lane change, it's a shame to not add the gates for
> normalization.
> 
> >I think the Cortex M4 can find the leading bit in a 32-bit register
> >with the CLZ (Count Leading Zeroes) instruction in a single cycle.
> 
> If this is a processor with a floating point unit, it's not
> something I care about. I'd be looking for integer units (as
> I wouldn't want to waste power on clocking substantial die
> space when not in use.)

There are control bits that enable Cortex M4  FPU, but I don't know
whether they control the FPU clocks or just access to the registers.

The Cortex M3 chips also have the shift and CLZ instructions, but don't
have the floating point unit.  IIRC, they are code compatible (and some 
of the STM32s are pin-compatible).   Peripheral registers and the 
memory map may be different---but the ARM cores are pretty similar.

I replaced an MSP430 with a Cortex M3 in an instrument that measures the 
frequency output of a pressure sensor.  I got about the same power 
dissipation but 8 times higher resolution due to the difference between
measuring period with an 8Mhz clock and a 60MHz clock. Battery drain
was minimized by shutting down the CPU clock between the input capture
interrupts and by shutting off all peripherals except the timers.
When 64K of buffer RAM got filled, it was written to the SD card.
The MSP430 had to write to SD much more often because it had only
about 10K of RAM and a slower SPI-based SD card interface.
> 
> A quick google tells me there is an M4 and an M4F, but then
> looking at the web page below you point towards, I see that
> there is a chapter (3.11) called "Floating-point
> instructions" underneath the heading of "Cortex-M4 Devices
> Generic User Guide"... so I don't know if all of them include
> FP or if some do and some don't and which you may be
> discussing here.

I know there are (were) Kinetis M4 chips without the FPU, but I
thing all the STM32F4 chips have the FPU.  The IAR compiler
has flags that allow you to choose hardware or software floating
point.   I think the GCC compiler does the same.  I don't know
if you get lower power dissipation if the FPU is present
but not used.

A lot of the web blurbs point out that you can make a choice
between clocking the CPU for 10 microseconds or the CPU + FPU
for 1 microsecond in applications where you can sleep 
between calculations.  I haven't gotten to the point of
calculating the power advantages either way in any of
my apps.  However, you can now get fancy JTAG debug modules
that measure power almost on a cycle-by-cycle basis and
compute the power stats for you.

The ChiBIOS RTOS I'm playing with has the option to
turn off the CPU clock during the idle thread.  I'll have
to try that out once I get my hobby autonomous navigation
app running.  I think that app will spend a lot of time
in the idle thread between 1Hz gps updates.
> 
> >> So it normalized and denormalized in 1 to 2 clocks depending
> >> on the word size I was using. The number of shifts required
> >> (or used) was stored in another register.
> >
> >> If you know of the instructions on the M4 that do that,
> >> please let me know.
> >
> >The ARM reference suggests a way to normalize  a 32-bit word
> >in 2 clocks using the CLZ and shift instructions:
> >
> >"Use the CLZ Thumb-2 instruction followed by a left shift of Rm by the 
> >resulting Rd value to normalize the value of register Rm. Use MOVS, 
> >rather than MOV, to flag the case where Rm is zero:
> >    CLZ r5, r9
> >    MOVS r9, r9, LSL r5"
> >
> >http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0553a/CIHJJEIH.html
> 
> Thanks!
> 
> >Of course, if you have an FPU, you would generally let it handle
> >normalization and denormalization.  IIRC, the CM4 can 
> >convert a 32-bit integer to IEEE-854 floating point with a 
> >single instruction.  You may have to set some global rounding
> >and saturation flags before that.
> 
> I do specialized floating point which permits me to optimize
> for the application. Generic FP is great for generic work.
> Not great for some things where, for example, dynamic range
> can be traded for precision or visa versa or where I know, a
> priori, that an entire vector will all share the same
> exponent. Just as a few real world examples in actual
> applications already fielded.
> 
Hmmm, that's a neat idea.  I could see that happening
with a lot of oceanographic instruments where the data doesn't
vary by more than a factor of two over the interval
of an FIR filter. (I do the FIR on the raw ADC counts.
After demeaning,  there's a bit more dynamic range.)

> I want the core tools, but I want to write my own microcode
> (in effect.) And I want small die space (better yield, lower
> cost, lower power consumption.) Just give me the basic lower
> level components of FP.
> 
As you've no doubt discovered---the basic lower-level component
in much lower volume may cost much more per unit.  Those
billions of cell phones and tablets have driven ARM SOC chip
prices to levels I wouldn't have imagined 5 years ago.

Have you considered FPGAs?  You could certainly get the
chip you want---but the learning curve might be higher
than you'd like. 

I never got past fairly simple CPLDs, but an undergrad
that soldered boards for me for a few months told me that
they used FPGAs in the control systems for the Oregon State
Baja racer built as an ME student project.

A properly sized and laid out FPGA seems to be the tool
of choice for some applications requiring speed,
deterministic behavior, etc. etc.  They used to be
all over in TVs, cable boxes, DVD players, etc. etc.
The newer and faster ARM chips may have displaced many
of the FPGAs and ASICs in consumer apps outside the direct
video processsing path for smaller companies.  For Samsung
Apple, and Sony, I suppose custom chips and ASICs are still
the way to go.

IIRC, the ARM core in IPhones and IPads have  FPUs---
although I don't know how much need those devices have
for floating point.  The burden in milliWatts and pennies
must not be too high since Apple wants to sqeeze out 
every possible minute of battery life and production cost.

<<SNIP>>

Mark Borgerson

On Fri, 4 Jan 2013 19:19:44 +0000 (UTC),
Anders.Montonen@kapsi.spam.stop.fi.invalid wrote:

>Mark Borgerson <mborgerson@comcast.net> wrote:
>> Hmmm.  If the process requiring minimal variation was the highest 
>> priority, it shouldn't have to worry about variations from tail-
>> chaining.  Doesn't that only happen with an interrupt of lower 
>> ore equal priority  is triggered and whose handler gets executed after
>> the handler of the higher priority interrupt is finished?
>
>A higher-priority interrupt can arrive during exception return. Quoting 
>section B1.5.12 of the ARMv7-M ARM:
>"The ARMv7-M architecture does not specify the point at which the 
>processor recognizes any asynchronous exception that arrives during an 
>exception. If the processor recognizes a new exception while it is 
>tail-chaining another exception, and the new exception has higher priority 
>than the exception being tail-chained, then the processor can, instead, 
>take the new exception, using late-arrival preemption. It is 
>IMPLEMENTATION DEFINED what conditions, if any, lead to late arrival 
>preemption."

Thanks, Anders. I can see that there is interesting reading
ahead should I decide to use this architecture for certain
applications. I don't mind nuance, so long as it is
predictable.

From your points and the above, if a particular
implementation of the core is chosen (a specific part from a
specific manufacturer) then would it be possible to establish
timer interrupts together with crafted software in order to
drive I/O pins with guaranteed known latencies?

("Implementation defined" connotes to me that it may actually
be defined for some specific implementation.)

To put the question in concrete terms, assume there is a
background task running but that I want to use a timer to
trigger an ADC sample and hold circuit, followed by another
triggering the ADC conversion start, where an exact number of
CPU cycles from one to the other is vital... and do this
WITHOUT the use of a timer counter output module designed in
hardware?

(That isn't a real example. I would normally use the output
module's features. But removing that possibility gets at the
question I'm asking better without having to describe the
real application in detail. So assume no hardware support
except for the timer interrupt event.)

Thanks by the way for what you've already added!

Jon

On Fri, 04 Jan 2013 12:40:59 -0800, Jon Kirwan
<jonk@infinitefactors.org> wrote:

><snip>

>>Mark B. wrote:
>>Doesn't the ability to rotate right by 1 to 32 bits in a single
>>cycle imply a barrel shifter?
>
>I suppose. The one in the ADSP-21xx requires much more logic.
>The ADSP-21xx barrel shifter can do both normalization and
>denormalization in a single cycle. Lane changes alone is, in
>my mind, only part of the job. Once you have the ability to
>do a 0-31 lane change, it's a shame to not add the gates for
>normalization.

See this:
 http://www.lr.ttu.ee/~juliad/IRZ0070/21xxUM/Chap_2.pdf

It covers the 2100 Family barrel shifter unit, starting on
page 2-22 (section 2.4).

The overview says,

"The shifter provides a complete set of shifting functions
for 16-bit inputs, yielding a 32-bit output. These include
arithmetic shift, logical shift and normalization. The
shifter also performs derivation of exponent and derivation
of common exponent for an entire block of numbers. These
basic functions can be combined to efficiently implement any
degree of numerical format control, including full
floating-point representation."

My kind of barrel shifter module. Wouldn't mind a 32x64. But
this is quite tolerable.

For interrupt latency (and I was using the timer here and had
complete control over the memory system), see:
 http://www.lr.ttu.ee/~juliad/IRZ0070/21xxUM/Chap_3.pdf

In this case, section 3.4.3.1, page 3-19ff.

"For the timer interrupt on these processors, the latency
from when the interrupt occurs to when the first instruction
of the service routine is executed is only one cycle. This is
shown in Figure 3.3. The single cycle of latency is needed to
fetch the instruction stored at the interrupt vector
location."

My kind of interrupt latency variability.

Jon