On Fri, 4 Jan 2013 14:06:22 -0800, Mark Borgerson
<mborgerson@comcast.net> wrote:
>In article <v1eee8lqemj1eqn0gd4bg86r2914gvq7rv@4ax.com>,
>jonk@infinitefactors.org says...
>>
>> On Fri, 4 Jan 2013 08:26:23 -0800, Mark Borgerson
>> <mborgerson@comcast.net> wrote:
>>
>> >In article <e65de8t4vl1fh4srvt3ul0eb5lhs40bp70@4ax.com>,
>> >jonk@infinitefactors.org says...
>> >>
>> >> On Thu, 3 Jan 2013 19:10:05 -0800, Mark Borgerson
>> >> <mborgerson@comcast.net> wrote:
>> >>
>> >> >In article <gq3ce898jeru18r5ufgarts0tb7kfl88ri@4ax.com>,
>> >> >jonk@infinitefactors.org says...
><<SNIP>>
>> >> >
>> >> >I think all the Cortex M3 and M4s have single cycle barrel shifters and
>> >> >single-cycle multiply. Integer divides can take a few cycles.
>> >>
>> >> I don't think the M4 has a barrel shifter -- not one that is
>> >> available to the instruction set. The ADSP-21xx could find
>> >> the leading bit in a 16 bit word in 1 clock, in a 32-bit word
>> >> in two clocks (two seperate instructions.) But during that
>> >> time, I could also do two memory moves per cycle, as well.
>> >
>> >Doesn't the ability to rotate right by 1 to 32 bits in a single
>> >cycle imply a barrel shifter?
>>
>> I suppose. The one in the ADSP-21xx requires much more logic.
>> The ADSP-21xx barrel shifter can do both normalization and
>> denormalization in a single cycle. Lane changes alone is, in
>> my mind, only part of the job. Once you have the ability to
>> do a 0-31 lane change, it's a shame to not add the gates for
>> normalization.
>>
>> >I think the Cortex M4 can find the leading bit in a 32-bit register
>> >with the CLZ (Count Leading Zeroes) instruction in a single cycle.
>>
>> If this is a processor with a floating point unit, it's not
>> something I care about. I'd be looking for integer units (as
>> I wouldn't want to waste power on clocking substantial die
>> space when not in use.)
>
>There are control bits that enable Cortex M4 FPU, but I don't know
>whether they control the FPU clocks or just access to the registers.
>
>The Cortex M3 chips also have the shift and CLZ instructions, but don't
>have the floating point unit. IIRC, they are code compatible (and some
>of the STM32s are pin-compatible). Peripheral registers and the
>memory map may be different---but the ARM cores are pretty similar.
>
>I replaced an MSP430 with a Cortex M3 in an instrument that measures the
>frequency output of a pressure sensor. I got about the same power
>dissipation but 8 times higher resolution due to the difference between
>measuring period with an 8Mhz clock and a 60MHz clock. Battery drain
>was minimized by shutting down the CPU clock between the input capture
>interrupts and by shutting off all peripherals except the timers.
>When 64K of buffer RAM got filled, it was written to the SD card.
>The MSP430 had to write to SD much more often because it had only
>about 10K of RAM and a slower SPI-based SD card interface.
Interesting tidbit. Thanks, Mark.
>> A quick google tells me there is an M4 and an M4F, but then
>> looking at the web page below you point towards, I see that
>> there is a chapter (3.11) called "Floating-point
>> instructions" underneath the heading of "Cortex-M4 Devices
>> Generic User Guide"... so I don't know if all of them include
>> FP or if some do and some don't and which you may be
>> discussing here.
>
>I know there are (were) Kinetis M4 chips without the FPU, but I
>thing all the STM32F4 chips have the FPU. The IAR compiler
>has flags that allow you to choose hardware or software floating
>point. I think the GCC compiler does the same. I don't know
>if you get lower power dissipation if the FPU is present
>but not used.
Understood. It remains to be determined.
>A lot of the web blurbs point out that you can make a choice
>between clocking the CPU for 10 microseconds or the CPU + FPU
>for 1 microsecond in applications where you can sleep
>between calculations. I haven't gotten to the point of
>calculating the power advantages either way in any of
>my apps. However, you can now get fancy JTAG debug modules
>that measure power almost on a cycle-by-cycle basis and
>compute the power stats for you.
I have a REALLY FANCY board that does that from Energy Micro.
Got a great price on it and was very much impressed with all
it offered. Never used such a board before.
>The ChiBIOS RTOS I'm playing with has the option to
>turn off the CPU clock during the idle thread. I'll have
>to try that out once I get my hobby autonomous navigation
>app running. I think that app will spend a lot of time
>in the idle thread between 1Hz gps updates.
I use that all the time with the MSP430, of course. It's kind
of standard practice there, I suppose. I do that with an O/S
I wrote, but I've not had reason to port it on the MSP430
yet. One can just sit on a halt instruction, if that's
available, too.
>> >> So it normalized and denormalized in 1 to 2 clocks depending
>> >> on the word size I was using. The number of shifts required
>> >> (or used) was stored in another register.
>> >
>> >> If you know of the instructions on the M4 that do that,
>> >> please let me know.
>> >
>> >The ARM reference suggests a way to normalize a 32-bit word
>> >in 2 clocks using the CLZ and shift instructions:
>> >
>> >"Use the CLZ Thumb-2 instruction followed by a left shift of Rm by the
>> >resulting Rd value to normalize the value of register Rm. Use MOVS,
>> >rather than MOV, to flag the case where Rm is zero:
>> > CLZ r5, r9
>> > MOVS r9, r9, LSL r5"
>> >
>> >http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0553a/CIHJJEIH.html
>>
>> Thanks!
>>
>> >Of course, if you have an FPU, you would generally let it handle
>> >normalization and denormalization. IIRC, the CM4 can
>> >convert a 32-bit integer to IEEE-854 floating point with a
>> >single instruction. You may have to set some global rounding
>> >and saturation flags before that.
>>
>> I do specialized floating point which permits me to optimize
>> for the application. Generic FP is great for generic work.
>> Not great for some things where, for example, dynamic range
>> can be traded for precision or visa versa or where I know, a
>> priori, that an entire vector will all share the same
>> exponent. Just as a few real world examples in actual
>> applications already fielded.
>>
>Hmmm, that's a neat idea. I could see that happening
>with a lot of oceanographic instruments where the data doesn't
>vary by more than a factor of two over the interval
>of an FIR filter. (I do the FIR on the raw ADC counts.
>After demeaning, there's a bit more dynamic range.)
I used the idea a number of times for good benefit.
>> I want the core tools, but I want to write my own microcode
>> (in effect.) And I want small die space (better yield, lower
>> cost, lower power consumption.) Just give me the basic lower
>> level components of FP.
>
>As you've no doubt discovered---the basic lower-level component
>in much lower volume may cost much more per unit. Those
>billions of cell phones and tablets have driven ARM SOC chip
>prices to levels I wouldn't have imagined 5 years ago.
>
>Have you considered FPGAs? You could certainly get the
>chip you want---but the learning curve might be higher
>than you'd like.
I used the Xilinx 4000 series, years ago. Wrote in VHDL (and
also verilog) to design a CPU, for example, and test it out.
I really enjoyed the experiences, a lot. But no, they are (or
were at the time) expensive, big, power hungry, never exactly
the right size, etc. It wouldn't have been competitive.
I enjoyed the learning curve, already. Probably the more
difficult part for me, anyway, was the floorplanning part of
it. Maybe some folks enjoy that a lot. The automatic floor
planner was horrible at the time and even an idiot neophyte
like me could do better, at the time anyway.
>I never got past fairly simple CPLDs, but an undergrad
>that soldered boards for me for a few months told me that
>they used FPGAs in the control systems for the Oregon State
>Baja racer built as an ME student project.
>
>A properly sized and laid out FPGA seems to be the tool
>of choice for some applications requiring speed,
>deterministic behavior, etc. etc. They used to be
>all over in TVs, cable boxes, DVD players, etc. etc.
>The newer and faster ARM chips may have displaced many
>of the FPGAs and ASICs in consumer apps outside the direct
>video processsing path for smaller companies. For Samsung
>Apple, and Sony, I suppose custom chips and ASICs are still
>the way to go.
>
>IIRC, the ARM core in IPhones and IPads have FPUs---
>although I don't know how much need those devices have
>for floating point. The burden in milliWatts and pennies
>must not be too high since Apple wants to sqeeze out
>every possible minute of battery life and production cost.
It's an informed assumption of mine, based upon some
knowledge and experience here, that an ALU with the basic
tool box for FP work.. but NOT the entire IEEE floating point
support... will take up less die space with better yields and
lower cost to the manufacturer and consume less power (given
the same FAB and design rules) for a crafted application.
Less is tied into the clock chain and, besides, I often do
NOT want the IEEE FP, anyway.
IEEE FP sucks (power and money.) But it is great for people
who have no clue and just want something they don't need to
think much about.
Just give me the functional units and let me decide how to
use them for the application.
Jon