Cortex-M3 vs PIC32 divide instruction| page 3

Reply by Jon Kirwan ●September 7, 20112011-09-07

On Wed, 7 Sep 2011 10:27:57 -0700 (PDT), dp <dp@tgi-sci.com>
wrote:

><snip>
>I also have wondered - just vaguely, though - how do they accelerate
>division on various architectures, e.g. the power core I use now
>needs only 14 (or was it 16?) cycles for a 32/32, older
>implementations
>of that core (the original 603e, that is) needed 30+, 37 IIRC.
>I have been moaning so many times of having to write yet another
>division - I think the only architecture which saved me that was the
>68k, ppc didn't, it does not have the 64/32 68k has, not on 32 bit
>machines) that I use the chance to ask Jon to share his findings,
>I am also really curious about it.
>
>Dimiter

I am similarly curious and would like to know the theoretical
details.

If I do uncover the details in the Cortex-M3 case, I'll write
a little something about it here.  It's possible that there
are some university docs I'll find that clue me in.  I might
even get lucky and someone at ARM may respond kindly.  It may
be that someone here knows, too, but just hasn't said as
much, yet.  Chances are this isn't some deep dark secret.
Just that I haven't yet come across it, is all.  I am
remarkably ignorant.

If you are interested in the details regarding the M4K
(PIC32) method, then I can write a lot on that unremarkable
topic.  That one is easy.  I could design the hardware myself
almost in my sleep.

Jon

Reply by David Brown ●September 8, 20112011-09-08

On 07/09/2011 19:27, dp wrote:

> Last (only...:) ) time I used a TI DSP was apr. 10 years ago,
> the 5420. Their divide was straight forward, use "subtract
> conditionally"
> in a repeat (penalty free) loop.
> I also have wondered - just vaguely, though - how do they accelerate
> division on various architectures, e.g. the power core I use now
> needs only 14 (or was it 16?) cycles for a 32/32, older
> implementations
> of that core (the original 603e, that is) needed 30+, 37 IIRC.

The basic division algorithm is a "subtract conditionally" loop, which 
is approximately one cycle per loop.  You can double the speed by simply 
doing two bits at a time.  That takes more hardware - you have to do 
three comparisons in parallel rather than just one, but it's not too 
expensive (multiplying by 0b00, 0b01, and 0b10 are all zero cost, and 
multiply by 0b11 is not hard).  That trick does not scale well, however 
- doing another bit in each cycle means much more hardware, and the 
depth of the combinational logic involved will mean slower clock speeds.

You can also save a clock cycle or two at the ends of the algorithm by 
careful setup - the cycles are usually still there in the latency, but 
get hidden within the rest of the instruction pipeline.

Early-exit testing can also be done - typically once the numerator part 
has been reduced to 0 (or less than the numerator), you can do a fast 
exit.  Some implementations may also have tricks like barrel-shifting at 
the start to "cancel out" any factors of 2 in the figures.

Beyond that, faster division is usually done by computing the 
reciprocal, then multiplying.  That is particularly useful for large 
bitwidths and floating point (i.e., hardware 64-bit floating point).

For integer work, it is usually best to leave that to the compiler's 
optimiser - a compiler may turn "x/3" into "(x * (2^n / 3)) >> n" for a 
suitable n.  This generally only makes sense if the cpu can quickly 
multiply numbers of twice the bitlength of x.

> I have been moaning so many times of having to write yet another
> division - I think the only architecture which saved me that was the
> 68k, ppc didn't, it does not have the 64/32 68k has, not on 32 bit
> machines) that I use the chance to ask Jon to share his findings,
> I am also really curious about it.
>

Reply by dp ●September 8, 20112011-09-08

On Sep 8, 10:31=A0am, David Brown <da...@westcontrol.removethisbit.com>
wrote:
> On 07/09/2011 19:27, dp wrote:
>
> > Last (only...:) ) time I used a TI DSP was apr. 10 years ago,
> > the 5420. Their divide was straight forward, use "subtract
> > conditionally"
> > in a repeat (penalty free) loop.
> > I also have wondered - just vaguely, though - how do they accelerate
> > division on various architectures, e.g. the power core I use now
> > needs only 14 (or was it 16?) cycles for a 32/32, older
> > implementations
> > of that core (the original 603e, that is) needed 30+, 37 IIRC.
>
> The basic division algorithm is a "subtract conditionally" loop, which
> is approximately one cycle per loop. =A0You can double the speed by simpl=
y
> doing two bits at a time. =A0That takes more hardware - you have to do
> three comparisons in parallel rather than just one, but it's not too
> expensive (multiplying by 0b00, 0b01, and 0b10 are all zero cost, and
> multiply by 0b11 is not hard). =A0That trick does not scale well, however
> - doing another bit in each cycle means much more hardware, and the
> depth of the combinational logic involved will mean slower clock speeds.
>
> You can also save a clock cycle or two at the ends of the algorithm by
> careful setup - the cycles are usually still there in the latency, but
> get hidden within the rest of the instruction pipeline.
>
> Early-exit testing can also be done - typically once the numerator part
> has been reduced to 0 (or less than the numerator), you can do a fast
> exit. =A0Some implementations may also have tricks like barrel-shifting a=
t
> the start to "cancel out" any factors of 2 in the figures.
>
> Beyond that, faster division is usually done by computing the
> reciprocal, then multiplying. =A0That is particularly useful for large
> bitwidths and floating point (i.e., hardware 64-bit floating point).
>
> For integer work, it is usually best to leave that to the compiler's
> optimiser - a compiler may turn "x/3" into "(x * (2^n / 3)) >> n" for a
> suitable n. =A0This generally only makes sense if the cpu can quickly
> multiply numbers of twice the bitlength of x.
>
> > I have been moaning so many times of having to write yet another
> > division - I think the only architecture which saved me that was the
> > 68k, ppc didn't, it does not have the 64/32 68k has, not on 32 bit
> > machines) that I use the chance to ask Jon to share his findings,
> > I am also really curious about it.
>
>

Oh I know what can be done, Jon apparently does too.
Like I said I have written numerous divisions on various machines.
I did not use the "count leading zeroes" on the power 64/32 division
though (was tempted but my time was a lot more important than the CPUs
back then). Then dropping out on a 0 dividend (and/or shifting in
advance
by the least of leading zeroes) can reduce execution time in some
(many) cases, but it does add cycles to the worst case so I am
not sure I would go for it anyway, would be application specific
I guess if it goes that narrow to the wire.

Dimiter

------------------------------------------------------
Dimiter Popoff               Transgalactic Instruments

http://www.tgi-sci.com
------------------------------------------------------
http://www.flickr.com/photos/didi_tgi/sets/72157600228621276/

Reply by Jon Kirwan ●September 8, 20112011-09-08

On Thu, 8 Sep 2011 03:40:42 -0700 (PDT), dp <dp@tgi-sci.com>
wrote:

>On Sep 8, 10:31&#4294967295;am, David Brown <da...@westcontrol.removethisbit.com>
>wrote:
>> On 07/09/2011 19:27, dp wrote:
>>
>> > Last (only...:) ) time I used a TI DSP was apr. 10 years ago,
>> > the 5420. Their divide was straight forward, use "subtract
>> > conditionally"
>> > in a repeat (penalty free) loop.
>> > I also have wondered - just vaguely, though - how do they accelerate
>> > division on various architectures, e.g. the power core I use now
>> > needs only 14 (or was it 16?) cycles for a 32/32, older
>> > implementations
>> > of that core (the original 603e, that is) needed 30+, 37 IIRC.
>>
>> The basic division algorithm is a "subtract conditionally" loop, which
>> is approximately one cycle per loop. &#4294967295;You can double the speed by simply
>> doing two bits at a time. &#4294967295;That takes more hardware - you have to do
>> three comparisons in parallel rather than just one, but it's not too
>> expensive (multiplying by 0b00, 0b01, and 0b10 are all zero cost, and
>> multiply by 0b11 is not hard). &#4294967295;That trick does not scale well, however
>> - doing another bit in each cycle means much more hardware, and the
>> depth of the combinational logic involved will mean slower clock speeds.
>>
>> You can also save a clock cycle or two at the ends of the algorithm by
>> careful setup - the cycles are usually still there in the latency, but
>> get hidden within the rest of the instruction pipeline.
>>
>> Early-exit testing can also be done - typically once the numerator part
>> has been reduced to 0 (or less than the numerator), you can do a fast
>> exit. &#4294967295;Some implementations may also have tricks like barrel-shifting at
>> the start to "cancel out" any factors of 2 in the figures.
>>
>> Beyond that, faster division is usually done by computing the
>> reciprocal, then multiplying. &#4294967295;That is particularly useful for large
>> bitwidths and floating point (i.e., hardware 64-bit floating point).
>>
>> For integer work, it is usually best to leave that to the compiler's
>> optimiser - a compiler may turn "x/3" into "(x * (2^n / 3)) >> n" for a
>> suitable n. &#4294967295;This generally only makes sense if the cpu can quickly
>> multiply numbers of twice the bitlength of x.
>>
>> > I have been moaning so many times of having to write yet another
>> > division - I think the only architecture which saved me that was the
>> > 68k, ppc didn't, it does not have the 64/32 68k has, not on 32 bit
>> > machines) that I use the chance to ask Jon to share his findings,
>> > I am also really curious about it.
>
>Oh I know what can be done, Jon apparently does too.

Yup.

>Like I said I have written numerous divisions on various machines.

And David especially knows I have done so.  We've talked
about it.  And he's aware of both my ignorance and skill.  So
I think he knows my boundaries.

>I did not use the "count leading zeroes" on the power 64/32 division
>though (was tempted but my time was a lot more important than the CPUs
>back then). Then dropping out on a 0 dividend (and/or shifting in
>advance
>by the least of leading zeroes) can reduce execution time in some
>(many) cases, but it does add cycles to the worst case so I am
>not sure I would go for it anyway, would be application specific
>I guess if it goes that narrow to the wire.

One doc I saw wrote "11 to 32" clock cycles for the division.
But more of a marketing piece.  Another, in a datasheet, I
find a maximum of 35 cycles.  That one I believe, as it
accounts for leading byte checks and result posting.  I've
looked a little bit at the 5-stage pipe docs and it's pretty
fancy.  It includes two separate result bypass paths to
accelerate results for following instructions to avoid
waiting for the register posting to take place.  Nice.

Interestingly, although the PIC32 achieves a fair pace (up to
80MHz), it isn't up to the MIPS synthesizable core, claimed
to be over 400MHz capable at 90nm and over 200Mhz capable at
130nm.

Anyway, I'd love to delve into the details.  Too bad the .v
RTL modules aren't public.  (Unless some kind soul can point
me to them?  ;)

Per other suggestions, I already plan to spend about $1000 or
so getting various tools set up for the PIC32 (swapping in my
older Pro Mate II tools, buying a REAL ICE and ICD3, and then
some 'stuff' to update my toolset.)  I already have the tools
purchased for the midrange Energy Micro EFM32, which is a
Cortex-M3, and have started testing development there.  It
was some work to get unlimited, free tools up and working --
thanks very much to CodeSourcery for that, by the way, as I
am now in debt to them.  Next week or two, I will start up on
the PIC32, as well.

The PIC32's MDU operates "autonomously."  So it continues on
a division _and_ following instructions so long as an IU
pipeline stall isn't triggered with the use of an MDU op.  I
am curious about how it functions in the presence of
interrupts.  But there is a LOT OF DOC to read, yet.  So I am
behind on that score.  In any case, a cursory glance over the
IEMAW pipeline doesn't seem to require a stall by itself.  So
I remain unsure.

I haven't read up on the Cortex-M3 DIV and SDIV.  Doc seems
in several different places, too.  But the Cortex-M3 may not
be autonomous.  Looking at DDI0337G, page 1-12, Figure 1-2,
seems to suggest it isn't autonomous and is squarely in the
"Ex" pathway.  So I'd guess a Cortex-M3 must wait for it.
Either way, it's always a two-edged sword so I don't think
one is necessarily better than another.

Noting from that same figure, register write-backs must
complete in Ex -- no need for the PIC32 bypass routes.  But
that is probably more the reason why Cortex-M3 would cycle
slower on the same feature size/process, too.

I am curious about the Cortex-M3 DIV and SDIV implementation
in hardware.  I'd love to see the details.  But it is looking
as though the CPU waits for it.  So if you have a 12 cycle
DIV in progress, that figure seems to suggest to me that
there will be a series of Ex pipeline stalls while the
division completes and posts its results to registers.

Anyway, this is all on my 'off hours' for hobby work and will
be some joy ahead.

Jon

Reply by dp ●September 8, 20112011-09-08

On Sep 9, 2:49=A0am, Jon Kirwan <j...@infinitefactors.org> wrote:
> ...
> Noting from that same figure, register write-backs must
> complete in Ex -- no need for the PIC32 bypass routes. =A0But
> that is probably more the reason why Cortex-M3 would cycle
> slower on the same feature size/process, too.

Division not stalling the pipeline may be pretty rare, at
least I have not seen it on the power parts I use (5200B,
a really nice one). But division just takes up everything.

As a side note I found out the hard way the depth of the
pipeline. Needed MAC - FMADD, as they have it, FP add and
accumulate. Naively, I did it in a loop as with a DSP
and expected to get the specified 2 cycles per 64*64 FMADD.
 Got > 10. Ouch, this was close to ruining the entire
design effort. So I spent a day or two and eventually
wrestled down the data dependencies which were causing this;
it took using 24 of the 32 FP regs to do so, though.
Well, as a side benefit I saved some loads (once I
have 8 samples and 8 coefficients in the regs I did not
have to waste them and load again). So eventually I got
5.5 nS (2.5 nS per cycle) doing the loop (this includes
everything, perhaps loading from DDRAM to cache at times
etc.).

Well, this got way aside. I won;t delete it though, not
so many chances to talk work :-).

Dimiter

Reply by Jon Kirwan ●September 8, 20112011-09-08

On Thu, 8 Sep 2011 17:11:15 -0700 (PDT), dp <dp@tgi-sci.com>
wrote:

>On Sep 9, 2:49&#4294967295;am, Jon Kirwan <j...@infinitefactors.org> wrote:
>> ...
>> Noting from that same figure, register write-backs must
>> complete in Ex -- no need for the PIC32 bypass routes. &#4294967295;But
>> that is probably more the reason why Cortex-M3 would cycle
>> slower on the same feature size/process, too.
>
>Division not stalling the pipeline may be pretty rare, at
>least I have not seen it on the power parts I use (5200B,
>a really nice one). But division just takes up everything.

Just as an aside, Bipolar Integrated Technologies (BIT) made
some FP hardware that included a fully combinatorial FP
division chip.  Remarkable piece of work.  I don't think
ANYONE else has done it.  Of course, I think they are long
gone now.  I worked with an engineer who was part of the
design process, though.

Getting back to the MIPS M4k/PIC32, their "blurbs" all use
the term "autonomous" and the detailed pipeline descriptions
I've laid eyes on appear to support that.  I would post the
details here, but I'm sure it would bore most folks.  It
would run a few pages and covers a lot of detail about each
of the IEMAW pipelines.

>As a side note I found out the hard way the depth of the
>pipeline. Needed MAC - FMADD, as they have it, FP add and
>accumulate. Naively, I did it in a loop as with a DSP
>and expected to get the specified 2 cycles per 64*64 FMADD.
> Got > 10.

Ouch, as you say.

>Ouch, this was close to ruining the entire
>design effort. So I spent a day or two and eventually
>wrestled down the data dependencies which were causing this;
>it took using 24 of the 32 FP regs to do so, though.
>Well, as a side benefit I saved some loads (once I
>have 8 samples and 8 coefficients in the regs I did not
>have to waste them and load again). So eventually I got
>5.5 nS (2.5 nS per cycle) doing the loop (this includes
>everything, perhaps loading from DDRAM to cache at times
>etc.).

This implies a GHz processor of some kind, doesn't it?  And
here all I'm talking about is 30-80MHz clocks.  I won't be
getting any 2.5ns per cycle.

>Well, this got way aside. I won;t delete it though, not
>so many chances to talk work :-).

I love hand-crafting code to make the most of a processor
that isn't on the bleeding edge of technology.  Part, not
all, of that includes meticulous attention to detail --
perhaps the kind of thing that appeals to those who do ships
in a bottle?  I like it when some of the project requires
that kind of thing.  It helps fill out another fun dimension
to a project that is bounded on all sides by many worries and
considerations which all must be considered and weighed.

By the way, I've taken note of the barrel shifter in the
Cortex-M3.  There appears to be a 'lane changer' in the A
state of the M4k.  But I am not seeing a barrel shifter.  So
more to look at and compare, I guess.

Jon

Reply by dp ●September 9, 20112011-09-09

On Sep 9, 4:24=A0am, Jon Kirwan <j...@infinitefactors.org> wrote:
> >.... So eventually I got
> >5.5 nS (2.5 nS per cycle) doing the loop (this includes
> >everything, perhaps loading from DDRAM to cache at times
> >etc.).
>
> This implies a GHz processor of some kind, doesn't it? =A0And
> here all I'm talking about is 30-80MHz clocks. =A0I won't be
> getting any 2.5ns per cycle.

Well the 5200B is a 400MHz part, would be nice to have it
in GHz range but I don't see it coming (has an unbelievably
convenient/flexible/you name it DMA engine which Freescale
have abandoned, probably has been perceived as "too complex"
for "most" users by the top brass or something).

> ...
> By the way, I've taken note of the barrel shifter in the
> Cortex-M3. =A0There appears to be a 'lane changer' in the A
> state of the M4k. =A0But I am not seeing a barrel shifter. =A0So
> more to look at and compare, I guess.

Not sure how either of these (lane changer/barrel shifter)
is designed, I never designed any. But MIPS sure do have
single cycle shifts "barrel" like? Probably something
like the rlwnm and rlwimi on power (rotate left then
and with mask, rotate left then insert under mask control).
I have never used MIPS but it looks way closer to what
power is. The greatest ARM disadvantage I see is that
it has only 16- registers, this can be rather limiting on a
load/store machine.

Dimiter

Reply by Jon Kirwan ●September 9, 20112011-09-09

On Thu, 8 Sep 2011 20:07:48 -0700 (PDT), Dimiter wrote:

>On Sep 9, 4:24&#4294967295;am, Jon Kirwan <j...@infinitefactors.org> wrote:
>> >.... So eventually I got
>> >5.5 nS (2.5 nS per cycle) doing the loop (this includes
>> >everything, perhaps loading from DDRAM to cache at times
>> >etc.).
>>
>> This implies a GHz processor of some kind, doesn't it? &#4294967295;And
>> here all I'm talking about is 30-80MHz clocks. &#4294967295;I won't be
>> getting any 2.5ns per cycle.
>
>Well the 5200B is a 400MHz part,

Yeah, that 2.5ns figure.  I just meant a lot closer to GHz
than I will likely see soon.

>would be nice to have it
>in GHz range but I don't see it coming (has an unbelievably
>convenient/flexible/you name it DMA engine which Freescale
>have abandoned, probably has been perceived as "too complex"
>for "most" users by the top brass or something).

Interesting you bring up DMA.  I was using the SiLabs part to
gain access to an internal, 1MHz, 16 bit SAR ADC (which, if
external, would have cost me as much as the chip or more) and
it required DMA -- (given that this is an 8051 style cpu,
that won't be surprising to anyone.)  Turns out, the
documentation on the DMA section is poor and if you really
want to know exactly what you are doing when you copy someone
else's supposedly working code then the doc is not adequate
to the task.  So I sent off my questions, worked with the
local disty, they pressed, I pressed, we all pressed.  But
SiLabs (US, anyway) didn't know.  Thing is, they had to track
down the DMA section designer who apparently now is off-
shore; I think in Singapore or something.  Took them months.
He'd designed it 8 years before that time.  But I got my
answer, at least.  Took maybe three months to do?

Thing is, no one else had ever asked the questions I asked.
Yet they were the kinds of "off by 1" question that I think
people who did ANYTHING OTHER THAN USE BOILERPLATE CODE would
have done.  So it is clear to me that those using the chip
either used some canned library or else just copied and
pasted sample code.  I must have been the only person
actually caring about understanding how to write my own stuff
for the DMA.  Or else the answer would have been much easier
to get.

>> ...
>> By the way, I've taken note of the barrel shifter in the
>> Cortex-M3. &#4294967295;There appears to be a 'lane changer' in the A
>> state of the M4k. &#4294967295;But I am not seeing a barrel shifter. &#4294967295;So
>> more to look at and compare, I guess.
>
>Not sure how either of these (lane changer/barrel shifter)
>is designed, I never designed any.  But MIPS sure do have
>single cycle shifts "barrel" like? Probably something
>like the rlwnm and rlwimi on power (rotate left then
>and with mask, rotate left then insert under mask control).
>I have never used MIPS but it looks way closer to what
>power is. The greatest ARM disadvantage I see is that
>it has only 16- registers, this can be rather limiting on a
>load/store machine.

There's a point.  I need to think about all this in the
context of a few specific algorithms to see how it all pans
out.  I am enjoying this.

Jon

Reply by Robert Wessel ●September 9, 20112011-09-09

On Thu, 08 Sep 2011 18:24:35 -0700, Jon Kirwan
<jonk@infinitefactors.org> wrote:

>
>Just as an aside, Bipolar Integrated Technologies (BIT) made
>some FP hardware that included a fully combinatorial FP
>division chip.  Remarkable piece of work.  I don't think
>ANYONE else has done it.  Of course, I think they are long
>gone now.  I worked with an engineer who was part of the
>design process, though.

Unless you're using "fully combinatorial" in a way I'm not expecting,
that seems improbable, unless the FP numbers were very short.

It's certainly theoretically possible, but the number of terms that
you'd get would be astronomical.  Assuming 24 bit mantissas and
(incorrectly) assuming you didn't need to look at anything else, you'd
be looking at 48 input bits affecting each output bit, and probably on
the order of 2**44 to 2**45 input terms (each one an and gate
averaging about 45 inputs) for each of the output bits.

Now if they implemented a full hardware divider, that's certainly
possible, although if they went through that much trouble I hope
they'd pipeline it for throughput.

Reply by Jon Kirwan ●September 9, 20112011-09-09

On Thu, 08 Sep 2011 23:32:47 -0500, Robert Wessel
<robertwessel2@yahoo.com> wrote:

>On Thu, 08 Sep 2011 18:24:35 -0700, Jon Kirwan
><jonk@infinitefactors.org> wrote:
>
>>
>>Just as an aside, Bipolar Integrated Technologies (BIT) made
>>some FP hardware that included a fully combinatorial FP
>>division chip.  Remarkable piece of work.  I don't think
>>ANYONE else has done it.  Of course, I think they are long
>>gone now.  I worked with an engineer who was part of the
>>design process, though.
>
>Unless you're using "fully combinatorial" in a way I'm not expecting,
>that seems improbable, unless the FP numbers were very short.

I'm dredging my memory and I'm glad you questioned this.  I'm
now thinking I must have been wrong about the combinatorial
nature -- it almost must have been some kind of internal free
running oscillator detail that escaped my notice then and my
memory converted that into 'fully combinatorial' when it
shouldn't have.  I recall that one "set inputs, waited out a
path delay, and got the result."  It's possible that I
interpreted that poorly through the fog of memory and that
there was some kind of freely running internal clock
mechanism.  If so, I would not have been savvy enough at the
time to have asked that question and today I would have lost
other details that might have corrected my memory there.

(These were part of a set of ECL FP chips around the late
1980's and up through about 1990 or so.  And very fast.)

>It's certainly theoretically possible, but the number of terms that
>you'd get would be astronomical.  Assuming 24 bit mantissas and
>(incorrectly) assuming you didn't need to look at anything else, you'd
>be looking at 48 input bits affecting each output bit, and probably on
>the order of 2**44 to 2**45 input terms (each one an and gate
>averaging about 45 inputs) for each of the output bits.

I am with you on this.  You make a good point.

>Now if they implemented a full hardware divider, that's certainly
>possible, although if they went through that much trouble I hope
>they'd pipeline it for throughput.

I now also remember something about triple-diffused poly-Si
bipolar process with some kind of special "self-aligning"
nature that allowed for very small feature size at the time.
It was impressive for its time, though.  Very much so.

Now I'm mad at myself for not verifying my memory before
writing.  Your point is good.  I'm very curious about exactly
what they did do and didn't do.  So I'm going to pay the
price and look right now and see if there are any details on
the web....

Hmm...

Okay, there is US Patent 5153848 (and others, too.)  That one
is titled, Floating Point Processor with Internal
Free-Running Clock, and says in the Abstract, "The multiplier
and divider are pipelined internally, driven by a fast,
two-phase internal clock that is transparent to the user."

So I think I know where my confusion came from.

Thanks, Robert.

Jon