EmbeddedRelated.com
Forums

Cortex-M3 vs PIC32 divide instruction

Started by Jon Kirwan September 6, 2011
On Wed, 7 Sep 2011 10:27:57 -0700 (PDT), dp <dp@tgi-sci.com>
wrote:

><snip> >I also have wondered - just vaguely, though - how do they accelerate >division on various architectures, e.g. the power core I use now >needs only 14 (or was it 16?) cycles for a 32/32, older >implementations >of that core (the original 603e, that is) needed 30+, 37 IIRC. >I have been moaning so many times of having to write yet another >division - I think the only architecture which saved me that was the >68k, ppc didn't, it does not have the 64/32 68k has, not on 32 bit >machines) that I use the chance to ask Jon to share his findings, >I am also really curious about it. > >Dimiter
I am similarly curious and would like to know the theoretical details. If I do uncover the details in the Cortex-M3 case, I'll write a little something about it here. It's possible that there are some university docs I'll find that clue me in. I might even get lucky and someone at ARM may respond kindly. It may be that someone here knows, too, but just hasn't said as much, yet. Chances are this isn't some deep dark secret. Just that I haven't yet come across it, is all. I am remarkably ignorant. If you are interested in the details regarding the M4K (PIC32) method, then I can write a lot on that unremarkable topic. That one is easy. I could design the hardware myself almost in my sleep. Jon
On 07/09/2011 19:27, dp wrote:

> Last (only...:) ) time I used a TI DSP was apr. 10 years ago, > the 5420. Their divide was straight forward, use "subtract > conditionally" > in a repeat (penalty free) loop. > I also have wondered - just vaguely, though - how do they accelerate > division on various architectures, e.g. the power core I use now > needs only 14 (or was it 16?) cycles for a 32/32, older > implementations > of that core (the original 603e, that is) needed 30+, 37 IIRC.
The basic division algorithm is a "subtract conditionally" loop, which is approximately one cycle per loop. You can double the speed by simply doing two bits at a time. That takes more hardware - you have to do three comparisons in parallel rather than just one, but it's not too expensive (multiplying by 0b00, 0b01, and 0b10 are all zero cost, and multiply by 0b11 is not hard). That trick does not scale well, however - doing another bit in each cycle means much more hardware, and the depth of the combinational logic involved will mean slower clock speeds. You can also save a clock cycle or two at the ends of the algorithm by careful setup - the cycles are usually still there in the latency, but get hidden within the rest of the instruction pipeline. Early-exit testing can also be done - typically once the numerator part has been reduced to 0 (or less than the numerator), you can do a fast exit. Some implementations may also have tricks like barrel-shifting at the start to "cancel out" any factors of 2 in the figures. Beyond that, faster division is usually done by computing the reciprocal, then multiplying. That is particularly useful for large bitwidths and floating point (i.e., hardware 64-bit floating point). For integer work, it is usually best to leave that to the compiler's optimiser - a compiler may turn "x/3" into "(x * (2^n / 3)) >> n" for a suitable n. This generally only makes sense if the cpu can quickly multiply numbers of twice the bitlength of x.
> I have been moaning so many times of having to write yet another > division - I think the only architecture which saved me that was the > 68k, ppc didn't, it does not have the 64/32 68k has, not on 32 bit > machines) that I use the chance to ask Jon to share his findings, > I am also really curious about it. >
On Sep 8, 10:31=A0am, David Brown <da...@westcontrol.removethisbit.com>
wrote:
> On 07/09/2011 19:27, dp wrote: > > > Last (only...:) ) time I used a TI DSP was apr. 10 years ago, > > the 5420. Their divide was straight forward, use "subtract > > conditionally" > > in a repeat (penalty free) loop. > > I also have wondered - just vaguely, though - how do they accelerate > > division on various architectures, e.g. the power core I use now > > needs only 14 (or was it 16?) cycles for a 32/32, older > > implementations > > of that core (the original 603e, that is) needed 30+, 37 IIRC. > > The basic division algorithm is a "subtract conditionally" loop, which > is approximately one cycle per loop. =A0You can double the speed by simpl=
y
> doing two bits at a time. =A0That takes more hardware - you have to do > three comparisons in parallel rather than just one, but it's not too > expensive (multiplying by 0b00, 0b01, and 0b10 are all zero cost, and > multiply by 0b11 is not hard). =A0That trick does not scale well, however > - doing another bit in each cycle means much more hardware, and the > depth of the combinational logic involved will mean slower clock speeds. > > You can also save a clock cycle or two at the ends of the algorithm by > careful setup - the cycles are usually still there in the latency, but > get hidden within the rest of the instruction pipeline. > > Early-exit testing can also be done - typically once the numerator part > has been reduced to 0 (or less than the numerator), you can do a fast > exit. =A0Some implementations may also have tricks like barrel-shifting a=
t
> the start to "cancel out" any factors of 2 in the figures. > > Beyond that, faster division is usually done by computing the > reciprocal, then multiplying. =A0That is particularly useful for large > bitwidths and floating point (i.e., hardware 64-bit floating point). > > For integer work, it is usually best to leave that to the compiler's > optimiser - a compiler may turn "x/3" into "(x * (2^n / 3)) >> n" for a > suitable n. =A0This generally only makes sense if the cpu can quickly > multiply numbers of twice the bitlength of x. > > > I have been moaning so many times of having to write yet another > > division - I think the only architecture which saved me that was the > > 68k, ppc didn't, it does not have the 64/32 68k has, not on 32 bit > > machines) that I use the chance to ask Jon to share his findings, > > I am also really curious about it. > >
Oh I know what can be done, Jon apparently does too. Like I said I have written numerous divisions on various machines. I did not use the "count leading zeroes" on the power 64/32 division though (was tempted but my time was a lot more important than the CPUs back then). Then dropping out on a 0 dividend (and/or shifting in advance by the least of leading zeroes) can reduce execution time in some (many) cases, but it does add cycles to the worst case so I am not sure I would go for it anyway, would be application specific I guess if it goes that narrow to the wire. Dimiter ------------------------------------------------------ Dimiter Popoff Transgalactic Instruments http://www.tgi-sci.com ------------------------------------------------------ http://www.flickr.com/photos/didi_tgi/sets/72157600228621276/
On Thu, 8 Sep 2011 03:40:42 -0700 (PDT), dp <dp@tgi-sci.com>
wrote:

>On Sep 8, 10:31&#4294967295;am, David Brown <da...@westcontrol.removethisbit.com> >wrote: >> On 07/09/2011 19:27, dp wrote: >> >> > Last (only...:) ) time I used a TI DSP was apr. 10 years ago, >> > the 5420. Their divide was straight forward, use "subtract >> > conditionally" >> > in a repeat (penalty free) loop. >> > I also have wondered - just vaguely, though - how do they accelerate >> > division on various architectures, e.g. the power core I use now >> > needs only 14 (or was it 16?) cycles for a 32/32, older >> > implementations >> > of that core (the original 603e, that is) needed 30+, 37 IIRC. >> >> The basic division algorithm is a "subtract conditionally" loop, which >> is approximately one cycle per loop. &#4294967295;You can double the speed by simply >> doing two bits at a time. &#4294967295;That takes more hardware - you have to do >> three comparisons in parallel rather than just one, but it's not too >> expensive (multiplying by 0b00, 0b01, and 0b10 are all zero cost, and >> multiply by 0b11 is not hard). &#4294967295;That trick does not scale well, however >> - doing another bit in each cycle means much more hardware, and the >> depth of the combinational logic involved will mean slower clock speeds. >> >> You can also save a clock cycle or two at the ends of the algorithm by >> careful setup - the cycles are usually still there in the latency, but >> get hidden within the rest of the instruction pipeline. >> >> Early-exit testing can also be done - typically once the numerator part >> has been reduced to 0 (or less than the numerator), you can do a fast >> exit. &#4294967295;Some implementations may also have tricks like barrel-shifting at >> the start to "cancel out" any factors of 2 in the figures. >> >> Beyond that, faster division is usually done by computing the >> reciprocal, then multiplying. &#4294967295;That is particularly useful for large >> bitwidths and floating point (i.e., hardware 64-bit floating point). >> >> For integer work, it is usually best to leave that to the compiler's >> optimiser - a compiler may turn "x/3" into "(x * (2^n / 3)) >> n" for a >> suitable n. &#4294967295;This generally only makes sense if the cpu can quickly >> multiply numbers of twice the bitlength of x. >> >> > I have been moaning so many times of having to write yet another >> > division - I think the only architecture which saved me that was the >> > 68k, ppc didn't, it does not have the 64/32 68k has, not on 32 bit >> > machines) that I use the chance to ask Jon to share his findings, >> > I am also really curious about it. > >Oh I know what can be done, Jon apparently does too.
Yup.
>Like I said I have written numerous divisions on various machines.
And David especially knows I have done so. We've talked about it. And he's aware of both my ignorance and skill. So I think he knows my boundaries.
>I did not use the "count leading zeroes" on the power 64/32 division >though (was tempted but my time was a lot more important than the CPUs >back then). Then dropping out on a 0 dividend (and/or shifting in >advance >by the least of leading zeroes) can reduce execution time in some >(many) cases, but it does add cycles to the worst case so I am >not sure I would go for it anyway, would be application specific >I guess if it goes that narrow to the wire.
One doc I saw wrote "11 to 32" clock cycles for the division. But more of a marketing piece. Another, in a datasheet, I find a maximum of 35 cycles. That one I believe, as it accounts for leading byte checks and result posting. I've looked a little bit at the 5-stage pipe docs and it's pretty fancy. It includes two separate result bypass paths to accelerate results for following instructions to avoid waiting for the register posting to take place. Nice. Interestingly, although the PIC32 achieves a fair pace (up to 80MHz), it isn't up to the MIPS synthesizable core, claimed to be over 400MHz capable at 90nm and over 200Mhz capable at 130nm. Anyway, I'd love to delve into the details. Too bad the .v RTL modules aren't public. (Unless some kind soul can point me to them? ;) Per other suggestions, I already plan to spend about $1000 or so getting various tools set up for the PIC32 (swapping in my older Pro Mate II tools, buying a REAL ICE and ICD3, and then some 'stuff' to update my toolset.) I already have the tools purchased for the midrange Energy Micro EFM32, which is a Cortex-M3, and have started testing development there. It was some work to get unlimited, free tools up and working -- thanks very much to CodeSourcery for that, by the way, as I am now in debt to them. Next week or two, I will start up on the PIC32, as well. The PIC32's MDU operates "autonomously." So it continues on a division _and_ following instructions so long as an IU pipeline stall isn't triggered with the use of an MDU op. I am curious about how it functions in the presence of interrupts. But there is a LOT OF DOC to read, yet. So I am behind on that score. In any case, a cursory glance over the IEMAW pipeline doesn't seem to require a stall by itself. So I remain unsure. I haven't read up on the Cortex-M3 DIV and SDIV. Doc seems in several different places, too. But the Cortex-M3 may not be autonomous. Looking at DDI0337G, page 1-12, Figure 1-2, seems to suggest it isn't autonomous and is squarely in the "Ex" pathway. So I'd guess a Cortex-M3 must wait for it. Either way, it's always a two-edged sword so I don't think one is necessarily better than another. Noting from that same figure, register write-backs must complete in Ex -- no need for the PIC32 bypass routes. But that is probably more the reason why Cortex-M3 would cycle slower on the same feature size/process, too. I am curious about the Cortex-M3 DIV and SDIV implementation in hardware. I'd love to see the details. But it is looking as though the CPU waits for it. So if you have a 12 cycle DIV in progress, that figure seems to suggest to me that there will be a series of Ex pipeline stalls while the division completes and posts its results to registers. Anyway, this is all on my 'off hours' for hobby work and will be some joy ahead. Jon
On Sep 9, 2:49=A0am, Jon Kirwan <j...@infinitefactors.org> wrote:
> ... > Noting from that same figure, register write-backs must > complete in Ex -- no need for the PIC32 bypass routes. =A0But > that is probably more the reason why Cortex-M3 would cycle > slower on the same feature size/process, too.
Division not stalling the pipeline may be pretty rare, at least I have not seen it on the power parts I use (5200B, a really nice one). But division just takes up everything. As a side note I found out the hard way the depth of the pipeline. Needed MAC - FMADD, as they have it, FP add and accumulate. Naively, I did it in a loop as with a DSP and expected to get the specified 2 cycles per 64*64 FMADD. Got > 10. Ouch, this was close to ruining the entire design effort. So I spent a day or two and eventually wrestled down the data dependencies which were causing this; it took using 24 of the 32 FP regs to do so, though. Well, as a side benefit I saved some loads (once I have 8 samples and 8 coefficients in the regs I did not have to waste them and load again). So eventually I got 5.5 nS (2.5 nS per cycle) doing the loop (this includes everything, perhaps loading from DDRAM to cache at times etc.). Well, this got way aside. I won;t delete it though, not so many chances to talk work :-). Dimiter
On Thu, 8 Sep 2011 17:11:15 -0700 (PDT), dp <dp@tgi-sci.com>
wrote:

>On Sep 9, 2:49&#4294967295;am, Jon Kirwan <j...@infinitefactors.org> wrote: >> ... >> Noting from that same figure, register write-backs must >> complete in Ex -- no need for the PIC32 bypass routes. &#4294967295;But >> that is probably more the reason why Cortex-M3 would cycle >> slower on the same feature size/process, too. > >Division not stalling the pipeline may be pretty rare, at >least I have not seen it on the power parts I use (5200B, >a really nice one). But division just takes up everything.
Just as an aside, Bipolar Integrated Technologies (BIT) made some FP hardware that included a fully combinatorial FP division chip. Remarkable piece of work. I don't think ANYONE else has done it. Of course, I think they are long gone now. I worked with an engineer who was part of the design process, though. Getting back to the MIPS M4k/PIC32, their "blurbs" all use the term "autonomous" and the detailed pipeline descriptions I've laid eyes on appear to support that. I would post the details here, but I'm sure it would bore most folks. It would run a few pages and covers a lot of detail about each of the IEMAW pipelines.
>As a side note I found out the hard way the depth of the >pipeline. Needed MAC - FMADD, as they have it, FP add and >accumulate. Naively, I did it in a loop as with a DSP >and expected to get the specified 2 cycles per 64*64 FMADD. > Got > 10.
Ouch, as you say.
>Ouch, this was close to ruining the entire >design effort. So I spent a day or two and eventually >wrestled down the data dependencies which were causing this; >it took using 24 of the 32 FP regs to do so, though. >Well, as a side benefit I saved some loads (once I >have 8 samples and 8 coefficients in the regs I did not >have to waste them and load again). So eventually I got >5.5 nS (2.5 nS per cycle) doing the loop (this includes >everything, perhaps loading from DDRAM to cache at times >etc.).
This implies a GHz processor of some kind, doesn't it? And here all I'm talking about is 30-80MHz clocks. I won't be getting any 2.5ns per cycle.
>Well, this got way aside. I won;t delete it though, not >so many chances to talk work :-).
I love hand-crafting code to make the most of a processor that isn't on the bleeding edge of technology. Part, not all, of that includes meticulous attention to detail -- perhaps the kind of thing that appeals to those who do ships in a bottle? I like it when some of the project requires that kind of thing. It helps fill out another fun dimension to a project that is bounded on all sides by many worries and considerations which all must be considered and weighed. By the way, I've taken note of the barrel shifter in the Cortex-M3. There appears to be a 'lane changer' in the A state of the M4k. But I am not seeing a barrel shifter. So more to look at and compare, I guess. Jon
On Sep 9, 4:24=A0am, Jon Kirwan <j...@infinitefactors.org> wrote:
> >.... So eventually I got > >5.5 nS (2.5 nS per cycle) doing the loop (this includes > >everything, perhaps loading from DDRAM to cache at times > >etc.). > > This implies a GHz processor of some kind, doesn't it? =A0And > here all I'm talking about is 30-80MHz clocks. =A0I won't be > getting any 2.5ns per cycle.
Well the 5200B is a 400MHz part, would be nice to have it in GHz range but I don't see it coming (has an unbelievably convenient/flexible/you name it DMA engine which Freescale have abandoned, probably has been perceived as "too complex" for "most" users by the top brass or something).
> ... > By the way, I've taken note of the barrel shifter in the > Cortex-M3. =A0There appears to be a 'lane changer' in the A > state of the M4k. =A0But I am not seeing a barrel shifter. =A0So > more to look at and compare, I guess.
Not sure how either of these (lane changer/barrel shifter) is designed, I never designed any. But MIPS sure do have single cycle shifts "barrel" like? Probably something like the rlwnm and rlwimi on power (rotate left then and with mask, rotate left then insert under mask control). I have never used MIPS but it looks way closer to what power is. The greatest ARM disadvantage I see is that it has only 16- registers, this can be rather limiting on a load/store machine. Dimiter
On Thu, 8 Sep 2011 20:07:48 -0700 (PDT), Dimiter wrote:

>On Sep 9, 4:24&#4294967295;am, Jon Kirwan <j...@infinitefactors.org> wrote: >> >.... So eventually I got >> >5.5 nS (2.5 nS per cycle) doing the loop (this includes >> >everything, perhaps loading from DDRAM to cache at times >> >etc.). >> >> This implies a GHz processor of some kind, doesn't it? &#4294967295;And >> here all I'm talking about is 30-80MHz clocks. &#4294967295;I won't be >> getting any 2.5ns per cycle. > >Well the 5200B is a 400MHz part,
Yeah, that 2.5ns figure. I just meant a lot closer to GHz than I will likely see soon.
>would be nice to have it >in GHz range but I don't see it coming (has an unbelievably >convenient/flexible/you name it DMA engine which Freescale >have abandoned, probably has been perceived as "too complex" >for "most" users by the top brass or something).
Interesting you bring up DMA. I was using the SiLabs part to gain access to an internal, 1MHz, 16 bit SAR ADC (which, if external, would have cost me as much as the chip or more) and it required DMA -- (given that this is an 8051 style cpu, that won't be surprising to anyone.) Turns out, the documentation on the DMA section is poor and if you really want to know exactly what you are doing when you copy someone else's supposedly working code then the doc is not adequate to the task. So I sent off my questions, worked with the local disty, they pressed, I pressed, we all pressed. But SiLabs (US, anyway) didn't know. Thing is, they had to track down the DMA section designer who apparently now is off- shore; I think in Singapore or something. Took them months. He'd designed it 8 years before that time. But I got my answer, at least. Took maybe three months to do? Thing is, no one else had ever asked the questions I asked. Yet they were the kinds of "off by 1" question that I think people who did ANYTHING OTHER THAN USE BOILERPLATE CODE would have done. So it is clear to me that those using the chip either used some canned library or else just copied and pasted sample code. I must have been the only person actually caring about understanding how to write my own stuff for the DMA. Or else the answer would have been much easier to get.
>> ... >> By the way, I've taken note of the barrel shifter in the >> Cortex-M3. &#4294967295;There appears to be a 'lane changer' in the A >> state of the M4k. &#4294967295;But I am not seeing a barrel shifter. &#4294967295;So >> more to look at and compare, I guess. > >Not sure how either of these (lane changer/barrel shifter) >is designed, I never designed any. But MIPS sure do have >single cycle shifts "barrel" like? Probably something >like the rlwnm and rlwimi on power (rotate left then >and with mask, rotate left then insert under mask control). >I have never used MIPS but it looks way closer to what >power is. The greatest ARM disadvantage I see is that >it has only 16- registers, this can be rather limiting on a >load/store machine.
There's a point. I need to think about all this in the context of a few specific algorithms to see how it all pans out. I am enjoying this. Jon
On Thu, 08 Sep 2011 18:24:35 -0700, Jon Kirwan
<jonk@infinitefactors.org> wrote:

> >Just as an aside, Bipolar Integrated Technologies (BIT) made >some FP hardware that included a fully combinatorial FP >division chip. Remarkable piece of work. I don't think >ANYONE else has done it. Of course, I think they are long >gone now. I worked with an engineer who was part of the >design process, though.
Unless you're using "fully combinatorial" in a way I'm not expecting, that seems improbable, unless the FP numbers were very short. It's certainly theoretically possible, but the number of terms that you'd get would be astronomical. Assuming 24 bit mantissas and (incorrectly) assuming you didn't need to look at anything else, you'd be looking at 48 input bits affecting each output bit, and probably on the order of 2**44 to 2**45 input terms (each one an and gate averaging about 45 inputs) for each of the output bits. Now if they implemented a full hardware divider, that's certainly possible, although if they went through that much trouble I hope they'd pipeline it for throughput.
On Thu, 08 Sep 2011 23:32:47 -0500, Robert Wessel
<robertwessel2@yahoo.com> wrote:

>On Thu, 08 Sep 2011 18:24:35 -0700, Jon Kirwan ><jonk@infinitefactors.org> wrote: > >> >>Just as an aside, Bipolar Integrated Technologies (BIT) made >>some FP hardware that included a fully combinatorial FP >>division chip. Remarkable piece of work. I don't think >>ANYONE else has done it. Of course, I think they are long >>gone now. I worked with an engineer who was part of the >>design process, though. > >Unless you're using "fully combinatorial" in a way I'm not expecting, >that seems improbable, unless the FP numbers were very short.
I'm dredging my memory and I'm glad you questioned this. I'm now thinking I must have been wrong about the combinatorial nature -- it almost must have been some kind of internal free running oscillator detail that escaped my notice then and my memory converted that into 'fully combinatorial' when it shouldn't have. I recall that one "set inputs, waited out a path delay, and got the result." It's possible that I interpreted that poorly through the fog of memory and that there was some kind of freely running internal clock mechanism. If so, I would not have been savvy enough at the time to have asked that question and today I would have lost other details that might have corrected my memory there. (These were part of a set of ECL FP chips around the late 1980's and up through about 1990 or so. And very fast.)
>It's certainly theoretically possible, but the number of terms that >you'd get would be astronomical. Assuming 24 bit mantissas and >(incorrectly) assuming you didn't need to look at anything else, you'd >be looking at 48 input bits affecting each output bit, and probably on >the order of 2**44 to 2**45 input terms (each one an and gate >averaging about 45 inputs) for each of the output bits.
I am with you on this. You make a good point.
>Now if they implemented a full hardware divider, that's certainly >possible, although if they went through that much trouble I hope >they'd pipeline it for throughput.
I now also remember something about triple-diffused poly-Si bipolar process with some kind of special "self-aligning" nature that allowed for very small feature size at the time. It was impressive for its time, though. Very much so. Now I'm mad at myself for not verifying my memory before writing. Your point is good. I'm very curious about exactly what they did do and didn't do. So I'm going to pay the price and look right now and see if there are any details on the web.... Hmm... Okay, there is US Patent 5153848 (and others, too.) That one is titled, Floating Point Processor with Internal Free-Running Clock, and says in the Abstract, "The multiplier and divider are pipelined internally, driven by a fast, two-phase internal clock that is transparent to the user." So I think I know where my confusion came from. Thanks, Robert. Jon