Reply by Jon Kirwan February 14, 20092009-02-14
On Sat, 14 Feb 2009 12:33:13 -0800 (PST), rickman <gnuarm@gmail.com>
wrote:

>Actually, this discussion has become more than academic. I am >estimating a job that will be using floating point in an FPGA. I have >found three open source floating point packages. Over the next few >days/weeks I will be digging into this.
I haven't been following this side-thread and if I'm speaking out of turn, I apologize. But I remember that Bipolar Integrated Technology (Beaverton, Oregon) had developed a combinatorial floating point divider. A designer I worked with on a project, in fact, used the parts for a board he designed then and that's how he described the chip to me. I can't speak for support on the QNaNs and so on, though. Jon
Reply by rickman February 14, 20092009-02-14
On Feb 13, 4:28=A0pm, rickman <gnu...@gmail.com> wrote:
> On Feb 13, 3:29=A0pm, Vladimir Vassilevsky <antispam_bo...@hotmail.com> > wrote: > > > Everett M. Greene wrote: > > > Vladimir Vassilevsky <antispam_bo...@hotmail.com> writes: > > > >>rickman wrote: > > > >>>Any idea how they are performing the divide that it runs as fast as > > >>>the multiply? > > > >>1. There are the fast hardware dividers which essentually perform > > >>several steps of the trivial division algorithm at once. > > > > What are these "trivial" steps? > > > Elementary school algorithm. Each step consists of compare/subtract and > > shift. > > I'm a little fuzzy on this. =A0How exactly do you perform multiple steps > in parallel if each step requires the result of the previous step to > evaluate the comparison? =A0By the fourth step there are eight > conditions that have to be evaluated. =A0For a 24 bit mantissa how many > bits are evaluated at once typically? =A0How does this run as fast as a > multiply? > > Maybe the topic changed, but the original discussion was about > software floating point. =A0Grant Edwards said that his measurements of > a software library gave equal timing results for multiply and divide. > I was simply describing the algorithm used in hardware to explain why > division was much slower in that case. =A0I don't see how any of these > methods would be any faster, and in fact I expect them to be slower > than Newton Raphson iteration. > > > >>2. The division can be computed as the multiplication by 1/x, where 1=
/x
> > >>is computed as the Taylor series. The whole series can be computed in > > >>parallel if the hardware allows for that. > > > > The Taylor series for 1/x is 1/x. > > > :)))))))) > > > =A0 =A0 1 > > ------ =3D 1 + x + x^2 + x^3 .... > > =A0 1 - x > > > > =A0How does that help? > > > A lot. The whole series can be computed in parallel in one step. > > Yes, this can, in theory, be computed in one step. =A0When working with > 32 bit floating point, just how many elements of this series does it > take to get a reasonable accuracy and how much logic does it take to > perform all those calculations? > > Here is an interesting paper I found. > > http://www.cs.ualberta.ca/~amaral/cascon/CDP05/slides/CDP05-enenkel.pdf > > Rick
Actually, this discussion has become more than academic. I am estimating a job that will be using floating point in an FPGA. I have found three open source floating point packages. Over the next few days/weeks I will be digging into this. Rick
Reply by Everett M. Greene February 14, 20092009-02-14
Vladimir Vassilevsky <antispam_bogus@hotmail.com> writes:
> Everett M. Greene wrote: > > Vladimir Vassilevsky <antispam_bogus@hotmail.com> writes: > >>rickman wrote: > >> > >>>Any idea how they are performing the divide that it runs as fast as > >>>the multiply? > >> > >>1. There are the fast hardware dividers which essentually perform > >>several steps of the trivial division algorithm at once. > > > > What are these "trivial" steps? > > Elementary school algorithm. Each step consists of compare/subtract and > shift.
Correct, but you said several can be computed at once. I took that to mean that several bits of the division could be done at once. Perhaps you meant that all the operations for one bit position can be performed at one time.
> >>2. The division can be computed as the multiplication by 1/x, where 1/x > >>is computed as the Taylor series. The whole series can be computed in > >>parallel if the hardware allows for that. > > > > The Taylor series for 1/x is 1/x. > > :)))))))) > > 1 > ----- = 1 + x + x^2 + x^3 .... > 1 - x
Only if 0 >= x < 1 and you put some minus signs in the expression
> > How does that help? > > A lot. The whole series can be computed in parallel in one step.
Reply by rickman February 13, 20092009-02-13
On Feb 13, 3:29=A0pm, Vladimir Vassilevsky <antispam_bo...@hotmail.com>
wrote:
> Everett M. Greene wrote: > > Vladimir Vassilevsky <antispam_bo...@hotmail.com> writes: > > >>rickman wrote: > > >>>Any idea how they are performing the divide that it runs as fast as > >>>the multiply? > > >>1. There are the fast hardware dividers which essentually perform > >>several steps of the trivial division algorithm at once. > > > What are these "trivial" steps? > > Elementary school algorithm. Each step consists of compare/subtract and > shift.
I'm a little fuzzy on this. How exactly do you perform multiple steps in parallel if each step requires the result of the previous step to evaluate the comparison? By the fourth step there are eight conditions that have to be evaluated. For a 24 bit mantissa how many bits are evaluated at once typically? How does this run as fast as a multiply? Maybe the topic changed, but the original discussion was about software floating point. Grant Edwards said that his measurements of a software library gave equal timing results for multiply and divide. I was simply describing the algorithm used in hardware to explain why division was much slower in that case. I don't see how any of these methods would be any faster, and in fact I expect them to be slower than Newton Raphson iteration.
> >>2. The division can be computed as the multiplication by 1/x, where 1/x > >>is computed as the Taylor series. The whole series can be computed in > >>parallel if the hardware allows for that. > > > The Taylor series for 1/x is 1/x. > > :)))))))) > > =A0 =A0 1 > ------ =3D 1 + x + x^2 + x^3 .... > =A0 1 - x > > > =A0How does that help? > > A lot. The whole series can be computed in parallel in one step.
Yes, this can, in theory, be computed in one step. When working with 32 bit floating point, just how many elements of this series does it take to get a reasonable accuracy and how much logic does it take to perform all those calculations? Here is an interesting paper I found. http://www.cs.ualberta.ca/~amaral/cascon/CDP05/slides/CDP05-enenkel.pdf Rick
Reply by Vladimir Vassilevsky February 13, 20092009-02-13

Everett M. Greene wrote:
> Vladimir Vassilevsky <antispam_bogus@hotmail.com> writes: > >>rickman wrote: >> >>>Any idea how they are performing the divide that it runs as fast as >>>the multiply? >> >>1. There are the fast hardware dividers which essentually perform >>several steps of the trivial division algorithm at once. > > What are these "trivial" steps?
Elementary school algorithm. Each step consists of compare/subtract and shift.
>>2. The division can be computed as the multiplication by 1/x, where 1/x >>is computed as the Taylor series. The whole series can be computed in >>parallel if the hardware allows for that. > > > The Taylor series for 1/x is 1/x.
:)))))))) 1 ------ = 1 + x + x^2 + x^3 .... 1 - x
> How does that help?
A lot. The whole series can be computed in parallel in one step. Vladimir Vassilevsky DSP and Mixed Signal Design Consultant http://www.abvolt.com
Reply by Everett M. Greene February 13, 20092009-02-13
Vladimir Vassilevsky <antispam_bogus@hotmail.com> writes:
> rickman wrote: > > > Any idea how they are performing the divide that it runs as fast as > > the multiply? > > 1. There are the fast hardware dividers which essentually perform > several steps of the trivial division algorithm at once.
What are these "trivial" steps?
> 2. The division can be computed as the multiplication by 1/x, where 1/x > is computed as the Taylor series. The whole series can be computed in > parallel if the hardware allows for that.
The Taylor series for 1/x is 1/x. How does that help?
> 3. LUT and approximation.
Reply by CBFalconer February 12, 20092009-02-12
Paul Keinanen wrote:
> Grant Edwards <grante@visi.com> wrote: >> CBFalconer <cbfalconer@yahoo.com> wrote: >> >>> In general, when using software floating point, you will find that >>> addition (or subtraction) is the slowest basic operation, due to >>> the need to find a common 'size' to inflict on both operands. >>> Division is the next slowest, and multiplication the fastest. >> >> I've not found that to be true on any of the platforms I've >> benchmarked. For example, I timed the four operations on a >> 6800, and add/sub was about 1ms, and mult/div was about 4ms. > > The 6800 did not have a multiply instruction and even the 6809 > multiply was _slow_, thus you had to perform the 24x24 bit mantissa > multiplication by repeated shifts and adds (24 times). > > In float add/sub the denormalisation+normalisation phases typically > required only a few bit shifts, seldom the full 24 bit shifts, > requiring a considerably smaller number of (8 bit) instructions than > the 24x24 bit multiply. > > However, if the instruction set contains single cycle reasonably wide > unsigned integer multiply instruction, the float add/mul execution > times would be much closer to each other.
For a complete example of an 8080 system, designed for speed and accuracy, including trig, log, exponential functions, see: "Falconer Floating Point Arithmetic" by Charles Falconer, in DDJ, March 1979, p.4 and April 1979, p.16. There were later improvements, basically minor, which improved the multiply and divide times. -- [mail]: Chuck F (cbfalconer at maineline dot net) [page]: <http://cbfalconer.home.att.net> Try the download section.
Reply by CBFalconer February 12, 20092009-02-12
Grant Edwards wrote:
> CBFalconer <cbfalconer@yahoo.com> wrote: > >> In general, when using software floating point, you will find that >> addition (or subtraction) is the slowest basic operation, due to >> the need to find a common 'size' to inflict on both operands. >> Division is the next slowest, and multiplication the fastest. > > I've not found that to be true on any of the platforms I've > benchmarked. For example, I timed the four operations on a > 6800, and add/sub was about 1ms, and mult/div was about 4ms.
Try adding two values with magnitudes differing by the register size (roughly). That means what the integral part of the FP value is held in. -- [mail]: Chuck F (cbfalconer at maineline dot net) [page]: <http://cbfalconer.home.att.net> Try the download section.
Reply by Grant Edwards February 12, 20092009-02-12
On 2009-02-12, Paul Keinanen <keinanen@sci.fi> wrote:

>>I've not found that to be true on any of the platforms I've >>benchmarked. For example, I timed the four operations on a >>6800, and add/sub was about 1ms, and mult/div was about 4ms. > > The 6800 did not have a multiply instruction and even the 6809 > multiply was _slow_, thus you had to perform the 24x24 bit mantissa > multiplication by repeated shifts and adds (24 times). > > In float add/sub the denormalisation+normalisation phases typically > required only a few bit shifts, seldom the full 24 bit shifts, > requiring a considerably smaller number of (8 bit) instructions than > the 24x24 bit multiply. > > However, if the instruction set contains single cycle reasonably wide > unsigned integer multiply instruction, the float add/mul execution > times would be much closer to each other.
Good point. The platforms I'm remembering didn't have hw multiply (or if they did, it was pretty narrow). Oddly, the platforms where I've used floating point were all slow (and often 8-bit). I've used ARM7 quite a bit which has a barrel-shifter and hw multiply, but never did floating point on that platform. -- Grant
Reply by Jon Kirwan February 12, 20092009-02-12
On Thu, 12 Feb 2009 12:43:51 -0500, Walter Banks
<walter@bytecraft.com> wrote:

>As several people have pointed out the biggest time issue is >normalization on processors that don't have a barrel shifter.
Just to add a little. For software implementations I've done for division, for example, the two inputs are already presumed to be normalized and the iterative division algorithm takes up the hog's share of the cycle count. Re-normalizing is usually hardly more than a few instructions to cover a few conditions. Where normalization has bit me is when first packing perviously non-normalized (fixed format) values prior to an integer (or FP) division in order to maximize useful bits in the result and with addition and subtraction where de-normalizing of one or the other is required. Often, I'll choose to instead perform the addition entirely in fixed point, jacking up the numerators so that a common divisor is assumed, and then performing the normalization and final division in a last step. Combinatorial barrel shifters are a big plus, often neglected in ALU designs for integer processors. It takes space though and end-use designers often don't look for it so I suppose it doesn't score well on the must-do list for manufacturers. Something else that probably doesn't rank high on the must-do list, as many aren't even aware of the possibility and don't look for it, is a simple, single-bit producing instruction for integer division that can be used as part of a sequence to achieve fuller divisions. The gates required are close to nil (trivial addition to ALU die space and no change to the longest combinatorial path, I think.) The larger cost may be pressure on the instruction space and having to write more documentation. Jon