On Sat, 14 Feb 2009 12:33:13 -0800 (PST), rickman <gnuarm@gmail.com>
wrote:

>Actually, this discussion has become more than academic.  I am
>estimating a job that will be using floating point in an FPGA.  I have
>found three open source floating point packages.  Over the next few
>days/weeks I will be digging into this.

I haven't been following this side-thread and if I'm speaking out of
turn, I apologize.  But I remember that Bipolar Integrated Technology
(Beaverton, Oregon) had developed a combinatorial floating point
divider.  A designer I worked with on a project, in fact, used the
parts for a board he designed then and that's how he described the
chip to me.  I can't speak for support on the QNaNs and so on, though.

Jon

On Feb 13, 4:28=A0pm, rickman <gnu...@gmail.com> wrote:
> On Feb 13, 3:29=A0pm, Vladimir Vassilevsky <antispam_bo...@hotmail.com>
> wrote:
>
> > Everett M. Greene wrote:
> > > Vladimir Vassilevsky <antispam_bo...@hotmail.com> writes:
>
> > >>rickman wrote:
>
> > >>>Any idea how they are performing the divide that it runs as fast as
> > >>>the multiply?
>
> > >>1. There are the fast hardware dividers which essentually perform
> > >>several steps of the trivial division algorithm at once.
>
> > > What are these "trivial" steps?
>
> > Elementary school algorithm. Each step consists of compare/subtract and
> > shift.
>
> I'm a little fuzzy on this. =A0How exactly do you perform multiple steps
> in parallel if each step requires the result of the previous step to
> evaluate the comparison? =A0By the fourth step there are eight
> conditions that have to be evaluated. =A0For a 24 bit mantissa how many
> bits are evaluated at once typically? =A0How does this run as fast as a
> multiply?
>
> Maybe the topic changed, but the original discussion was about
> software floating point. =A0Grant Edwards said that his measurements of
> a software library gave equal timing results for multiply and divide.
> I was simply describing the algorithm used in hardware to explain why
> division was much slower in that case. =A0I don't see how any of these
> methods would be any faster, and in fact I expect them to be slower
> than Newton Raphson iteration.
>
> > >>2. The division can be computed as the multiplication by 1/x, where 1=
/x
> > >>is computed as the Taylor series. The whole series can be computed in
> > >>parallel if the hardware allows for that.
>
> > > The Taylor series for 1/x is 1/x.
>
> > :))))))))
>
> > =A0 =A0 1
> > ------ =3D 1 + x + x^2 + x^3 ....
> > =A0 1 - x
>
> > > =A0How does that help?
>
> > A lot. The whole series can be computed in parallel in one step.
>
> Yes, this can, in theory, be computed in one step. =A0When working with
> 32 bit floating point, just how many elements of this series does it
> take to get a reasonable accuracy and how much logic does it take to
> perform all those calculations?
>
> Here is an interesting paper I found.
>
> http://www.cs.ualberta.ca/~amaral/cascon/CDP05/slides/CDP05-enenkel.pdf
>
> Rick

Actually, this discussion has become more than academic.  I am
estimating a job that will be using floating point in an FPGA.  I have
found three open source floating point packages.  Over the next few
days/weeks I will be digging into this.

Rick

Vladimir Vassilevsky <antispam_bogus@hotmail.com> writes:
> Everett M. Greene wrote:
> > Vladimir Vassilevsky <antispam_bogus@hotmail.com> writes:
> >>rickman wrote:
> >>
> >>>Any idea how they are performing the divide that it runs as fast as
> >>>the multiply?
> >>
> >>1. There are the fast hardware dividers which essentually perform 
> >>several steps of the trivial division algorithm at once.
> > 
> > What are these "trivial" steps?
> 
> Elementary school algorithm. Each step consists of compare/subtract and 
> shift.

Correct, but you said several can be computed at once.
I took that to mean that several bits of the division
could be done at once.  Perhaps you meant that all the
operations for one bit position can be performed at one
time.

> >>2. The division can be computed as the multiplication by 1/x, where 1/x 
> >>is computed as the Taylor series. The whole series can be computed in 
> >>parallel if the hardware allows for that.
> > 
> > The Taylor series for 1/x is 1/x.
> 
> :))))))))
> 
>     1
>   ----- = 1 + x + x^2 + x^3 ....
>   1 - x

Only if 0 >= x < 1 and you put some minus signs in the expression

> >  How does that help?
> 
> A lot. The whole series can be computed in parallel in one step.

On Feb 13, 3:29=A0pm, Vladimir Vassilevsky <antispam_bo...@hotmail.com>
wrote:
> Everett M. Greene wrote:
> > Vladimir Vassilevsky <antispam_bo...@hotmail.com> writes:
>
> >>rickman wrote:
>
> >>>Any idea how they are performing the divide that it runs as fast as
> >>>the multiply?
>
> >>1. There are the fast hardware dividers which essentually perform
> >>several steps of the trivial division algorithm at once.
>
> > What are these "trivial" steps?
>
> Elementary school algorithm. Each step consists of compare/subtract and
> shift.

I'm a little fuzzy on this.  How exactly do you perform multiple steps
in parallel if each step requires the result of the previous step to
evaluate the comparison?  By the fourth step there are eight
conditions that have to be evaluated.  For a 24 bit mantissa how many
bits are evaluated at once typically?  How does this run as fast as a
multiply?

Maybe the topic changed, but the original discussion was about
software floating point.  Grant Edwards said that his measurements of
a software library gave equal timing results for multiply and divide.
I was simply describing the algorithm used in hardware to explain why
division was much slower in that case.  I don't see how any of these
methods would be any faster, and in fact I expect them to be slower
than Newton Raphson iteration.

> >>2. The division can be computed as the multiplication by 1/x, where 1/x
> >>is computed as the Taylor series. The whole series can be computed in
> >>parallel if the hardware allows for that.
>
> > The Taylor series for 1/x is 1/x.
>
> :))))))))
>
> =A0 =A0 1
> ------ =3D 1 + x + x^2 + x^3 ....
> =A0 1 - x
>
> > =A0How does that help?
>
> A lot. The whole series can be computed in parallel in one step.

Yes, this can, in theory, be computed in one step.  When working with
32 bit floating point, just how many elements of this series does it
take to get a reasonable accuracy and how much logic does it take to
perform all those calculations?

Here is an interesting paper I found.

http://www.cs.ualberta.ca/~amaral/cascon/CDP05/slides/CDP05-enenkel.pdf

Rick


Everett M. Greene wrote:
> Vladimir Vassilevsky <antispam_bogus@hotmail.com> writes:
> 
>>rickman wrote:
>>
>>>Any idea how they are performing the divide that it runs as fast as
>>>the multiply?
>>
>>1. There are the fast hardware dividers which essentually perform 
>>several steps of the trivial division algorithm at once.
> 
> What are these "trivial" steps?

Elementary school algorithm. Each step consists of compare/subtract and 
shift.

>>2. The division can be computed as the multiplication by 1/x, where 1/x 
>>is computed as the Taylor series. The whole series can be computed in 
>>parallel if the hardware allows for that.
> 
> 
> The Taylor series for 1/x is 1/x.

:))))))))

    1
------ = 1 + x + x^2 + x^3 ....
  1 - x

>  How does that help?

A lot. The whole series can be computed in parallel in one step.


Vladimir Vassilevsky

DSP and Mixed Signal Design Consultant

http://www.abvolt.com

Vladimir Vassilevsky <antispam_bogus@hotmail.com> writes:
> rickman wrote:
> 
> > Any idea how they are performing the divide that it runs as fast as
> > the multiply?
> 
> 1. There are the fast hardware dividers which essentually perform 
> several steps of the trivial division algorithm at once.

What are these "trivial" steps?

> 2. The division can be computed as the multiplication by 1/x, where 1/x 
> is computed as the Taylor series. The whole series can be computed in 
> parallel if the hardware allows for that.

The Taylor series for 1/x is 1/x.  How does that help?

> 3. LUT and approximation.

Paul Keinanen wrote:
> Grant Edwards <grante@visi.com> wrote:
>> CBFalconer <cbfalconer@yahoo.com> wrote:
>>
>>> In general, when using software floating point, you will find that
>>> addition (or subtraction) is the slowest basic operation, due to
>>> the need to find a common 'size' to inflict on both operands.
>>> Division is the next slowest, and multiplication the fastest.
>>
>> I've not found that to be true on any of the platforms I've
>> benchmarked.  For example, I timed the four operations on a
>> 6800, and add/sub was about 1ms, and mult/div was about 4ms.
> 
> The 6800 did not have a multiply instruction and even the 6809
> multiply was _slow_, thus you had to perform the 24x24 bit mantissa
> multiplication by repeated shifts and adds (24 times).
> 
> In float add/sub the denormalisation+normalisation phases typically
> required only a few bit shifts, seldom the full 24 bit shifts,
> requiring a considerably smaller number of (8 bit) instructions than
> the 24x24 bit multiply.
> 
> However, if the instruction set contains single cycle reasonably wide
> unsigned integer multiply instruction, the float add/mul execution
> times would be much closer to  each other.

For a complete example of an 8080 system, designed for speed and
accuracy, including trig, log, exponential functions, see:

   "Falconer Floating Point Arithmetic" by Charles Falconer,

in DDJ, March 1979, p.4 and April 1979, p.16.  There were later
improvements, basically minor, which improved the multiply and
divide times.

-- 
 [mail]: Chuck F (cbfalconer at maineline dot net) 
 [page]: <http://cbfalconer.home.att.net>
            Try the download section.

Grant Edwards wrote:
> CBFalconer <cbfalconer@yahoo.com> wrote:
> 
>> In general, when using software floating point, you will find that
>> addition (or subtraction) is the slowest basic operation, due to
>> the need to find a common 'size' to inflict on both operands.
>> Division is the next slowest, and multiplication the fastest.
> 
> I've not found that to be true on any of the platforms I've
> benchmarked.  For example, I timed the four operations on a
> 6800, and add/sub was about 1ms, and mult/div was about 4ms.

Try adding two values with magnitudes differing by the register
size (roughly).  That means what the integral part of the FP value
is held in.

-- 
 [mail]: Chuck F (cbfalconer at maineline dot net) 
 [page]: <http://cbfalconer.home.att.net>
            Try the download section.

On 2009-02-12, Paul Keinanen <keinanen@sci.fi> wrote:

>>I've not found that to be true on any of the platforms I've
>>benchmarked.  For example, I timed the four operations on a
>>6800, and add/sub was about 1ms, and mult/div was about 4ms.
>
> The 6800 did not have a multiply instruction and even the 6809
> multiply was _slow_, thus you had to perform the 24x24 bit mantissa
> multiplication by repeated shifts and adds (24 times). 
>
> In float add/sub the denormalisation+normalisation phases typically
> required only a few bit shifts, seldom the full 24 bit shifts,
> requiring a considerably smaller number of (8 bit) instructions than
> the 24x24 bit multiply.
>
> However, if the instruction set contains single cycle reasonably wide
> unsigned integer multiply instruction, the float add/mul execution
> times would be much closer to  each other.

Good point.  The platforms I'm remembering didn't have hw
multiply (or if they did, it was pretty narrow).  Oddly, the
platforms where I've used floating point were all slow (and
often 8-bit).  I've used ARM7 quite a bit which has a
barrel-shifter and hw multiply, but never did floating point on
that platform.

-- 
Grant

On Thu, 12 Feb 2009 12:43:51 -0500, Walter Banks
<walter@bytecraft.com> wrote:

>As several people have pointed out the biggest time issue is
>normalization on processors that don't have a barrel shifter.

Just to add a little.  For software implementations I've done for
division, for example, the two inputs are already presumed to be
normalized and the iterative division algorithm takes up the hog's
share of the cycle count.  Re-normalizing is usually hardly more than
a few instructions to cover a few conditions.  Where normalization has
bit me is when first packing perviously non-normalized (fixed format)
values prior to an integer (or FP) division in order to maximize
useful bits in the result and with addition and subtraction where
de-normalizing of one or the other is required.  Often, I'll choose to
instead perform the addition entirely in fixed point, jacking up the
numerators so that a common divisor is assumed, and then performing
the normalization and final division in a last step.

Combinatorial barrel shifters are a big plus, often neglected in ALU
designs for integer processors.  It takes space though and end-use
designers often don't look for it so I suppose it doesn't score well
on the must-do list for manufacturers.

Something else that probably doesn't rank high on the must-do list, as
many aren't even aware of the possibility and don't look for it, is a
simple, single-bit producing instruction for integer division that can
be used as part of a sequence to achieve fuller divisions.  The gates
required are close to nil (trivial addition to ALU die space and no
change to the longest combinatorial path, I think.)  The larger cost
may be pressure on the instruction space and having to write more
documentation.

Jon