floating point calculations.| page 4

Reply by Walter Banks ●February 12, 20092009-02-12


> compiler will turn x / 4 into (x >> 2),
> assuming that it is used).

This is safe to do for unsigned numbers and numbers (x) that can be determined
to be positive at compile time.

Regards,

--
Walter Banks
Byte Craft Limited
http://www.bytecraft.com

Reply by rickman ●February 12, 20092009-02-12

On Feb 12, 10:30 am, Grant Edwards <gra...@visi.com> wrote:
> On 2009-02-12, CBFalconer <cbfalco...@yahoo.com> wrote:
>
> > In general, when using software floating point, you will find that
> > addition (or subtraction) is the slowest basic operation, due to
> > the need to find a common 'size' to inflict on both operands.
> > Division is the next slowest, and multiplication the fastest.
>
> I've not found that to be true on any of the platforms I've
> benchmarked.  For example, I timed the four operations on a
> 6800, and add/sub was about 1ms, and mult/div was about 4ms.

I am surprised at this result.  I worked on array processors many
years ago and division does not have a direct method of calculation.
Instead they used an iterative approximation method using multiplies
to get the estimate.  I believe that for 32 bit floating point (not
IEEE, it was before that) they used 7 iterations which got very
close.  On a 100 MFLOPS machine running at 25 MHz (ECL!) they did a
adds and multiplies in the same time.  Of course, this is not
software, but the same complexity applies to software operations.  In
general the adds require a denormalization, the add and a
renormalization while the multiply only requires the multiply and
normalization steps.  However, the multiply can take more than one
operation compared to the add.

I have read that to implement the full IEEE spec requires a lot of
extra steps for error checking which will slow down all of it.

Any idea how they are performing the divide that it runs as fast as
the multiply?

Rick

Reply by Vladimir Vassilevsky ●February 12, 20092009-02-12

rickman wrote:

> Any idea how they are performing the divide that it runs as fast as
> the multiply?

1. There are the fast hardware dividers which essentually perform 
several steps of the trivial division algorithm at once.

2. The division can be computed as the multiplication by 1/x, where 1/x 
is computed as the Taylor series. The whole series can be computed in 
parallel if the hardware allows for that.

3. LUT and approximation.

Vladimir Vassilevsky
DSP and Mixed Signal Design Consultant
http://www.abvolt.com

Reply by Jon Kirwan ●February 12, 20092009-02-12

On Thu, 12 Feb 2009 09:56:57 -0800 (PST), rickman <gnuarm@gmail.com>
wrote:

><snip>
>Any idea how they are performing the divide that it runs as fast as
>the multiply?

Or how they are performing a multiply that is as slow as their divide?

Jon

Reply by Grant Edwards ●February 12, 20092009-02-12

On 2009-02-12, rickman <gnuarm@gmail.com> wrote:
> On Feb 12, 10:30 am, Grant Edwards <gra...@visi.com> wrote:
>> On 2009-02-12, CBFalconer <cbfalco...@yahoo.com> wrote:
>>
>> > In general, when using software floating point, you will find that
>> > addition (or subtraction) is the slowest basic operation, due to
>> > the need to find a common 'size' to inflict on both operands.
>> > Division is the next slowest, and multiplication the fastest.
>>
>> I've not found that to be true on any of the platforms I've
>> benchmarked.  For example, I timed the four operations on a
>> 6800, and add/sub was about 1ms, and mult/div was about 4ms.
>
> I am surprised at this result. I worked on array processors
> many years ago and division does not have a direct method of
> calculation. Instead they used an iterative approximation
> method using multiplies to get the estimate.  I believe that
> for 32 bit floating point (not IEEE, it was before that) they
> used 7 iterations which got very close.  On a 100 MFLOPS
> machine running at 25 MHz (ECL!) they did a adds and
> multiplies in the same time.  Of course, this is not software,
> but the same complexity applies to software operations.  In
> general the adds require a denormalization, the add and a
> renormalization while the multiply only requires the multiply
> and normalization steps.  However, the multiply can take more
> than one operation compared to the add.
>
> I have read that to implement the full IEEE spec requires a
> lot of extra steps for error checking which will slow down all
> of it.
>
> Any idea how they are performing the divide that it runs as
> fast as the multiply?

That was years ago, so I may be mis-remembering something, and I no
longer have access to the floating point library in question
(IIRC, it was from US Software).  IIRC, it was probably a
68HC11 rather than a 6800 as I originally stated.

-- 
Grant Edwards                   grante             Yow! Did YOU find a
                                  at               DIGITAL WATCH in YOUR box
                               visi.com            of VELVEETA?

Reply by Paul Keinanen ●February 12, 20092009-02-12

On Thu, 12 Feb 2009 09:30:50 -0600, Grant Edwards <grante@visi.com>
wrote:

>On 2009-02-12, CBFalconer <cbfalconer@yahoo.com> wrote:
>
>> In general, when using software floating point, you will find that
>> addition (or subtraction) is the slowest basic operation, due to
>> the need to find a common 'size' to inflict on both operands. 
>> Division is the next slowest, and multiplication the fastest.
>
>I've not found that to be true on any of the platforms I've
>benchmarked.  For example, I timed the four operations on a
>6800, and add/sub was about 1ms, and mult/div was about 4ms.

The 6800 did not have a multiply instruction and even the 6809
multiply was _slow_, thus you had to perform the 24x24 bit mantissa
multiplication by repeated shifts and adds (24 times). 

In float add/sub the denormalisation+normalisation phases typically
required only a few bit shifts, seldom the full 24 bit shifts,
requiring a considerably smaller number of (8 bit) instructions than
the 24x24 bit multiply.

However, if the instruction set contains single cycle reasonably wide
unsigned integer multiply instruction, the float add/mul execution
times would be much closer to  each other.

Paul

Reply by Jon Kirwan ●February 12, 20092009-02-12

On Thu, 12 Feb 2009 12:43:51 -0500, Walter Banks
<walter@bytecraft.com> wrote:

>As several people have pointed out the biggest time issue is
>normalization on processors that don't have a barrel shifter.

Just to add a little.  For software implementations I've done for
division, for example, the two inputs are already presumed to be
normalized and the iterative division algorithm takes up the hog's
share of the cycle count.  Re-normalizing is usually hardly more than
a few instructions to cover a few conditions.  Where normalization has
bit me is when first packing perviously non-normalized (fixed format)
values prior to an integer (or FP) division in order to maximize
useful bits in the result and with addition and subtraction where
de-normalizing of one or the other is required.  Often, I'll choose to
instead perform the addition entirely in fixed point, jacking up the
numerators so that a common divisor is assumed, and then performing
the normalization and final division in a last step.

Combinatorial barrel shifters are a big plus, often neglected in ALU
designs for integer processors.  It takes space though and end-use
designers often don't look for it so I suppose it doesn't score well
on the must-do list for manufacturers.

Something else that probably doesn't rank high on the must-do list, as
many aren't even aware of the possibility and don't look for it, is a
simple, single-bit producing instruction for integer division that can
be used as part of a sequence to achieve fuller divisions.  The gates
required are close to nil (trivial addition to ALU die space and no
change to the longest combinatorial path, I think.)  The larger cost
may be pressure on the instruction space and having to write more
documentation.

Jon

Reply by Grant Edwards ●February 12, 20092009-02-12

On 2009-02-12, Paul Keinanen <keinanen@sci.fi> wrote:

>>I've not found that to be true on any of the platforms I've
>>benchmarked.  For example, I timed the four operations on a
>>6800, and add/sub was about 1ms, and mult/div was about 4ms.
>
> The 6800 did not have a multiply instruction and even the 6809
> multiply was _slow_, thus you had to perform the 24x24 bit mantissa
> multiplication by repeated shifts and adds (24 times). 
>
> In float add/sub the denormalisation+normalisation phases typically
> required only a few bit shifts, seldom the full 24 bit shifts,
> requiring a considerably smaller number of (8 bit) instructions than
> the 24x24 bit multiply.
>
> However, if the instruction set contains single cycle reasonably wide
> unsigned integer multiply instruction, the float add/mul execution
> times would be much closer to  each other.

Good point.  The platforms I'm remembering didn't have hw
multiply (or if they did, it was pretty narrow).  Oddly, the
platforms where I've used floating point were all slow (and
often 8-bit).  I've used ARM7 quite a bit which has a
barrel-shifter and hw multiply, but never did floating point on
that platform.

-- 
Grant

Reply by CBFalconer ●February 12, 20092009-02-12

Grant Edwards wrote:
> CBFalconer <cbfalconer@yahoo.com> wrote:
> 
>> In general, when using software floating point, you will find that
>> addition (or subtraction) is the slowest basic operation, due to
>> the need to find a common 'size' to inflict on both operands.
>> Division is the next slowest, and multiplication the fastest.
> 
> I've not found that to be true on any of the platforms I've
> benchmarked.  For example, I timed the four operations on a
> 6800, and add/sub was about 1ms, and mult/div was about 4ms.

Try adding two values with magnitudes differing by the register
size (roughly).  That means what the integral part of the FP value
is held in.

-- 
 [mail]: Chuck F (cbfalconer at maineline dot net) 
 [page]: <http://cbfalconer.home.att.net>
            Try the download section.

Reply by CBFalconer ●February 12, 20092009-02-12

Paul Keinanen wrote:
> Grant Edwards <grante@visi.com> wrote:
>> CBFalconer <cbfalconer@yahoo.com> wrote:
>>
>>> In general, when using software floating point, you will find that
>>> addition (or subtraction) is the slowest basic operation, due to
>>> the need to find a common 'size' to inflict on both operands.
>>> Division is the next slowest, and multiplication the fastest.
>>
>> I've not found that to be true on any of the platforms I've
>> benchmarked.  For example, I timed the four operations on a
>> 6800, and add/sub was about 1ms, and mult/div was about 4ms.
> 
> The 6800 did not have a multiply instruction and even the 6809
> multiply was _slow_, thus you had to perform the 24x24 bit mantissa
> multiplication by repeated shifts and adds (24 times).
> 
> In float add/sub the denormalisation+normalisation phases typically
> required only a few bit shifts, seldom the full 24 bit shifts,
> requiring a considerably smaller number of (8 bit) instructions than
> the 24x24 bit multiply.
> 
> However, if the instruction set contains single cycle reasonably wide
> unsigned integer multiply instruction, the float add/mul execution
> times would be much closer to  each other.

For a complete example of an 8080 system, designed for speed and
accuracy, including trig, log, exponential functions, see:

   "Falconer Floating Point Arithmetic" by Charles Falconer,

in DDJ, March 1979, p.4 and April 1979, p.16.  There were later
improvements, basically minor, which improved the multiply and
divide times.

-- 
 [mail]: Chuck F (cbfalconer at maineline dot net) 
 [page]: <http://cbfalconer.home.att.net>
            Try the download section.