FFT Speeds| page 2

Reply by Paul Rubin ●April 1, 20202020-04-01

Clifford Heath <no.spam@please.net> writes:
> Ultimately I want an RPi-like module with a GPU  that supports
> CUDA...

CUDA is basically Nvidia-specific.  You're more likely to find OpenCL on
those sorts of boards.  E.g. web search on "opencl odroid" gets some
hits, though it's not clear it uses the odroid's gpu.

Reply by Rick C ●April 1, 20202020-04-01

On Wednesday, April 1, 2020 at 12:32:27 PM UTC-4, Clive Arthur wrote:
> On 01/04/2020 04:53, Rick C wrote:
> 
> <snipped>
> > 
> > Personally I much prefer coding FPGAs in HDL than mucking about with the complexities of trying to get a sequentially executing CPU to appear to be processing in parallel. 
> 
> Probably a dumb question, I'm no FPGA expert, but for a fixed-point FFT, 
> could one use multiple shift and add multipliers instead of one fast 
> hardware multiplier?  Roughly how many shift-add multipliers could you 
> fit in the same space as a fast multiplier?
> 
> I use a 128 point FFT for a comms system, and a dsPIC handles this 
> quickly enough with it's two hardware multipliers and ancillary logic, 
> but I've wondered if an FPGA with lots of parallel shift-add multipliers 
> might be quicker.
> 
> In my application, I can't use FPGAs with inbuilt processors or multipliers.

Ok, sometimes I don't completely follow what people have posted.  By shift and add multiplier I assume you mean a single adder that is used sequentially to generate a multiply in N clocks?  

I can't think why you would use such a multiplier if you need many of them to do the job.  I'm not an expert on multiplier design, but I believe a modified Booth's algorithm is not only faster but more compact.  The logic is a bit involved but someone explained to me how it uses fewer computations, but I completely forget the details.  

It's almost certain that multiple multipliers of any type could be faster... for some situations.  A 128 point (complex/real?, xx bit?, fixed/float) FFT is not very large, so not a huge amount of computations.  If you split the work you need to consider the time to offload the data and return the results.  A friend many years ago told me about evaluating an attached array processor on his PDP 11/70 in grad school.  The calcs were fast, but the I/O speed was slow enough it was faster on the CPU.  

I expect this is real time work so you might want to stream the data into the FPGA and do more processing there.  I would be looking at the problem as by default work being done on the FPGA and offloading to a CPU the parts that are harder to do in an FPGA like network connectivity, etc.  Many of my designs have no CPU. 

Take a look at the iCE5LP series from Lattice.  2 or 4 DSP blocks with 1100, 2048 or 3520 logic cells (LUT4s + FF) and 8 or 16 kB block RAM in a 0.5 mm pitch, 7 x 7 mm, 48 pin QFN.  A nice part for many small designs.  

Can you explain why you can't use the built in multipliers?  I get not wanting to use a CPU, but a multiplier is a multiplier whether you construct it or it's already in the FPGA. 

-- 

  Rick C.

  -- Get 1,000 miles of free Supercharging
  -- Tesla referral code - https://ts.la/richard11209

Reply by Phil Hobbs ●April 1, 20202020-04-01

On 2020-03-31 21:07, Clifford Heath wrote:
> On 1/4/20 1:14 am, dalai lamah wrote:
>> Un bel giorno Rick C digit&ograve;:
>>
>>> I'm not building anything, I'm just curious.&nbsp; I worked on a machine in
>>> the 80's called an attached array processor capable of 100 MFLOPS, the
>>> ST-100.&nbsp; It could do a 1K complex FFT in 0.86 msec.&nbsp; What sort of device
>>> would be required to do this today?
>>
>> A beefed-up microcontroller should be enough:
>>
>> http://www.ti.com/lit/an/spry288a/spry288a.pdf
>>
>> C28x family runs up to 200 MHz and does 1k complex FFT in 53219 cycles,
>> i.e. 0.27 ms. Of course there are caveats, you must run everything in RAM
>> and very tight.
>>
>> Some models are also dual core.
>>
> 
> All other things being equal, doubling the length of an FFT adds one 
> complex multiply per sample. So performance is O(log2(N))
> 
> Clifford Heath

But doubles the number of samples, so it's N log2(N).

Cheers

Phil Hobbs

-- 
Dr Philip C D Hobbs
Principal Consultant
ElectroOptical Innovations LLC / Hobbs ElectroOptics
Optics, Electro-optics, Photonics, Analog Electronics
Briarcliff Manor NY 10510

http://electrooptical.net
http://hobbs-eo.com

Reply by Clifford Heath ●April 1, 20202020-04-01

On 2/4/20 4:43 am, Paul Rubin wrote:
> Clifford Heath <no.spam@please.net> writes:
>> Ultimately I want an RPi-like module with a GPU  that supports
>> CUDA...
> 
> CUDA is basically Nvidia-specific.  You're more likely to find OpenCL on
> those sorts of boards.  E.g. web search on "opencl odroid" gets some
> hits, though it's not clear it uses the odroid's gpu.

Many thanks for the pointer. It looks like the ODroid-XU3 and -XU4 
(which have the Mali T628MP6 GPU) already have OpenCL support.

I might need to reassign my ODroid-C2 to a different task and get an XU4.

Clifford Heath.

Reply by Clive Arthur ●April 2, 20202020-04-02

On 01/04/2020 20:20, Rick C wrote:
> On Wednesday, April 1, 2020 at 12:32:27 PM UTC-4, Clive Arthur wrote:
>> On 01/04/2020 04:53, Rick C wrote:
>>
>> <snipped>
>>>
>>> Personally I much prefer coding FPGAs in HDL than mucking about with the complexities of trying to get a sequentially executing CPU to appear to be processing in parallel.
>>
>> Probably a dumb question, I'm no FPGA expert, but for a fixed-point FFT,
>> could one use multiple shift and add multipliers instead of one fast
>> hardware multiplier?  Roughly how many shift-add multipliers could you
>> fit in the same space as a fast multiplier?
>>
>> I use a 128 point FFT for a comms system, and a dsPIC handles this
>> quickly enough with it's two hardware multipliers and ancillary logic,
>> but I've wondered if an FPGA with lots of parallel shift-add multipliers
>> might be quicker.
>>
>> In my application, I can't use FPGAs with inbuilt processors or multipliers.
> 
> Ok, sometimes I don't completely follow what people have posted.  

That's because it's not my field and I don't know the correct form of words.

> By shift and add multiplier I assume you mean a single adder that is used sequentially to generate a multiply in N clocks?

Yes.
> 
> I can't think why you would use such a multiplier if you need many of them to do the job.  I'm not an expert on multiplier design, but I believe a modified Booth's algorithm is not only faster but more compact.  The logic is a bit involved but someone explained to me how it uses fewer computations, but I completely forget the details.

The last time I used a hardware multiplier outside a CPU it was made 
with TTL chips and wire wrapped.  A Wallace tree IIRC taken from a big 
orange Texas databook.  My experience is somewhat limited.

> It's almost certain that multiple multipliers of any type could be faster... for some situations.  A 128 point (complex/real?, xx bit?, fixed/float) FFT is not very large, so not a huge amount of computations.  If you split the work you need to consider the time to offload the data and return the results.  A friend many years ago told me about evaluating an attached array processor on his PDP 11/70 in grad school.  The calcs were fast, but the I/O speed was slow enough it was faster on the CPU.
> 
> I expect this is real time work so you might want to stream the data into the FPGA and do more processing there.  I would be looking at the problem as by default work being done on the FPGA and offloading to a CPU the parts that are harder to do in an FPGA like network connectivity, etc.  Many of my designs have no CPU.
> 
> Take a look at the iCE5LP series from Lattice.  2 or 4 DSP blocks with 1100, 2048 or 3520 logic cells (LUT4s + FF) and 8 or 16 kB block RAM in a 0.5 mm pitch, 7 x 7 mm, 48 pin QFN.  A nice part for many small designs.
> 
> Can you explain why you can't use the built in multipliers?  I get not wanting to use a CPU, but a multiplier is a multiplier whether you construct it or it's already in the FPGA.
> 
There is a possible future requirement to run at high temperatures and 
in a very high vibration space-constrained environment.  This limits the 
size of the FPGA and the package type in practice to say something with 
20k gates - the actual part numbers known to work is confidential, but 
they won't be the latest greatest.

It had occurred to me that, as the FFT needs lots of complex 
multiplications, these could be done using lots of smaller simple 
shift-add multipliers instead of one large 'normal' multiplier - and for 
all I know, that's the way an expert would do it.  But only if the 
speed/size trade off worked.

-- 
Cheers
Clive

Reply by Rick C ●April 2, 20202020-04-02

On Thursday, April 2, 2020 at 4:40:42 AM UTC-4, Clive Arthur wrote:
> On 01/04/2020 20:20, Rick C wrote:
> > On Wednesday, April 1, 2020 at 12:32:27 PM UTC-4, Clive Arthur wrote:
> >> On 01/04/2020 04:53, Rick C wrote:
> >>
> >> <snipped>
> >>>
> >>> Personally I much prefer coding FPGAs in HDL than mucking about with the complexities of trying to get a sequentially executing CPU to appear to be processing in parallel.
> >>
> >> Probably a dumb question, I'm no FPGA expert, but for a fixed-point FFT,
> >> could one use multiple shift and add multipliers instead of one fast
> >> hardware multiplier?  Roughly how many shift-add multipliers could you
> >> fit in the same space as a fast multiplier?
> >>
> >> I use a 128 point FFT for a comms system, and a dsPIC handles this
> >> quickly enough with it's two hardware multipliers and ancillary logic,
> >> but I've wondered if an FPGA with lots of parallel shift-add multipliers
> >> might be quicker.
> >>
> >> In my application, I can't use FPGAs with inbuilt processors or multipliers.
> > 
> > Ok, sometimes I don't completely follow what people have posted.  
> 
> That's because it's not my field and I don't know the correct form of words.

No, I think your terminology is good, I just don't recall exactly what things mean from time to time.  I've reached a point where I often know a topic (well, the parts I remember) but not the terminology.  


> > By shift and add multiplier I assume you mean a single adder that is used sequentially to generate a multiply in N clocks?
> 
> Yes.
> > 
> > I can't think why you would use such a multiplier if you need many of them to do the job.  I'm not an expert on multiplier design, but I believe a modified Booth's algorithm is not only faster but more compact.  The logic is a bit involved but someone explained to me how it uses fewer computations, but I completely forget the details.
> 
> The last time I used a hardware multiplier outside a CPU it was made 
> with TTL chips and wire wrapped.  A Wallace tree IIRC taken from a big 
> orange Texas databook.  My experience is somewhat limited.

I've heard of Wallace trees, but never used them.  I think they are more useful with the sort of function blocks you have in TTL/ECL chips.  When designing with gates I believe the modified Booth's algorithm is faster, but I'm not sure.  It certainly fits in an FPGA better because they all have fast adder chains in the fabric logic so the Wallace tree logic would be less effective. 


> > It's almost certain that multiple multipliers of any type could be faster... for some situations.  A 128 point (complex/real?, xx bit?, fixed/float) FFT is not very large, so not a huge amount of computations.  If you split the work you need to consider the time to offload the data and return the results.  A friend many years ago told me about evaluating an attached array processor on his PDP 11/70 in grad school.  The calcs were fast, but the I/O speed was slow enough it was faster on the CPU.
> > 
> > I expect this is real time work so you might want to stream the data into the FPGA and do more processing there.  I would be looking at the problem as by default work being done on the FPGA and offloading to a CPU the parts that are harder to do in an FPGA like network connectivity, etc.  Many of my designs have no CPU.
> > 
> > Take a look at the iCE5LP series from Lattice.  2 or 4 DSP blocks with 1100, 2048 or 3520 logic cells (LUT4s + FF) and 8 or 16 kB block RAM in a 0.5 mm pitch, 7 x 7 mm, 48 pin QFN.  A nice part for many small designs.
> > 
> > Can you explain why you can't use the built in multipliers?  I get not wanting to use a CPU, but a multiplier is a multiplier whether you construct it or it's already in the FPGA.
> > 
> There is a possible future requirement to run at high temperatures and 
> in a very high vibration space-constrained environment.  This limits the 
> size of the FPGA and the package type in practice to say something with 
> 20k gates - the actual part numbers known to work is confidential, but 
> they won't be the latest greatest.

Ok, that makes sense.  The qualified part doesn't have multiplier blocks.  


> It had occurred to me that, as the FFT needs lots of complex 
> multiplications, these could be done using lots of smaller simple 
> shift-add multipliers instead of one large 'normal' multiplier - and for 
> all I know, that's the way an expert would do it.  But only if the 
> speed/size trade off worked.

A shift-add multiplier is one layer of a pipelined multiplier.  So have N shift-add multipliers of M bits or N/M Booth's multipliers.  The amount of logic will be about the same other than whatever optimization factor the Booth's will offer.  I'm not sure it is less logic, but faster for sure.  Anyway, you should be able to put a lot of them into 20 kLUTs.  No, wait, you said "gates".  

I think that tells me which FPGA familly you are working with.  No one uses "gates" anymore and these guys do a lot of "qualified" business.  That explains the lack of multipliers.  20 kgates isn't really much logic either.  I'm guessing around 3,000 logic elements and these can be either logic or FF but not both? 

I wish I could remember more, but these days you can put 
A <= B * C in your code and you will get a library function block.  Let the tools work it out.  

-- 

  Rick C.

  -+ Get 1,000 miles of free Supercharging
  -+ Tesla referral code - https://ts.la/richard11209

Reply by ●April 2, 20202020-04-02

On Thu, 2 Apr 2020 09:40:34 +0100, Clive Arthur
<cliveta@nowaytoday.co.uk> wrote:

>On 01/04/2020 20:20, Rick C wrote:
>> On Wednesday, April 1, 2020 at 12:32:27 PM UTC-4, Clive Arthur wrote:
>>> On 01/04/2020 04:53, Rick C wrote:
>>>
>>> <snipped>
>>>>
>>>> Personally I much prefer coding FPGAs in HDL than mucking about with the complexities of trying to get a sequentially executing CPU to appear to be processing in parallel.
>>>
>>> Probably a dumb question, I'm no FPGA expert, but for a fixed-point FFT,
>>> could one use multiple shift and add multipliers instead of one fast
>>> hardware multiplier?  Roughly how many shift-add multipliers could you
>>> fit in the same space as a fast multiplier?
>>>
>>> I use a 128 point FFT for a comms system, and a dsPIC handles this
>>> quickly enough with it's two hardware multipliers and ancillary logic,
>>> but I've wondered if an FPGA with lots of parallel shift-add multipliers
>>> might be quicker.
>>>
>>> In my application, I can't use FPGAs with inbuilt processors or multipliers.
>> 
>> Ok, sometimes I don't completely follow what people have posted.  
>
>That's because it's not my field and I don't know the correct form of words.
>
>> By shift and add multiplier I assume you mean a single adder that is used sequentially to generate a multiply in N clocks?
>
>Yes.
>> 
>> I can't think why you would use such a multiplier if you need many of them to do the job.  I'm not an expert on multiplier design, but I believe a modified Booth's algorithm is not only faster but more compact.  The logic is a bit involved but someone explained to me how it uses fewer computations, but I completely forget the details.
>
>The last time I used a hardware multiplier outside a CPU it was made 
>with TTL chips and wire wrapped.  A Wallace tree IIRC taken from a big 
>orange Texas databook.  My experience is somewhat limited.

Implementing e.g. a 32 x 32 Wallace tree multiplier with MSI chips is
nasty, due to the huge number of interconnections. The product
consisting of two input AND gates produces 1024 lines. Then full
adders (preferably with carry look ahead) are required to count the
number of "1" bits. This requires multiple stages with the adder
propagation delays in each stage.

This is not a problem in e.g. in FFT calculations, since the
multiplies can be pipelined, producing one product every adder
propagation delay time but the total time taken by a single multiply
can be 10 or more adder propagation times,

Reply by Richard Damon ●April 2, 20202020-04-02

On 4/1/20 12:32 PM, Clive Arthur wrote:
> On 01/04/2020 04:53, Rick C wrote:
> 
> <snipped>
>>
>> Personally I much prefer coding FPGAs in HDL than mucking about with
>> the complexities of trying to get a sequentially executing CPU to
>> appear to be processing in parallel. 
> 
> Probably a dumb question, I'm no FPGA expert, but for a fixed-point FFT,
> could one use multiple shift and add multipliers instead of one fast
> hardware multiplier?&nbsp; Roughly how many shift-add multipliers could you
> fit in the same space as a fast multiplier?
> 
> I use a 128 point FFT for a comms system, and a dsPIC handles this
> quickly enough with it's two hardware multipliers and ancillary logic,
> but I've wondered if an FPGA with lots of parallel shift-add multipliers
> might be quicker.
> 
> In my application, I can't use FPGAs with inbuilt processors or
> multipliers.
> 

Going back to this basic question, If I presume you are asking not about
the fairly limited number hard core multipliers that many FPGAs have
included, but about implementation in the FPGA fabric, and are comparing
one design which has a full tree implemented in parrallel, and can run
at a throughput of 1 multiply per cycle (it may take several cycle to
get a new result, but a new multiply can be done every cycle) verse an
implementation that uses reduced hardware but serially iterates the
answer in N cycles, my first impression is that the additional control
mechanisms to iterate probably take more logic than the fully parallel
solution and thus costs space, and thus has a lower throughput for a
given number of logic cells dedicated to multipliers. Someone would
really need to design this to be sure. Maybe the fact that the sequencer
could drive many serial multipliers might make a difference.

Reply by Rick C ●April 2, 20202020-04-02

On Thursday, April 2, 2020 at 8:02:57 AM UTC-4, Richard Damon wrote:
> On 4/1/20 12:32 PM, Clive Arthur wrote:
> > 
<snipped>
> > 
> > Probably a dumb question, I'm no FPGA expert, but for a fixed-point FFT,
> > could one use multiple shift and add multipliers instead of one fast
> > hardware multiplier?&nbsp; Roughly how many shift-add multipliers could you
> > fit in the same space as a fast multiplier?
> > 
> > I use a 128 point FFT for a comms system, and a dsPIC handles this
> > quickly enough with it's two hardware multipliers and ancillary logic,
> > but I've wondered if an FPGA with lots of parallel shift-add multipliers
> > might be quicker.
> > 
> > In my application, I can't use FPGAs with inbuilt processors or
> > multipliers.
> > 
> 
> Going back to this basic question, If I presume you are asking not about
> the fairly limited number hard core multipliers that many FPGAs have
> included, but about implementation in the FPGA fabric, and are comparing
> one design which has a full tree implemented in parrallel, and can run
> at a throughput of 1 multiply per cycle (it may take several cycle to
> get a new result, but a new multiply can be done every cycle) verse an
> implementation that uses reduced hardware but serially iterates the
> answer in N cycles, my first impression is that the additional control
> mechanisms to iterate probably take more logic than the fully parallel
> solution and thus costs space, and thus has a lower throughput for a
> given number of logic cells dedicated to multipliers. Someone would
> really need to design this to be sure. Maybe the fact that the sequencer
> could drive many serial multipliers might make a difference.

I have designed shift/add multipliers and the control logic is not much.  There is a log2(N) bit counter and minimal other logic.  The shift/add multiplier only works with unsigned multipliers (or is it the multiplicand?) so there needs to be a 2's complement before and after the multiplier.  I'm pretty sure the Booth's algorithm handles sign automagically, but I won't swear to it.  I did this over a decade ago.  I'm pretty sure if you are doing a bunch of multipliers the shift/add uses more real estate for the same throughput, but I don't recall the details of modified Booth's so I can't say for sure.  

-- 

  Rick C.

  +- Get 1,000 miles of free Supercharging
  +- Tesla referral code - https://ts.la/richard11209

Reply by Tauno Voipio ●April 2, 20202020-04-02

On 2.4.20 21:16, Rick C wrote:
> On Thursday, April 2, 2020 at 8:02:57 AM UTC-4, Richard Damon wrote:
>> On 4/1/20 12:32 PM, Clive Arthur wrote:
>>>
> <snipped>
>>>
>>> Probably a dumb question, I'm no FPGA expert, but for a fixed-point FFT,
>>> could one use multiple shift and add multipliers instead of one fast
>>> hardware multiplier?&nbsp; Roughly how many shift-add multipliers could you
>>> fit in the same space as a fast multiplier?
>>>
>>> I use a 128 point FFT for a comms system, and a dsPIC handles this
>>> quickly enough with it's two hardware multipliers and ancillary logic,
>>> but I've wondered if an FPGA with lots of parallel shift-add multipliers
>>> might be quicker.
>>>
>>> In my application, I can't use FPGAs with inbuilt processors or
>>> multipliers.
>>>
>>
>> Going back to this basic question, If I presume you are asking not about
>> the fairly limited number hard core multipliers that many FPGAs have
>> included, but about implementation in the FPGA fabric, and are comparing
>> one design which has a full tree implemented in parrallel, and can run
>> at a throughput of 1 multiply per cycle (it may take several cycle to
>> get a new result, but a new multiply can be done every cycle) verse an
>> implementation that uses reduced hardware but serially iterates the
>> answer in N cycles, my first impression is that the additional control
>> mechanisms to iterate probably take more logic than the fully parallel
>> solution and thus costs space, and thus has a lower throughput for a
>> given number of logic cells dedicated to multipliers. Someone would
>> really need to design this to be sure. Maybe the fact that the sequencer
>> could drive many serial multipliers might make a difference.
> 
> I have designed shift/add multipliers and the control logic is not much.  There is a log2(N) bit counter and minimal other logic.  The shift/add multiplier only works with unsigned multipliers (or is it the multiplicand?) so there needs to be a 2's complement before and after the multiplier.  I'm pretty sure the Booth's algorithm handles sign automagically, but I won't swear to it.  I did this over a decade ago.  I'm pretty sure if you are doing a bunch of multipliers the shift/add uses more real estate for the same throughput, but I don't recall the details of modified Booth's so I can't say for sure.


Booth's algorithm does handle signed operands in two's complement notation.

-- 

-TV