FFT Speeds

I'm not building anything, I'm just curious.  I worked on a machine in the 80's called an attached array processor capable of 100 MFLOPS, the ST-100.  It could do a 1K complex FFT in 0.86 msec.  What sort of device would be required to do this today?  A desktop would far exceed this performance.  Low end single chip processors like an ARM CM3 would probably not reach this performance level.  But I bet the processor in an rPi would exceed it.  

Can anyone nail this down a bit more?  Are there ARM CM4 processors capable of a sub ms 1K complex FFT in 32 bit floating point?  Looking at the Cypress PSoC 61 is says "with single-cycle multiply (Floating Point and Memory Protection Unit)" but I think those are three separate things, single cycle integer multiply, floating point and MPU.  

Anyone know how fast a CPU like this can crank a 1k FFT?  

Of course the rack cabinet machine I'm talking about was fully capable of moving data around to keep up with the FFT so the computations were the bottle neck.  I expect in many applications in MCUs the FFT time is not the bottle neck, at least not at the 1K size, rather getting the data and moving it around is nearly as slow and the rest of the application is running on the same CPU.  The ST-100 was the equivalent of a floating point coprocessor so it only handled the math. 

-- 

  Rick C.

  - Get 1,000 miles of free Supercharging
  - Tesla referral code - https://ts.la/richard11209

Reply by David Brown ●March 30, 20202020-03-30

On 30/03/2020 21:11, Rick C wrote:
> I'm not building anything, I'm just curious.  I worked on a machine
> in the 80's called an attached array processor capable of 100 MFLOPS,
> the ST-100.  It could do a 1K complex FFT in 0.86 msec.  What sort of
> device would be required to do this today?  A desktop would far
> exceed this performance.  Low end single chip processors like an ARM
> CM3 would probably not reach this performance level.  But I bet the
> processor in an rPi would exceed it.
> 
> Can anyone nail this down a bit more?  Are there ARM CM4 processors
> capable of a sub ms 1K complex FFT in 32 bit floating point?  Looking
> at the Cypress PSoC 61 is says "with single-cycle multiply (Floating
> Point and Memory Protection Unit)" but I think those are three
> separate things, single cycle integer multiply, floating point and
> MPU.

"MPU" is not an operation and does not need instruction cycles.  The 
Cortex-M4 has a single-cycle multiply-accumulate integer instruction 
(unlike the M3), and AFAIK the M4F has a single-cycle single precision 
floating point MAC.

But as you know, the MAC instruction is only part of the deal - you need 
other instructions for moving data into and out of registers.

I haven't timed FFT's myself.  But I found this reference:

<http://openaudio.blogspot.com/2016/09/benchmarking-fft-speed.html>

The 180 MHz M4F chip does 2.1.K 512 length single-precision FFT's per 
second, which means it would take a bit over 1 ms for a 1K FFT.  (I 
couldn't see if that was for complex FFT.)  The chip will do 
single-cycle floating point operations, but feeding the data in and out 
will be the bottleneck.

If you want faster, there are some Cortex-M7 devices at up to 528 MHz 
from NXP (the i.mx RT "crossover" family).  The M7 does up to two 
instructions per cycle, and the chips have plenty of single-cycle 
tightly-coupled ram.  You might peak at 6 to 8 times the performance of 
the K66F here, which would be well within your target.

If you really want top-speed FFT, you need a DSP chip.  These have 
several advantages - they have instructions that combine the loads, MAC 
and stores in one instruction, multiple ram blocks with separate buses 
for parallel access, and special counters and addressing modes designed 
specifically for circular buffers and FFT algorithms.  Some of TI's top 
models get ridiculous speeds from their super-scaler multi-core devices. 
  On the other hand, they are a real pain for anything /but/ DSP algorithms!

> 
> Anyone know how fast a CPU like this can crank a 1k FFT?

The Cortex-M devices are not fast CPUs - they are nice, solid 
microcontrollers.  The Cortex-A devices are fast CPUs with vector 
processing, super-scaling, etc., and will do FFT's a lot faster than any 
Cortex-M.

> 
> Of course the rack cabinet machine I'm talking about was fully
> capable of moving data around to keep up with the FFT so the
> computations were the bottle neck.  I expect in many applications in
> MCUs the FFT time is not the bottle neck, at least not at the 1K
> size, rather getting the data and moving it around is nearly as slow
> and the rest of the application is running on the same CPU.  The
> ST-100 was the equivalent of a floating point coprocessor so it only
> handled the math.
>

Reply by Stef ●March 31, 20202020-03-31

On 2020-03-30 David Brown wrote in comp.arch.embedded:
>
> I haven't timed FFT's myself.  But I found this reference:
>
><http://openaudio.blogspot.com/2016/09/benchmarking-fft-speed.html>
>
> The 180 MHz M4F chip does 2.1.K 512 length single-precision FFT's per 
> second, which means it would take a bit over 1 ms for a 1K FFT.  (I 
> couldn't see if that was for complex FFT.)  The chip will do 
> single-cycle floating point operations, but feeding the data in and out 
> will be the bottleneck.
>
> If you want faster, there are some Cortex-M7 devices at up to 528 MHz 
> from NXP (the i.mx RT "crossover" family).  The M7 does up to two 
> instructions per cycle, and the chips have plenty of single-cycle 
> tightly-coupled ram.  You might peak at 6 to 8 times the performance of 
> the K66F here, which would be well within your target.

A while back I timed a 250k single precision complex FFT on a 400 MHz
ST M7, using the fftw lib with pre-planned fft. This took 330ms. May
have been limited by memory speed as the data did not fit in internal
or tightly coupled RAM and was in external SDRAM. While the FFT was
running, new data for two channels was coming in at 250 kSPS (IRQ/DMA),
which probably also had in impact on the (memory) performance.

-- 
Stef    (remove caps, dashes and .invalid from e-mail address to reply by mail)

Maryann's Law:
	You can always find what you're not looking for.

Reply by dalai lamah ●March 31, 20202020-03-31

Un bel giorno Rick C digit&#4294967295;:

> I'm not building anything, I'm just curious.  I worked on a machine in
> the 80's called an attached array processor capable of 100 MFLOPS, the
> ST-100.  It could do a 1K complex FFT in 0.86 msec.  What sort of device
> would be required to do this today?

A beefed-up microcontroller should be enough:

http://www.ti.com/lit/an/spry288a/spry288a.pdf

C28x family runs up to 200 MHz and does 1k complex FFT in 53219 cycles,
i.e. 0.27 ms. Of course there are caveats, you must run everything in RAM
and very tight.

Some models are also dual core.

-- 
Fletto i muscoli e sono nel vuoto.

Reply by Clifford Heath ●March 31, 20202020-03-31

On 1/4/20 1:14 am, dalai lamah wrote:
> Un bel giorno Rick C digit&ograve;:
> 
>> I'm not building anything, I'm just curious.  I worked on a machine in
>> the 80's called an attached array processor capable of 100 MFLOPS, the
>> ST-100.  It could do a 1K complex FFT in 0.86 msec.  What sort of device
>> would be required to do this today?
> 
> A beefed-up microcontroller should be enough:
> 
> http://www.ti.com/lit/an/spry288a/spry288a.pdf
> 
> C28x family runs up to 200 MHz and does 1k complex FFT in 53219 cycles,
> i.e. 0.27 ms. Of course there are caveats, you must run everything in RAM
> and very tight.
> 
> Some models are also dual core.
> 

All other things being equal, doubling the length of an FFT adds one 
complex multiply per sample. So performance is O(log2(N))

Clifford Heath

Reply by Paul Rubin ●March 31, 20202020-03-31

Clifford Heath <no.spam@please.net> writes:
> All other things being equal, doubling the length of an FFT adds one
> complex multiply per sample. So performance is O(log2(N))

I think you mean N log N.  Usually we think of complex multiplication as
4 real multiplications plus 2 additions.  I wonder if (on machines with
1 cycle MAC) that makes complex FFT 4x slower than real FFT, or if they
have some interesting optimizations, parallelism, etc.

Reply by Clifford Heath ●March 31, 20202020-03-31

On 1/4/20 12:41 pm, Paul Rubin wrote:
> Clifford Heath <no.spam@please.net> writes:
>> All other things being equal, doubling the length of an FFT adds one
>> complex multiply per sample. So performance is O(log2(N))
> 
> I think you mean N log N.

Ahh yes, sorry. Mis-translated by omitting "per sample" :)

>  Usually we think of complex multiplication as
> 4 real multiplications plus 2 additions.  I wonder if (on machines with
> 1 cycle MAC) that makes complex FFT 4x slower than real FFT, or if they
> have some interesting optimizations, parallelism, etc.

Memory/cache behaviour is often the limiting factor. Bailey's approach 
to ordering the computation can radically assist though:

<https://www.davidhbailey.com/dhbpapers/fftq.pdf>

Ultimately I want an RPi-like module with a GPU  that supports CUDA... 
for a lower price than the Jetson Nano :). It's a shame to have to jump 
to an FPGA or funky DSP chip when a GPU is perfect for the job.

Clifford Heath

Reply by Rick C ●April 1, 20202020-04-01

On Tuesday, March 31, 2020 at 10:40:48 PM UTC-4, Clifford Heath wrote:
> On 1/4/20 12:41 pm, Paul Rubin wrote:
> > Clifford Heath <no.spam@please.net> writes:
> >> All other things being equal, doubling the length of an FFT adds one
> >> complex multiply per sample. So performance is O(log2(N))
> > 
> > I think you mean N log N.
> 
> Ahh yes, sorry. Mis-translated by omitting "per sample" :)
> 
> >  Usually we think of complex multiplication as
> > 4 real multiplications plus 2 additions.  I wonder if (on machines with
> > 1 cycle MAC) that makes complex FFT 4x slower than real FFT, or if they
> > have some interesting optimizations, parallelism, etc.
> 
> Memory/cache behaviour is often the limiting factor. Bailey's approach 
> to ordering the computation can radically assist though:
> 
> <https://www.davidhbailey.com/dhbpapers/fftq.pdf>
> 
> Ultimately I want an RPi-like module with a GPU  that supports CUDA... 
> for a lower price than the Jetson Nano :). It's a shame to have to jump 
> to an FPGA or funky DSP chip when a GPU is perfect for the job.

I assume CUDA is a development language for GPUs?  The rPi GPU has a not fully documented GPU so it may or may not support CUDA.  But an FPGA is easy enough to find on a daughtercard for the rPi.  Or are you saying you just don't want to mess with an FPGA at all and would rather use software on a GPU? 

Personally I much prefer coding FPGAs in HDL than mucking about with the complexities of trying to get a sequentially executing CPU to appear to be processing in parallel.  A GPU might be a bit fun though.  I've worked with Transputers and programmed an attached array processor (4 ALUs in parallel).  I expect a GPU would be a step up from either of those.  

Too bad the rPi GPU is not accessible.  That could make for an interesting target. 

BTW, thanks to those who pointed out FFT benchmarks.  Looks like a single chip CPU does come pretty close to two rack cabinets from 30+ years ago.  It doesn't warm the room as well though. 

-- 

  Rick C.

  + Get 1,000 miles of free Supercharging
  + Tesla referral code - https://ts.la/richard11209

Reply by Clifford Heath ●April 1, 20202020-04-01

On 1/4/20 2:53 pm, Rick C wrote:
> On Tuesday, March 31, 2020 at 10:40:48 PM UTC-4, Clifford Heath wrote:
>> Ultimately I want an RPi-like module with a GPU  that supports CUDA...
>> for a lower price than the Jetson Nano :). It's a shame to have to jump
>> to an FPGA or funky DSP chip when a GPU is perfect for the job.
> 
> I assume CUDA is a development language for GPUs?

Yes.

> The rPi GPU has a not fully documented GPU so it may or may not support CUDA. 

Correct. Same for ODroid.

> But an FPGA is easy enough to find on a daughtercard for the rPi.  Or are you saying you just don't want to mess with an FPGA at all and would rather use software on a GPU?

Yes, that. I want the array processor already defined for me, I just 
want to tell it what to do.
> Personally I much prefer coding FPGAs in HDL

I would too, if I had already learnt to do it.

>  trying to get a sequentially executing CPU to appear to be processing in parallel.  A GPU might be a bit fun though.

Yes, exactly.

> BTW, thanks to those who pointed out FFT benchmarks.  Looks like a single chip CPU does come pretty close to two rack cabinets from 30+ years ago.  It doesn't warm the room as well though.

In 1986 I was at HP, and we took delivery of one of their 3D Turbo SRX 
graphics rendering boxes... with hardware radiosity for rendering 
surfaces, etc... minimally configured, as we only needed it for 
configuration testing. But it was my desktop computer... That thing 
would still dim the lights when you turned it on!

The performance was dwarfed by single-chip PC graphics cards in less 
than ten years.

Clifford Heath

Reply by Clive Arthur ●April 1, 20202020-04-01

On 01/04/2020 04:53, Rick C wrote:

<snipped>
> 
> Personally I much prefer coding FPGAs in HDL than mucking about with the complexities of trying to get a sequentially executing CPU to appear to be processing in parallel. 

Probably a dumb question, I'm no FPGA expert, but for a fixed-point FFT, 
could one use multiple shift and add multipliers instead of one fast 
hardware multiplier?  Roughly how many shift-add multipliers could you 
fit in the same space as a fast multiplier?

I use a 128 point FFT for a comms system, and a dsPIC handles this 
quickly enough with it's two hardware multipliers and ancillary logic, 
but I've wondered if an FPGA with lots of parallel shift-add multipliers 
might be quicker.

In my application, I can't use FPGAs with inbuilt processors or multipliers.

-- 
Cheers
Clive

Previous12 3 4 Next

FFT Speeds

Sign in

You might also like...

Search forums

Free PDF Downloads

Blogs - Hall of Fame

Discussion Groups

Quick Links

About EmbeddedRelated.com

Social Networks

The Related Media Group