EmbeddedRelated.com
Forums

FFT Speeds

Started by Rick C March 30, 2020
I'm not building anything, I'm just curious.  I worked on a machine in the 80's called an attached array processor capable of 100 MFLOPS, the ST-100.  It could do a 1K complex FFT in 0.86 msec.  What sort of device would be required to do this today?  A desktop would far exceed this performance.  Low end single chip processors like an ARM CM3 would probably not reach this performance level.  But I bet the processor in an rPi would exceed it.  

Can anyone nail this down a bit more?  Are there ARM CM4 processors capable of a sub ms 1K complex FFT in 32 bit floating point?  Looking at the Cypress PSoC 61 is says "with single-cycle multiply (Floating Point and Memory Protection Unit)" but I think those are three separate things, single cycle integer multiply, floating point and MPU.  

Anyone know how fast a CPU like this can crank a 1k FFT?  

Of course the rack cabinet machine I'm talking about was fully capable of moving data around to keep up with the FFT so the computations were the bottle neck.  I expect in many applications in MCUs the FFT time is not the bottle neck, at least not at the 1K size, rather getting the data and moving it around is nearly as slow and the rest of the application is running on the same CPU.  The ST-100 was the equivalent of a floating point coprocessor so it only handled the math. 

-- 

  Rick C.

  - Get 1,000 miles of free Supercharging
  - Tesla referral code - https://ts.la/richard11209
On 30/03/2020 21:11, Rick C wrote:
> I'm not building anything, I'm just curious. I worked on a machine > in the 80's called an attached array processor capable of 100 MFLOPS, > the ST-100. It could do a 1K complex FFT in 0.86 msec. What sort of > device would be required to do this today? A desktop would far > exceed this performance. Low end single chip processors like an ARM > CM3 would probably not reach this performance level. But I bet the > processor in an rPi would exceed it. > > Can anyone nail this down a bit more? Are there ARM CM4 processors > capable of a sub ms 1K complex FFT in 32 bit floating point? Looking > at the Cypress PSoC 61 is says "with single-cycle multiply (Floating > Point and Memory Protection Unit)" but I think those are three > separate things, single cycle integer multiply, floating point and > MPU.
"MPU" is not an operation and does not need instruction cycles. The Cortex-M4 has a single-cycle multiply-accumulate integer instruction (unlike the M3), and AFAIK the M4F has a single-cycle single precision floating point MAC. But as you know, the MAC instruction is only part of the deal - you need other instructions for moving data into and out of registers. I haven't timed FFT's myself. But I found this reference: <http://openaudio.blogspot.com/2016/09/benchmarking-fft-speed.html> The 180 MHz M4F chip does 2.1.K 512 length single-precision FFT's per second, which means it would take a bit over 1 ms for a 1K FFT. (I couldn't see if that was for complex FFT.) The chip will do single-cycle floating point operations, but feeding the data in and out will be the bottleneck. If you want faster, there are some Cortex-M7 devices at up to 528 MHz from NXP (the i.mx RT "crossover" family). The M7 does up to two instructions per cycle, and the chips have plenty of single-cycle tightly-coupled ram. You might peak at 6 to 8 times the performance of the K66F here, which would be well within your target. If you really want top-speed FFT, you need a DSP chip. These have several advantages - they have instructions that combine the loads, MAC and stores in one instruction, multiple ram blocks with separate buses for parallel access, and special counters and addressing modes designed specifically for circular buffers and FFT algorithms. Some of TI's top models get ridiculous speeds from their super-scaler multi-core devices. On the other hand, they are a real pain for anything /but/ DSP algorithms!
> > Anyone know how fast a CPU like this can crank a 1k FFT?
The Cortex-M devices are not fast CPUs - they are nice, solid microcontrollers. The Cortex-A devices are fast CPUs with vector processing, super-scaling, etc., and will do FFT's a lot faster than any Cortex-M.
> > Of course the rack cabinet machine I'm talking about was fully > capable of moving data around to keep up with the FFT so the > computations were the bottle neck. I expect in many applications in > MCUs the FFT time is not the bottle neck, at least not at the 1K > size, rather getting the data and moving it around is nearly as slow > and the rest of the application is running on the same CPU. The > ST-100 was the equivalent of a floating point coprocessor so it only > handled the math. >
On 2020-03-30 David Brown wrote in comp.arch.embedded:
> > I haven't timed FFT's myself. But I found this reference: > ><http://openaudio.blogspot.com/2016/09/benchmarking-fft-speed.html> > > The 180 MHz M4F chip does 2.1.K 512 length single-precision FFT's per > second, which means it would take a bit over 1 ms for a 1K FFT. (I > couldn't see if that was for complex FFT.) The chip will do > single-cycle floating point operations, but feeding the data in and out > will be the bottleneck. > > If you want faster, there are some Cortex-M7 devices at up to 528 MHz > from NXP (the i.mx RT "crossover" family). The M7 does up to two > instructions per cycle, and the chips have plenty of single-cycle > tightly-coupled ram. You might peak at 6 to 8 times the performance of > the K66F here, which would be well within your target.
A while back I timed a 250k single precision complex FFT on a 400 MHz ST M7, using the fftw lib with pre-planned fft. This took 330ms. May have been limited by memory speed as the data did not fit in internal or tightly coupled RAM and was in external SDRAM. While the FFT was running, new data for two channels was coming in at 250 kSPS (IRQ/DMA), which probably also had in impact on the (memory) performance. -- Stef (remove caps, dashes and .invalid from e-mail address to reply by mail) Maryann's Law: You can always find what you're not looking for.
Un bel giorno Rick C digit&#4294967295;:

> I'm not building anything, I'm just curious. I worked on a machine in > the 80's called an attached array processor capable of 100 MFLOPS, the > ST-100. It could do a 1K complex FFT in 0.86 msec. What sort of device > would be required to do this today?
A beefed-up microcontroller should be enough: http://www.ti.com/lit/an/spry288a/spry288a.pdf C28x family runs up to 200 MHz and does 1k complex FFT in 53219 cycles, i.e. 0.27 ms. Of course there are caveats, you must run everything in RAM and very tight. Some models are also dual core. -- Fletto i muscoli e sono nel vuoto.
On 1/4/20 1:14 am, dalai lamah wrote:
> Un bel giorno Rick C digit&ograve;: > >> I'm not building anything, I'm just curious. I worked on a machine in >> the 80's called an attached array processor capable of 100 MFLOPS, the >> ST-100. It could do a 1K complex FFT in 0.86 msec. What sort of device >> would be required to do this today? > > A beefed-up microcontroller should be enough: > > http://www.ti.com/lit/an/spry288a/spry288a.pdf > > C28x family runs up to 200 MHz and does 1k complex FFT in 53219 cycles, > i.e. 0.27 ms. Of course there are caveats, you must run everything in RAM > and very tight. > > Some models are also dual core. >
All other things being equal, doubling the length of an FFT adds one complex multiply per sample. So performance is O(log2(N)) Clifford Heath
Clifford Heath <no.spam@please.net> writes:
> All other things being equal, doubling the length of an FFT adds one > complex multiply per sample. So performance is O(log2(N))
I think you mean N log N. Usually we think of complex multiplication as 4 real multiplications plus 2 additions. I wonder if (on machines with 1 cycle MAC) that makes complex FFT 4x slower than real FFT, or if they have some interesting optimizations, parallelism, etc.
On 1/4/20 12:41 pm, Paul Rubin wrote:
> Clifford Heath <no.spam@please.net> writes: >> All other things being equal, doubling the length of an FFT adds one >> complex multiply per sample. So performance is O(log2(N)) > > I think you mean N log N.
Ahh yes, sorry. Mis-translated by omitting "per sample" :)
> Usually we think of complex multiplication as > 4 real multiplications plus 2 additions. I wonder if (on machines with > 1 cycle MAC) that makes complex FFT 4x slower than real FFT, or if they > have some interesting optimizations, parallelism, etc.
Memory/cache behaviour is often the limiting factor. Bailey's approach to ordering the computation can radically assist though: <https://www.davidhbailey.com/dhbpapers/fftq.pdf> Ultimately I want an RPi-like module with a GPU that supports CUDA... for a lower price than the Jetson Nano :). It's a shame to have to jump to an FPGA or funky DSP chip when a GPU is perfect for the job. Clifford Heath
On Tuesday, March 31, 2020 at 10:40:48 PM UTC-4, Clifford Heath wrote:
> On 1/4/20 12:41 pm, Paul Rubin wrote: > > Clifford Heath <no.spam@please.net> writes: > >> All other things being equal, doubling the length of an FFT adds one > >> complex multiply per sample. So performance is O(log2(N)) > > > > I think you mean N log N. > > Ahh yes, sorry. Mis-translated by omitting "per sample" :) > > > Usually we think of complex multiplication as > > 4 real multiplications plus 2 additions. I wonder if (on machines with > > 1 cycle MAC) that makes complex FFT 4x slower than real FFT, or if they > > have some interesting optimizations, parallelism, etc. > > Memory/cache behaviour is often the limiting factor. Bailey's approach > to ordering the computation can radically assist though: > > <https://www.davidhbailey.com/dhbpapers/fftq.pdf> > > Ultimately I want an RPi-like module with a GPU that supports CUDA... > for a lower price than the Jetson Nano :). It's a shame to have to jump > to an FPGA or funky DSP chip when a GPU is perfect for the job.
I assume CUDA is a development language for GPUs? The rPi GPU has a not fully documented GPU so it may or may not support CUDA. But an FPGA is easy enough to find on a daughtercard for the rPi. Or are you saying you just don't want to mess with an FPGA at all and would rather use software on a GPU? Personally I much prefer coding FPGAs in HDL than mucking about with the complexities of trying to get a sequentially executing CPU to appear to be processing in parallel. A GPU might be a bit fun though. I've worked with Transputers and programmed an attached array processor (4 ALUs in parallel). I expect a GPU would be a step up from either of those. Too bad the rPi GPU is not accessible. That could make for an interesting target. BTW, thanks to those who pointed out FFT benchmarks. Looks like a single chip CPU does come pretty close to two rack cabinets from 30+ years ago. It doesn't warm the room as well though. -- Rick C. + Get 1,000 miles of free Supercharging + Tesla referral code - https://ts.la/richard11209
On 1/4/20 2:53 pm, Rick C wrote:
> On Tuesday, March 31, 2020 at 10:40:48 PM UTC-4, Clifford Heath wrote: >> Ultimately I want an RPi-like module with a GPU that supports CUDA... >> for a lower price than the Jetson Nano :). It's a shame to have to jump >> to an FPGA or funky DSP chip when a GPU is perfect for the job. > > I assume CUDA is a development language for GPUs?
Yes.
> The rPi GPU has a not fully documented GPU so it may or may not support CUDA.
Correct. Same for ODroid.
> But an FPGA is easy enough to find on a daughtercard for the rPi. Or are you saying you just don't want to mess with an FPGA at all and would rather use software on a GPU?
Yes, that. I want the array processor already defined for me, I just want to tell it what to do.
> Personally I much prefer coding FPGAs in HDL
I would too, if I had already learnt to do it.
> trying to get a sequentially executing CPU to appear to be processing in parallel. A GPU might be a bit fun though.
Yes, exactly.
> BTW, thanks to those who pointed out FFT benchmarks. Looks like a single chip CPU does come pretty close to two rack cabinets from 30+ years ago. It doesn't warm the room as well though.
In 1986 I was at HP, and we took delivery of one of their 3D Turbo SRX graphics rendering boxes... with hardware radiosity for rendering surfaces, etc... minimally configured, as we only needed it for configuration testing. But it was my desktop computer... That thing would still dim the lights when you turned it on! The performance was dwarfed by single-chip PC graphics cards in less than ten years. Clifford Heath
On 01/04/2020 04:53, Rick C wrote:

<snipped>
> > Personally I much prefer coding FPGAs in HDL than mucking about with the complexities of trying to get a sequentially executing CPU to appear to be processing in parallel.
Probably a dumb question, I'm no FPGA expert, but for a fixed-point FFT, could one use multiple shift and add multipliers instead of one fast hardware multiplier? Roughly how many shift-add multipliers could you fit in the same space as a fast multiplier? I use a 128 point FFT for a comms system, and a dsPIC handles this quickly enough with it's two hardware multipliers and ancillary logic, but I've wondered if an FPGA with lots of parallel shift-add multipliers might be quicker. In my application, I can't use FPGAs with inbuilt processors or multipliers. -- Cheers Clive