EFM32 Instruction Execution Times| page 2

Reply by Randy Yates ●November 7, 20152015-11-07

rickman <gnuarm@gmail.com> writes:

> On 11/7/2015 3:57 PM, Randy Yates wrote:
>> David Brown <david.brown@hesbynett.no> writes:
>>
>>> On 07/11/15 17:45, rickman wrote:
>>>> On 11/7/2015 11:35 AM, Dimiter_Popoff wrote:
>>>>> On 07.11.2015 &#1075;. 17:47, Randy Yates wrote:
>>>>>> Hi,
>>>>>>
>>>>>> I'm trying to find information on the Silicon Labs/Energy Micro EFM32
>>>>>> Cortex M3 processor instruction execution times, namely, the
>>>>>> MLA/Multiply-Accumulate instruction, but others as well. I've found the
>>>>>> instruction in the reference manual but nowhere are cycle times
>>>>>> mentioned.
>>>>>>
>>>>>> This is surreal. Every assembly language reference manual I've ever used
>>>>>> includes cycle counts for each instruction. Here they're nowhere to be
>>>>>> found.
>>>>>>
>>>>>
>>>>> Be prepared for surprises with the MAC instruction on a non-specialized
>>>>> DSP like that.
>>>>> Even if they specify 1 cycle throughput this can be unrealistic given
>>>>> the few registers they have.
>>>>> If they have a 6 stage pipeline it takes at least 18 registers for
>>>>> MAC intermediate results only to bypass the data dependencies.
>>>>> IOW, if you just write a loop with a counter you will need at least 6
>>>>> cycles (plus perhaps some additional time for the mul) simply because
>>>>> every multiply-add needs the result of the previous one to be able
>>>>> to add to.
>>>>
>>>> You might want to rethink that.  The accumulate operation (add) is
>>>> typically one clock cycle while the multiply is sometimes multiple
>>>> cycles.  I don't know what the multiply time is in the CM3, I thought it
>>>> was one cycle as well, but perhaps that is a pipelined time. Regardless,
>>>> the multiply spits out a result on every clock which is then added to
>>>> the accumulator on each clock producing a MAC result on each clock.
>>>>
>>>> I remember that in the CM4 they claimed to be able to get close to 1 MAC
>>>> per clock with various optimizations.
>>>>
>>>
>>> The Cortex M4 has a range of additional instructions aimed precisely
>>> at DSP instructions such as MAC.  That is the main difference between
>>> the M3 and the M4.
>>>
>>> The M3 and M4 have a 3 stage pipeline.
>>>
>>> A very quick google
>>
>> Hi David,
>>
>> I don't want to sound ungrateful, but why in the hell must I resort
>> to Google to get this deeply domain-specific information? It belongs
>> in a reference manual.
>>
>> Turns out Rick was right - it's in the ARM Cortex M3 TRM:
>>
>> http://infocenter.arm.com/help/topic/com.arm.doc.ddi0337e/DDI0337E_cortex_m3_r1p1_trm.pdf
>>
>>> shows that MLA (32x32 -> 32) on the M3 is 2 cycles. It is 1 cycle on
>>> the M4. If you want 64-bit results and accumulates, it is 4 to 7
>>> cycles on the M3 and 1 on the M4. The M4 also has a variety of other
>>> DSP-style instructions, including SIMD codes for 16-bit or 8-bit MACs
>>> in parallel.
>>>
>>> With its very short pipelines, the M4 has enough registers to keep up
>>> a good throughput at MAC operations - significantly better than on an
>>> M3 in many circumstances.
>>>
>>> Even an M4 is not going to compete with a dedicated DSP on MAC
>>> throughput per clock cycle - but it is /vastly/ easier to work with.
>>
>>> The real question is what the OP actually wants to do, and if his M3
>>> (or a replacement M4) is good enough - there is no point in going for
>>> a hideous architecture that can do 1 GMAC/s if 1 MMAC/s is more than
>>> enough for the application.
>>
>> The goal is to implement a high performance filter in few enough cycles
>> to get back to low-power mode and meet a specific battery life goal. Is
>> the CM3 "good enough?" TBD. There are a lot of choices (processing
>> architecture, filter specifications, etc.) that will decide.
>
> Should I assume that I can't talk you into an FPGA design in a low
> power device?

That would require a board respin. Not good!
-- 
Randy Yates
Digital Signal Labs
http://www.digitalsignallabs.com

Reply by rickman ●November 7, 20152015-11-07

On 11/7/2015 4:49 PM, Randy Yates wrote:
> rickman <gnuarm@gmail.com> writes:
>
>> On 11/7/2015 3:57 PM, Randy Yates wrote:
>>> David Brown <david.brown@hesbynett.no> writes:
>>>
>>>> On 07/11/15 17:45, rickman wrote:
>>>>> On 11/7/2015 11:35 AM, Dimiter_Popoff wrote:
>>>>>> On 07.11.2015 &#1075;. 17:47, Randy Yates wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> I'm trying to find information on the Silicon Labs/Energy Micro EFM32
>>>>>>> Cortex M3 processor instruction execution times, namely, the
>>>>>>> MLA/Multiply-Accumulate instruction, but others as well. I've found the
>>>>>>> instruction in the reference manual but nowhere are cycle times
>>>>>>> mentioned.
>>>>>>>
>>>>>>> This is surreal. Every assembly language reference manual I've ever used
>>>>>>> includes cycle counts for each instruction. Here they're nowhere to be
>>>>>>> found.
>>>>>>>
>>>>>>
>>>>>> Be prepared for surprises with the MAC instruction on a non-specialized
>>>>>> DSP like that.
>>>>>> Even if they specify 1 cycle throughput this can be unrealistic given
>>>>>> the few registers they have.
>>>>>> If they have a 6 stage pipeline it takes at least 18 registers for
>>>>>> MAC intermediate results only to bypass the data dependencies.
>>>>>> IOW, if you just write a loop with a counter you will need at least 6
>>>>>> cycles (plus perhaps some additional time for the mul) simply because
>>>>>> every multiply-add needs the result of the previous one to be able
>>>>>> to add to.
>>>>>
>>>>> You might want to rethink that.  The accumulate operation (add) is
>>>>> typically one clock cycle while the multiply is sometimes multiple
>>>>> cycles.  I don't know what the multiply time is in the CM3, I thought it
>>>>> was one cycle as well, but perhaps that is a pipelined time. Regardless,
>>>>> the multiply spits out a result on every clock which is then added to
>>>>> the accumulator on each clock producing a MAC result on each clock.
>>>>>
>>>>> I remember that in the CM4 they claimed to be able to get close to 1 MAC
>>>>> per clock with various optimizations.
>>>>>
>>>>
>>>> The Cortex M4 has a range of additional instructions aimed precisely
>>>> at DSP instructions such as MAC.  That is the main difference between
>>>> the M3 and the M4.
>>>>
>>>> The M3 and M4 have a 3 stage pipeline.
>>>>
>>>> A very quick google
>>>
>>> Hi David,
>>>
>>> I don't want to sound ungrateful, but why in the hell must I resort
>>> to Google to get this deeply domain-specific information? It belongs
>>> in a reference manual.
>>>
>>> Turns out Rick was right - it's in the ARM Cortex M3 TRM:
>>>
>>> http://infocenter.arm.com/help/topic/com.arm.doc.ddi0337e/DDI0337E_cortex_m3_r1p1_trm.pdf
>>>
>>>> shows that MLA (32x32 -> 32) on the M3 is 2 cycles. It is 1 cycle on
>>>> the M4. If you want 64-bit results and accumulates, it is 4 to 7
>>>> cycles on the M3 and 1 on the M4. The M4 also has a variety of other
>>>> DSP-style instructions, including SIMD codes for 16-bit or 8-bit MACs
>>>> in parallel.
>>>>
>>>> With its very short pipelines, the M4 has enough registers to keep up
>>>> a good throughput at MAC operations - significantly better than on an
>>>> M3 in many circumstances.
>>>>
>>>> Even an M4 is not going to compete with a dedicated DSP on MAC
>>>> throughput per clock cycle - but it is /vastly/ easier to work with.
>>>
>>>> The real question is what the OP actually wants to do, and if his M3
>>>> (or a replacement M4) is good enough - there is no point in going for
>>>> a hideous architecture that can do 1 GMAC/s if 1 MMAC/s is more than
>>>> enough for the application.
>>>
>>> The goal is to implement a high performance filter in few enough cycles
>>> to get back to low-power mode and meet a specific battery life goal. Is
>>> the CM3 "good enough?" TBD. There are a lot of choices (processing
>>> architecture, filter specifications, etc.) that will decide.
>>
>> Should I assume that I can't talk you into an FPGA design in a low
>> power device?
>
> That would require a board respin. Not good!

Yes, of course.  I didn't quite grasp what you were saying.   You want 
to duty cycle running the signal processing at full power with idling at 
low power.  Exactly how to do it with most CPUs.

-- 

Rick

Reply by Dimiter_Popoff ●November 7, 20152015-11-07

On 07.11.2015 &#1075;. 18:45, rickman wrote:
> On 11/7/2015 11:35 AM, Dimiter_Popoff wrote:
>> On 07.11.2015 &#1075;. 17:47, Randy Yates wrote:
>>> Hi,
>>>
>>> I'm trying to find information on the Silicon Labs/Energy Micro EFM32
>>> Cortex M3 processor instruction execution times, namely, the
>>> MLA/Multiply-Accumulate instruction, but others as well. I've found the
>>> instruction in the reference manual but nowhere are cycle times
>>> mentioned.
>>>
>>> This is surreal. Every assembly language reference manual I've ever used
>>> includes cycle counts for each instruction. Here they're nowhere to be
>>> found.
>>>
>>
>> Be prepared for surprises with the MAC instruction on a non-specialized
>> DSP like that.
>> Even if they specify 1 cycle throughput this can be unrealistic given
>> the few registers they have.
>> If they have a 6 stage pipeline it takes at least 18 registers for
>> MAC intermediate results only to bypass the data dependencies.
>> IOW, if you just write a loop with a counter you will need at least 6
>> cycles (plus perhaps some additional time for the mul) simply because
>> every multiply-add needs the result of the previous one to be able
>> to add to.
>
> You might want to rethink that.  The accumulate operation (add) is
> typically one clock cycle while the multiply is sometimes multiple
> cycles.  I don't know what the multiply time is in the CM3, I thought it
> was one cycle as well, but perhaps that is a pipelined time. Regardless,
> the multiply spits out a result on every clock which is then added to
> the accumulator on each clock producing a MAC result on each clock.
>
> I remember that in the CM4 they claimed to be able to get close to 1 MAC
> per clock with various optimizations.
>

Hah, it appears I am the only one - not only in this group - to have
really gone through this.

The multiply does spit say a result every cycle, OK. But this is at the
end of the pipeline; so each multiply has started 6 (to stay with my 6
stages example) cycles earlier than the result it spits. Now since
we accumulate the result in one register - and it is also at the input
of the pipeline for the multiply-add opcode - a new instruction cannot
begin going through the pipeline before one is finished, not without
some additional, DSP-ish trickery - which "normal" processor do not
have or if they do they talk about some "DSP engine" or sort of.

I had to do this the hard way on the e300 power core; in a simple loop,
the FMADD (FP multiply-add 64 bit) would take something like 20-30nS
in a straight forward loop (at 2.5nS clock period). The latency
specified for the FMADD is just 2 cycles though; I had to bypass the
data dependencies by using at least 6 (I did 6, 7 and 8) sets of
3 registers so the loop would go through all sets which all
had different destination registers and thus would have enough
time for the pipeline every cycle. At the end of the loop all
6 (or 7 or 8) destination registers are simply added to get the
final result.
I have posted it before, hopefully this explanation is better
than my previous ones. Here is the source of how this works:

http://tgi-sci.com/misc/mac8.sa

Notice that it also saves load/store by a factor of 6 (or 8
in the example, I think it is the 8 sets case); the measured
performance with this was 5.5 nS per FMADD (theoretical best,
no load/store involved would have been 5 nS).

Now David said the ARM in question has only 3 stages in its pipeline,
it would take 9 registers to bypass its data dependencies;
might even be doable with the few registers they have.
[In fact the above is a good example of why ARM try to keep their
pipelines short; their architecture just does not have the registers
it takes to maintain a longer pipeline full/productive, it is a major
architecture limitation for a load/store machine).

Dimiter

------------------------------------------------------
Dimiter Popoff, TGI             http://www.tgi-sci.com
------------------------------------------------------
http://www.flickr.com/photos/didi_tgi/

Reply by David Brown ●November 7, 20152015-11-07

On 07/11/15 23:18, Dimiter_Popoff wrote:
> On 07.11.2015 &#1075;. 18:45, rickman wrote:
>> On 11/7/2015 11:35 AM, Dimiter_Popoff wrote:
>>> On 07.11.2015 &#1075;. 17:47, Randy Yates wrote:
>>>> Hi,
>>>>
>>>> I'm trying to find information on the Silicon Labs/Energy Micro EFM32
>>>> Cortex M3 processor instruction execution times, namely, the
>>>> MLA/Multiply-Accumulate instruction, but others as well. I've found the
>>>> instruction in the reference manual but nowhere are cycle times
>>>> mentioned.
>>>>
>>>> This is surreal. Every assembly language reference manual I've ever
>>>> used
>>>> includes cycle counts for each instruction. Here they're nowhere to be
>>>> found.
>>>>
>>>
>>> Be prepared for surprises with the MAC instruction on a non-specialized
>>> DSP like that.
>>> Even if they specify 1 cycle throughput this can be unrealistic given
>>> the few registers they have.
>>> If they have a 6 stage pipeline it takes at least 18 registers for
>>> MAC intermediate results only to bypass the data dependencies.
>>> IOW, if you just write a loop with a counter you will need at least 6
>>> cycles (plus perhaps some additional time for the mul) simply because
>>> every multiply-add needs the result of the previous one to be able
>>> to add to.
>>
>> You might want to rethink that.  The accumulate operation (add) is
>> typically one clock cycle while the multiply is sometimes multiple
>> cycles.  I don't know what the multiply time is in the CM3, I thought it
>> was one cycle as well, but perhaps that is a pipelined time. Regardless,
>> the multiply spits out a result on every clock which is then added to
>> the accumulator on each clock producing a MAC result on each clock.
>>
>> I remember that in the CM4 they claimed to be able to get close to 1 MAC
>> per clock with various optimizations.
>>
>
> Hah, it appears I am the only one - not only in this group - to have
> really gone through this.
>
> The multiply does spit say a result every cycle, OK. But this is at the
> end of the pipeline; so each multiply has started 6 (to stay with my 6
> stages example) cycles earlier than the result it spits. Now since
> we accumulate the result in one register - and it is also at the input
> of the pipeline for the multiply-add opcode - a new instruction cannot
> begin going through the pipeline before one is finished, not without
> some additional, DSP-ish trickery - which "normal" processor do not
> have or if they do they talk about some "DSP engine" or sort of.
>
> I had to do this the hard way on the e300 power core; in a simple loop,
> the FMADD (FP multiply-add 64 bit) would take something like 20-30nS
> in a straight forward loop (at 2.5nS clock period). The latency
> specified for the FMADD is just 2 cycles though; I had to bypass the
> data dependencies by using at least 6 (I did 6, 7 and 8) sets of
> 3 registers so the loop would go through all sets which all
> had different destination registers and thus would have enough
> time for the pipeline every cycle. At the end of the loop all
> 6 (or 7 or 8) destination registers are simply added to get the
> final result.
> I have posted it before, hopefully this explanation is better
> than my previous ones. Here is the source of how this works:
>
> http://tgi-sci.com/misc/mac8.sa
>
> Notice that it also saves load/store by a factor of 6 (or 8
> in the example, I think it is the 8 sets case); the measured
> performance with this was 5.5 nS per FMADD (theoretical best,
> no load/store involved would have been 5 nS).
>
> Now David said the ARM in question has only 3 stages in its pipeline,
> it would take 9 registers to bypass its data dependencies;
> might even be doable with the few registers they have.
> [In fact the above is a good example of why ARM try to keep their
> pipelines short; their architecture just does not have the registers
> it takes to maintain a longer pipeline full/productive, it is a major
> architecture limitation for a load/store machine).
>

No, that is most certainly /not/ why ARM wants to keep these pipelines 
short.  I am not disagreeing with your calculations regarding 
throughput, latency, and registers on hardware that is not DSP-dedicated 
(and I don't know what DSP features an M4 really has, as I haven't 
needed them myself).  You've pointed out these issues before, and I 
think they are often misunderstood - people see the "MAC instruction 
timing 1 cycle" and think they can get 72 MMACs from a 72 MHz M4.  So it 
is good that you raise awareness here.

But these are primarily control-oriented microcontroller cores - short 
pipelines means low latencies, consistent timings, short branch delays, 
minimal interrupt latency jitter, small core die area, and low power. 
Being able to improve throughput of long MAC chains is merely a bonus.

Remember, the M4 core is not in the same class as the e300 - you would 
be better to compare the e300 to a Cortex A device with NEON SIMD 
instructions and see how that compares in DSP throughput. 
(Alternatively, you could compare MACs/s per $, or per mW, to get a 
fairer match.)

Reply by David Brown ●November 7, 20152015-11-07

On 07/11/15 21:57, Randy Yates wrote:
> David Brown <david.brown@hesbynett.no> writes:
>
>> On 07/11/15 17:45, rickman wrote:
>>> On 11/7/2015 11:35 AM, Dimiter_Popoff wrote:
>>>> On 07.11.2015 &#1075;. 17:47, Randy Yates wrote:
>>>>> Hi,
>>>>>
>>>>> I'm trying to find information on the Silicon Labs/Energy Micro EFM32
>>>>> Cortex M3 processor instruction execution times, namely, the
>>>>> MLA/Multiply-Accumulate instruction, but others as well. I've found the
>>>>> instruction in the reference manual but nowhere are cycle times
>>>>> mentioned.
>>>>>
>>>>> This is surreal. Every assembly language reference manual I've ever used
>>>>> includes cycle counts for each instruction. Here they're nowhere to be
>>>>> found.
>>>>>
>>>>
>>>> Be prepared for surprises with the MAC instruction on a non-specialized
>>>> DSP like that.
>>>> Even if they specify 1 cycle throughput this can be unrealistic given
>>>> the few registers they have.
>>>> If they have a 6 stage pipeline it takes at least 18 registers for
>>>> MAC intermediate results only to bypass the data dependencies.
>>>> IOW, if you just write a loop with a counter you will need at least 6
>>>> cycles (plus perhaps some additional time for the mul) simply because
>>>> every multiply-add needs the result of the previous one to be able
>>>> to add to.
>>>
>>> You might want to rethink that.  The accumulate operation (add) is
>>> typically one clock cycle while the multiply is sometimes multiple
>>> cycles.  I don't know what the multiply time is in the CM3, I thought it
>>> was one cycle as well, but perhaps that is a pipelined time. Regardless,
>>> the multiply spits out a result on every clock which is then added to
>>> the accumulator on each clock producing a MAC result on each clock.
>>>
>>> I remember that in the CM4 they claimed to be able to get close to 1 MAC
>>> per clock with various optimizations.
>>>
>>
>> The Cortex M4 has a range of additional instructions aimed precisely
>> at DSP instructions such as MAC.  That is the main difference between
>> the M3 and the M4.
>>
>> The M3 and M4 have a 3 stage pipeline.
>>
>> A very quick google
>
> Hi David,
>
> I don't want to sound ungrateful, but why in the hell must I resort
> to Google to get this deeply domain-specific information? It belongs
> in a reference manual.
>
> Turns out Rick was right - it's in the ARM Cortex M3 TRM:
>
> http://infocenter.arm.com/help/topic/com.arm.doc.ddi0337e/DDI0337E_cortex_m3_r1p1_trm.pdf
>

And guess which link turns up at the top of a google search for "Cortex 
M3 MLA timing"?

You can argue that Silicon Labs should have information in their own 
datasheets, or at least pointers to the ARM documents.  They probably 
/do/ have that information there somewhere, if you dig deep enough.  But 
sometimes googling is a lot faster, easier and less stressful than 
looking in the "right" places.

>> shows that MLA (32x32 -> 32) on the M3 is 2 cycles. It is 1 cycle on
>> the M4. If you want 64-bit results and accumulates, it is 4 to 7
>> cycles on the M3 and 1 on the M4. The M4 also has a variety of other
>> DSP-style instructions, including SIMD codes for 16-bit or 8-bit MACs
>> in parallel.
>>
>> With its very short pipelines, the M4 has enough registers to keep up
>> a good throughput at MAC operations - significantly better than on an
>> M3 in many circumstances.
>>
>> Even an M4 is not going to compete with a dedicated DSP on MAC
>> throughput per clock cycle - but it is /vastly/ easier to work with.
>
>> The real question is what the OP actually wants to do, and if his M3
>> (or a replacement M4) is good enough - there is no point in going for
>> a hideous architecture that can do 1 GMAC/s if 1 MMAC/s is more than
>> enough for the application.
>
> The goal is to implement a high performance filter in few enough cycles
> to get back to low-power mode and meet a specific battery life goal. Is
> the CM3 "good enough?" TBD. There are a lot of choices (processing
> architecture, filter specifications, etc.) that will decide.
>

Without knowing a good deal more, it is impossible to guess.  But 
certainly the "run as fast as possible for a short time, then sleep" is 
the right way to minimise power.  Have plenty of capacitors on the board 
to reduce power spikes to the battery.

Reply by rickman ●November 7, 20152015-11-07

On 11/7/2015 5:18 PM, Dimiter_Popoff wrote:
> On 07.11.2015 &#1075;. 18:45, rickman wrote:
>> On 11/7/2015 11:35 AM, Dimiter_Popoff wrote:
>>> On 07.11.2015 &#1075;. 17:47, Randy Yates wrote:
>>>> Hi,
>>>>
>>>> I'm trying to find information on the Silicon Labs/Energy Micro EFM32
>>>> Cortex M3 processor instruction execution times, namely, the
>>>> MLA/Multiply-Accumulate instruction, but others as well. I've found the
>>>> instruction in the reference manual but nowhere are cycle times
>>>> mentioned.
>>>>
>>>> This is surreal. Every assembly language reference manual I've ever
>>>> used
>>>> includes cycle counts for each instruction. Here they're nowhere to be
>>>> found.
>>>>
>>>
>>> Be prepared for surprises with the MAC instruction on a non-specialized
>>> DSP like that.
>>> Even if they specify 1 cycle throughput this can be unrealistic given
>>> the few registers they have.
>>> If they have a 6 stage pipeline it takes at least 18 registers for
>>> MAC intermediate results only to bypass the data dependencies.
>>> IOW, if you just write a loop with a counter you will need at least 6
>>> cycles (plus perhaps some additional time for the mul) simply because
>>> every multiply-add needs the result of the previous one to be able
>>> to add to.
>>
>> You might want to rethink that.  The accumulate operation (add) is
>> typically one clock cycle while the multiply is sometimes multiple
>> cycles.  I don't know what the multiply time is in the CM3, I thought it
>> was one cycle as well, but perhaps that is a pipelined time. Regardless,
>> the multiply spits out a result on every clock which is then added to
>> the accumulator on each clock producing a MAC result on each clock.
>>
>> I remember that in the CM4 they claimed to be able to get close to 1 MAC
>> per clock with various optimizations.
>>
>
> Hah, it appears I am the only one - not only in this group - to have
> really gone through this.
>
> The multiply does spit say a result every cycle, OK. But this is at the
> end of the pipeline; so each multiply has started 6 (to stay with my 6
> stages example) cycles earlier than the result it spits. Now since
> we accumulate the result in one register - and it is also at the input
> of the pipeline for the multiply-add opcode - a new instruction cannot
> begin going through the pipeline before one is finished, not without
> some additional, DSP-ish trickery - which "normal" processor do not
> have or if they do they talk about some "DSP engine" or sort of.

I don't quite understand what you are saying.  You seem to be saying the 
pipeline is 6 clock cycles long while that does not seem to be supported 
by the facts.  Then you propose the inputs to the instruction have to be 
available at the *start* of the instruction (not sure what that even 
means really as instructions are fetched, decoded and executed, which 
one is the "start") which is not necessarily true.  I don't know that 
pipelining the MAC instruction requires anything special from the CPU 
other than the various controls required for pipelining.

I don't know how many clocks it takes to do pipelined MAC instructions 
on the CM3.  I do know they specifically added all the required logic to 
do fully pipelined MACs on the CM4, the real limitation seems to be 
memory accesses.  Perhaps it was a 16 bit mode where two coefficients 
would be fetched in one memory operation and two data values were 
fetched in one memory operation, but they were able to reach 1 MAC per 
clock as long as nothing got in the way.


> I had to do this the hard way on the e300 power core; in a simple loop,
> the FMADD (FP multiply-add 64 bit) would take something like 20-30nS
> in a straight forward loop (at 2.5nS clock period). The latency
> specified for the FMADD is just 2 cycles though; I had to bypass the
> data dependencies by using at least 6 (I did 6, 7 and 8) sets of
> 3 registers so the loop would go through all sets which all
> had different destination registers and thus would have enough
> time for the pipeline every cycle. At the end of the loop all
> 6 (or 7 or 8) destination registers are simply added to get the
> final result.
> I have posted it before, hopefully this explanation is better
> than my previous ones. Here is the source of how this works:
>
> http://tgi-sci.com/misc/mac8.sa

There is no reason why one processor would be the same as another in 
this regard.  This link seems to be something other than ARM CM3 code. 
I'm guessing this is your e300 power core.


> Notice that it also saves load/store by a factor of 6 (or 8
> in the example, I think it is the 8 sets case); the measured
> performance with this was 5.5 nS per FMADD (theoretical best,
> no load/store involved would have been 5 nS).
>
> Now David said the ARM in question has only 3 stages in its pipeline,
> it would take 9 registers to bypass its data dependencies;
> might even be doable with the few registers they have.
> [In fact the above is a good example of why ARM try to keep their
> pipelines short; their architecture just does not have the registers
> it takes to maintain a longer pipeline full/productive, it is a major
> architecture limitation for a load/store machine).

I'm not in a position to debate this since I am not so familiar with the 
ARM instruction set, but I don't see any reason to use more registers 
for a simple instruction like the MAC than are actually required.  I 
have never seen a problem with overlapping register usage in pipelined 
instruction sets.  As long as the register is updated by the time it is 
used, it all works.  Otherwise, what is the point of pipelining?

-- 

Rick

Reply by Tim Wescott ●November 7, 20152015-11-07

On Sat, 07 Nov 2015 10:47:27 -0500, Randy Yates wrote:

> Hi,
> 
> I'm trying to find information on the Silicon Labs/Energy Micro EFM32
> Cortex M3 processor instruction execution times, namely, the
> MLA/Multiply-Accumulate instruction, but others as well. I've found the
> instruction in the reference manual but nowhere are cycle times
> mentioned.
> 
> This is surreal. Every assembly language reference manual I've ever used
> includes cycle counts for each instruction. Here they're nowhere to be
> found.

In addition to everything else that's mentioned, with today's processors 
you're highly constrained by pipelining & whatnot.

Most of the parts that I've worked with need lots of wait states to run 
out of flash -- I wouldn't be surprised if the processor spends most of 
it's time twiddling it's thumbs waiting on memory.

-- 

Tim Wescott
Wescott Design Services
http://www.wescottdesign.com

Reply by Dimiter_Popoff ●November 7, 20152015-11-07

On 08.11.2015 &#1075;. 00:33, David Brown wrote:
> On 07/11/15 23:18, Dimiter_Popoff wrote:
>> On 07.11.2015 &#1075;. 18:45, rickman wrote:
>>> On 11/7/2015 11:35 AM, Dimiter_Popoff wrote:
>>>> On 07.11.2015 &#1075;. 17:47, Randy Yates wrote:
>>>>> Hi,
>>>>>
>>>>> I'm trying to find information on the Silicon Labs/Energy Micro EFM32
>>>>> Cortex M3 processor instruction execution times, namely, the
>>>>> MLA/Multiply-Accumulate instruction, but others as well. I've found
>>>>> the
>>>>> instruction in the reference manual but nowhere are cycle times
>>>>> mentioned.
>>>>>
>>>>> This is surreal. Every assembly language reference manual I've ever
>>>>> used
>>>>> includes cycle counts for each instruction. Here they're nowhere to be
>>>>> found.
>>>>>
>>>>
>>>> Be prepared for surprises with the MAC instruction on a non-specialized
>>>> DSP like that.
>>>> Even if they specify 1 cycle throughput this can be unrealistic given
>>>> the few registers they have.
>>>> If they have a 6 stage pipeline it takes at least 18 registers for
>>>> MAC intermediate results only to bypass the data dependencies.
>>>> IOW, if you just write a loop with a counter you will need at least 6
>>>> cycles (plus perhaps some additional time for the mul) simply because
>>>> every multiply-add needs the result of the previous one to be able
>>>> to add to.
>>>
>>> You might want to rethink that.  The accumulate operation (add) is
>>> typically one clock cycle while the multiply is sometimes multiple
>>> cycles.  I don't know what the multiply time is in the CM3, I thought it
>>> was one cycle as well, but perhaps that is a pipelined time. Regardless,
>>> the multiply spits out a result on every clock which is then added to
>>> the accumulator on each clock producing a MAC result on each clock.
>>>
>>> I remember that in the CM4 they claimed to be able to get close to 1 MAC
>>> per clock with various optimizations.
>>>
>>
>> Hah, it appears I am the only one - not only in this group - to have
>> really gone through this.
>>
>> The multiply does spit say a result every cycle, OK. But this is at the
>> end of the pipeline; so each multiply has started 6 (to stay with my 6
>> stages example) cycles earlier than the result it spits. Now since
>> we accumulate the result in one register - and it is also at the input
>> of the pipeline for the multiply-add opcode - a new instruction cannot
>> begin going through the pipeline before one is finished, not without
>> some additional, DSP-ish trickery - which "normal" processor do not
>> have or if they do they talk about some "DSP engine" or sort of.
>>
>> I had to do this the hard way on the e300 power core; in a simple loop,
>> the FMADD (FP multiply-add 64 bit) would take something like 20-30nS
>> in a straight forward loop (at 2.5nS clock period). The latency
>> specified for the FMADD is just 2 cycles though; I had to bypass the
>> data dependencies by using at least 6 (I did 6, 7 and 8) sets of
>> 3 registers so the loop would go through all sets which all
>> had different destination registers and thus would have enough
>> time for the pipeline every cycle. At the end of the loop all
>> 6 (or 7 or 8) destination registers are simply added to get the
>> final result.
>> I have posted it before, hopefully this explanation is better
>> than my previous ones. Here is the source of how this works:
>>
>> http://tgi-sci.com/misc/mac8.sa
>>
>> Notice that it also saves load/store by a factor of 6 (or 8
>> in the example, I think it is the 8 sets case); the measured
>> performance with this was 5.5 nS per FMADD (theoretical best,
>> no load/store involved would have been 5 nS).
>>
>> Now David said the ARM in question has only 3 stages in its pipeline,
>> it would take 9 registers to bypass its data dependencies;
>> might even be doable with the few registers they have.
>> [In fact the above is a good example of why ARM try to keep their
>> pipelines short; their architecture just does not have the registers
>> it takes to maintain a longer pipeline full/productive, it is a major
>> architecture limitation for a load/store machine).
>>
>
> No, that is most certainly /not/ why ARM wants to keep these pipelines
> short.  I am not disagreeing with your calculations regarding
> throughput, latency, and registers on hardware that is not DSP-dedicated
> (and I don't know what DSP features an M4 really has, as I haven't
> needed them myself).  You've pointed out these issues before, and I
> think they are often misunderstood - people see the "MAC instruction
> timing 1 cycle" and think they can get 72 MMACs from a 72 MHz M4.  So it
> is good that you raise awareness here.
>
> But these are primarily control-oriented microcontroller cores - short
> pipelines means low latencies, consistent timings, short branch delays,
> minimal interrupt latency jitter, small core die area, and low power.
> Being able to improve throughput of long MAC chains is merely a bonus.
>
> Remember, the M4 core is not in the same class as the e300 - you would
> be better to compare the e300 to a Cortex A device with NEON SIMD
> instructions and see how that compares in DSP throughput.
> (Alternatively, you could compare MACs/s per $, or per mW, to get a
> fairer match.)
>
>

Well I cannot have your certainty about the motivation ARM have,
but I strongly suspect they _do_ know about the data dependencies
and they do take them into account when designing.

What I am pointing out is the architectural limitation; the MAC
loop is only one good example how it takes pipeline depth
times 3 registers plus address pointers and counters etc. to be
able to keep it productive.

Of course like you say most applications do not need all the
resources, then there are architectures much worse than ARM
doing commercially fine etc., I am not interested in such a
discussion at all.

My point is about the number of registers a load/store machine needs
in order to make use of a given pipeline depth. ARM is fundamentally
limited in that by having too few registers and being a load/store
machine at the same time, there is nothing one can do against these
figures.

Dimiter

------------------------------------------------------
Dimiter Popoff, TGI             http://www.tgi-sci.com
------------------------------------------------------
http://www.flickr.com/photos/didi_tgi/

Reply by Dimiter_Popoff ●November 7, 20152015-11-07

On 08.11.2015 &#1075;. 00:46, rickman wrote:
> On 11/7/2015 5:18 PM, Dimiter_Popoff wrote:
>> On 07.11.2015 &#1075;. 18:45, rickman wrote:
>>> On 11/7/2015 11:35 AM, Dimiter_Popoff wrote:
>>>> On 07.11.2015 &#1075;. 17:47, Randy Yates wrote:
>>>>> Hi,
>>>>>
>>>>> I'm trying to find information on the Silicon Labs/Energy Micro EFM32
>>>>> Cortex M3 processor instruction execution times, namely, the
>>>>> MLA/Multiply-Accumulate instruction, but others as well. I've found
>>>>> the
>>>>> instruction in the reference manual but nowhere are cycle times
>>>>> mentioned.
>>>>>
>>>>> This is surreal. Every assembly language reference manual I've ever
>>>>> used
>>>>> includes cycle counts for each instruction. Here they're nowhere to be
>>>>> found.
>>>>>
>>>>
>>>> Be prepared for surprises with the MAC instruction on a non-specialized
>>>> DSP like that.
>>>> Even if they specify 1 cycle throughput this can be unrealistic given
>>>> the few registers they have.
>>>> If they have a 6 stage pipeline it takes at least 18 registers for
>>>> MAC intermediate results only to bypass the data dependencies.
>>>> IOW, if you just write a loop with a counter you will need at least 6
>>>> cycles (plus perhaps some additional time for the mul) simply because
>>>> every multiply-add needs the result of the previous one to be able
>>>> to add to.
>>>
>>> You might want to rethink that.  The accumulate operation (add) is
>>> typically one clock cycle while the multiply is sometimes multiple
>>> cycles.  I don't know what the multiply time is in the CM3, I thought it
>>> was one cycle as well, but perhaps that is a pipelined time. Regardless,
>>> the multiply spits out a result on every clock which is then added to
>>> the accumulator on each clock producing a MAC result on each clock.
>>>
>>> I remember that in the CM4 they claimed to be able to get close to 1 MAC
>>> per clock with various optimizations.
>>>
>>
>> Hah, it appears I am the only one - not only in this group - to have
>> really gone through this.
>>
>> The multiply does spit say a result every cycle, OK. But this is at the
>> end of the pipeline; so each multiply has started 6 (to stay with my 6
>> stages example) cycles earlier than the result it spits. Now since
>> we accumulate the result in one register - and it is also at the input
>> of the pipeline for the multiply-add opcode - a new instruction cannot
>> begin going through the pipeline before one is finished, not without
>> some additional, DSP-ish trickery - which "normal" processor do not
>> have or if they do they talk about some "DSP engine" or sort of.
>
> I don't quite understand what you are saying.  You seem to be saying the
> pipeline is 6 clock cycles long while that does not seem to be supported
> by the facts.

I just stick to the same example from the beginning for clarity.

>  Then you propose the inputs to the instruction have to be
> available at the *start* of the instruction (not sure what that even
> means really as instructions are fetched, decoded and executed, which
> one is the "start") which is not necessarily true.  I don't know that
> pipelining the MAC instruction requires anything special from the CPU
> other than the various controls required for pipelining.

Well I know on the surface this is easy to overlook, as I had not
thought about it until I had to deal with it. But it is a general
issue.
Operands enter the pipeline at its input; if one of these operands
needs to be the output of the pipeline guess what, you will have
to wait for the entire pipeline length to be walked through before
you have all operands to do the next operation.
Let us try the MAC example: at the pipeline input you
need a sample, a coefficient and the accumulated value,
S, C and A.
Assume that to calculate S*C+A takes as many steps as the pipeline
is deep, say 6 cycles.
Now we start with S0*C0+A0=A1; next cycle we do S0*C0+A1.
But A1 will not be available for another 6 cycles, not before
s0*c0+a0 make it to the end of the pipeline.
It is called a data dependency.

> I'm not in a position to debate this since I am not so familiar with the
> ARM instruction set, but I don't see any reason to use more registers
> for a simple instruction like the MAC than are actually required.  I
> have never seen a problem with overlapping register usage in pipelined
> instruction sets.  As long as the register is updated by the time it is
> used, it all works.  Otherwise, what is the point of pipelining?
>

Well I hope I did explain it well enough this time :-). Pipelining is
powerful but like anything else it has its limitations, the above
example summarizes it quite well.

Dimiter

------------------------------------------------------
Dimiter Popoff, TGI             http://www.tgi-sci.com
------------------------------------------------------
http://www.flickr.com/photos/didi_tgi/

Reply by Dimiter_Popoff ●November 7, 20152015-11-07

On 08.11.2015 &#1075;. 01:07, Dimiter_Popoff wrote:
> On 08.11.2015 &#1075;. 00:46, rickman wrote:
>> On 11/7/2015 5:18 PM, Dimiter_Popoff wrote:
>>> On 07.11.2015 &#1075;. 18:45, rickman wrote:
>>>> On 11/7/2015 11:35 AM, Dimiter_Popoff wrote:
>>>>> On 07.11.2015 &#1075;. 17:47, Randy Yates wrote:
>>>>>> Hi,
>>>>>>
>>>>>> I'm trying to find information on the Silicon Labs/Energy Micro EFM32
>>>>>> Cortex M3 processor instruction execution times, namely, the
>>>>>> MLA/Multiply-Accumulate instruction, but others as well. I've found
>>>>>> the
>>>>>> instruction in the reference manual but nowhere are cycle times
>>>>>> mentioned.
>>>>>>
>>>>>> This is surreal. Every assembly language reference manual I've ever
>>>>>> used
>>>>>> includes cycle counts for each instruction. Here they're nowhere
>>>>>> to be
>>>>>> found.
>>>>>>
>>>>>
>>>>> Be prepared for surprises with the MAC instruction on a
>>>>> non-specialized
>>>>> DSP like that.
>>>>> Even if they specify 1 cycle throughput this can be unrealistic given
>>>>> the few registers they have.
>>>>> If they have a 6 stage pipeline it takes at least 18 registers for
>>>>> MAC intermediate results only to bypass the data dependencies.
>>>>> IOW, if you just write a loop with a counter you will need at least 6
>>>>> cycles (plus perhaps some additional time for the mul) simply because
>>>>> every multiply-add needs the result of the previous one to be able
>>>>> to add to.
>>>>
>>>> You might want to rethink that.  The accumulate operation (add) is
>>>> typically one clock cycle while the multiply is sometimes multiple
>>>> cycles.  I don't know what the multiply time is in the CM3, I
>>>> thought it
>>>> was one cycle as well, but perhaps that is a pipelined time.
>>>> Regardless,
>>>> the multiply spits out a result on every clock which is then added to
>>>> the accumulator on each clock producing a MAC result on each clock.
>>>>
>>>> I remember that in the CM4 they claimed to be able to get close to 1
>>>> MAC
>>>> per clock with various optimizations.
>>>>
>>>
>>> Hah, it appears I am the only one - not only in this group - to have
>>> really gone through this.
>>>
>>> The multiply does spit say a result every cycle, OK. But this is at the
>>> end of the pipeline; so each multiply has started 6 (to stay with my 6
>>> stages example) cycles earlier than the result it spits. Now since
>>> we accumulate the result in one register - and it is also at the input
>>> of the pipeline for the multiply-add opcode - a new instruction cannot
>>> begin going through the pipeline before one is finished, not without
>>> some additional, DSP-ish trickery - which "normal" processor do not
>>> have or if they do they talk about some "DSP engine" or sort of.
>>
>> I don't quite understand what you are saying.  You seem to be saying the
>> pipeline is 6 clock cycles long while that does not seem to be supported
>> by the facts.
>
> I just stick to the same example from the beginning for clarity.
>
>>  Then you propose the inputs to the instruction have to be
>> available at the *start* of the instruction (not sure what that even
>> means really as instructions are fetched, decoded and executed, which
>> one is the "start") which is not necessarily true.  I don't know that
>> pipelining the MAC instruction requires anything special from the CPU
>> other than the various controls required for pipelining.
>
> Well I know on the surface this is easy to overlook, as I had not
> thought about it until I had to deal with it. But it is a general
> issue.
> Operands enter the pipeline at its input; if one of these operands
> needs to be the output of the pipeline guess what, you will have
> to wait for the entire pipeline length to be walked through before
> you have all operands to do the next operation.
> Let us try the MAC example: at the pipeline input you
> need a sample, a coefficient and the accumulated value,
> S, C and A.
> Assume that to calculate S*C+A takes as many steps as the pipeline
> is deep, say 6 cycles.
> Now we start with S0*C0+A0=A1; next cycle we do S0*C0+A1.
> But A1 will not be available for another 6 cycles, not before
> s0*c0+a0 make it to the end of the pipeline.
> It is called a data dependency.
>
>> I'm not in a position to debate this since I am not so familiar with the
>> ARM instruction set, but I don't see any reason to use more registers
>> for a simple instruction like the MAC than are actually required.  I
>> have never seen a problem with overlapping register usage in pipelined
>> instruction sets.  As long as the register is updated by the time it is
>> used, it all works.  Otherwise, what is the point of pipelining?
>>
>
> Well I hope I did explain it well enough this time :-). Pipelining is
> powerful but like anything else it has its limitations, the above
> example summarizes it quite well.
>
> Dimiter
>
> ------------------------------------------------------
> Dimiter Popoff, TGI             http://www.tgi-sci.com
> ------------------------------------------------------
> http://www.flickr.com/photos/didi_tgi/
>
>

 > Now we start with S0*C0+A0=A1; next cycle we do S0*C0+A1.

my mistake - obviously this shoud read

"Now we start with S0*C0+A0=A1; next cycle we do S1*C1+A1."

Dimiter

Previous 123 4 5 Next

EFM32 Instruction Execution Times

Sign in

You might also like...

Search forums

Free PDF Downloads

Blogs - Hall of Fame

Discussion Groups

Quick Links

About EmbeddedRelated.com

Social Networks

The Related Media Group