EFM32 Instruction Execution Times

Hi,

I'm trying to find information on the Silicon Labs/Energy Micro EFM32
Cortex M3 processor instruction execution times, namely, the
MLA/Multiply-Accumulate instruction, but others as well. I've found the
instruction in the reference manual but nowhere are cycle times
mentioned.

This is surreal. Every assembly language reference manual I've ever used
includes cycle counts for each instruction. Here they're nowhere to be
found.
-- 
Randy Yates
Digital Signal Labs
http://www.digitalsignallabs.com

Reply by rickman ●November 7, 20152015-11-07

On 11/7/2015 10:47 AM, Randy Yates wrote:
> Hi,
>
> I'm trying to find information on the Silicon Labs/Energy Micro EFM32
> Cortex M3 processor instruction execution times, namely, the
> MLA/Multiply-Accumulate instruction, but others as well. I've found the
> instruction in the reference manual but nowhere are cycle times
> mentioned.
>
> This is surreal. Every assembly language reference manual I've ever used
> includes cycle counts for each instruction. Here they're nowhere to be
> found.

I'm not 100% certain, but I think details like this are the same for all 
CM3 processors since all makers of the chips license the same code for 
the processor.  They can optimize various aspects like cache size, 
memory and peripherals, but ARM has been moving to standardizing more 
and more of the core CPU design so that there is a great deal of 
consistency across all the instantiations of their design.

Check at the ARM web site for docs on the CM3 core.

I'm curious why you are working with this particular part.  I have 
looked at their devices and not found a lot that makes them stand out in 
the crowd of CM3s.  Their big deal is supposed to be low power, but I 
didn't find them to be much lower power than the many other CM3s available.

-- 

Rick

Reply by Dimiter_Popoff ●November 7, 20152015-11-07

On 07.11.2015 &#1075;. 17:47, Randy Yates wrote:
> Hi,
>
> I'm trying to find information on the Silicon Labs/Energy Micro EFM32
> Cortex M3 processor instruction execution times, namely, the
> MLA/Multiply-Accumulate instruction, but others as well. I've found the
> instruction in the reference manual but nowhere are cycle times
> mentioned.
>
> This is surreal. Every assembly language reference manual I've ever used
> includes cycle counts for each instruction. Here they're nowhere to be
> found.
>

Be prepared for surprises with the MAC instruction on a non-specialized
DSP like that.
Even if they specify 1 cycle throughput this can be unrealistic given
the few registers they have.
If they have a 6 stage pipeline it takes at least 18 registers for
MAC intermediate results only to bypass the data dependencies.
IOW, if you just write a loop with a counter you will need at least 6
cycles (plus perhaps some additional time for the mul) simply because
every multiply-add needs the result of the previous one to be able
to add to.

Dimiter

------------------------------------------------------
Dimiter Popoff, TGI             http://www.tgi-sci.com
------------------------------------------------------
http://www.flickr.com/photos/didi_tgi/

Reply by rickman ●November 7, 20152015-11-07

On 11/7/2015 11:35 AM, Dimiter_Popoff wrote:
> On 07.11.2015 &#1075;. 17:47, Randy Yates wrote:
>> Hi,
>>
>> I'm trying to find information on the Silicon Labs/Energy Micro EFM32
>> Cortex M3 processor instruction execution times, namely, the
>> MLA/Multiply-Accumulate instruction, but others as well. I've found the
>> instruction in the reference manual but nowhere are cycle times
>> mentioned.
>>
>> This is surreal. Every assembly language reference manual I've ever used
>> includes cycle counts for each instruction. Here they're nowhere to be
>> found.
>>
>
> Be prepared for surprises with the MAC instruction on a non-specialized
> DSP like that.
> Even if they specify 1 cycle throughput this can be unrealistic given
> the few registers they have.
> If they have a 6 stage pipeline it takes at least 18 registers for
> MAC intermediate results only to bypass the data dependencies.
> IOW, if you just write a loop with a counter you will need at least 6
> cycles (plus perhaps some additional time for the mul) simply because
> every multiply-add needs the result of the previous one to be able
> to add to.

You might want to rethink that.  The accumulate operation (add) is 
typically one clock cycle while the multiply is sometimes multiple 
cycles.  I don't know what the multiply time is in the CM3, I thought it 
was one cycle as well, but perhaps that is a pipelined time. 
Regardless, the multiply spits out a result on every clock which is then 
added to the accumulator on each clock producing a MAC result on each 
clock.

I remember that in the CM4 they claimed to be able to get close to 1 MAC 
per clock with various optimizations.

-- 

Rick

Reply by David Brown ●November 7, 20152015-11-07

On 07/11/15 17:45, rickman wrote:
> On 11/7/2015 11:35 AM, Dimiter_Popoff wrote:
>> On 07.11.2015 &#1075;. 17:47, Randy Yates wrote:
>>> Hi,
>>>
>>> I'm trying to find information on the Silicon Labs/Energy Micro EFM32
>>> Cortex M3 processor instruction execution times, namely, the
>>> MLA/Multiply-Accumulate instruction, but others as well. I've found the
>>> instruction in the reference manual but nowhere are cycle times
>>> mentioned.
>>>
>>> This is surreal. Every assembly language reference manual I've ever used
>>> includes cycle counts for each instruction. Here they're nowhere to be
>>> found.
>>>
>>
>> Be prepared for surprises with the MAC instruction on a non-specialized
>> DSP like that.
>> Even if they specify 1 cycle throughput this can be unrealistic given
>> the few registers they have.
>> If they have a 6 stage pipeline it takes at least 18 registers for
>> MAC intermediate results only to bypass the data dependencies.
>> IOW, if you just write a loop with a counter you will need at least 6
>> cycles (plus perhaps some additional time for the mul) simply because
>> every multiply-add needs the result of the previous one to be able
>> to add to.
>
> You might want to rethink that.  The accumulate operation (add) is
> typically one clock cycle while the multiply is sometimes multiple
> cycles.  I don't know what the multiply time is in the CM3, I thought it
> was one cycle as well, but perhaps that is a pipelined time. Regardless,
> the multiply spits out a result on every clock which is then added to
> the accumulator on each clock producing a MAC result on each clock.
>
> I remember that in the CM4 they claimed to be able to get close to 1 MAC
> per clock with various optimizations.
>

The Cortex M4 has a range of additional instructions aimed precisely at 
DSP instructions such as MAC.  That is the main difference between the 
M3 and the M4.

The M3 and M4 have a 3 stage pipeline.

A very quick google shows that MLA (32x32 -> 32) on the M3 is 2 cycles. 
  It is 1 cycle on the M4.  If you want 64-bit results and accumulates, 
it is 4 to 7 cycles on the M3 and 1 on the M4.  The M4 also has a 
variety of other DSP-style instructions, including SIMD codes for 16-bit 
or 8-bit MACs in parallel.

With its very short pipelines, the M4 has enough registers to keep up a 
good throughput at MAC operations - significantly better than on an M3 
in many circumstances.

Even an M4 is not going to compete with a dedicated DSP on MAC 
throughput per clock cycle - but it is /vastly/ easier to work with. 
The real question is what the OP actually wants to do, and if his M3 (or 
a replacement M4) is good enough - there is no point in going for a 
hideous architecture that can do 1 GMAC/s if 1 MMAC/s is more than 
enough for the application.

Reply by Randy Yates ●November 7, 20152015-11-07

rickman <gnuarm@gmail.com> writes:

> On 11/7/2015 10:47 AM, Randy Yates wrote:
>> Hi,
>>
>> I'm trying to find information on the Silicon Labs/Energy Micro EFM32
>> Cortex M3 processor instruction execution times, namely, the
>> MLA/Multiply-Accumulate instruction, but others as well. I've found the
>> instruction in the reference manual but nowhere are cycle times
>> mentioned.
>>
>> This is surreal. Every assembly language reference manual I've ever used
>> includes cycle counts for each instruction. Here they're nowhere to be
>> found.
>
> I'm not 100% certain, but I think details like this are the same for
> all CM3 processors since all makers of the chips license the same code
> for the processor.  They can optimize various aspects like cache size,
> memory and peripherals, but ARM has been moving to standardizing more
> and more of the core CPU design so that there is a great deal of
> consistency across all the instantiations of their design.
>
> Check at the ARM web site for docs on the CM3 core.

Hi Rick,

Thanks, I will.

> I'm curious why you are working with this particular part.  I have
> looked at their devices and not found a lot that makes them stand out
> in the crowd of CM3s.  Their big deal is supposed to be low power, but
> I didn't find them to be much lower power than the many other CM3s
> available.

If I were choosing the processor from scratch I would almost have
certainly chosen the M4, assuming it made sense from a power POV.
However, I'm coming in on the tail-end of someone else's project, so the
choice wasn't mine and has already been made awhile back.
-- 
Randy Yates
Digital Signal Labs
http://www.digitalsignallabs.com

Reply by Randy Yates ●November 7, 20152015-11-07

Randy Yates <yates@digitalsignallabs.com> writes:

> rickman <gnuarm@gmail.com> writes:
> [...]
>> I'm curious why you are working with this particular part.  I have
>> looked at their devices and not found a lot that makes them stand out
>> in the crowd of CM3s.  Their big deal is supposed to be low power, but
>> I didn't find them to be much lower power than the many other CM3s
>> available.
>
> If I were choosing the processor from scratch I would almost have
> certainly chosen the M4, assuming it made sense from a power POV.
> However, I'm coming in on the tail-end of someone else's project, so the
> choice wasn't mine and has already been made awhile back.

Rick, I realized you are asking why this particular CM3. Of course the
answer is still that I didn't make the choice.

To pick your brain, why not the EFM32? Is there something to detract
from this SiLabs choice?
-- 
Randy Yates
Digital Signal Labs
http://www.digitalsignallabs.com

Reply by Randy Yates ●November 7, 20152015-11-07

David Brown <david.brown@hesbynett.no> writes:

> On 07/11/15 17:45, rickman wrote:
>> On 11/7/2015 11:35 AM, Dimiter_Popoff wrote:
>>> On 07.11.2015 &#1075;. 17:47, Randy Yates wrote:
>>>> Hi,
>>>>
>>>> I'm trying to find information on the Silicon Labs/Energy Micro EFM32
>>>> Cortex M3 processor instruction execution times, namely, the
>>>> MLA/Multiply-Accumulate instruction, but others as well. I've found the
>>>> instruction in the reference manual but nowhere are cycle times
>>>> mentioned.
>>>>
>>>> This is surreal. Every assembly language reference manual I've ever used
>>>> includes cycle counts for each instruction. Here they're nowhere to be
>>>> found.
>>>>
>>>
>>> Be prepared for surprises with the MAC instruction on a non-specialized
>>> DSP like that.
>>> Even if they specify 1 cycle throughput this can be unrealistic given
>>> the few registers they have.
>>> If they have a 6 stage pipeline it takes at least 18 registers for
>>> MAC intermediate results only to bypass the data dependencies.
>>> IOW, if you just write a loop with a counter you will need at least 6
>>> cycles (plus perhaps some additional time for the mul) simply because
>>> every multiply-add needs the result of the previous one to be able
>>> to add to.
>>
>> You might want to rethink that.  The accumulate operation (add) is
>> typically one clock cycle while the multiply is sometimes multiple
>> cycles.  I don't know what the multiply time is in the CM3, I thought it
>> was one cycle as well, but perhaps that is a pipelined time. Regardless,
>> the multiply spits out a result on every clock which is then added to
>> the accumulator on each clock producing a MAC result on each clock.
>>
>> I remember that in the CM4 they claimed to be able to get close to 1 MAC
>> per clock with various optimizations.
>>
>
> The Cortex M4 has a range of additional instructions aimed precisely
> at DSP instructions such as MAC.  That is the main difference between
> the M3 and the M4.
>
> The M3 and M4 have a 3 stage pipeline.
>
> A very quick google 

Hi David,

I don't want to sound ungrateful, but why in the hell must I resort
to Google to get this deeply domain-specific information? It belongs
in a reference manual. 

Turns out Rick was right - it's in the ARM Cortex M3 TRM:

http://infocenter.arm.com/help/topic/com.arm.doc.ddi0337e/DDI0337E_cortex_m3_r1p1_trm.pdf

> shows that MLA (32x32 -> 32) on the M3 is 2 cycles. It is 1 cycle on
> the M4. If you want 64-bit results and accumulates, it is 4 to 7
> cycles on the M3 and 1 on the M4. The M4 also has a variety of other
> DSP-style instructions, including SIMD codes for 16-bit or 8-bit MACs
> in parallel.
>
> With its very short pipelines, the M4 has enough registers to keep up
> a good throughput at MAC operations - significantly better than on an
> M3 in many circumstances.
>
> Even an M4 is not going to compete with a dedicated DSP on MAC
> throughput per clock cycle - but it is /vastly/ easier to work with.

> The real question is what the OP actually wants to do, and if his M3
> (or a replacement M4) is good enough - there is no point in going for
> a hideous architecture that can do 1 GMAC/s if 1 MMAC/s is more than
> enough for the application.

The goal is to implement a high performance filter in few enough cycles
to get back to low-power mode and meet a specific battery life goal. Is
the CM3 "good enough?" TBD. There are a lot of choices (processing
architecture, filter specifications, etc.) that will decide.
-- 
Randy Yates
Digital Signal Labs
http://www.digitalsignallabs.com

Reply by rickman ●November 7, 20152015-11-07

On 11/7/2015 3:57 PM, Randy Yates wrote:
> David Brown <david.brown@hesbynett.no> writes:
>
>> On 07/11/15 17:45, rickman wrote:
>>> On 11/7/2015 11:35 AM, Dimiter_Popoff wrote:
>>>> On 07.11.2015 &#1075;. 17:47, Randy Yates wrote:
>>>>> Hi,
>>>>>
>>>>> I'm trying to find information on the Silicon Labs/Energy Micro EFM32
>>>>> Cortex M3 processor instruction execution times, namely, the
>>>>> MLA/Multiply-Accumulate instruction, but others as well. I've found the
>>>>> instruction in the reference manual but nowhere are cycle times
>>>>> mentioned.
>>>>>
>>>>> This is surreal. Every assembly language reference manual I've ever used
>>>>> includes cycle counts for each instruction. Here they're nowhere to be
>>>>> found.
>>>>>
>>>>
>>>> Be prepared for surprises with the MAC instruction on a non-specialized
>>>> DSP like that.
>>>> Even if they specify 1 cycle throughput this can be unrealistic given
>>>> the few registers they have.
>>>> If they have a 6 stage pipeline it takes at least 18 registers for
>>>> MAC intermediate results only to bypass the data dependencies.
>>>> IOW, if you just write a loop with a counter you will need at least 6
>>>> cycles (plus perhaps some additional time for the mul) simply because
>>>> every multiply-add needs the result of the previous one to be able
>>>> to add to.
>>>
>>> You might want to rethink that.  The accumulate operation (add) is
>>> typically one clock cycle while the multiply is sometimes multiple
>>> cycles.  I don't know what the multiply time is in the CM3, I thought it
>>> was one cycle as well, but perhaps that is a pipelined time. Regardless,
>>> the multiply spits out a result on every clock which is then added to
>>> the accumulator on each clock producing a MAC result on each clock.
>>>
>>> I remember that in the CM4 they claimed to be able to get close to 1 MAC
>>> per clock with various optimizations.
>>>
>>
>> The Cortex M4 has a range of additional instructions aimed precisely
>> at DSP instructions such as MAC.  That is the main difference between
>> the M3 and the M4.
>>
>> The M3 and M4 have a 3 stage pipeline.
>>
>> A very quick google
>
> Hi David,
>
> I don't want to sound ungrateful, but why in the hell must I resort
> to Google to get this deeply domain-specific information? It belongs
> in a reference manual.
>
> Turns out Rick was right - it's in the ARM Cortex M3 TRM:
>
> http://infocenter.arm.com/help/topic/com.arm.doc.ddi0337e/DDI0337E_cortex_m3_r1p1_trm.pdf
>
>> shows that MLA (32x32 -> 32) on the M3 is 2 cycles. It is 1 cycle on
>> the M4. If you want 64-bit results and accumulates, it is 4 to 7
>> cycles on the M3 and 1 on the M4. The M4 also has a variety of other
>> DSP-style instructions, including SIMD codes for 16-bit or 8-bit MACs
>> in parallel.
>>
>> With its very short pipelines, the M4 has enough registers to keep up
>> a good throughput at MAC operations - significantly better than on an
>> M3 in many circumstances.
>>
>> Even an M4 is not going to compete with a dedicated DSP on MAC
>> throughput per clock cycle - but it is /vastly/ easier to work with.
>
>> The real question is what the OP actually wants to do, and if his M3
>> (or a replacement M4) is good enough - there is no point in going for
>> a hideous architecture that can do 1 GMAC/s if 1 MMAC/s is more than
>> enough for the application.
>
> The goal is to implement a high performance filter in few enough cycles
> to get back to low-power mode and meet a specific battery life goal. Is
> the CM3 "good enough?" TBD. There are a lot of choices (processing
> architecture, filter specifications, etc.) that will decide.

Should I assume that I can't talk you into an FPGA design in a low power 
device?

-- 

Rick

Reply by Randy Yates ●November 7, 20152015-11-07

rickman <gnuarm@gmail.com> writes:

> On 11/7/2015 3:57 PM, Randy Yates wrote:
>> David Brown <david.brown@hesbynett.no> writes:
>>
>>> On 07/11/15 17:45, rickman wrote:
>>>> On 11/7/2015 11:35 AM, Dimiter_Popoff wrote:
>>>>> On 07.11.2015 &#1075;. 17:47, Randy Yates wrote:
>>>>>> Hi,
>>>>>>
>>>>>> I'm trying to find information on the Silicon Labs/Energy Micro EFM32
>>>>>> Cortex M3 processor instruction execution times, namely, the
>>>>>> MLA/Multiply-Accumulate instruction, but others as well. I've found the
>>>>>> instruction in the reference manual but nowhere are cycle times
>>>>>> mentioned.
>>>>>>
>>>>>> This is surreal. Every assembly language reference manual I've ever used
>>>>>> includes cycle counts for each instruction. Here they're nowhere to be
>>>>>> found.
>>>>>>
>>>>>
>>>>> Be prepared for surprises with the MAC instruction on a non-specialized
>>>>> DSP like that.
>>>>> Even if they specify 1 cycle throughput this can be unrealistic given
>>>>> the few registers they have.
>>>>> If they have a 6 stage pipeline it takes at least 18 registers for
>>>>> MAC intermediate results only to bypass the data dependencies.
>>>>> IOW, if you just write a loop with a counter you will need at least 6
>>>>> cycles (plus perhaps some additional time for the mul) simply because
>>>>> every multiply-add needs the result of the previous one to be able
>>>>> to add to.
>>>>
>>>> You might want to rethink that.  The accumulate operation (add) is
>>>> typically one clock cycle while the multiply is sometimes multiple
>>>> cycles.  I don't know what the multiply time is in the CM3, I thought it
>>>> was one cycle as well, but perhaps that is a pipelined time. Regardless,
>>>> the multiply spits out a result on every clock which is then added to
>>>> the accumulator on each clock producing a MAC result on each clock.
>>>>
>>>> I remember that in the CM4 they claimed to be able to get close to 1 MAC
>>>> per clock with various optimizations.
>>>>
>>>
>>> The Cortex M4 has a range of additional instructions aimed precisely
>>> at DSP instructions such as MAC.  That is the main difference between
>>> the M3 and the M4.
>>>
>>> The M3 and M4 have a 3 stage pipeline.
>>>
>>> A very quick google
>>
>> Hi David,
>>
>> I don't want to sound ungrateful, but why in the hell must I resort
>> to Google to get this deeply domain-specific information? It belongs
>> in a reference manual.
>>
>> Turns out Rick was right - it's in the ARM Cortex M3 TRM:
>>
>> http://infocenter.arm.com/help/topic/com.arm.doc.ddi0337e/DDI0337E_cortex_m3_r1p1_trm.pdf
>>
>>> shows that MLA (32x32 -> 32) on the M3 is 2 cycles. It is 1 cycle on
>>> the M4. If you want 64-bit results and accumulates, it is 4 to 7
>>> cycles on the M3 and 1 on the M4. The M4 also has a variety of other
>>> DSP-style instructions, including SIMD codes for 16-bit or 8-bit MACs
>>> in parallel.
>>>
>>> With its very short pipelines, the M4 has enough registers to keep up
>>> a good throughput at MAC operations - significantly better than on an
>>> M3 in many circumstances.
>>>
>>> Even an M4 is not going to compete with a dedicated DSP on MAC
>>> throughput per clock cycle - but it is /vastly/ easier to work with.
>>
>>> The real question is what the OP actually wants to do, and if his M3
>>> (or a replacement M4) is good enough - there is no point in going for
>>> a hideous architecture that can do 1 GMAC/s if 1 MMAC/s is more than
>>> enough for the application.
>>
>> The goal is to implement a high performance filter in few enough cycles
>> to get back to low-power mode and meet a specific battery life goal. Is
>> the CM3 "good enough?" TBD. There are a lot of choices (processing
>> architecture, filter specifications, etc.) that will decide.
>
> Should I assume that I can't talk you into an FPGA design in a low
> power device?

That would require a board respin. Not good!
-- 
Randy Yates
Digital Signal Labs
http://www.digitalsignallabs.com

Previous12 3 4 5 Next

EFM32 Instruction Execution Times

Sign in

You might also like...

Search forums

Free PDF Downloads

Blogs - Hall of Fame

Discussion Groups

Quick Links

About EmbeddedRelated.com

Social Networks

The Related Media Group