EmbeddedRelated.com
Forums

EFM32 Instruction Execution Times

Started by Randy Yates November 7, 2015
Hi,

I'm trying to find information on the Silicon Labs/Energy Micro EFM32
Cortex M3 processor instruction execution times, namely, the
MLA/Multiply-Accumulate instruction, but others as well. I've found the
instruction in the reference manual but nowhere are cycle times
mentioned.

This is surreal. Every assembly language reference manual I've ever used
includes cycle counts for each instruction. Here they're nowhere to be
found.
-- 
Randy Yates
Digital Signal Labs
http://www.digitalsignallabs.com
On 11/7/2015 10:47 AM, Randy Yates wrote:
> Hi, > > I'm trying to find information on the Silicon Labs/Energy Micro EFM32 > Cortex M3 processor instruction execution times, namely, the > MLA/Multiply-Accumulate instruction, but others as well. I've found the > instruction in the reference manual but nowhere are cycle times > mentioned. > > This is surreal. Every assembly language reference manual I've ever used > includes cycle counts for each instruction. Here they're nowhere to be > found.
I'm not 100% certain, but I think details like this are the same for all CM3 processors since all makers of the chips license the same code for the processor. They can optimize various aspects like cache size, memory and peripherals, but ARM has been moving to standardizing more and more of the core CPU design so that there is a great deal of consistency across all the instantiations of their design. Check at the ARM web site for docs on the CM3 core. I'm curious why you are working with this particular part. I have looked at their devices and not found a lot that makes them stand out in the crowd of CM3s. Their big deal is supposed to be low power, but I didn't find them to be much lower power than the many other CM3s available. -- Rick
On 07.11.2015 г. 17:47, Randy Yates wrote:
> Hi, > > I'm trying to find information on the Silicon Labs/Energy Micro EFM32 > Cortex M3 processor instruction execution times, namely, the > MLA/Multiply-Accumulate instruction, but others as well. I've found the > instruction in the reference manual but nowhere are cycle times > mentioned. > > This is surreal. Every assembly language reference manual I've ever used > includes cycle counts for each instruction. Here they're nowhere to be > found. >
Be prepared for surprises with the MAC instruction on a non-specialized DSP like that. Even if they specify 1 cycle throughput this can be unrealistic given the few registers they have. If they have a 6 stage pipeline it takes at least 18 registers for MAC intermediate results only to bypass the data dependencies. IOW, if you just write a loop with a counter you will need at least 6 cycles (plus perhaps some additional time for the mul) simply because every multiply-add needs the result of the previous one to be able to add to. Dimiter ------------------------------------------------------ Dimiter Popoff, TGI http://www.tgi-sci.com ------------------------------------------------------ http://www.flickr.com/photos/didi_tgi/
On 11/7/2015 11:35 AM, Dimiter_Popoff wrote:
> On 07.11.2015 г. 17:47, Randy Yates wrote: >> Hi, >> >> I'm trying to find information on the Silicon Labs/Energy Micro EFM32 >> Cortex M3 processor instruction execution times, namely, the >> MLA/Multiply-Accumulate instruction, but others as well. I've found the >> instruction in the reference manual but nowhere are cycle times >> mentioned. >> >> This is surreal. Every assembly language reference manual I've ever used >> includes cycle counts for each instruction. Here they're nowhere to be >> found. >> > > Be prepared for surprises with the MAC instruction on a non-specialized > DSP like that. > Even if they specify 1 cycle throughput this can be unrealistic given > the few registers they have. > If they have a 6 stage pipeline it takes at least 18 registers for > MAC intermediate results only to bypass the data dependencies. > IOW, if you just write a loop with a counter you will need at least 6 > cycles (plus perhaps some additional time for the mul) simply because > every multiply-add needs the result of the previous one to be able > to add to.
You might want to rethink that. The accumulate operation (add) is typically one clock cycle while the multiply is sometimes multiple cycles. I don't know what the multiply time is in the CM3, I thought it was one cycle as well, but perhaps that is a pipelined time. Regardless, the multiply spits out a result on every clock which is then added to the accumulator on each clock producing a MAC result on each clock. I remember that in the CM4 they claimed to be able to get close to 1 MAC per clock with various optimizations. -- Rick
On 07/11/15 17:45, rickman wrote:
> On 11/7/2015 11:35 AM, Dimiter_Popoff wrote: >> On 07.11.2015 г. 17:47, Randy Yates wrote: >>> Hi, >>> >>> I'm trying to find information on the Silicon Labs/Energy Micro EFM32 >>> Cortex M3 processor instruction execution times, namely, the >>> MLA/Multiply-Accumulate instruction, but others as well. I've found the >>> instruction in the reference manual but nowhere are cycle times >>> mentioned. >>> >>> This is surreal. Every assembly language reference manual I've ever used >>> includes cycle counts for each instruction. Here they're nowhere to be >>> found. >>> >> >> Be prepared for surprises with the MAC instruction on a non-specialized >> DSP like that. >> Even if they specify 1 cycle throughput this can be unrealistic given >> the few registers they have. >> If they have a 6 stage pipeline it takes at least 18 registers for >> MAC intermediate results only to bypass the data dependencies. >> IOW, if you just write a loop with a counter you will need at least 6 >> cycles (plus perhaps some additional time for the mul) simply because >> every multiply-add needs the result of the previous one to be able >> to add to. > > You might want to rethink that. The accumulate operation (add) is > typically one clock cycle while the multiply is sometimes multiple > cycles. I don't know what the multiply time is in the CM3, I thought it > was one cycle as well, but perhaps that is a pipelined time. Regardless, > the multiply spits out a result on every clock which is then added to > the accumulator on each clock producing a MAC result on each clock. > > I remember that in the CM4 they claimed to be able to get close to 1 MAC > per clock with various optimizations. >
The Cortex M4 has a range of additional instructions aimed precisely at DSP instructions such as MAC. That is the main difference between the M3 and the M4. The M3 and M4 have a 3 stage pipeline. A very quick google shows that MLA (32x32 -> 32) on the M3 is 2 cycles. It is 1 cycle on the M4. If you want 64-bit results and accumulates, it is 4 to 7 cycles on the M3 and 1 on the M4. The M4 also has a variety of other DSP-style instructions, including SIMD codes for 16-bit or 8-bit MACs in parallel. With its very short pipelines, the M4 has enough registers to keep up a good throughput at MAC operations - significantly better than on an M3 in many circumstances. Even an M4 is not going to compete with a dedicated DSP on MAC throughput per clock cycle - but it is /vastly/ easier to work with. The real question is what the OP actually wants to do, and if his M3 (or a replacement M4) is good enough - there is no point in going for a hideous architecture that can do 1 GMAC/s if 1 MMAC/s is more than enough for the application.
rickman <gnuarm@gmail.com> writes:

> On 11/7/2015 10:47 AM, Randy Yates wrote: >> Hi, >> >> I'm trying to find information on the Silicon Labs/Energy Micro EFM32 >> Cortex M3 processor instruction execution times, namely, the >> MLA/Multiply-Accumulate instruction, but others as well. I've found the >> instruction in the reference manual but nowhere are cycle times >> mentioned. >> >> This is surreal. Every assembly language reference manual I've ever used >> includes cycle counts for each instruction. Here they're nowhere to be >> found. > > I'm not 100% certain, but I think details like this are the same for > all CM3 processors since all makers of the chips license the same code > for the processor. They can optimize various aspects like cache size, > memory and peripherals, but ARM has been moving to standardizing more > and more of the core CPU design so that there is a great deal of > consistency across all the instantiations of their design. > > Check at the ARM web site for docs on the CM3 core.
Hi Rick, Thanks, I will.
> I'm curious why you are working with this particular part. I have > looked at their devices and not found a lot that makes them stand out > in the crowd of CM3s. Their big deal is supposed to be low power, but > I didn't find them to be much lower power than the many other CM3s > available.
If I were choosing the processor from scratch I would almost have certainly chosen the M4, assuming it made sense from a power POV. However, I'm coming in on the tail-end of someone else's project, so the choice wasn't mine and has already been made awhile back. -- Randy Yates Digital Signal Labs http://www.digitalsignallabs.com
Randy Yates <yates@digitalsignallabs.com> writes:

> rickman <gnuarm@gmail.com> writes: > [...] >> I'm curious why you are working with this particular part. I have >> looked at their devices and not found a lot that makes them stand out >> in the crowd of CM3s. Their big deal is supposed to be low power, but >> I didn't find them to be much lower power than the many other CM3s >> available. > > If I were choosing the processor from scratch I would almost have > certainly chosen the M4, assuming it made sense from a power POV. > However, I'm coming in on the tail-end of someone else's project, so the > choice wasn't mine and has already been made awhile back.
Rick, I realized you are asking why this particular CM3. Of course the answer is still that I didn't make the choice. To pick your brain, why not the EFM32? Is there something to detract from this SiLabs choice? -- Randy Yates Digital Signal Labs http://www.digitalsignallabs.com
David Brown <david.brown@hesbynett.no> writes:

> On 07/11/15 17:45, rickman wrote: >> On 11/7/2015 11:35 AM, Dimiter_Popoff wrote: >>> On 07.11.2015 &#1075;. 17:47, Randy Yates wrote: >>>> Hi, >>>> >>>> I'm trying to find information on the Silicon Labs/Energy Micro EFM32 >>>> Cortex M3 processor instruction execution times, namely, the >>>> MLA/Multiply-Accumulate instruction, but others as well. I've found the >>>> instruction in the reference manual but nowhere are cycle times >>>> mentioned. >>>> >>>> This is surreal. Every assembly language reference manual I've ever used >>>> includes cycle counts for each instruction. Here they're nowhere to be >>>> found. >>>> >>> >>> Be prepared for surprises with the MAC instruction on a non-specialized >>> DSP like that. >>> Even if they specify 1 cycle throughput this can be unrealistic given >>> the few registers they have. >>> If they have a 6 stage pipeline it takes at least 18 registers for >>> MAC intermediate results only to bypass the data dependencies. >>> IOW, if you just write a loop with a counter you will need at least 6 >>> cycles (plus perhaps some additional time for the mul) simply because >>> every multiply-add needs the result of the previous one to be able >>> to add to. >> >> You might want to rethink that. The accumulate operation (add) is >> typically one clock cycle while the multiply is sometimes multiple >> cycles. I don't know what the multiply time is in the CM3, I thought it >> was one cycle as well, but perhaps that is a pipelined time. Regardless, >> the multiply spits out a result on every clock which is then added to >> the accumulator on each clock producing a MAC result on each clock. >> >> I remember that in the CM4 they claimed to be able to get close to 1 MAC >> per clock with various optimizations. >> > > The Cortex M4 has a range of additional instructions aimed precisely > at DSP instructions such as MAC. That is the main difference between > the M3 and the M4. > > The M3 and M4 have a 3 stage pipeline. > > A very quick google
Hi David, I don't want to sound ungrateful, but why in the hell must I resort to Google to get this deeply domain-specific information? It belongs in a reference manual. Turns out Rick was right - it's in the ARM Cortex M3 TRM: http://infocenter.arm.com/help/topic/com.arm.doc.ddi0337e/DDI0337E_cortex_m3_r1p1_trm.pdf
> shows that MLA (32x32 -> 32) on the M3 is 2 cycles. It is 1 cycle on > the M4. If you want 64-bit results and accumulates, it is 4 to 7 > cycles on the M3 and 1 on the M4. The M4 also has a variety of other > DSP-style instructions, including SIMD codes for 16-bit or 8-bit MACs > in parallel. > > With its very short pipelines, the M4 has enough registers to keep up > a good throughput at MAC operations - significantly better than on an > M3 in many circumstances. > > Even an M4 is not going to compete with a dedicated DSP on MAC > throughput per clock cycle - but it is /vastly/ easier to work with.
> The real question is what the OP actually wants to do, and if his M3 > (or a replacement M4) is good enough - there is no point in going for > a hideous architecture that can do 1 GMAC/s if 1 MMAC/s is more than > enough for the application.
The goal is to implement a high performance filter in few enough cycles to get back to low-power mode and meet a specific battery life goal. Is the CM3 "good enough?" TBD. There are a lot of choices (processing architecture, filter specifications, etc.) that will decide. -- Randy Yates Digital Signal Labs http://www.digitalsignallabs.com
On 11/7/2015 3:57 PM, Randy Yates wrote:
> David Brown <david.brown@hesbynett.no> writes: > >> On 07/11/15 17:45, rickman wrote: >>> On 11/7/2015 11:35 AM, Dimiter_Popoff wrote: >>>> On 07.11.2015 &#1075;. 17:47, Randy Yates wrote: >>>>> Hi, >>>>> >>>>> I'm trying to find information on the Silicon Labs/Energy Micro EFM32 >>>>> Cortex M3 processor instruction execution times, namely, the >>>>> MLA/Multiply-Accumulate instruction, but others as well. I've found the >>>>> instruction in the reference manual but nowhere are cycle times >>>>> mentioned. >>>>> >>>>> This is surreal. Every assembly language reference manual I've ever used >>>>> includes cycle counts for each instruction. Here they're nowhere to be >>>>> found. >>>>> >>>> >>>> Be prepared for surprises with the MAC instruction on a non-specialized >>>> DSP like that. >>>> Even if they specify 1 cycle throughput this can be unrealistic given >>>> the few registers they have. >>>> If they have a 6 stage pipeline it takes at least 18 registers for >>>> MAC intermediate results only to bypass the data dependencies. >>>> IOW, if you just write a loop with a counter you will need at least 6 >>>> cycles (plus perhaps some additional time for the mul) simply because >>>> every multiply-add needs the result of the previous one to be able >>>> to add to. >>> >>> You might want to rethink that. The accumulate operation (add) is >>> typically one clock cycle while the multiply is sometimes multiple >>> cycles. I don't know what the multiply time is in the CM3, I thought it >>> was one cycle as well, but perhaps that is a pipelined time. Regardless, >>> the multiply spits out a result on every clock which is then added to >>> the accumulator on each clock producing a MAC result on each clock. >>> >>> I remember that in the CM4 they claimed to be able to get close to 1 MAC >>> per clock with various optimizations. >>> >> >> The Cortex M4 has a range of additional instructions aimed precisely >> at DSP instructions such as MAC. That is the main difference between >> the M3 and the M4. >> >> The M3 and M4 have a 3 stage pipeline. >> >> A very quick google > > Hi David, > > I don't want to sound ungrateful, but why in the hell must I resort > to Google to get this deeply domain-specific information? It belongs > in a reference manual. > > Turns out Rick was right - it's in the ARM Cortex M3 TRM: > > http://infocenter.arm.com/help/topic/com.arm.doc.ddi0337e/DDI0337E_cortex_m3_r1p1_trm.pdf > >> shows that MLA (32x32 -> 32) on the M3 is 2 cycles. It is 1 cycle on >> the M4. If you want 64-bit results and accumulates, it is 4 to 7 >> cycles on the M3 and 1 on the M4. The M4 also has a variety of other >> DSP-style instructions, including SIMD codes for 16-bit or 8-bit MACs >> in parallel. >> >> With its very short pipelines, the M4 has enough registers to keep up >> a good throughput at MAC operations - significantly better than on an >> M3 in many circumstances. >> >> Even an M4 is not going to compete with a dedicated DSP on MAC >> throughput per clock cycle - but it is /vastly/ easier to work with. > >> The real question is what the OP actually wants to do, and if his M3 >> (or a replacement M4) is good enough - there is no point in going for >> a hideous architecture that can do 1 GMAC/s if 1 MMAC/s is more than >> enough for the application. > > The goal is to implement a high performance filter in few enough cycles > to get back to low-power mode and meet a specific battery life goal. Is > the CM3 "good enough?" TBD. There are a lot of choices (processing > architecture, filter specifications, etc.) that will decide.
Should I assume that I can't talk you into an FPGA design in a low power device? -- Rick
rickman <gnuarm@gmail.com> writes:

> On 11/7/2015 3:57 PM, Randy Yates wrote: >> David Brown <david.brown@hesbynett.no> writes: >> >>> On 07/11/15 17:45, rickman wrote: >>>> On 11/7/2015 11:35 AM, Dimiter_Popoff wrote: >>>>> On 07.11.2015 &#1075;. 17:47, Randy Yates wrote: >>>>>> Hi, >>>>>> >>>>>> I'm trying to find information on the Silicon Labs/Energy Micro EFM32 >>>>>> Cortex M3 processor instruction execution times, namely, the >>>>>> MLA/Multiply-Accumulate instruction, but others as well. I've found the >>>>>> instruction in the reference manual but nowhere are cycle times >>>>>> mentioned. >>>>>> >>>>>> This is surreal. Every assembly language reference manual I've ever used >>>>>> includes cycle counts for each instruction. Here they're nowhere to be >>>>>> found. >>>>>> >>>>> >>>>> Be prepared for surprises with the MAC instruction on a non-specialized >>>>> DSP like that. >>>>> Even if they specify 1 cycle throughput this can be unrealistic given >>>>> the few registers they have. >>>>> If they have a 6 stage pipeline it takes at least 18 registers for >>>>> MAC intermediate results only to bypass the data dependencies. >>>>> IOW, if you just write a loop with a counter you will need at least 6 >>>>> cycles (plus perhaps some additional time for the mul) simply because >>>>> every multiply-add needs the result of the previous one to be able >>>>> to add to. >>>> >>>> You might want to rethink that. The accumulate operation (add) is >>>> typically one clock cycle while the multiply is sometimes multiple >>>> cycles. I don't know what the multiply time is in the CM3, I thought it >>>> was one cycle as well, but perhaps that is a pipelined time. Regardless, >>>> the multiply spits out a result on every clock which is then added to >>>> the accumulator on each clock producing a MAC result on each clock. >>>> >>>> I remember that in the CM4 they claimed to be able to get close to 1 MAC >>>> per clock with various optimizations. >>>> >>> >>> The Cortex M4 has a range of additional instructions aimed precisely >>> at DSP instructions such as MAC. That is the main difference between >>> the M3 and the M4. >>> >>> The M3 and M4 have a 3 stage pipeline. >>> >>> A very quick google >> >> Hi David, >> >> I don't want to sound ungrateful, but why in the hell must I resort >> to Google to get this deeply domain-specific information? It belongs >> in a reference manual. >> >> Turns out Rick was right - it's in the ARM Cortex M3 TRM: >> >> http://infocenter.arm.com/help/topic/com.arm.doc.ddi0337e/DDI0337E_cortex_m3_r1p1_trm.pdf >> >>> shows that MLA (32x32 -> 32) on the M3 is 2 cycles. It is 1 cycle on >>> the M4. If you want 64-bit results and accumulates, it is 4 to 7 >>> cycles on the M3 and 1 on the M4. The M4 also has a variety of other >>> DSP-style instructions, including SIMD codes for 16-bit or 8-bit MACs >>> in parallel. >>> >>> With its very short pipelines, the M4 has enough registers to keep up >>> a good throughput at MAC operations - significantly better than on an >>> M3 in many circumstances. >>> >>> Even an M4 is not going to compete with a dedicated DSP on MAC >>> throughput per clock cycle - but it is /vastly/ easier to work with. >> >>> The real question is what the OP actually wants to do, and if his M3 >>> (or a replacement M4) is good enough - there is no point in going for >>> a hideous architecture that can do 1 GMAC/s if 1 MMAC/s is more than >>> enough for the application. >> >> The goal is to implement a high performance filter in few enough cycles >> to get back to low-power mode and meet a specific battery life goal. Is >> the CM3 "good enough?" TBD. There are a lot of choices (processing >> architecture, filter specifications, etc.) that will decide. > > Should I assume that I can't talk you into an FPGA design in a low > power device?
That would require a board respin. Not good! -- Randy Yates Digital Signal Labs http://www.digitalsignallabs.com