EmbeddedRelated.com
Forums

EFM32 Instruction Execution Times

Started by Randy Yates November 7, 2015
On 09/11/15 08:17, rickman wrote:
> On 11/9/2015 1:19 AM, Tim Wescott wrote: >> On Sat, 07 Nov 2015 18:27:15 -0500, rickman wrote: >> >>> On 11/7/2015 5:47 PM, Tim Wescott wrote: >>>> On Sat, 07 Nov 2015 10:47:27 -0500, Randy Yates wrote: >>>> >>>>> Hi, >>>>> >>>>> I'm trying to find information on the Silicon Labs/Energy Micro EFM32 >>>>> Cortex M3 processor instruction execution times, namely, the >>>>> MLA/Multiply-Accumulate instruction, but others as well. I've found >>>>> the instruction in the reference manual but nowhere are cycle times >>>>> mentioned. >>>>> >>>>> This is surreal. Every assembly language reference manual I've ever >>>>> used includes cycle counts for each instruction. Here they're nowhere >>>>> to be found. >>>> >>>> In addition to everything else that's mentioned, with today's >>>> processors you're highly constrained by pipelining & whatnot. >>>> >>>> Most of the parts that I've worked with need lots of wait states to run >>>> out of flash -- I wouldn't be surprised if the processor spends most of >>>> it's time twiddling it's thumbs waiting on memory. >>> >>> If you want to run fast, you either put your code in RAM, or you let the >>> processor use cache that is available on all but low end processors. >> >> Unless I'm severely mistaken, most Cortex M3 processors are "low end" and >> do not sport caches. > > According to wikipedia (not always reliable) CM3 has no cache. There's > still RAM speedup and I forgot that most CM3 devices use a prefetch to > make sure the CPU has instructions when they are needed.
I think it is /possible/ to have cache on a CM3, but it is certainly not common.
> > Think about it. Why would they keep speeding up the CPU clock speed if > performance was limited by the Flash alone?
There are three points here: Cortex M devices use the Thumb2 instruction set - the aim of this is that a solid majority of instructions are 16-bit. Since these cpus are single-issue, that means you can run your cpu an average of about 50-70% higher clock speed than flash, assuming a 32-bit bus. There are ways to get processors going faster than flash even without a processor instruction cache. In particular, it is common for the flash units in faster microcontrollers to have a small buffer/cache in the flash module. If this is combined with wide access flash, say 64-bit wide, you can easily get streams of instructions at cpu speed (but with a penalty for branches and calls). And here is the main point - manufacturers /don't/ keep speeding up CPU clock speed on the CM3. Most serious manufacturers who make fast CM microcontrollers have moved to the CM4 - some never bothered with the CM3 in the first place. They put caches (and single-precision floating point) on their faster devices. So yes, CM3 devices /are/ low end - they are now either on older, legacy parts (in this field, that means more than a couple of years old), or as microcontrollers in integrated chips where the cpu plays a minor role (such as a high-end ADC that happens to have a cpu integrated).
> >> At least on the ST parts, not all of the RAM is connected to the >> processor's instruction bus, so you don't get as much speedup as you'd >> think. Some do have a magic memory address range that's dual-ported to >> both buses. > >
On Sun, 8 Nov 2015 02:19:53 +0200, Dimiter_Popoff <dp@tgi-sci.com>
wrote:

>>> So where else can data enter the pipeline except at its input?
Data can be inserted into the pipeline at the input of any functional unit. It can also be extracted at the output of any functional unit even for units in the middle of the pipe. The general method is called "bypassing" or "forwarding" and many modern CPUs contain large bypass/forward networks to route data internally "just in time" to where it needs to be. This bypass/forward method is what Rick was referring to previously when he said:
>> ... A pipeline does multiple steps, you don't need an input until the step >> that uses it. The adder for the accumulation only needs the result of the >> accumulation on the next clock when it starts the next add. Why would it >> need the result of the add at the same time as the inputs to the multiply?
A MAC unit is a combination of a multiplier and an adder. A very low end MAC may be serial, but most are pipelined. A pipelined unit will have a *zero* cycle forwarding path from the adder's output back into the adder's input, so the result can be used on the very next cycle.
>>> How can you have the result of a 6 cycle operation in less than >>> 6 cycles?
You can't. However, the point of the pipeline is to parallelize operations by overlapping them. In your example: S0 * C0 + A0 = A1 S1 * C1 + A1 = A2 : Sn * Cn + An = A(n+1) The output of the MAC can be fed back directly into its adder to be available on the next cycle. So consider a typical low end 4 cycle pipelined MAC, combining a 3 cycle multiplier with a 1 cycle adder, that can *start* a new operation on every cycle cycle operation(s) 1 S0 * C0 -> T0 2 S1 * C1 -> T1 3 S2 * C2 -> T2 4 S3 * C3 -> T3 , T0 + A0 -> A1 5 S4 * C4 -> T4 , T1 + A1 -> A2 6 S5 * C5 -> T5 , T2 + A2 -> A3 : The 1st result takes 4 cycles - the length of the pipeline - but once the pipe is primed, it begins to produce a new result on every succeeding cycle. In general, a pipe that can start a new operation every N cycles can produce a new result every N cycles. How often a pipeline can start a new operation often is referred to as the "major cycle" of the pipe. The major cycle of the pipeline may be *far* less than its total length. Reducing the major cycle is the whole point of pipelining an operation. George
On 09.11.2015 &#1075;. 19:52, George Neuner wrote:
> On Sun, 8 Nov 2015 02:19:53 +0200, Dimiter_Popoff <dp@tgi-sci.com> > wrote: > >>>> So where else can data enter the pipeline except at its input? > > Data can be inserted into the pipeline at the input of any functional > unit. It can also be extracted at the output of any functional unit > even for units in the middle of the pipe. > > The general method is called "bypassing" or "forwarding" and many > modern CPUs contain large bypass/forward networks to route data > internally "just in time" to where it needs to be. > > This bypass/forward method is what Rick was referring to previously > when he said: >>> ... A pipeline does multiple steps, you don't need an input until the step >>> that uses it. The adder for the accumulation only needs the result of the >>> accumulation on the next clock when it starts the next add. Why would it >>> need the result of the add at the same time as the inputs to the multiply? > > > A MAC unit is a combination of a multiplier and an adder. A very low > end MAC may be serial, but most are pipelined. A pipelined unit will > have a *zero* cycle forwarding path from the adder's output back into > the adder's input, so the result can be used on the very next cycle. > > >>>> How can you have the result of a 6 cycle operation in less than >>>> 6 cycles? > > You can't. However, the point of the pipeline is to parallelize > operations by overlapping them. > > In your example: > S0 * C0 + A0 = A1 > S1 * C1 + A1 = A2 > : > Sn * Cn + An = A(n+1) > > The output of the MAC can be fed back directly into its adder to be > available on the next cycle. > > So consider a typical low end 4 cycle pipelined MAC, combining a 3 > cycle multiplier with a 1 cycle adder, that can *start* a new > operation on every cycle > > cycle operation(s) > 1 S0 * C0 -> T0 > 2 S1 * C1 -> T1 > 3 S2 * C2 -> T2 > 4 S3 * C3 -> T3 , T0 + A0 -> A1 > 5 S4 * C4 -> T4 , T1 + A1 -> A2 > 6 S5 * C5 -> T5 , T2 + A2 -> A3 > : > > The 1st result takes 4 cycles - the length of the pipeline - but once > the pipe is primed, it begins to produce a new result on every > succeeding cycle. > > In general, a pipe that can start a new operation every N cycles can > produce a new result every N cycles. > > How often a pipeline can start a new operation often is referred to as > the "major cycle" of the pipe. The major cycle of the pipeline may be > *far* less than its total length. Reducing the major cycle is the > whole point of pipelining an operation. > > > George >
I know what a pipeline is and how it works, and what its whole point is. Why are there two operations per pipeline stage in your example? If you want to better understand what data dependencies I refer to, try to draw your scheme for computation of say a factorial. Why are there two operations per pipeline stage in your example? Dimiter
On Mon, 9 Nov 2015 21:01:36 +0200, Dimiter_Popoff <dp@tgi-sci.com>
wrote:

>I know what a pipeline is and how it works, and what its whole point is.
But you don't seem to know what a bypass/forward network is or why your comments about data dependencies and pipelined operations are only *partly* correct.
>Why are there two operations per pipeline stage in your example?
Read more carefully: those aren't pipeline stages, they are clocks. The multiplier and adder operate simultaneously, but the adder is not enabled until both inputs are available. Once the pipeline is primed, the adder has inputs available on every clock and so there are 2 results produced per clock - the temporary output from the multiplier and the final accumulated output from the adder.
>If you want to better understand what data dependencies I refer to, >try to draw your scheme for computation of say a factorial.
You are conflating the CPU's instruction decode/execute pipeline with the _separate_ functional unit pipeline of the MAC. I am very aware of data dependencies. _You_ need to do some reading about modern CPU architectures. As this thread in particular was about MAC, you need to learn more about how a MAC unit actually is implemented. With a pipelined MAC you do not have to wait for one operation to complete before starting a dependent operation that uses the result. Bypass/forward networks exist to mitigate pipeline stalls due to data dependencies by delivering data directly from the producer to the consumer *without* passing through the register file. These networks operate inside pipelines and sometimes even between different pipelines. George
On 09.11.2015 &#1075;. 22:19, George Neuner wrote:
> On Mon, 9 Nov 2015 21:01:36 +0200, Dimiter_Popoff <dp@tgi-sci.com> > wrote: > >> I know what a pipeline is and how it works, and what its whole point is. > > But you don't seem to know what a bypass/forward network is or why > your comments about data dependencies and pipelined operations are > only *partly* correct.
I know forwarding may be done for some opcodes - you don't seem to know that it is neither universally applicable to any opcode, nor is it applied to any opcode to which it is applicable out of practical considerations. With the MAC case I spoke of it has not been done simply because there is a way to do it in software by taking advantage of the sufficient number of registers, thus saving silicon area and making the entire operation _more_ efficient (by saving the number of needed load/store operations).
> >> Why are there two operations per pipeline stage in your example? > > Read more carefully: those aren't pipeline stages, they are clocks.
>
> The multiplier and adder operate simultaneously, but the adder is not > enabled until both inputs are available. Once the pipeline is primed, > the adder has inputs available on every clock and so there are 2 > results produced per clock - the temporary output from the multiplier > and the final accumulated output from the adder.
Can you please detail your scheme with names of the user visible registers where S, C and A are. You have yet to understand it is wrong.
>> If you want to better understand what data dependencies I refer to, >> try to draw your scheme for computation of say a factorial. > > You are conflating the CPU's instruction decode/execute pipeline with > the _separate_ functional unit pipeline of the MAC.
If there is a separate MAC unit this is a DSP to which my comments do not apply, I made that exception at the very start. Please read more carefully. Now try to draw your scheme for factorial computation using a single pipeline.
> I am very aware of data dependencies.
Evidently not.
>.. _You_ need to do some reading > about modern CPU architectures.
I think it is the other way around. I know what forwarding in this context is and I know - as you seem not to know - that this is the exception, not the norm. If everything could be forwarded the pipeline would be unnecessary (your written scheme demonstrates that actually you think of the pipeline as of some FIFO which it is not, it only bears some resemblance).
> As this thread in particular was about MAC, you need to learn more > about how a MAC unit actually is implemented. With a pipelined MAC > you do not have to wait for one operation to complete before starting > a dependent operation that uses the result.
I have used a pipelined MAC unit on a DSP some 15 years ago for the first time, it did 1 MAC per cycle, what makes you think I do not know this is being done all the time. And I wrote many times in my previous posts that my MAC comments did NOT apply to specialized DSPs. Many pipelined processors do not have a specialized DSP unit; and some still have a MAC instruction, the power architecture is a major example. I know of at least one reasonably modern core for which things work exactly as I explained they do; you need to take advantage of the multiple registers the programming model gives you to achieve the specified 2 clocks per 64 bit MAC. And I do know one DSP core which does 1 MAC/cycle in complete detail (complete enough to have written the assembler for it, too). How many cores do _you_ know in such detail (know like in "know"). And please before trying to teach me lessons make sure you know what you are talking about. Dimiter
Tim Wescott <seemywebsite@myfooter.really> writes:

> On Sat, 07 Nov 2015 18:27:15 -0500, rickman wrote: > >> On 11/7/2015 5:47 PM, Tim Wescott wrote: >>> On Sat, 07 Nov 2015 10:47:27 -0500, Randy Yates wrote: >>> >>>> Hi, >>>> >>>> I'm trying to find information on the Silicon Labs/Energy Micro EFM32 >>>> Cortex M3 processor instruction execution times, namely, the >>>> MLA/Multiply-Accumulate instruction, but others as well. I've found >>>> the instruction in the reference manual but nowhere are cycle times >>>> mentioned. >>>> >>>> This is surreal. Every assembly language reference manual I've ever >>>> used includes cycle counts for each instruction. Here they're nowhere >>>> to be found. >>> >>> In addition to everything else that's mentioned, with today's >>> processors you're highly constrained by pipelining & whatnot. >>> >>> Most of the parts that I've worked with need lots of wait states to run >>> out of flash -- I wouldn't be surprised if the processor spends most of >>> it's time twiddling it's thumbs waiting on memory. >> >> If you want to run fast, you either put your code in RAM, or you let the >> processor use cache that is available on all but low end processors. > > Unless I'm severely mistaken, most Cortex M3 processors are "low end" and > do not sport caches.
I was wondering about that too. Also, is RAM 0-wait and FLASH not? A one-line instruction cache helps, but I was also wondering whether coefficients (constants) would need to be in RAM for best performance. -- Randy Yates Digital Signal Labs http://www.digitalsignallabs.com
On Tue, 10 Nov 2015 09:22:08 -0500, Randy Yates wrote:

> Tim Wescott <seemywebsite@myfooter.really> writes: > >> On Sat, 07 Nov 2015 18:27:15 -0500, rickman wrote: >> >>> On 11/7/2015 5:47 PM, Tim Wescott wrote: >>>> On Sat, 07 Nov 2015 10:47:27 -0500, Randy Yates wrote: >>>> >>>>> Hi, >>>>> >>>>> I'm trying to find information on the Silicon Labs/Energy Micro >>>>> EFM32 Cortex M3 processor instruction execution times, namely, the >>>>> MLA/Multiply-Accumulate instruction, but others as well. I've found >>>>> the instruction in the reference manual but nowhere are cycle times >>>>> mentioned. >>>>> >>>>> This is surreal. Every assembly language reference manual I've ever >>>>> used includes cycle counts for each instruction. Here they're >>>>> nowhere to be found. >>>> >>>> In addition to everything else that's mentioned, with today's >>>> processors you're highly constrained by pipelining & whatnot. >>>> >>>> Most of the parts that I've worked with need lots of wait states to >>>> run out of flash -- I wouldn't be surprised if the processor spends >>>> most of it's time twiddling it's thumbs waiting on memory. >>> >>> If you want to run fast, you either put your code in RAM, or you let >>> the processor use cache that is available on all but low end >>> processors. >> >> Unless I'm severely mistaken, most Cortex M3 processors are "low end" >> and do not sport caches. > > I was wondering about that too. Also, is RAM 0-wait and FLASH not? > > A one-line instruction cache helps, but I was also wondering whether > coefficients (constants) would need to be in RAM for best performance.
Flash is generally wait (if you're running the processor above minimum speed) and RAM can be (if it's directly connected to the processor's instruction bus and does not have to use the bridge). -- Tim Wescott Wescott Design Services http://www.wescottdesign.com
Op Sat, 07 Nov 2015 16:47:27 +0100 schreef Randy Yates  
<yates@digitalsignallabs.com>:
> Hi, > > I'm trying to find information on the Silicon Labs/Energy Micro EFM32 > Cortex M3 processor instruction execution times, namely, the > MLA/Multiply-Accumulate instruction, but others as well. I've found the > instruction in the reference manual but nowhere are cycle times > mentioned.
Which reference manual?
> This is surreal. Every assembly language reference manual I've ever used > includes cycle counts for each instruction. Here they're nowhere to be > found.
ARM is a bit different. What an instruction does, is basically the same across the entire architecture, in this case ARMv7-M. This is documented in an Architecture Reference Manual (ARM). How long an instruction takes, depends on the implementation, in this case Cortex-M3. This is documented in a Technical Reference Manual (TRM). -- (Remove the obvious prefix to reply privately.) Gemaakt met Opera's e-mailprogramma: http://www.opera.com/mail/
"Boudewijn Dijkstra" <sp4mtr4p.boudewijn@indes.com> writes:

> Op Sat, 07 Nov 2015 16:47:27 +0100 schreef Randy Yates > <yates@digitalsignallabs.com>: >> Hi, >> >> I'm trying to find information on the Silicon Labs/Energy Micro EFM32 >> Cortex M3 processor instruction execution times, namely, the >> MLA/Multiply-Accumulate instruction, but others as well. I've found the >> instruction in the reference manual but nowhere are cycle times >> mentioned. > > Which reference manual?
I was referring to the one distributed by Silicon Labs.
>> This is surreal. Every assembly language reference manual I've ever used >> includes cycle counts for each instruction. Here they're nowhere to be >> found. > > ARM is a bit different. What an instruction does, is basically the > same across the entire architecture, in this case ARMv7-M. This is > documented in an Architecture Reference Manual (ARM). How long an > instruction takes, depends on the implementation, in this case > Cortex-M3. This is documented in a Technical Reference Manual (TRM).
Thanks for the information, Boudewijn. I was not aware of the Architecture Reference Manual. In my opinion it would have been better for Silicon Labs to have omitted all instruction information in their TRM and referred people to the ARM Technical Reference Manual, rather than listing some pieces there and other pieces in the ARM TRM. -- Randy Yates Digital Signal Labs http://www.digitalsignallabs.com