EFM32 Instruction Execution Times| page 5

Reply by David Brown ●November 9, 20152015-11-09

On 09/11/15 08:17, rickman wrote:
> On 11/9/2015 1:19 AM, Tim Wescott wrote:
>> On Sat, 07 Nov 2015 18:27:15 -0500, rickman wrote:
>>
>>> On 11/7/2015 5:47 PM, Tim Wescott wrote:
>>>> On Sat, 07 Nov 2015 10:47:27 -0500, Randy Yates wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I'm trying to find information on the Silicon Labs/Energy Micro EFM32
>>>>> Cortex M3 processor instruction execution times, namely, the
>>>>> MLA/Multiply-Accumulate instruction, but others as well. I've found
>>>>> the instruction in the reference manual but nowhere are cycle times
>>>>> mentioned.
>>>>>
>>>>> This is surreal. Every assembly language reference manual I've ever
>>>>> used includes cycle counts for each instruction. Here they're nowhere
>>>>> to be found.
>>>>
>>>> In addition to everything else that's mentioned, with today's
>>>> processors you're highly constrained by pipelining & whatnot.
>>>>
>>>> Most of the parts that I've worked with need lots of wait states to run
>>>> out of flash -- I wouldn't be surprised if the processor spends most of
>>>> it's time twiddling it's thumbs waiting on memory.
>>>
>>> If you want to run fast, you either put your code in RAM, or you let the
>>> processor use cache that is available on all but low end processors.
>>
>> Unless I'm severely mistaken, most Cortex M3 processors are "low end" and
>> do not sport caches.
> 
> According to wikipedia (not always reliable) CM3 has no cache.  There's
> still RAM speedup and I forgot that most CM3 devices use a prefetch to
> make sure the CPU has instructions when they are needed.

I think it is /possible/ to have cache on a CM3, but it is certainly not
common.

> 
> Think about it.  Why would they keep speeding up the CPU clock speed if
> performance was limited by the Flash alone?

There are three points here:

Cortex M devices use the Thumb2 instruction set - the aim of this is
that a solid majority of instructions are 16-bit.  Since these cpus are
single-issue, that means you can run your cpu an average of about 50-70%
higher clock speed than flash, assuming a 32-bit bus.

There are ways to get processors going faster than flash even without a
processor instruction cache.  In particular, it is common for the flash
units in faster microcontrollers to have a small buffer/cache in the
flash module.  If this is combined with wide access flash, say 64-bit
wide, you can easily get streams of instructions at cpu speed (but with
a penalty for branches and calls).

And here is the main point - manufacturers /don't/ keep speeding up CPU
clock speed on the CM3.  Most serious manufacturers who make fast CM
microcontrollers have moved to the CM4 - some never bothered with the
CM3 in the first place.  They put caches (and single-precision floating
point) on their faster devices.

So yes, CM3 devices /are/ low end - they are now either on older, legacy
parts (in this field, that means more than a couple of years old), or as
microcontrollers in integrated chips where the cpu plays a minor role
(such as a high-end ADC that happens to have a cpu integrated).

> 
>> At least on the ST parts, not all of the RAM is connected to the
>> processor's instruction bus, so you don't get as much speedup as you'd
>> think.  Some do have a magic memory address range that's dual-ported to
>> both buses.
> 
>

Reply by George Neuner ●November 9, 20152015-11-09

On Sun, 8 Nov 2015 02:19:53 +0200, Dimiter_Popoff <dp@tgi-sci.com>
wrote:

>>> So where else can data enter the pipeline except at its input?

Data can be inserted into the pipeline at the input of any functional
unit.  It can also be extracted at the output of any functional unit
even for units in the middle of the pipe.

The general method is called "bypassing" or "forwarding" and many
modern CPUs contain large bypass/forward networks to route data
internally "just in time" to where it needs to be.

This bypass/forward method is what Rick was referring to previously
when he said: 
>> ... A pipeline does multiple steps, you don't need an input until the step
>> that uses it.  The adder for the accumulation only needs the result of the
>> accumulation on the next clock when it starts the next add.  Why would it
>>  need the result of the add at the same time as the inputs to the multiply?

A MAC unit is a combination of a multiplier and an adder.  A very low
end MAC may be serial, but most are pipelined.  A pipelined unit will
have a *zero* cycle forwarding path from the adder's output back into
the adder's input, so the result can be used on the very next cycle.

>>> How can you have the result of a 6 cycle operation in less than
>>> 6 cycles?

You can't.  However, the point of the pipeline is to parallelize
operations by overlapping them.

In your example:
   S0 * C0 + A0 = A1
   S1 * C1 + A1 = A2
     :
   Sn * Cn + An = A(n+1)

The output of the MAC can be fed back directly into its adder to be
available on the next cycle.  

So consider a typical low end 4 cycle pipelined MAC, combining a 3
cycle multiplier with a 1 cycle adder, that can *start* a new
operation on every cycle

cycle     operation(s)
1          S0 * C0 -> T0
2          S1 * C1 -> T1
3          S2 * C2 -> T2 
4          S3 * C3 -> T3 ,  T0 + A0 -> A1
5          S4 * C4 -> T4 ,  T1 + A1 -> A2 
6          S5 * C5 -> T5 ,  T2 + A2 -> A3 
             :

The 1st result takes 4 cycles - the length of the pipeline - but once
the pipe is primed, it begins to produce a new result on every
succeeding cycle.

In general, a pipe that can start a new operation every N cycles can
produce a new result every N cycles.

How often a pipeline can start a new operation often is referred to as
the "major cycle" of the pipe.  The major cycle of the pipeline may be
*far* less than its total length.  Reducing the major cycle is the
whole point of pipelining an operation.

George

Reply by Dimiter_Popoff ●November 9, 20152015-11-09

On 09.11.2015 &#1075;. 19:52, George Neuner wrote:
> On Sun, 8 Nov 2015 02:19:53 +0200, Dimiter_Popoff <dp@tgi-sci.com>
> wrote:
>
>>>> So where else can data enter the pipeline except at its input?
>
> Data can be inserted into the pipeline at the input of any functional
> unit.  It can also be extracted at the output of any functional unit
> even for units in the middle of the pipe.
>
> The general method is called "bypassing" or "forwarding" and many
> modern CPUs contain large bypass/forward networks to route data
> internally "just in time" to where it needs to be.
>
> This bypass/forward method is what Rick was referring to previously
> when he said:
>>> ... A pipeline does multiple steps, you don't need an input until the step
>>> that uses it.  The adder for the accumulation only needs the result of the
>>> accumulation on the next clock when it starts the next add.  Why would it
>>>   need the result of the add at the same time as the inputs to the multiply?
>
>
> A MAC unit is a combination of a multiplier and an adder.  A very low
> end MAC may be serial, but most are pipelined.  A pipelined unit will
> have a *zero* cycle forwarding path from the adder's output back into
> the adder's input, so the result can be used on the very next cycle.
>
>
>>>> How can you have the result of a 6 cycle operation in less than
>>>> 6 cycles?
>
> You can't.  However, the point of the pipeline is to parallelize
> operations by overlapping them.
>
> In your example:
>     S0 * C0 + A0 = A1
>     S1 * C1 + A1 = A2
>       :
>     Sn * Cn + An = A(n+1)
>
> The output of the MAC can be fed back directly into its adder to be
> available on the next cycle.
>
> So consider a typical low end 4 cycle pipelined MAC, combining a 3
> cycle multiplier with a 1 cycle adder, that can *start* a new
> operation on every cycle
>
> cycle     operation(s)
> 1          S0 * C0 -> T0
> 2          S1 * C1 -> T1
> 3          S2 * C2 -> T2
> 4          S3 * C3 -> T3 ,  T0 + A0 -> A1
> 5          S4 * C4 -> T4 ,  T1 + A1 -> A2
> 6          S5 * C5 -> T5 ,  T2 + A2 -> A3
>               :
>
> The 1st result takes 4 cycles - the length of the pipeline - but once
> the pipe is primed, it begins to produce a new result on every
> succeeding cycle.
>
> In general, a pipe that can start a new operation every N cycles can
> produce a new result every N cycles.
>
> How often a pipeline can start a new operation often is referred to as
> the "major cycle" of the pipe.  The major cycle of the pipeline may be
> *far* less than its total length.  Reducing the major cycle is the
> whole point of pipelining an operation.
>
>
> George
>

I know what a pipeline is and how it works, and what its whole point is.

Why are there two operations per pipeline stage in your example?

If you want to better understand what data dependencies I refer to,
try to draw your scheme for computation of say a factorial.

Why are there two operations per pipeline stage in your example?

Dimiter

Reply by George Neuner ●November 9, 20152015-11-09

On Mon, 9 Nov 2015 21:01:36 +0200, Dimiter_Popoff <dp@tgi-sci.com>
wrote:

>I know what a pipeline is and how it works, and what its whole point is.

But you don't seem to know what a bypass/forward network is or why
your comments about data dependencies and pipelined operations are
only *partly* correct.

>Why are there two operations per pipeline stage in your example?

Read more carefully: those aren't pipeline stages, they are clocks.

The multiplier and adder operate simultaneously, but the adder is not
enabled until both inputs are available.  Once the pipeline is primed,
the adder has inputs available on every clock and so  there are 2
results produced per clock - the temporary output from the multiplier
and the final accumulated output from the adder.

>If you want to better understand what data dependencies I refer to,
>try to draw your scheme for computation of say a factorial.

You are conflating the CPU's instruction decode/execute pipeline with
the _separate_ functional unit pipeline of the MAC.

I am very aware of data dependencies.  _You_ need to do some reading
about modern CPU architectures.  

As this thread in particular was about MAC, you need to learn more
about how a MAC unit actually is implemented.  With a pipelined MAC
you do not have to wait for one operation to complete before starting
a dependent operation that uses the result.

Bypass/forward networks exist to mitigate pipeline stalls due to data
dependencies by delivering data directly from the producer to the
consumer *without* passing through the register file.  These networks
operate inside pipelines and sometimes even between different
pipelines.

George

Reply by Dimiter_Popoff ●November 9, 20152015-11-09

On 09.11.2015 &#1075;. 22:19, George Neuner wrote:
> On Mon, 9 Nov 2015 21:01:36 +0200, Dimiter_Popoff <dp@tgi-sci.com>
> wrote:
>
>> I know what a pipeline is and how it works, and what its whole point is.
>
> But you don't seem to know what a bypass/forward network is or why
> your comments about data dependencies and pipelined operations are
> only *partly* correct.

I know forwarding may be done for some opcodes - you don't seem to
know that it is neither universally applicable to any opcode, nor is it
applied to any opcode to which it is applicable out of practical
considerations.

With the MAC case I spoke of it has not been done simply because
there is a way to do it in software by taking advantage of the
sufficient number of registers, thus saving silicon area and
making the entire operation _more_ efficient (by saving the number
of needed load/store operations).

>
>> Why are there two operations per pipeline stage in your example?
>
> Read more carefully: those aren't pipeline stages, they are clocks.
 >
> The multiplier and adder operate simultaneously, but the adder is not
> enabled until both inputs are available.  Once the pipeline is primed,
> the adder has inputs available on every clock and so  there are 2
> results produced per clock - the temporary output from the multiplier
> and the final accumulated output from the adder.

Can you please detail your scheme with names of the user visible
registers where S, C and A are. You have yet to understand it is
wrong.

>> If you want to better understand what data dependencies I refer to,
>> try to draw your scheme for computation of say a factorial.
>
> You are conflating the CPU's instruction decode/execute pipeline with
> the _separate_ functional unit pipeline of the MAC.

If there is a separate MAC unit this is a DSP to which my comments
do not apply, I made that exception at the very start. Please read
more carefully.

Now try to draw your scheme for factorial computation using a single
pipeline.

> I am very aware of data dependencies.

Evidently not.

>..  _You_ need to do some reading
> about modern CPU architectures.

I think it is the other way around. I know what forwarding in this
context is and I know - as you seem not to know - that this is the
exception, not the norm. If everything could be forwarded the pipeline
would be unnecessary (your written scheme demonstrates that actually
you think of the pipeline as of some FIFO which it is not, it only
bears some resemblance).

> As this thread in particular was about MAC, you need to learn more
> about how a MAC unit actually is implemented.  With a pipelined MAC
> you do not have to wait for one operation to complete before starting
> a dependent operation that uses the result.

I have used a pipelined MAC unit on a DSP some 15 years ago for the
first time, it did 1 MAC per cycle, what makes you think I do not
know this is being done all the time.

And I wrote many times in my previous posts that my MAC comments did
NOT apply to specialized DSPs.

Many pipelined processors do not have a specialized DSP unit; and some
still have a MAC instruction, the power architecture is a major example.

I know of at least one reasonably modern core for which things work
exactly as I explained they do; you need to take advantage of the
multiple registers the programming model gives you to achieve the
specified 2 clocks per 64 bit MAC.
And I do know one DSP core which does 1 MAC/cycle in complete detail
(complete enough to have written the assembler for it, too).

How many cores do _you_ know in such detail (know like in "know").

And please before trying to teach me lessons make sure you know what
you are talking about.

Dimiter

Reply by Randy Yates ●November 10, 20152015-11-10

Tim Wescott <seemywebsite@myfooter.really> writes:

> On Sat, 07 Nov 2015 18:27:15 -0500, rickman wrote:
>
>> On 11/7/2015 5:47 PM, Tim Wescott wrote:
>>> On Sat, 07 Nov 2015 10:47:27 -0500, Randy Yates wrote:
>>>
>>>> Hi,
>>>>
>>>> I'm trying to find information on the Silicon Labs/Energy Micro EFM32
>>>> Cortex M3 processor instruction execution times, namely, the
>>>> MLA/Multiply-Accumulate instruction, but others as well. I've found
>>>> the instruction in the reference manual but nowhere are cycle times
>>>> mentioned.
>>>>
>>>> This is surreal. Every assembly language reference manual I've ever
>>>> used includes cycle counts for each instruction. Here they're nowhere
>>>> to be found.
>>>
>>> In addition to everything else that's mentioned, with today's
>>> processors you're highly constrained by pipelining & whatnot.
>>>
>>> Most of the parts that I've worked with need lots of wait states to run
>>> out of flash -- I wouldn't be surprised if the processor spends most of
>>> it's time twiddling it's thumbs waiting on memory.
>> 
>> If you want to run fast, you either put your code in RAM, or you let the
>> processor use cache that is available on all but low end processors.
>
> Unless I'm severely mistaken, most Cortex M3 processors are "low end" and 
> do not sport caches.

I was wondering about that too. Also, is RAM 0-wait and FLASH not? 

A one-line instruction cache helps, but I was also wondering whether
coefficients (constants) would need to be in RAM for best performance.
-- 
Randy Yates
Digital Signal Labs
http://www.digitalsignallabs.com

Reply by Tim Wescott ●November 10, 20152015-11-10

On Tue, 10 Nov 2015 09:22:08 -0500, Randy Yates wrote:

> Tim Wescott <seemywebsite@myfooter.really> writes:
> 
>> On Sat, 07 Nov 2015 18:27:15 -0500, rickman wrote:
>>
>>> On 11/7/2015 5:47 PM, Tim Wescott wrote:
>>>> On Sat, 07 Nov 2015 10:47:27 -0500, Randy Yates wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I'm trying to find information on the Silicon Labs/Energy Micro
>>>>> EFM32 Cortex M3 processor instruction execution times, namely, the
>>>>> MLA/Multiply-Accumulate instruction, but others as well. I've found
>>>>> the instruction in the reference manual but nowhere are cycle times
>>>>> mentioned.
>>>>>
>>>>> This is surreal. Every assembly language reference manual I've ever
>>>>> used includes cycle counts for each instruction. Here they're
>>>>> nowhere to be found.
>>>>
>>>> In addition to everything else that's mentioned, with today's
>>>> processors you're highly constrained by pipelining & whatnot.
>>>>
>>>> Most of the parts that I've worked with need lots of wait states to
>>>> run out of flash -- I wouldn't be surprised if the processor spends
>>>> most of it's time twiddling it's thumbs waiting on memory.
>>> 
>>> If you want to run fast, you either put your code in RAM, or you let
>>> the processor use cache that is available on all but low end
>>> processors.
>>
>> Unless I'm severely mistaken, most Cortex M3 processors are "low end"
>> and do not sport caches.
> 
> I was wondering about that too. Also, is RAM 0-wait and FLASH not?
> 
> A one-line instruction cache helps, but I was also wondering whether
> coefficients (constants) would need to be in RAM for best performance.

Flash is generally wait (if you're running the processor above minimum 
speed) and RAM can be (if it's directly connected to the processor's 
instruction bus and does not have to use the bridge).

-- 

Tim Wescott
Wescott Design Services
http://www.wescottdesign.com

Reply by Boudewijn Dijkstra ●November 11, 20152015-11-11

Op Sat, 07 Nov 2015 16:47:27 +0100 schreef Randy Yates  
<yates@digitalsignallabs.com>:
> Hi,
>
> I'm trying to find information on the Silicon Labs/Energy Micro EFM32
> Cortex M3 processor instruction execution times, namely, the
> MLA/Multiply-Accumulate instruction, but others as well. I've found the
> instruction in the reference manual but nowhere are cycle times
> mentioned.

Which reference manual?

> This is surreal. Every assembly language reference manual I've ever used
> includes cycle counts for each instruction. Here they're nowhere to be
> found.

ARM is a bit different.  What an instruction does, is basically the same  
across the entire architecture, in this case ARMv7-M.  This is documented  
in an Architecture Reference Manual (ARM).  How long an instruction takes,  
depends on the implementation, in this case Cortex-M3.  This is documented  
in a Technical Reference Manual (TRM).



-- 
(Remove the obvious prefix to reply privately.)
Gemaakt met Opera's e-mailprogramma: http://www.opera.com/mail/

Reply by Randy Yates ●November 11, 20152015-11-11

"Boudewijn Dijkstra" <sp4mtr4p.boudewijn@indes.com> writes:

> Op Sat, 07 Nov 2015 16:47:27 +0100 schreef Randy Yates
> <yates@digitalsignallabs.com>:
>> Hi,
>>
>> I'm trying to find information on the Silicon Labs/Energy Micro EFM32
>> Cortex M3 processor instruction execution times, namely, the
>> MLA/Multiply-Accumulate instruction, but others as well. I've found the
>> instruction in the reference manual but nowhere are cycle times
>> mentioned.
>
> Which reference manual?

I was referring to the one distributed by Silicon Labs.

>> This is surreal. Every assembly language reference manual I've ever used
>> includes cycle counts for each instruction. Here they're nowhere to be
>> found.
>
> ARM is a bit different.  What an instruction does, is basically the
> same across the entire architecture, in this case ARMv7-M.  This is
> documented in an Architecture Reference Manual (ARM).  How long an
> instruction takes, depends on the implementation, in this case
> Cortex-M3.  This is documented in a Technical Reference Manual (TRM).

Thanks for the information, Boudewijn. I was not aware of the
Architecture Reference Manual.

In my opinion it would have been better for Silicon Labs to have omitted
all instruction information in their TRM and referred people to the ARM
Technical Reference Manual, rather than listing some pieces there and
other pieces in the ARM TRM.
-- 
Randy Yates
Digital Signal Labs
http://www.digitalsignallabs.com

Previous 3 45Next

EFM32 Instruction Execution Times

Sign in

You might also like...

Search forums

Free PDF Downloads

Blogs - Hall of Fame

Discussion Groups

Quick Links

About EmbeddedRelated.com

Social Networks

The Related Media Group