EFM32 Instruction Execution Times| page 4

Reply by Richard Damon ●November 8, 20152015-11-08

On 11/8/15 6:12 AM, Dimiter_Popoff wrote:
  Well, I'll try to explain it one more time for you. These
> things may be well understood but you are clearly not among
> those who have understood them so here we go again.
>
> To repeat the example I already gave in another post,
> to do a MAC we need to multiply a sample (S) by a coefficient (C)
> and add the product to the accumulated result (A).
> So we have
> S0*C0+A0=A1,
> S1*C1+A1=A2 and so on.
>
> Let us say we have a 6 stage pipeline (just to persist with the
> 6 figure I started my examples with).
> At the first line we have all 3 inputs - S0, C0 and A0 readily
> available and they enter the pipeline.
> Next clock cycle we need S1, C1 - which let us say we have or
> can fetch fast enough and... oops, A1. But A1 will be at the OUTPUT
> of the pipeline 6 cycles later so we will have to wait until then.
>
> I don't think this can be explained any clearer. Obviously the
> MAC operation can be substituted by any other one which needs the
> output of the pipeline as an input.
>
> This is plain arithmetic and is valid for any pipelined processor.
> Those which manage 1 or more MACs per cycle have special hardware
> to do so - doing what my source example does this or that way,
> the registers they use would simply be hidden from the programming
> model.
>
> Dimiter
>

The issue is the assumption that the MAC instruction can't be scheduled 
to start until all the data it needs anywhere during the execution is 
available as completed.

The scheduler knows that the addend isn't needed until cycle 5, and can 
see that it will be available then, so it can start the MAC now and get 
that data later.

Yes, this makes the scheduler more complicated, but it can be (and is) 
done to keep things running fast. (It might not be done in all cases, 
but would commonly be done for this sort of case for any processor 
designed for efficient DSP.

Reply by Dimiter_Popoff ●November 8, 20152015-11-08

On 08.11.2015 &#1075;. 15:18, Richard Damon wrote:
> On 11/8/15 6:12 AM, Dimiter_Popoff wrote:
>   Well, I'll try to explain it one more time for you. These
>> things may be well understood but you are clearly not among
>> those who have understood them so here we go again.
>>
>> To repeat the example I already gave in another post,
>> to do a MAC we need to multiply a sample (S) by a coefficient (C)
>> and add the product to the accumulated result (A).
>> So we have
>> S0*C0+A0=A1,
>> S1*C1+A1=A2 and so on.
>>
>> Let us say we have a 6 stage pipeline (just to persist with the
>> 6 figure I started my examples with).
>> At the first line we have all 3 inputs - S0, C0 and A0 readily
>> available and they enter the pipeline.
>> Next clock cycle we need S1, C1 - which let us say we have or
>> can fetch fast enough and... oops, A1. But A1 will be at the OUTPUT
>> of the pipeline 6 cycles later so we will have to wait until then.
>>
>> I don't think this can be explained any clearer. Obviously the
>> MAC operation can be substituted by any other one which needs the
>> output of the pipeline as an input.
>>
>> This is plain arithmetic and is valid for any pipelined processor.
>> Those which manage 1 or more MACs per cycle have special hardware
>> to do so - doing what my source example does this or that way,
>> the registers they use would simply be hidden from the programming
>> model.
>>
>> Dimiter
>>
>
> The issue is the assumption that the MAC instruction can't be scheduled
> to start until all the data it needs anywhere during the execution is
> available as completed.
>
> The scheduler knows that the addend isn't needed until cycle 5, and can
> see that it will be available then, so it can start the MAC now and get
> that data later.
>
> Yes, this makes the scheduler more complicated, but it can be (and is)
> done to keep things running fast. (It might not be done in all cases,
> but would commonly be done for this sort of case for any processor
> designed for efficient DSP.

Of course like I said DSPs are dealing with this - have been for
decades.
But it takes some specific opcode(s), for MAC between registers
(like on "normal" load/store processors) it is impractical
to track all the opcodes currently in the pipeline in order to
make this sort of decisions - which is the main reason why
load/store machines are designed with more registers, the idea
behind RISC is to leave more work to be done by software and thus
save silicon area.
ARM has apparently been started as a cheaper/lower power tradeoff;
it has worked quite well for them, is working at the moment really.
The too few registers impediment comes into effect only when horse
powers begin to matter - and I think they have addressed this in their
64-bit model (I am not familiar with it though).

Dimiter

------------------------------------------------------
Dimiter Popoff, TGI             http://www.tgi-sci.com
------------------------------------------------------
http://www.flickr.com/photos/didi_tgi/

Reply by ●November 8, 20152015-11-08

Den s&ouml;ndag 8 november 2015 kl. 14:36:22 UTC+1 skrev dp:
> On 08.11.2015 &#1075;. 15:18, Richard Damon wrote:
> > On 11/8/15 6:12 AM, Dimiter_Popoff wrote:
> >   Well, I'll try to explain it one more time for you. These
> >> things may be well understood but you are clearly not among
> >> those who have understood them so here we go again.
> >>
> >> To repeat the example I already gave in another post,
> >> to do a MAC we need to multiply a sample (S) by a coefficient (C)
> >> and add the product to the accumulated result (A).
> >> So we have
> >> S0*C0+A0=A1,
> >> S1*C1+A1=A2 and so on.
> >>
> >> Let us say we have a 6 stage pipeline (just to persist with the
> >> 6 figure I started my examples with).
> >> At the first line we have all 3 inputs - S0, C0 and A0 readily
> >> available and they enter the pipeline.
> >> Next clock cycle we need S1, C1 - which let us say we have or
> >> can fetch fast enough and... oops, A1. But A1 will be at the OUTPUT
> >> of the pipeline 6 cycles later so we will have to wait until then.
> >>
> >> I don't think this can be explained any clearer. Obviously the
> >> MAC operation can be substituted by any other one which needs the
> >> output of the pipeline as an input.
> >>
> >> This is plain arithmetic and is valid for any pipelined processor.
> >> Those which manage 1 or more MACs per cycle have special hardware
> >> to do so - doing what my source example does this or that way,
> >> the registers they use would simply be hidden from the programming
> >> model.
> >>
> >> Dimiter
> >>
> >
> > The issue is the assumption that the MAC instruction can't be scheduled
> > to start until all the data it needs anywhere during the execution is
> > available as completed.
> >
> > The scheduler knows that the addend isn't needed until cycle 5, and can
> > see that it will be available then, so it can start the MAC now and get
> > that data later.
> >
> > Yes, this makes the scheduler more complicated, but it can be (and is)
> > done to keep things running fast. (It might not be done in all cases,
> > but would commonly be done for this sort of case for any processor
> > designed for efficient DSP.
> 
> Of course like I said DSPs are dealing with this - have been for
> decades.
> But it takes some specific opcode(s), for MAC between registers
> (like on "normal" load/store processors) it is impractical
> to track all the opcodes currently in the pipeline in order to
> make this sort of decisions - which is the main reason why
> load/store machines are designed with more registers, the idea
> behind RISC is to leave more work to be done by software and thus
> save silicon area.
> ARM has apparently been started as a cheaper/lower power tradeoff;
> it has worked quite well for them, is working at the moment really.
> The too few registers impediment comes into effect only when horse
> powers begin to matter - and I think they have addressed this in their
> 64-bit model (I am not familiar with it though).
> 
> Dimiter
> 
> ------------------------------------------------------
> Dimiter Popoff, TGI             http://www.tgi-sci.com
> ------------------------------------------------------
> http://www.flickr.com/photos/didi_tgi/

When designing a pipelined MAC, making it such that the accumulation operand has to be internally delayed a few cycles before it is used is... strange. Normally, it would be designed in such a way that back-to-back MACs can be issued, and the accumulation operand is just forwarded to the next instruction internally. Creating logic to catch the most recent contents for a "register" would be trivial. I have a hard time seing how anyone would skip such an easy optimization, with a big payoff, especially for a register constrained architecture. But, there might still be such implementations of course.

BR Jakob

Reply by Mel Wilson ●November 8, 20152015-11-08

On Sun, 08 Nov 2015 02:19:53 +0200, Dimiter_Popoff wrote:

> Under DSP trickeries I mean doing extra hardware to hide the effect of
> the pipeline length for MAC instructions which I explained above from
> the end user.

"A complete conspiracy is a law of nature."
 -- Henri Poincar&eacute;
(or words, probably in French, to that effect.)

Reply by rickman ●November 8, 20152015-11-08

On 11/8/2015 8:50 AM, jakbru@gmail.com wrote:
> Den s&ouml;ndag 8 november 2015 kl. 14:36:22 UTC+1 skrev dp:
>> On 08.11.2015 &#1075;. 15:18, Richard Damon wrote:
>>> On 11/8/15 6:12 AM, Dimiter_Popoff wrote:
>>>    Well, I'll try to explain it one more time for you. These
>>>> things may be well understood but you are clearly not among
>>>> those who have understood them so here we go again.
>>>>
>>>> To repeat the example I already gave in another post,
>>>> to do a MAC we need to multiply a sample (S) by a coefficient (C)
>>>> and add the product to the accumulated result (A).
>>>> So we have
>>>> S0*C0+A0=A1,
>>>> S1*C1+A1=A2 and so on.
>>>>
>>>> Let us say we have a 6 stage pipeline (just to persist with the
>>>> 6 figure I started my examples with).
>>>> At the first line we have all 3 inputs - S0, C0 and A0 readily
>>>> available and they enter the pipeline.
>>>> Next clock cycle we need S1, C1 - which let us say we have or
>>>> can fetch fast enough and... oops, A1. But A1 will be at the OUTPUT
>>>> of the pipeline 6 cycles later so we will have to wait until then.
>>>>
>>>> I don't think this can be explained any clearer. Obviously the
>>>> MAC operation can be substituted by any other one which needs the
>>>> output of the pipeline as an input.
>>>>
>>>> This is plain arithmetic and is valid for any pipelined processor.
>>>> Those which manage 1 or more MACs per cycle have special hardware
>>>> to do so - doing what my source example does this or that way,
>>>> the registers they use would simply be hidden from the programming
>>>> model.
>>>>
>>>> Dimiter
>>>>
>>>
>>> The issue is the assumption that the MAC instruction can't be scheduled
>>> to start until all the data it needs anywhere during the execution is
>>> available as completed.
>>>
>>> The scheduler knows that the addend isn't needed until cycle 5, and can
>>> see that it will be available then, so it can start the MAC now and get
>>> that data later.
>>>
>>> Yes, this makes the scheduler more complicated, but it can be (and is)
>>> done to keep things running fast. (It might not be done in all cases,
>>> but would commonly be done for this sort of case for any processor
>>> designed for efficient DSP.
>>
>> Of course like I said DSPs are dealing with this - have been for
>> decades.
>> But it takes some specific opcode(s), for MAC between registers
>> (like on "normal" load/store processors) it is impractical
>> to track all the opcodes currently in the pipeline in order to
>> make this sort of decisions - which is the main reason why
>> load/store machines are designed with more registers, the idea
>> behind RISC is to leave more work to be done by software and thus
>> save silicon area.
>> ARM has apparently been started as a cheaper/lower power tradeoff;
>> it has worked quite well for them, is working at the moment really.
>> The too few registers impediment comes into effect only when horse
>> powers begin to matter - and I think they have addressed this in their
>> 64-bit model (I am not familiar with it though).
>>
>> Dimiter
>>
>> ------------------------------------------------------
>> Dimiter Popoff, TGI             http://www.tgi-sci.com
>> ------------------------------------------------------
>> http://www.flickr.com/photos/didi_tgi/
>
> When designing a pipelined MAC, making it such that the accumulation operand has to be internally delayed a few cycles before it is used is... strange. Normally, it would be designed in such a way that back-to-back MACs can be issued, and the accumulation operand is just forwarded to the next instruction internally. Creating logic to catch the most recent contents for a "register" would be trivial. I have a hard time seing how anyone would skip such an easy optimization, with a big payoff, especially for a register constrained architecture. But, there might still be such implementations of course.

We all know this, but Dimiter seems to want to hold onto the idea that 
the CM3 is constructed the why he imagines it without considering the 
possibility that it is different and refuses to do any work to verify 
the facts.

At this point I consider his posts on the topic to be trollish and 
without value.

-- 

Rick

Reply by Dimiter_Popoff ●November 8, 20152015-11-08

On 08.11.2015 &#1075;. 22:15, rickman wrote:
> On 11/8/2015 8:50 AM, jakbru@gmail.com wrote:
>> Den s&ouml;ndag 8 november 2015 kl. 14:36:22 UTC+1 skrev dp:
>>> On 08.11.2015 &#1075;. 15:18, Richard Damon wrote:
>>>> On 11/8/15 6:12 AM, Dimiter_Popoff wrote:
>>>>    Well, I'll try to explain it one more time for you. These
>>>>> things may be well understood but you are clearly not among
>>>>> those who have understood them so here we go again.
>>>>>
>>>>> To repeat the example I already gave in another post,
>>>>> to do a MAC we need to multiply a sample (S) by a coefficient (C)
>>>>> and add the product to the accumulated result (A).
>>>>> So we have
>>>>> S0*C0+A0=A1,
>>>>> S1*C1+A1=A2 and so on.
>>>>>
>>>>> Let us say we have a 6 stage pipeline (just to persist with the
>>>>> 6 figure I started my examples with).
>>>>> At the first line we have all 3 inputs - S0, C0 and A0 readily
>>>>> available and they enter the pipeline.
>>>>> Next clock cycle we need S1, C1 - which let us say we have or
>>>>> can fetch fast enough and... oops, A1. But A1 will be at the OUTPUT
>>>>> of the pipeline 6 cycles later so we will have to wait until then.
>>>>>
>>>>> I don't think this can be explained any clearer. Obviously the
>>>>> MAC operation can be substituted by any other one which needs the
>>>>> output of the pipeline as an input.
>>>>>
>>>>> This is plain arithmetic and is valid for any pipelined processor.
>>>>> Those which manage 1 or more MACs per cycle have special hardware
>>>>> to do so - doing what my source example does this or that way,
>>>>> the registers they use would simply be hidden from the programming
>>>>> model.
>>>>>
>>>>> Dimiter
>>>>>
>>>>
>>>> The issue is the assumption that the MAC instruction can't be scheduled
>>>> to start until all the data it needs anywhere during the execution is
>>>> available as completed.
>>>>
>>>> The scheduler knows that the addend isn't needed until cycle 5, and can
>>>> see that it will be available then, so it can start the MAC now and get
>>>> that data later.
>>>>
>>>> Yes, this makes the scheduler more complicated, but it can be (and is)
>>>> done to keep things running fast. (It might not be done in all cases,
>>>> but would commonly be done for this sort of case for any processor
>>>> designed for efficient DSP.
>>>
>>> Of course like I said DSPs are dealing with this - have been for
>>> decades.
>>> But it takes some specific opcode(s), for MAC between registers
>>> (like on "normal" load/store processors) it is impractical
>>> to track all the opcodes currently in the pipeline in order to
>>> make this sort of decisions - which is the main reason why
>>> load/store machines are designed with more registers, the idea
>>> behind RISC is to leave more work to be done by software and thus
>>> save silicon area.
>>> ARM has apparently been started as a cheaper/lower power tradeoff;
>>> it has worked quite well for them, is working at the moment really.
>>> The too few registers impediment comes into effect only when horse
>>> powers begin to matter - and I think they have addressed this in their
>>> 64-bit model (I am not familiar with it though).
>>>
>>> Dimiter
>>>
>>> ------------------------------------------------------
>>> Dimiter Popoff, TGI             http://www.tgi-sci.com
>>> ------------------------------------------------------
>>> http://www.flickr.com/photos/didi_tgi/
>>
>> When designing a pipelined MAC, making it such that the accumulation
>> operand has to be internally delayed a few cycles before it is used
>> is... strange. Normally, it would be designed in such a way that
>> back-to-back MACs can be issued, and the accumulation operand is just
>> forwarded to the next instruction internally. Creating logic to catch
>> the most recent contents for a "register" would be trivial. I have a
>> hard time seing how anyone would skip such an easy optimization, with
>> a big payoff, especially for a register constrained architecture. But,
>> there might still be such implementations of course.
>
> We all know this, but Dimiter seems to want to hold onto the idea that
> the CM3 is constructed the why he imagines it without considering the
> possibility that it is different and refuses to do any work to verify
> the facts.
>
> At this point I consider his posts on the topic to be trollish and
> without value.
>

You suggesting to look something up for days is of huge value, sure.
So what did you look up.

BTW, did you eventually understand my explanation? I would have expected
someone doing logic designs to be a lot quicker in doing so.

Dimiter

Reply by rickman ●November 8, 20152015-11-08

On 11/8/2015 3:52 PM, Dimiter_Popoff wrote:
> On 08.11.2015 &#1075;. 22:15, rickman wrote:
>> On 11/8/2015 8:50 AM, jakbru@gmail.com wrote:
>>> Den s&ouml;ndag 8 november 2015 kl. 14:36:22 UTC+1 skrev dp:
>>>> On 08.11.2015 &#1075;. 15:18, Richard Damon wrote:
>>>>> On 11/8/15 6:12 AM, Dimiter_Popoff wrote:
>>>>>    Well, I'll try to explain it one more time for you. These
>>>>>> things may be well understood but you are clearly not among
>>>>>> those who have understood them so here we go again.
>>>>>>
>>>>>> To repeat the example I already gave in another post,
>>>>>> to do a MAC we need to multiply a sample (S) by a coefficient (C)
>>>>>> and add the product to the accumulated result (A).
>>>>>> So we have
>>>>>> S0*C0+A0=A1,
>>>>>> S1*C1+A1=A2 and so on.
>>>>>>
>>>>>> Let us say we have a 6 stage pipeline (just to persist with the
>>>>>> 6 figure I started my examples with).
>>>>>> At the first line we have all 3 inputs - S0, C0 and A0 readily
>>>>>> available and they enter the pipeline.
>>>>>> Next clock cycle we need S1, C1 - which let us say we have or
>>>>>> can fetch fast enough and... oops, A1. But A1 will be at the OUTPUT
>>>>>> of the pipeline 6 cycles later so we will have to wait until then.
>>>>>>
>>>>>> I don't think this can be explained any clearer. Obviously the
>>>>>> MAC operation can be substituted by any other one which needs the
>>>>>> output of the pipeline as an input.
>>>>>>
>>>>>> This is plain arithmetic and is valid for any pipelined processor.
>>>>>> Those which manage 1 or more MACs per cycle have special hardware
>>>>>> to do so - doing what my source example does this or that way,
>>>>>> the registers they use would simply be hidden from the programming
>>>>>> model.
>>>>>>
>>>>>> Dimiter
>>>>>>
>>>>>
>>>>> The issue is the assumption that the MAC instruction can't be
>>>>> scheduled
>>>>> to start until all the data it needs anywhere during the execution is
>>>>> available as completed.
>>>>>
>>>>> The scheduler knows that the addend isn't needed until cycle 5, and
>>>>> can
>>>>> see that it will be available then, so it can start the MAC now and
>>>>> get
>>>>> that data later.
>>>>>
>>>>> Yes, this makes the scheduler more complicated, but it can be (and is)
>>>>> done to keep things running fast. (It might not be done in all cases,
>>>>> but would commonly be done for this sort of case for any processor
>>>>> designed for efficient DSP.
>>>>
>>>> Of course like I said DSPs are dealing with this - have been for
>>>> decades.
>>>> But it takes some specific opcode(s), for MAC between registers
>>>> (like on "normal" load/store processors) it is impractical
>>>> to track all the opcodes currently in the pipeline in order to
>>>> make this sort of decisions - which is the main reason why
>>>> load/store machines are designed with more registers, the idea
>>>> behind RISC is to leave more work to be done by software and thus
>>>> save silicon area.
>>>> ARM has apparently been started as a cheaper/lower power tradeoff;
>>>> it has worked quite well for them, is working at the moment really.
>>>> The too few registers impediment comes into effect only when horse
>>>> powers begin to matter - and I think they have addressed this in their
>>>> 64-bit model (I am not familiar with it though).
>>>>
>>>> Dimiter
>>>>
>>>> ------------------------------------------------------
>>>> Dimiter Popoff, TGI             http://www.tgi-sci.com
>>>> ------------------------------------------------------
>>>> http://www.flickr.com/photos/didi_tgi/
>>>
>>> When designing a pipelined MAC, making it such that the accumulation
>>> operand has to be internally delayed a few cycles before it is used
>>> is... strange. Normally, it would be designed in such a way that
>>> back-to-back MACs can be issued, and the accumulation operand is just
>>> forwarded to the next instruction internally. Creating logic to catch
>>> the most recent contents for a "register" would be trivial. I have a
>>> hard time seing how anyone would skip such an easy optimization, with
>>> a big payoff, especially for a register constrained architecture. But,
>>> there might still be such implementations of course.
>>
>> We all know this, but Dimiter seems to want to hold onto the idea that
>> the CM3 is constructed the why he imagines it without considering the
>> possibility that it is different and refuses to do any work to verify
>> the facts.
>>
>> At this point I consider his posts on the topic to be trollish and
>> without value.
>>
>
> You suggesting to look something up for days is of huge value, sure.
> So what did you look up.
>
> BTW, did you eventually understand my explanation? I would have expected
> someone doing logic designs to be a lot quicker in doing so.

I've already told you I understand your explanation.  The issue is not 
the logic of your idea, the issue is whether your ideas apply to the CM3 
or not.  As many here have pointed out, the limitations you impose are 
in no way an inherent part of a pipelined processor.  I have already 
said that pipelined designs typically have exactly the logic needed for 
a pipeline to be fully useful which you seem to feel is DSP "trickery". 
  Rather this is just intelligent design.

I think I'm done with this conversation.  It is just going in circles 
and getting nowhere.

-- 

Rick

Reply by Dimiter_Popoff ●November 8, 20152015-11-08

On 09.11.2015 &#1075;. 00:15, rickman wrote:
> On 11/8/2015 3:52 PM, Dimiter_Popoff wrote:
>> On 08.11.2015 &#1075;. 22:15, rickman wrote:
>>> On 11/8/2015 8:50 AM, jakbru@gmail.com wrote:
>>>> Den s&ouml;ndag 8 november 2015 kl. 14:36:22 UTC+1 skrev dp:
>>>>> On 08.11.2015 &#1075;. 15:18, Richard Damon wrote:
>>>>>> On 11/8/15 6:12 AM, Dimiter_Popoff wrote:
>>>>>>    Well, I'll try to explain it one more time for you. These
>>>>>>> things may be well understood but you are clearly not among
>>>>>>> those who have understood them so here we go again.
>>>>>>>
>>>>>>> To repeat the example I already gave in another post,
>>>>>>> to do a MAC we need to multiply a sample (S) by a coefficient (C)
>>>>>>> and add the product to the accumulated result (A).
>>>>>>> So we have
>>>>>>> S0*C0+A0=A1,
>>>>>>> S1*C1+A1=A2 and so on.
>>>>>>>
>>>>>>> Let us say we have a 6 stage pipeline (just to persist with the
>>>>>>> 6 figure I started my examples with).
>>>>>>> At the first line we have all 3 inputs - S0, C0 and A0 readily
>>>>>>> available and they enter the pipeline.
>>>>>>> Next clock cycle we need S1, C1 - which let us say we have or
>>>>>>> can fetch fast enough and... oops, A1. But A1 will be at the OUTPUT
>>>>>>> of the pipeline 6 cycles later so we will have to wait until then.
>>>>>>>
>>>>>>> I don't think this can be explained any clearer. Obviously the
>>>>>>> MAC operation can be substituted by any other one which needs the
>>>>>>> output of the pipeline as an input.
>>>>>>>
>>>>>>> This is plain arithmetic and is valid for any pipelined processor.
>>>>>>> Those which manage 1 or more MACs per cycle have special hardware
>>>>>>> to do so - doing what my source example does this or that way,
>>>>>>> the registers they use would simply be hidden from the programming
>>>>>>> model.
>>>>>>>
>>>>>>> Dimiter
>>>>>>>
>>>>>>
>>>>>> The issue is the assumption that the MAC instruction can't be
>>>>>> scheduled
>>>>>> to start until all the data it needs anywhere during the execution is
>>>>>> available as completed.
>>>>>>
>>>>>> The scheduler knows that the addend isn't needed until cycle 5, and
>>>>>> can
>>>>>> see that it will be available then, so it can start the MAC now and
>>>>>> get
>>>>>> that data later.
>>>>>>
>>>>>> Yes, this makes the scheduler more complicated, but it can be (and
>>>>>> is)
>>>>>> done to keep things running fast. (It might not be done in all cases,
>>>>>> but would commonly be done for this sort of case for any processor
>>>>>> designed for efficient DSP.
>>>>>
>>>>> Of course like I said DSPs are dealing with this - have been for
>>>>> decades.
>>>>> But it takes some specific opcode(s), for MAC between registers
>>>>> (like on "normal" load/store processors) it is impractical
>>>>> to track all the opcodes currently in the pipeline in order to
>>>>> make this sort of decisions - which is the main reason why
>>>>> load/store machines are designed with more registers, the idea
>>>>> behind RISC is to leave more work to be done by software and thus
>>>>> save silicon area.
>>>>> ARM has apparently been started as a cheaper/lower power tradeoff;
>>>>> it has worked quite well for them, is working at the moment really.
>>>>> The too few registers impediment comes into effect only when horse
>>>>> powers begin to matter - and I think they have addressed this in their
>>>>> 64-bit model (I am not familiar with it though).
>>>>>
>>>>> Dimiter
>>>>>
>>>>> ------------------------------------------------------
>>>>> Dimiter Popoff, TGI             http://www.tgi-sci.com
>>>>> ------------------------------------------------------
>>>>> http://www.flickr.com/photos/didi_tgi/
>>>>
>>>> When designing a pipelined MAC, making it such that the accumulation
>>>> operand has to be internally delayed a few cycles before it is used
>>>> is... strange. Normally, it would be designed in such a way that
>>>> back-to-back MACs can be issued, and the accumulation operand is just
>>>> forwarded to the next instruction internally. Creating logic to catch
>>>> the most recent contents for a "register" would be trivial. I have a
>>>> hard time seing how anyone would skip such an easy optimization, with
>>>> a big payoff, especially for a register constrained architecture. But,
>>>> there might still be such implementations of course.
>>>
>>> We all know this, but Dimiter seems to want to hold onto the idea that
>>> the CM3 is constructed the why he imagines it without considering the
>>> possibility that it is different and refuses to do any work to verify
>>> the facts.
>>>
>>> At this point I consider his posts on the topic to be trollish and
>>> without value.
>>>
>>
>> You suggesting to look something up for days is of huge value, sure.
>> So what did you look up.
>>
>> BTW, did you eventually understand my explanation? I would have expected
>> someone doing logic designs to be a lot quicker in doing so.
>
> I've already told you I understand your explanation.  The issue is not
> the logic of your idea, the issue is whether your ideas apply to the CM3
> or not.  As many here have pointed out, the limitations you impose are
> in no way an inherent part of a pipelined processor.  I have already
> said that pipelined designs typically have exactly the logic needed for
> a pipeline to be fully useful which you seem to feel is DSP "trickery".
>   Rather this is just intelligent design.
>
> I think I'm done with this conversation.  It is just going in circles
> and getting nowhere.
>

OK, no point continuing indeed. You have your generic assumptions
against my experience plus my numeric explanation - you are free to
stick to your beliefs of course.

Reply by Tim Wescott ●November 9, 20152015-11-09

On Sat, 07 Nov 2015 18:27:15 -0500, rickman wrote:

> On 11/7/2015 5:47 PM, Tim Wescott wrote:
>> On Sat, 07 Nov 2015 10:47:27 -0500, Randy Yates wrote:
>>
>>> Hi,
>>>
>>> I'm trying to find information on the Silicon Labs/Energy Micro EFM32
>>> Cortex M3 processor instruction execution times, namely, the
>>> MLA/Multiply-Accumulate instruction, but others as well. I've found
>>> the instruction in the reference manual but nowhere are cycle times
>>> mentioned.
>>>
>>> This is surreal. Every assembly language reference manual I've ever
>>> used includes cycle counts for each instruction. Here they're nowhere
>>> to be found.
>>
>> In addition to everything else that's mentioned, with today's
>> processors you're highly constrained by pipelining & whatnot.
>>
>> Most of the parts that I've worked with need lots of wait states to run
>> out of flash -- I wouldn't be surprised if the processor spends most of
>> it's time twiddling it's thumbs waiting on memory.
> 
> If you want to run fast, you either put your code in RAM, or you let the
> processor use cache that is available on all but low end processors.

Unless I'm severely mistaken, most Cortex M3 processors are "low end" and 
do not sport caches.

At least on the ST parts, not all of the RAM is connected to the 
processor's instruction bus, so you don't get as much speedup as you'd 
think.  Some do have a magic memory address range that's dual-ported to 
both buses.

-- 

Tim Wescott
Wescott Design Services
http://www.wescottdesign.com

Reply by rickman ●November 9, 20152015-11-09

On 11/9/2015 1:19 AM, Tim Wescott wrote:
> On Sat, 07 Nov 2015 18:27:15 -0500, rickman wrote:
>
>> On 11/7/2015 5:47 PM, Tim Wescott wrote:
>>> On Sat, 07 Nov 2015 10:47:27 -0500, Randy Yates wrote:
>>>
>>>> Hi,
>>>>
>>>> I'm trying to find information on the Silicon Labs/Energy Micro EFM32
>>>> Cortex M3 processor instruction execution times, namely, the
>>>> MLA/Multiply-Accumulate instruction, but others as well. I've found
>>>> the instruction in the reference manual but nowhere are cycle times
>>>> mentioned.
>>>>
>>>> This is surreal. Every assembly language reference manual I've ever
>>>> used includes cycle counts for each instruction. Here they're nowhere
>>>> to be found.
>>>
>>> In addition to everything else that's mentioned, with today's
>>> processors you're highly constrained by pipelining & whatnot.
>>>
>>> Most of the parts that I've worked with need lots of wait states to run
>>> out of flash -- I wouldn't be surprised if the processor spends most of
>>> it's time twiddling it's thumbs waiting on memory.
>>
>> If you want to run fast, you either put your code in RAM, or you let the
>> processor use cache that is available on all but low end processors.
>
> Unless I'm severely mistaken, most Cortex M3 processors are "low end" and
> do not sport caches.

According to wikipedia (not always reliable) CM3 has no cache.  There's 
still RAM speedup and I forgot that most CM3 devices use a prefetch to 
make sure the CPU has instructions when they are needed.

Think about it.  Why would they keep speeding up the CPU clock speed if 
performance was limited by the Flash alone?

> At least on the ST parts, not all of the RAM is connected to the
> processor's instruction bus, so you don't get as much speedup as you'd
> think.  Some do have a magic memory address range that's dual-ported to
> both buses.


-- 

Rick