EmbeddedRelated.com
Forums

EFM32 Instruction Execution Times

Started by Randy Yates November 7, 2015
On 11/8/15 6:12 AM, Dimiter_Popoff wrote:
  Well, I'll try to explain it one more time for you. These
> things may be well understood but you are clearly not among > those who have understood them so here we go again. > > To repeat the example I already gave in another post, > to do a MAC we need to multiply a sample (S) by a coefficient (C) > and add the product to the accumulated result (A). > So we have > S0*C0+A0=A1, > S1*C1+A1=A2 and so on. > > Let us say we have a 6 stage pipeline (just to persist with the > 6 figure I started my examples with). > At the first line we have all 3 inputs - S0, C0 and A0 readily > available and they enter the pipeline. > Next clock cycle we need S1, C1 - which let us say we have or > can fetch fast enough and... oops, A1. But A1 will be at the OUTPUT > of the pipeline 6 cycles later so we will have to wait until then. > > I don't think this can be explained any clearer. Obviously the > MAC operation can be substituted by any other one which needs the > output of the pipeline as an input. > > This is plain arithmetic and is valid for any pipelined processor. > Those which manage 1 or more MACs per cycle have special hardware > to do so - doing what my source example does this or that way, > the registers they use would simply be hidden from the programming > model. > > Dimiter >
The issue is the assumption that the MAC instruction can't be scheduled to start until all the data it needs anywhere during the execution is available as completed. The scheduler knows that the addend isn't needed until cycle 5, and can see that it will be available then, so it can start the MAC now and get that data later. Yes, this makes the scheduler more complicated, but it can be (and is) done to keep things running fast. (It might not be done in all cases, but would commonly be done for this sort of case for any processor designed for efficient DSP.
On 08.11.2015 г. 15:18, Richard Damon wrote:
> On 11/8/15 6:12 AM, Dimiter_Popoff wrote: > Well, I'll try to explain it one more time for you. These >> things may be well understood but you are clearly not among >> those who have understood them so here we go again. >> >> To repeat the example I already gave in another post, >> to do a MAC we need to multiply a sample (S) by a coefficient (C) >> and add the product to the accumulated result (A). >> So we have >> S0*C0+A0=A1, >> S1*C1+A1=A2 and so on. >> >> Let us say we have a 6 stage pipeline (just to persist with the >> 6 figure I started my examples with). >> At the first line we have all 3 inputs - S0, C0 and A0 readily >> available and they enter the pipeline. >> Next clock cycle we need S1, C1 - which let us say we have or >> can fetch fast enough and... oops, A1. But A1 will be at the OUTPUT >> of the pipeline 6 cycles later so we will have to wait until then. >> >> I don't think this can be explained any clearer. Obviously the >> MAC operation can be substituted by any other one which needs the >> output of the pipeline as an input. >> >> This is plain arithmetic and is valid for any pipelined processor. >> Those which manage 1 or more MACs per cycle have special hardware >> to do so - doing what my source example does this or that way, >> the registers they use would simply be hidden from the programming >> model. >> >> Dimiter >> > > The issue is the assumption that the MAC instruction can't be scheduled > to start until all the data it needs anywhere during the execution is > available as completed. > > The scheduler knows that the addend isn't needed until cycle 5, and can > see that it will be available then, so it can start the MAC now and get > that data later. > > Yes, this makes the scheduler more complicated, but it can be (and is) > done to keep things running fast. (It might not be done in all cases, > but would commonly be done for this sort of case for any processor > designed for efficient DSP.
Of course like I said DSPs are dealing with this - have been for decades. But it takes some specific opcode(s), for MAC between registers (like on "normal" load/store processors) it is impractical to track all the opcodes currently in the pipeline in order to make this sort of decisions - which is the main reason why load/store machines are designed with more registers, the idea behind RISC is to leave more work to be done by software and thus save silicon area. ARM has apparently been started as a cheaper/lower power tradeoff; it has worked quite well for them, is working at the moment really. The too few registers impediment comes into effect only when horse powers begin to matter - and I think they have addressed this in their 64-bit model (I am not familiar with it though). Dimiter ------------------------------------------------------ Dimiter Popoff, TGI http://www.tgi-sci.com ------------------------------------------------------ http://www.flickr.com/photos/didi_tgi/
Den söndag 8 november 2015 kl. 14:36:22 UTC+1 skrev dp:
> On 08.11.2015 г. 15:18, Richard Damon wrote: > > On 11/8/15 6:12 AM, Dimiter_Popoff wrote: > > Well, I'll try to explain it one more time for you. These > >> things may be well understood but you are clearly not among > >> those who have understood them so here we go again. > >> > >> To repeat the example I already gave in another post, > >> to do a MAC we need to multiply a sample (S) by a coefficient (C) > >> and add the product to the accumulated result (A). > >> So we have > >> S0*C0+A0=A1, > >> S1*C1+A1=A2 and so on. > >> > >> Let us say we have a 6 stage pipeline (just to persist with the > >> 6 figure I started my examples with). > >> At the first line we have all 3 inputs - S0, C0 and A0 readily > >> available and they enter the pipeline. > >> Next clock cycle we need S1, C1 - which let us say we have or > >> can fetch fast enough and... oops, A1. But A1 will be at the OUTPUT > >> of the pipeline 6 cycles later so we will have to wait until then. > >> > >> I don't think this can be explained any clearer. Obviously the > >> MAC operation can be substituted by any other one which needs the > >> output of the pipeline as an input. > >> > >> This is plain arithmetic and is valid for any pipelined processor. > >> Those which manage 1 or more MACs per cycle have special hardware > >> to do so - doing what my source example does this or that way, > >> the registers they use would simply be hidden from the programming > >> model. > >> > >> Dimiter > >> > > > > The issue is the assumption that the MAC instruction can't be scheduled > > to start until all the data it needs anywhere during the execution is > > available as completed. > > > > The scheduler knows that the addend isn't needed until cycle 5, and can > > see that it will be available then, so it can start the MAC now and get > > that data later. > > > > Yes, this makes the scheduler more complicated, but it can be (and is) > > done to keep things running fast. (It might not be done in all cases, > > but would commonly be done for this sort of case for any processor > > designed for efficient DSP. > > Of course like I said DSPs are dealing with this - have been for > decades. > But it takes some specific opcode(s), for MAC between registers > (like on "normal" load/store processors) it is impractical > to track all the opcodes currently in the pipeline in order to > make this sort of decisions - which is the main reason why > load/store machines are designed with more registers, the idea > behind RISC is to leave more work to be done by software and thus > save silicon area. > ARM has apparently been started as a cheaper/lower power tradeoff; > it has worked quite well for them, is working at the moment really. > The too few registers impediment comes into effect only when horse > powers begin to matter - and I think they have addressed this in their > 64-bit model (I am not familiar with it though). > > Dimiter > > ------------------------------------------------------ > Dimiter Popoff, TGI http://www.tgi-sci.com > ------------------------------------------------------ > http://www.flickr.com/photos/didi_tgi/
When designing a pipelined MAC, making it such that the accumulation operand has to be internally delayed a few cycles before it is used is... strange. Normally, it would be designed in such a way that back-to-back MACs can be issued, and the accumulation operand is just forwarded to the next instruction internally. Creating logic to catch the most recent contents for a "register" would be trivial. I have a hard time seing how anyone would skip such an easy optimization, with a big payoff, especially for a register constrained architecture. But, there might still be such implementations of course. BR Jakob
On Sun, 08 Nov 2015 02:19:53 +0200, Dimiter_Popoff wrote:

> Under DSP trickeries I mean doing extra hardware to hide the effect of > the pipeline length for MAC instructions which I explained above from > the end user.
"A complete conspiracy is a law of nature." -- Henri Poincaré (or words, probably in French, to that effect.)
On 11/8/2015 8:50 AM, jakbru@gmail.com wrote:
> Den söndag 8 november 2015 kl. 14:36:22 UTC+1 skrev dp: >> On 08.11.2015 г. 15:18, Richard Damon wrote: >>> On 11/8/15 6:12 AM, Dimiter_Popoff wrote: >>> Well, I'll try to explain it one more time for you. These >>>> things may be well understood but you are clearly not among >>>> those who have understood them so here we go again. >>>> >>>> To repeat the example I already gave in another post, >>>> to do a MAC we need to multiply a sample (S) by a coefficient (C) >>>> and add the product to the accumulated result (A). >>>> So we have >>>> S0*C0+A0=A1, >>>> S1*C1+A1=A2 and so on. >>>> >>>> Let us say we have a 6 stage pipeline (just to persist with the >>>> 6 figure I started my examples with). >>>> At the first line we have all 3 inputs - S0, C0 and A0 readily >>>> available and they enter the pipeline. >>>> Next clock cycle we need S1, C1 - which let us say we have or >>>> can fetch fast enough and... oops, A1. But A1 will be at the OUTPUT >>>> of the pipeline 6 cycles later so we will have to wait until then. >>>> >>>> I don't think this can be explained any clearer. Obviously the >>>> MAC operation can be substituted by any other one which needs the >>>> output of the pipeline as an input. >>>> >>>> This is plain arithmetic and is valid for any pipelined processor. >>>> Those which manage 1 or more MACs per cycle have special hardware >>>> to do so - doing what my source example does this or that way, >>>> the registers they use would simply be hidden from the programming >>>> model. >>>> >>>> Dimiter >>>> >>> >>> The issue is the assumption that the MAC instruction can't be scheduled >>> to start until all the data it needs anywhere during the execution is >>> available as completed. >>> >>> The scheduler knows that the addend isn't needed until cycle 5, and can >>> see that it will be available then, so it can start the MAC now and get >>> that data later. >>> >>> Yes, this makes the scheduler more complicated, but it can be (and is) >>> done to keep things running fast. (It might not be done in all cases, >>> but would commonly be done for this sort of case for any processor >>> designed for efficient DSP. >> >> Of course like I said DSPs are dealing with this - have been for >> decades. >> But it takes some specific opcode(s), for MAC between registers >> (like on "normal" load/store processors) it is impractical >> to track all the opcodes currently in the pipeline in order to >> make this sort of decisions - which is the main reason why >> load/store machines are designed with more registers, the idea >> behind RISC is to leave more work to be done by software and thus >> save silicon area. >> ARM has apparently been started as a cheaper/lower power tradeoff; >> it has worked quite well for them, is working at the moment really. >> The too few registers impediment comes into effect only when horse >> powers begin to matter - and I think they have addressed this in their >> 64-bit model (I am not familiar with it though). >> >> Dimiter >> >> ------------------------------------------------------ >> Dimiter Popoff, TGI http://www.tgi-sci.com >> ------------------------------------------------------ >> http://www.flickr.com/photos/didi_tgi/ > > When designing a pipelined MAC, making it such that the accumulation operand has to be internally delayed a few cycles before it is used is... strange. Normally, it would be designed in such a way that back-to-back MACs can be issued, and the accumulation operand is just forwarded to the next instruction internally. Creating logic to catch the most recent contents for a "register" would be trivial. I have a hard time seing how anyone would skip such an easy optimization, with a big payoff, especially for a register constrained architecture. But, there might still be such implementations of course.
We all know this, but Dimiter seems to want to hold onto the idea that the CM3 is constructed the why he imagines it without considering the possibility that it is different and refuses to do any work to verify the facts. At this point I consider his posts on the topic to be trollish and without value. -- Rick
On 08.11.2015 г. 22:15, rickman wrote:
> On 11/8/2015 8:50 AM, jakbru@gmail.com wrote: >> Den söndag 8 november 2015 kl. 14:36:22 UTC+1 skrev dp: >>> On 08.11.2015 г. 15:18, Richard Damon wrote: >>>> On 11/8/15 6:12 AM, Dimiter_Popoff wrote: >>>> Well, I'll try to explain it one more time for you. These >>>>> things may be well understood but you are clearly not among >>>>> those who have understood them so here we go again. >>>>> >>>>> To repeat the example I already gave in another post, >>>>> to do a MAC we need to multiply a sample (S) by a coefficient (C) >>>>> and add the product to the accumulated result (A). >>>>> So we have >>>>> S0*C0+A0=A1, >>>>> S1*C1+A1=A2 and so on. >>>>> >>>>> Let us say we have a 6 stage pipeline (just to persist with the >>>>> 6 figure I started my examples with). >>>>> At the first line we have all 3 inputs - S0, C0 and A0 readily >>>>> available and they enter the pipeline. >>>>> Next clock cycle we need S1, C1 - which let us say we have or >>>>> can fetch fast enough and... oops, A1. But A1 will be at the OUTPUT >>>>> of the pipeline 6 cycles later so we will have to wait until then. >>>>> >>>>> I don't think this can be explained any clearer. Obviously the >>>>> MAC operation can be substituted by any other one which needs the >>>>> output of the pipeline as an input. >>>>> >>>>> This is plain arithmetic and is valid for any pipelined processor. >>>>> Those which manage 1 or more MACs per cycle have special hardware >>>>> to do so - doing what my source example does this or that way, >>>>> the registers they use would simply be hidden from the programming >>>>> model. >>>>> >>>>> Dimiter >>>>> >>>> >>>> The issue is the assumption that the MAC instruction can't be scheduled >>>> to start until all the data it needs anywhere during the execution is >>>> available as completed. >>>> >>>> The scheduler knows that the addend isn't needed until cycle 5, and can >>>> see that it will be available then, so it can start the MAC now and get >>>> that data later. >>>> >>>> Yes, this makes the scheduler more complicated, but it can be (and is) >>>> done to keep things running fast. (It might not be done in all cases, >>>> but would commonly be done for this sort of case for any processor >>>> designed for efficient DSP. >>> >>> Of course like I said DSPs are dealing with this - have been for >>> decades. >>> But it takes some specific opcode(s), for MAC between registers >>> (like on "normal" load/store processors) it is impractical >>> to track all the opcodes currently in the pipeline in order to >>> make this sort of decisions - which is the main reason why >>> load/store machines are designed with more registers, the idea >>> behind RISC is to leave more work to be done by software and thus >>> save silicon area. >>> ARM has apparently been started as a cheaper/lower power tradeoff; >>> it has worked quite well for them, is working at the moment really. >>> The too few registers impediment comes into effect only when horse >>> powers begin to matter - and I think they have addressed this in their >>> 64-bit model (I am not familiar with it though). >>> >>> Dimiter >>> >>> ------------------------------------------------------ >>> Dimiter Popoff, TGI http://www.tgi-sci.com >>> ------------------------------------------------------ >>> http://www.flickr.com/photos/didi_tgi/ >> >> When designing a pipelined MAC, making it such that the accumulation >> operand has to be internally delayed a few cycles before it is used >> is... strange. Normally, it would be designed in such a way that >> back-to-back MACs can be issued, and the accumulation operand is just >> forwarded to the next instruction internally. Creating logic to catch >> the most recent contents for a "register" would be trivial. I have a >> hard time seing how anyone would skip such an easy optimization, with >> a big payoff, especially for a register constrained architecture. But, >> there might still be such implementations of course. > > We all know this, but Dimiter seems to want to hold onto the idea that > the CM3 is constructed the why he imagines it without considering the > possibility that it is different and refuses to do any work to verify > the facts. > > At this point I consider his posts on the topic to be trollish and > without value. >
You suggesting to look something up for days is of huge value, sure. So what did you look up. BTW, did you eventually understand my explanation? I would have expected someone doing logic designs to be a lot quicker in doing so. Dimiter
On 11/8/2015 3:52 PM, Dimiter_Popoff wrote:
> On 08.11.2015 г. 22:15, rickman wrote: >> On 11/8/2015 8:50 AM, jakbru@gmail.com wrote: >>> Den söndag 8 november 2015 kl. 14:36:22 UTC+1 skrev dp: >>>> On 08.11.2015 г. 15:18, Richard Damon wrote: >>>>> On 11/8/15 6:12 AM, Dimiter_Popoff wrote: >>>>> Well, I'll try to explain it one more time for you. These >>>>>> things may be well understood but you are clearly not among >>>>>> those who have understood them so here we go again. >>>>>> >>>>>> To repeat the example I already gave in another post, >>>>>> to do a MAC we need to multiply a sample (S) by a coefficient (C) >>>>>> and add the product to the accumulated result (A). >>>>>> So we have >>>>>> S0*C0+A0=A1, >>>>>> S1*C1+A1=A2 and so on. >>>>>> >>>>>> Let us say we have a 6 stage pipeline (just to persist with the >>>>>> 6 figure I started my examples with). >>>>>> At the first line we have all 3 inputs - S0, C0 and A0 readily >>>>>> available and they enter the pipeline. >>>>>> Next clock cycle we need S1, C1 - which let us say we have or >>>>>> can fetch fast enough and... oops, A1. But A1 will be at the OUTPUT >>>>>> of the pipeline 6 cycles later so we will have to wait until then. >>>>>> >>>>>> I don't think this can be explained any clearer. Obviously the >>>>>> MAC operation can be substituted by any other one which needs the >>>>>> output of the pipeline as an input. >>>>>> >>>>>> This is plain arithmetic and is valid for any pipelined processor. >>>>>> Those which manage 1 or more MACs per cycle have special hardware >>>>>> to do so - doing what my source example does this or that way, >>>>>> the registers they use would simply be hidden from the programming >>>>>> model. >>>>>> >>>>>> Dimiter >>>>>> >>>>> >>>>> The issue is the assumption that the MAC instruction can't be >>>>> scheduled >>>>> to start until all the data it needs anywhere during the execution is >>>>> available as completed. >>>>> >>>>> The scheduler knows that the addend isn't needed until cycle 5, and >>>>> can >>>>> see that it will be available then, so it can start the MAC now and >>>>> get >>>>> that data later. >>>>> >>>>> Yes, this makes the scheduler more complicated, but it can be (and is) >>>>> done to keep things running fast. (It might not be done in all cases, >>>>> but would commonly be done for this sort of case for any processor >>>>> designed for efficient DSP. >>>> >>>> Of course like I said DSPs are dealing with this - have been for >>>> decades. >>>> But it takes some specific opcode(s), for MAC between registers >>>> (like on "normal" load/store processors) it is impractical >>>> to track all the opcodes currently in the pipeline in order to >>>> make this sort of decisions - which is the main reason why >>>> load/store machines are designed with more registers, the idea >>>> behind RISC is to leave more work to be done by software and thus >>>> save silicon area. >>>> ARM has apparently been started as a cheaper/lower power tradeoff; >>>> it has worked quite well for them, is working at the moment really. >>>> The too few registers impediment comes into effect only when horse >>>> powers begin to matter - and I think they have addressed this in their >>>> 64-bit model (I am not familiar with it though). >>>> >>>> Dimiter >>>> >>>> ------------------------------------------------------ >>>> Dimiter Popoff, TGI http://www.tgi-sci.com >>>> ------------------------------------------------------ >>>> http://www.flickr.com/photos/didi_tgi/ >>> >>> When designing a pipelined MAC, making it such that the accumulation >>> operand has to be internally delayed a few cycles before it is used >>> is... strange. Normally, it would be designed in such a way that >>> back-to-back MACs can be issued, and the accumulation operand is just >>> forwarded to the next instruction internally. Creating logic to catch >>> the most recent contents for a "register" would be trivial. I have a >>> hard time seing how anyone would skip such an easy optimization, with >>> a big payoff, especially for a register constrained architecture. But, >>> there might still be such implementations of course. >> >> We all know this, but Dimiter seems to want to hold onto the idea that >> the CM3 is constructed the why he imagines it without considering the >> possibility that it is different and refuses to do any work to verify >> the facts. >> >> At this point I consider his posts on the topic to be trollish and >> without value. >> > > You suggesting to look something up for days is of huge value, sure. > So what did you look up. > > BTW, did you eventually understand my explanation? I would have expected > someone doing logic designs to be a lot quicker in doing so.
I've already told you I understand your explanation. The issue is not the logic of your idea, the issue is whether your ideas apply to the CM3 or not. As many here have pointed out, the limitations you impose are in no way an inherent part of a pipelined processor. I have already said that pipelined designs typically have exactly the logic needed for a pipeline to be fully useful which you seem to feel is DSP "trickery". Rather this is just intelligent design. I think I'm done with this conversation. It is just going in circles and getting nowhere. -- Rick
On 09.11.2015 г. 00:15, rickman wrote:
> On 11/8/2015 3:52 PM, Dimiter_Popoff wrote: >> On 08.11.2015 г. 22:15, rickman wrote: >>> On 11/8/2015 8:50 AM, jakbru@gmail.com wrote: >>>> Den söndag 8 november 2015 kl. 14:36:22 UTC+1 skrev dp: >>>>> On 08.11.2015 г. 15:18, Richard Damon wrote: >>>>>> On 11/8/15 6:12 AM, Dimiter_Popoff wrote: >>>>>> Well, I'll try to explain it one more time for you. These >>>>>>> things may be well understood but you are clearly not among >>>>>>> those who have understood them so here we go again. >>>>>>> >>>>>>> To repeat the example I already gave in another post, >>>>>>> to do a MAC we need to multiply a sample (S) by a coefficient (C) >>>>>>> and add the product to the accumulated result (A). >>>>>>> So we have >>>>>>> S0*C0+A0=A1, >>>>>>> S1*C1+A1=A2 and so on. >>>>>>> >>>>>>> Let us say we have a 6 stage pipeline (just to persist with the >>>>>>> 6 figure I started my examples with). >>>>>>> At the first line we have all 3 inputs - S0, C0 and A0 readily >>>>>>> available and they enter the pipeline. >>>>>>> Next clock cycle we need S1, C1 - which let us say we have or >>>>>>> can fetch fast enough and... oops, A1. But A1 will be at the OUTPUT >>>>>>> of the pipeline 6 cycles later so we will have to wait until then. >>>>>>> >>>>>>> I don't think this can be explained any clearer. Obviously the >>>>>>> MAC operation can be substituted by any other one which needs the >>>>>>> output of the pipeline as an input. >>>>>>> >>>>>>> This is plain arithmetic and is valid for any pipelined processor. >>>>>>> Those which manage 1 or more MACs per cycle have special hardware >>>>>>> to do so - doing what my source example does this or that way, >>>>>>> the registers they use would simply be hidden from the programming >>>>>>> model. >>>>>>> >>>>>>> Dimiter >>>>>>> >>>>>> >>>>>> The issue is the assumption that the MAC instruction can't be >>>>>> scheduled >>>>>> to start until all the data it needs anywhere during the execution is >>>>>> available as completed. >>>>>> >>>>>> The scheduler knows that the addend isn't needed until cycle 5, and >>>>>> can >>>>>> see that it will be available then, so it can start the MAC now and >>>>>> get >>>>>> that data later. >>>>>> >>>>>> Yes, this makes the scheduler more complicated, but it can be (and >>>>>> is) >>>>>> done to keep things running fast. (It might not be done in all cases, >>>>>> but would commonly be done for this sort of case for any processor >>>>>> designed for efficient DSP. >>>>> >>>>> Of course like I said DSPs are dealing with this - have been for >>>>> decades. >>>>> But it takes some specific opcode(s), for MAC between registers >>>>> (like on "normal" load/store processors) it is impractical >>>>> to track all the opcodes currently in the pipeline in order to >>>>> make this sort of decisions - which is the main reason why >>>>> load/store machines are designed with more registers, the idea >>>>> behind RISC is to leave more work to be done by software and thus >>>>> save silicon area. >>>>> ARM has apparently been started as a cheaper/lower power tradeoff; >>>>> it has worked quite well for them, is working at the moment really. >>>>> The too few registers impediment comes into effect only when horse >>>>> powers begin to matter - and I think they have addressed this in their >>>>> 64-bit model (I am not familiar with it though). >>>>> >>>>> Dimiter >>>>> >>>>> ------------------------------------------------------ >>>>> Dimiter Popoff, TGI http://www.tgi-sci.com >>>>> ------------------------------------------------------ >>>>> http://www.flickr.com/photos/didi_tgi/ >>>> >>>> When designing a pipelined MAC, making it such that the accumulation >>>> operand has to be internally delayed a few cycles before it is used >>>> is... strange. Normally, it would be designed in such a way that >>>> back-to-back MACs can be issued, and the accumulation operand is just >>>> forwarded to the next instruction internally. Creating logic to catch >>>> the most recent contents for a "register" would be trivial. I have a >>>> hard time seing how anyone would skip such an easy optimization, with >>>> a big payoff, especially for a register constrained architecture. But, >>>> there might still be such implementations of course. >>> >>> We all know this, but Dimiter seems to want to hold onto the idea that >>> the CM3 is constructed the why he imagines it without considering the >>> possibility that it is different and refuses to do any work to verify >>> the facts. >>> >>> At this point I consider his posts on the topic to be trollish and >>> without value. >>> >> >> You suggesting to look something up for days is of huge value, sure. >> So what did you look up. >> >> BTW, did you eventually understand my explanation? I would have expected >> someone doing logic designs to be a lot quicker in doing so. > > I've already told you I understand your explanation. The issue is not > the logic of your idea, the issue is whether your ideas apply to the CM3 > or not. As many here have pointed out, the limitations you impose are > in no way an inherent part of a pipelined processor. I have already > said that pipelined designs typically have exactly the logic needed for > a pipeline to be fully useful which you seem to feel is DSP "trickery". > Rather this is just intelligent design. > > I think I'm done with this conversation. It is just going in circles > and getting nowhere. >
OK, no point continuing indeed. You have your generic assumptions against my experience plus my numeric explanation - you are free to stick to your beliefs of course.
On Sat, 07 Nov 2015 18:27:15 -0500, rickman wrote:

> On 11/7/2015 5:47 PM, Tim Wescott wrote: >> On Sat, 07 Nov 2015 10:47:27 -0500, Randy Yates wrote: >> >>> Hi, >>> >>> I'm trying to find information on the Silicon Labs/Energy Micro EFM32 >>> Cortex M3 processor instruction execution times, namely, the >>> MLA/Multiply-Accumulate instruction, but others as well. I've found >>> the instruction in the reference manual but nowhere are cycle times >>> mentioned. >>> >>> This is surreal. Every assembly language reference manual I've ever >>> used includes cycle counts for each instruction. Here they're nowhere >>> to be found. >> >> In addition to everything else that's mentioned, with today's >> processors you're highly constrained by pipelining & whatnot. >> >> Most of the parts that I've worked with need lots of wait states to run >> out of flash -- I wouldn't be surprised if the processor spends most of >> it's time twiddling it's thumbs waiting on memory. > > If you want to run fast, you either put your code in RAM, or you let the > processor use cache that is available on all but low end processors.
Unless I'm severely mistaken, most Cortex M3 processors are "low end" and do not sport caches. At least on the ST parts, not all of the RAM is connected to the processor's instruction bus, so you don't get as much speedup as you'd think. Some do have a magic memory address range that's dual-ported to both buses. -- Tim Wescott Wescott Design Services http://www.wescottdesign.com
On 11/9/2015 1:19 AM, Tim Wescott wrote:
> On Sat, 07 Nov 2015 18:27:15 -0500, rickman wrote: > >> On 11/7/2015 5:47 PM, Tim Wescott wrote: >>> On Sat, 07 Nov 2015 10:47:27 -0500, Randy Yates wrote: >>> >>>> Hi, >>>> >>>> I'm trying to find information on the Silicon Labs/Energy Micro EFM32 >>>> Cortex M3 processor instruction execution times, namely, the >>>> MLA/Multiply-Accumulate instruction, but others as well. I've found >>>> the instruction in the reference manual but nowhere are cycle times >>>> mentioned. >>>> >>>> This is surreal. Every assembly language reference manual I've ever >>>> used includes cycle counts for each instruction. Here they're nowhere >>>> to be found. >>> >>> In addition to everything else that's mentioned, with today's >>> processors you're highly constrained by pipelining & whatnot. >>> >>> Most of the parts that I've worked with need lots of wait states to run >>> out of flash -- I wouldn't be surprised if the processor spends most of >>> it's time twiddling it's thumbs waiting on memory. >> >> If you want to run fast, you either put your code in RAM, or you let the >> processor use cache that is available on all but low end processors. > > Unless I'm severely mistaken, most Cortex M3 processors are "low end" and > do not sport caches.
According to wikipedia (not always reliable) CM3 has no cache. There's still RAM speedup and I forgot that most CM3 devices use a prefetch to make sure the CPU has instructions when they are needed. Think about it. Why would they keep speeding up the CPU clock speed if performance was limited by the Flash alone?
> At least on the ST parts, not all of the RAM is connected to the > processor's instruction bus, so you don't get as much speedup as you'd > think. Some do have a magic memory address range that's dual-ported to > both buses.
-- Rick