EmbeddedRelated.com
Forums
Memfault State of IoT Report

EFM32 Instruction Execution Times

Started by Randy Yates November 7, 2015
On 11/7/2015 6:07 PM, Dimiter_Popoff wrote:
> On 08.11.2015 г. 00:46, rickman wrote: >> On 11/7/2015 5:18 PM, Dimiter_Popoff wrote: >>> On 07.11.2015 г. 18:45, rickman wrote: >>>> On 11/7/2015 11:35 AM, Dimiter_Popoff wrote: >>>>> On 07.11.2015 г. 17:47, Randy Yates wrote: >>>>>> Hi, >>>>>> >>>>>> I'm trying to find information on the Silicon Labs/Energy Micro EFM32 >>>>>> Cortex M3 processor instruction execution times, namely, the >>>>>> MLA/Multiply-Accumulate instruction, but others as well. I've found >>>>>> the >>>>>> instruction in the reference manual but nowhere are cycle times >>>>>> mentioned. >>>>>> >>>>>> This is surreal. Every assembly language reference manual I've ever >>>>>> used >>>>>> includes cycle counts for each instruction. Here they're nowhere >>>>>> to be >>>>>> found. >>>>>> >>>>> >>>>> Be prepared for surprises with the MAC instruction on a >>>>> non-specialized >>>>> DSP like that. >>>>> Even if they specify 1 cycle throughput this can be unrealistic given >>>>> the few registers they have. >>>>> If they have a 6 stage pipeline it takes at least 18 registers for >>>>> MAC intermediate results only to bypass the data dependencies. >>>>> IOW, if you just write a loop with a counter you will need at least 6 >>>>> cycles (plus perhaps some additional time for the mul) simply because >>>>> every multiply-add needs the result of the previous one to be able >>>>> to add to. >>>> >>>> You might want to rethink that. The accumulate operation (add) is >>>> typically one clock cycle while the multiply is sometimes multiple >>>> cycles. I don't know what the multiply time is in the CM3, I >>>> thought it >>>> was one cycle as well, but perhaps that is a pipelined time. >>>> Regardless, >>>> the multiply spits out a result on every clock which is then added to >>>> the accumulator on each clock producing a MAC result on each clock. >>>> >>>> I remember that in the CM4 they claimed to be able to get close to 1 >>>> MAC >>>> per clock with various optimizations. >>>> >>> >>> Hah, it appears I am the only one - not only in this group - to have >>> really gone through this. >>> >>> The multiply does spit say a result every cycle, OK. But this is at the >>> end of the pipeline; so each multiply has started 6 (to stay with my 6 >>> stages example) cycles earlier than the result it spits. Now since >>> we accumulate the result in one register - and it is also at the input >>> of the pipeline for the multiply-add opcode - a new instruction cannot >>> begin going through the pipeline before one is finished, not without >>> some additional, DSP-ish trickery - which "normal" processor do not >>> have or if they do they talk about some "DSP engine" or sort of. >> >> I don't quite understand what you are saying. You seem to be saying the >> pipeline is 6 clock cycles long while that does not seem to be supported >> by the facts. > > I just stick to the same example from the beginning for clarity. > >> Then you propose the inputs to the instruction have to be >> available at the *start* of the instruction (not sure what that even >> means really as instructions are fetched, decoded and executed, which >> one is the "start") which is not necessarily true. I don't know that >> pipelining the MAC instruction requires anything special from the CPU >> other than the various controls required for pipelining. > > Well I know on the surface this is easy to overlook, as I had not > thought about it until I had to deal with it. But it is a general > issue. > Operands enter the pipeline at its input; if one of these operands > needs to be the output of the pipeline guess what, you will have > to wait for the entire pipeline length to be walked through before > you have all operands to do the next operation. > Let us try the MAC example: at the pipeline input you > need a sample, a coefficient and the accumulated value, > S, C and A. > Assume that to calculate S*C+A takes as many steps as the pipeline > is deep, say 6 cycles. > Now we start with S0*C0+A0=A1; next cycle we do S0*C0+A1. > But A1 will not be available for another 6 cycles, not before > s0*c0+a0 make it to the end of the pipeline. > It is called a data dependency. > >> I'm not in a position to debate this since I am not so familiar with the >> ARM instruction set, but I don't see any reason to use more registers >> for a simple instruction like the MAC than are actually required. I >> have never seen a problem with overlapping register usage in pipelined >> instruction sets. As long as the register is updated by the time it is >> used, it all works. Otherwise, what is the point of pipelining? >> > > Well I hope I did explain it well enough this time :-). Pipelining is > powerful but like anything else it has its limitations, the above > example summarizes it quite well.
Your explanation has always been clear, but I am not certain that your facts are straight. I have worked with DSP chips and designed pipelined processors for FPGAs. Your supposition that the data from a register must "enter the pipeline at its input" is an assumption from what I can see. I would have to consult the ARM CM3 architecture reference manual to see for sure. I don't see where you have done this. That is my point. Just as you assumed an invalid value for the pipeline length you may well be making a wrong assumption about how the pipeline works. -- Rick
On 11/7/2015 5:47 PM, Tim Wescott wrote:
> On Sat, 07 Nov 2015 10:47:27 -0500, Randy Yates wrote: > >> Hi, >> >> I'm trying to find information on the Silicon Labs/Energy Micro EFM32 >> Cortex M3 processor instruction execution times, namely, the >> MLA/Multiply-Accumulate instruction, but others as well. I've found the >> instruction in the reference manual but nowhere are cycle times >> mentioned. >> >> This is surreal. Every assembly language reference manual I've ever used >> includes cycle counts for each instruction. Here they're nowhere to be >> found. > > In addition to everything else that's mentioned, with today's processors > you're highly constrained by pipelining & whatnot. > > Most of the parts that I've worked with need lots of wait states to run > out of flash -- I wouldn't be surprised if the processor spends most of > it's time twiddling it's thumbs waiting on memory.
If you want to run fast, you either put your code in RAM, or you let the processor use cache that is available on all but low end processors. -- Rick
On 08.11.2015 г. 01:25, rickman wrote:
> On 11/7/2015 6:07 PM, Dimiter_Popoff wrote: >> On 08.11.2015 г. 00:46, rickman wrote: >>> On 11/7/2015 5:18 PM, Dimiter_Popoff wrote: >>>> On 07.11.2015 г. 18:45, rickman wrote: >>>>> On 11/7/2015 11:35 AM, Dimiter_Popoff wrote: >>>>>> On 07.11.2015 г. 17:47, Randy Yates wrote: >>>>>>> Hi, >>>>>>> >>>>>>> I'm trying to find information on the Silicon Labs/Energy Micro >>>>>>> EFM32 >>>>>>> Cortex M3 processor instruction execution times, namely, the >>>>>>> MLA/Multiply-Accumulate instruction, but others as well. I've found >>>>>>> the >>>>>>> instruction in the reference manual but nowhere are cycle times >>>>>>> mentioned. >>>>>>> >>>>>>> This is surreal. Every assembly language reference manual I've ever >>>>>>> used >>>>>>> includes cycle counts for each instruction. Here they're nowhere >>>>>>> to be >>>>>>> found. >>>>>>> >>>>>> >>>>>> Be prepared for surprises with the MAC instruction on a >>>>>> non-specialized >>>>>> DSP like that. >>>>>> Even if they specify 1 cycle throughput this can be unrealistic given >>>>>> the few registers they have. >>>>>> If they have a 6 stage pipeline it takes at least 18 registers for >>>>>> MAC intermediate results only to bypass the data dependencies. >>>>>> IOW, if you just write a loop with a counter you will need at least 6 >>>>>> cycles (plus perhaps some additional time for the mul) simply because >>>>>> every multiply-add needs the result of the previous one to be able >>>>>> to add to. >>>>> >>>>> You might want to rethink that. The accumulate operation (add) is >>>>> typically one clock cycle while the multiply is sometimes multiple >>>>> cycles. I don't know what the multiply time is in the CM3, I >>>>> thought it >>>>> was one cycle as well, but perhaps that is a pipelined time. >>>>> Regardless, >>>>> the multiply spits out a result on every clock which is then added to >>>>> the accumulator on each clock producing a MAC result on each clock. >>>>> >>>>> I remember that in the CM4 they claimed to be able to get close to 1 >>>>> MAC >>>>> per clock with various optimizations. >>>>> >>>> >>>> Hah, it appears I am the only one - not only in this group - to have >>>> really gone through this. >>>> >>>> The multiply does spit say a result every cycle, OK. But this is at the >>>> end of the pipeline; so each multiply has started 6 (to stay with my 6 >>>> stages example) cycles earlier than the result it spits. Now since >>>> we accumulate the result in one register - and it is also at the input >>>> of the pipeline for the multiply-add opcode - a new instruction cannot >>>> begin going through the pipeline before one is finished, not without >>>> some additional, DSP-ish trickery - which "normal" processor do not >>>> have or if they do they talk about some "DSP engine" or sort of. >>> >>> I don't quite understand what you are saying. You seem to be saying the >>> pipeline is 6 clock cycles long while that does not seem to be supported >>> by the facts. >> >> I just stick to the same example from the beginning for clarity. >> >>> Then you propose the inputs to the instruction have to be >>> available at the *start* of the instruction (not sure what that even >>> means really as instructions are fetched, decoded and executed, which >>> one is the "start") which is not necessarily true. I don't know that >>> pipelining the MAC instruction requires anything special from the CPU >>> other than the various controls required for pipelining. >> >> Well I know on the surface this is easy to overlook, as I had not >> thought about it until I had to deal with it. But it is a general >> issue. >> Operands enter the pipeline at its input; if one of these operands >> needs to be the output of the pipeline guess what, you will have >> to wait for the entire pipeline length to be walked through before >> you have all operands to do the next operation. >> Let us try the MAC example: at the pipeline input you >> need a sample, a coefficient and the accumulated value, >> S, C and A. >> Assume that to calculate S*C+A takes as many steps as the pipeline >> is deep, say 6 cycles. >> Now we start with S0*C0+A0=A1; next cycle we do S0*C0+A1. >> But A1 will not be available for another 6 cycles, not before >> s0*c0+a0 make it to the end of the pipeline. >> It is called a data dependency. >> >>> I'm not in a position to debate this since I am not so familiar with the >>> ARM instruction set, but I don't see any reason to use more registers >>> for a simple instruction like the MAC than are actually required. I >>> have never seen a problem with overlapping register usage in pipelined >>> instruction sets. As long as the register is updated by the time it is >>> used, it all works. Otherwise, what is the point of pipelining? >>> >> >> Well I hope I did explain it well enough this time :-). Pipelining is >> powerful but like anything else it has its limitations, the above >> example summarizes it quite well. > > Your explanation has always been clear, but I am not certain that your > facts are straight. I have worked with DSP chips and designed pipelined > processors for FPGAs. Your supposition that the data from a register > must "enter the pipeline at its input" is an assumption from what I can > see. I would have to consult the ARM CM3 architecture reference manual > to see for sure. I don't see where you have done this. That is my > point. Just as you assumed an invalid value for the pipeline length you > may well be making a wrong assumption about how the pipeline works. >
So where else can data enter the pipeline except at its input? How can you have the result of a 6 cycle operation in less than 6 cycles? That on a general ALU, I made the exception for DSP trickeries in my first post. Dimiter
On 11/7/2015 6:34 PM, Dimiter_Popoff wrote:
> On 08.11.2015 г. 01:25, rickman wrote: >> On 11/7/2015 6:07 PM, Dimiter_Popoff wrote: >>> On 08.11.2015 г. 00:46, rickman wrote: >>>> On 11/7/2015 5:18 PM, Dimiter_Popoff wrote: >>>>> On 07.11.2015 г. 18:45, rickman wrote: >>>>>> On 11/7/2015 11:35 AM, Dimiter_Popoff wrote: >>>>>>> On 07.11.2015 г. 17:47, Randy Yates wrote: >>>>>>>> Hi, >>>>>>>> >>>>>>>> I'm trying to find information on the Silicon Labs/Energy Micro >>>>>>>> EFM32 >>>>>>>> Cortex M3 processor instruction execution times, namely, the >>>>>>>> MLA/Multiply-Accumulate instruction, but others as well. I've found >>>>>>>> the >>>>>>>> instruction in the reference manual but nowhere are cycle times >>>>>>>> mentioned. >>>>>>>> >>>>>>>> This is surreal. Every assembly language reference manual I've ever >>>>>>>> used >>>>>>>> includes cycle counts for each instruction. Here they're nowhere >>>>>>>> to be >>>>>>>> found. >>>>>>>> >>>>>>> >>>>>>> Be prepared for surprises with the MAC instruction on a >>>>>>> non-specialized >>>>>>> DSP like that. >>>>>>> Even if they specify 1 cycle throughput this can be unrealistic >>>>>>> given >>>>>>> the few registers they have. >>>>>>> If they have a 6 stage pipeline it takes at least 18 registers for >>>>>>> MAC intermediate results only to bypass the data dependencies. >>>>>>> IOW, if you just write a loop with a counter you will need at >>>>>>> least 6 >>>>>>> cycles (plus perhaps some additional time for the mul) simply >>>>>>> because >>>>>>> every multiply-add needs the result of the previous one to be able >>>>>>> to add to. >>>>>> >>>>>> You might want to rethink that. The accumulate operation (add) is >>>>>> typically one clock cycle while the multiply is sometimes multiple >>>>>> cycles. I don't know what the multiply time is in the CM3, I >>>>>> thought it >>>>>> was one cycle as well, but perhaps that is a pipelined time. >>>>>> Regardless, >>>>>> the multiply spits out a result on every clock which is then added to >>>>>> the accumulator on each clock producing a MAC result on each clock. >>>>>> >>>>>> I remember that in the CM4 they claimed to be able to get close to 1 >>>>>> MAC >>>>>> per clock with various optimizations. >>>>>> >>>>> >>>>> Hah, it appears I am the only one - not only in this group - to have >>>>> really gone through this. >>>>> >>>>> The multiply does spit say a result every cycle, OK. But this is at >>>>> the >>>>> end of the pipeline; so each multiply has started 6 (to stay with my 6 >>>>> stages example) cycles earlier than the result it spits. Now since >>>>> we accumulate the result in one register - and it is also at the input >>>>> of the pipeline for the multiply-add opcode - a new instruction cannot >>>>> begin going through the pipeline before one is finished, not without >>>>> some additional, DSP-ish trickery - which "normal" processor do not >>>>> have or if they do they talk about some "DSP engine" or sort of. >>>> >>>> I don't quite understand what you are saying. You seem to be saying >>>> the >>>> pipeline is 6 clock cycles long while that does not seem to be >>>> supported >>>> by the facts. >>> >>> I just stick to the same example from the beginning for clarity. >>> >>>> Then you propose the inputs to the instruction have to be >>>> available at the *start* of the instruction (not sure what that even >>>> means really as instructions are fetched, decoded and executed, which >>>> one is the "start") which is not necessarily true. I don't know that >>>> pipelining the MAC instruction requires anything special from the CPU >>>> other than the various controls required for pipelining. >>> >>> Well I know on the surface this is easy to overlook, as I had not >>> thought about it until I had to deal with it. But it is a general >>> issue. >>> Operands enter the pipeline at its input; if one of these operands >>> needs to be the output of the pipeline guess what, you will have >>> to wait for the entire pipeline length to be walked through before >>> you have all operands to do the next operation. >>> Let us try the MAC example: at the pipeline input you >>> need a sample, a coefficient and the accumulated value, >>> S, C and A. >>> Assume that to calculate S*C+A takes as many steps as the pipeline >>> is deep, say 6 cycles. >>> Now we start with S0*C0+A0=A1; next cycle we do S0*C0+A1. >>> But A1 will not be available for another 6 cycles, not before >>> s0*c0+a0 make it to the end of the pipeline. >>> It is called a data dependency. >>> >>>> I'm not in a position to debate this since I am not so familiar with >>>> the >>>> ARM instruction set, but I don't see any reason to use more registers >>>> for a simple instruction like the MAC than are actually required. I >>>> have never seen a problem with overlapping register usage in pipelined >>>> instruction sets. As long as the register is updated by the time it is >>>> used, it all works. Otherwise, what is the point of pipelining? >>>> >>> >>> Well I hope I did explain it well enough this time :-). Pipelining is >>> powerful but like anything else it has its limitations, the above >>> example summarizes it quite well. >> >> Your explanation has always been clear, but I am not certain that your >> facts are straight. I have worked with DSP chips and designed pipelined >> processors for FPGAs. Your supposition that the data from a register >> must "enter the pipeline at its input" is an assumption from what I can >> see. I would have to consult the ARM CM3 architecture reference manual >> to see for sure. I don't see where you have done this. That is my >> point. Just as you assumed an invalid value for the pipeline length you >> may well be making a wrong assumption about how the pipeline works. >> > > So where else can data enter the pipeline except at its input? > How can you have the result of a 6 cycle operation in less than > 6 cycles? > That on a general ALU, I made the exception for DSP trickeries > in my first post.
I don't know what "DSP trickeries" means. A pipeline does multiple steps, you don't need an input until the step that uses it. The adder for the accumulation only needs the result of the accumulation on the next clock when it starts the next add. Why would it need the result of the add at the same time as the inputs to the multiply? Rather than making assumptions about how the ARM instruction set works, why not look it up? -- Rick
On 08.11.2015 г. 02:05, rickman wrote:
> On 11/7/2015 6:34 PM, Dimiter_Popoff wrote: >> On 08.11.2015 г. 01:25, rickman wrote: >>> On 11/7/2015 6:07 PM, Dimiter_Popoff wrote: >>>> On 08.11.2015 г. 00:46, rickman wrote: >>>>> On 11/7/2015 5:18 PM, Dimiter_Popoff wrote: >>>>>> On 07.11.2015 г. 18:45, rickman wrote: >>>>>>> On 11/7/2015 11:35 AM, Dimiter_Popoff wrote: >>>>>>>> On 07.11.2015 г. 17:47, Randy Yates wrote: >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> I'm trying to find information on the Silicon Labs/Energy Micro >>>>>>>>> EFM32 >>>>>>>>> Cortex M3 processor instruction execution times, namely, the >>>>>>>>> MLA/Multiply-Accumulate instruction, but others as well. I've >>>>>>>>> found >>>>>>>>> the >>>>>>>>> instruction in the reference manual but nowhere are cycle times >>>>>>>>> mentioned. >>>>>>>>> >>>>>>>>> This is surreal. Every assembly language reference manual I've >>>>>>>>> ever >>>>>>>>> used >>>>>>>>> includes cycle counts for each instruction. Here they're nowhere >>>>>>>>> to be >>>>>>>>> found. >>>>>>>>> >>>>>>>> >>>>>>>> Be prepared for surprises with the MAC instruction on a >>>>>>>> non-specialized >>>>>>>> DSP like that. >>>>>>>> Even if they specify 1 cycle throughput this can be unrealistic >>>>>>>> given >>>>>>>> the few registers they have. >>>>>>>> If they have a 6 stage pipeline it takes at least 18 registers for >>>>>>>> MAC intermediate results only to bypass the data dependencies. >>>>>>>> IOW, if you just write a loop with a counter you will need at >>>>>>>> least 6 >>>>>>>> cycles (plus perhaps some additional time for the mul) simply >>>>>>>> because >>>>>>>> every multiply-add needs the result of the previous one to be able >>>>>>>> to add to. >>>>>>> >>>>>>> You might want to rethink that. The accumulate operation (add) is >>>>>>> typically one clock cycle while the multiply is sometimes multiple >>>>>>> cycles. I don't know what the multiply time is in the CM3, I >>>>>>> thought it >>>>>>> was one cycle as well, but perhaps that is a pipelined time. >>>>>>> Regardless, >>>>>>> the multiply spits out a result on every clock which is then >>>>>>> added to >>>>>>> the accumulator on each clock producing a MAC result on each clock. >>>>>>> >>>>>>> I remember that in the CM4 they claimed to be able to get close to 1 >>>>>>> MAC >>>>>>> per clock with various optimizations. >>>>>>> >>>>>> >>>>>> Hah, it appears I am the only one - not only in this group - to have >>>>>> really gone through this. >>>>>> >>>>>> The multiply does spit say a result every cycle, OK. But this is at >>>>>> the >>>>>> end of the pipeline; so each multiply has started 6 (to stay with >>>>>> my 6 >>>>>> stages example) cycles earlier than the result it spits. Now since >>>>>> we accumulate the result in one register - and it is also at the >>>>>> input >>>>>> of the pipeline for the multiply-add opcode - a new instruction >>>>>> cannot >>>>>> begin going through the pipeline before one is finished, not without >>>>>> some additional, DSP-ish trickery - which "normal" processor do not >>>>>> have or if they do they talk about some "DSP engine" or sort of. >>>>> >>>>> I don't quite understand what you are saying. You seem to be saying >>>>> the >>>>> pipeline is 6 clock cycles long while that does not seem to be >>>>> supported >>>>> by the facts. >>>> >>>> I just stick to the same example from the beginning for clarity. >>>> >>>>> Then you propose the inputs to the instruction have to be >>>>> available at the *start* of the instruction (not sure what that even >>>>> means really as instructions are fetched, decoded and executed, which >>>>> one is the "start") which is not necessarily true. I don't know that >>>>> pipelining the MAC instruction requires anything special from the CPU >>>>> other than the various controls required for pipelining. >>>> >>>> Well I know on the surface this is easy to overlook, as I had not >>>> thought about it until I had to deal with it. But it is a general >>>> issue. >>>> Operands enter the pipeline at its input; if one of these operands >>>> needs to be the output of the pipeline guess what, you will have >>>> to wait for the entire pipeline length to be walked through before >>>> you have all operands to do the next operation. >>>> Let us try the MAC example: at the pipeline input you >>>> need a sample, a coefficient and the accumulated value, >>>> S, C and A. >>>> Assume that to calculate S*C+A takes as many steps as the pipeline >>>> is deep, say 6 cycles. >>>> Now we start with S0*C0+A0=A1; next cycle we do S0*C0+A1. >>>> But A1 will not be available for another 6 cycles, not before >>>> s0*c0+a0 make it to the end of the pipeline. >>>> It is called a data dependency. >>>> >>>>> I'm not in a position to debate this since I am not so familiar with >>>>> the >>>>> ARM instruction set, but I don't see any reason to use more registers >>>>> for a simple instruction like the MAC than are actually required. I >>>>> have never seen a problem with overlapping register usage in pipelined >>>>> instruction sets. As long as the register is updated by the time >>>>> it is >>>>> used, it all works. Otherwise, what is the point of pipelining? >>>>> >>>> >>>> Well I hope I did explain it well enough this time :-). Pipelining is >>>> powerful but like anything else it has its limitations, the above >>>> example summarizes it quite well. >>> >>> Your explanation has always been clear, but I am not certain that your >>> facts are straight. I have worked with DSP chips and designed pipelined >>> processors for FPGAs. Your supposition that the data from a register >>> must "enter the pipeline at its input" is an assumption from what I can >>> see. I would have to consult the ARM CM3 architecture reference manual >>> to see for sure. I don't see where you have done this. That is my >>> point. Just as you assumed an invalid value for the pipeline length you >>> may well be making a wrong assumption about how the pipeline works. >>> >> >> So where else can data enter the pipeline except at its input? >> How can you have the result of a 6 cycle operation in less than >> 6 cycles? >> That on a general ALU, I made the exception for DSP trickeries >> in my first post. > > I don't know what "DSP trickeries" means. A pipeline does multiple > steps, you don't need an input until the step that uses it. The adder > for the accumulation only needs the result of the accumulation on the > next clock when it starts the next add. Why would it need the result of > the add at the same time as the inputs to the multiply? > > Rather than making assumptions about how the ARM instruction set works, > why not look it up? >
It is not assumption, not more than the result of adding 1 to 1 being 2 is an assumption. A 6 cycle operation takes 6 cycles, it boils down to that. I made it clear enough, please consult my previous posts. Under DSP trickeries I mean doing extra hardware to hide the effect of the pipeline length for MAC instructions which I explained above from the end user. As for looking up everyone is free to look up whatever one wants, there is no need for me to look up things for other people. Dimiter
Den lørdag den 7. november 2015 kl. 21.57.35 UTC+1 skrev Randy Yates:
> David Brown <david.brown@hesbynett.no> writes: > > > On 07/11/15 17:45, rickman wrote: > >> On 11/7/2015 11:35 AM, Dimiter_Popoff wrote: > >>> On 07.11.2015 &#1075;. 17:47, Randy Yates wrote: > >>>> Hi, > >>>> > >>>> I'm trying to find information on the Silicon Labs/Energy Micro EFM32 > >>>> Cortex M3 processor instruction execution times, namely, the > >>>> MLA/Multiply-Accumulate instruction, but others as well. I've found the > >>>> instruction in the reference manual but nowhere are cycle times > >>>> mentioned. > >>>> > >>>> This is surreal. Every assembly language reference manual I've ever used > >>>> includes cycle counts for each instruction. Here they're nowhere to be > >>>> found. > >>>> > >>> > >>> Be prepared for surprises with the MAC instruction on a non-specialized > >>> DSP like that. > >>> Even if they specify 1 cycle throughput this can be unrealistic given > >>> the few registers they have. > >>> If they have a 6 stage pipeline it takes at least 18 registers for > >>> MAC intermediate results only to bypass the data dependencies. > >>> IOW, if you just write a loop with a counter you will need at least 6 > >>> cycles (plus perhaps some additional time for the mul) simply because > >>> every multiply-add needs the result of the previous one to be able > >>> to add to. > >> > >> You might want to rethink that. The accumulate operation (add) is > >> typically one clock cycle while the multiply is sometimes multiple > >> cycles. I don't know what the multiply time is in the CM3, I thought it > >> was one cycle as well, but perhaps that is a pipelined time. Regardless, > >> the multiply spits out a result on every clock which is then added to > >> the accumulator on each clock producing a MAC result on each clock. > >> > >> I remember that in the CM4 they claimed to be able to get close to 1 MAC > >> per clock with various optimizations. > >> > > > > The Cortex M4 has a range of additional instructions aimed precisely > > at DSP instructions such as MAC. That is the main difference between > > the M3 and the M4. > > > > The M3 and M4 have a 3 stage pipeline. > > > > A very quick google > > Hi David, > > I don't want to sound ungrateful, but why in the hell must I resort > to Google to get this deeply domain-specific information? It belongs > in a reference manual. > > Turns out Rick was right - it's in the ARM Cortex M3 TRM: > > http://infocenter.arm.com/help/topic/com.arm.doc.ddi0337e/DDI0337E_cortex_m3_r1p1_trm.pdf > > > shows that MLA (32x32 -> 32) on the M3 is 2 cycles. It is 1 cycle on > > the M4. If you want 64-bit results and accumulates, it is 4 to 7 > > cycles on the M3 and 1 on the M4. The M4 also has a variety of other > > DSP-style instructions, including SIMD codes for 16-bit or 8-bit MACs > > in parallel. > > > > With its very short pipelines, the M4 has enough registers to keep up > > a good throughput at MAC operations - significantly better than on an > > M3 in many circumstances. > > > > Even an M4 is not going to compete with a dedicated DSP on MAC > > throughput per clock cycle - but it is /vastly/ easier to work with. > > > The real question is what the OP actually wants to do, and if his M3 > > (or a replacement M4) is good enough - there is no point in going for > > a hideous architecture that can do 1 GMAC/s if 1 MMAC/s is more than > > enough for the application. > > The goal is to implement a high performance filter in few enough cycles > to get back to low-power mode and meet a specific battery life goal. Is > the CM3 "good enough?" TBD. There are a lot of choices (processing > architecture, filter specifications, etc.) that will decide.
I believe there is a CMSIS-DSP that implements a large number of optimised dsp functions like filters and ffts -Lasse
On 11/7/2015 7:19 PM, Dimiter_Popoff wrote:
> On 08.11.2015 &#1075;. 02:05, rickman wrote: >> On 11/7/2015 6:34 PM, Dimiter_Popoff wrote: >>> On 08.11.2015 &#1075;. 01:25, rickman wrote: >>>> On 11/7/2015 6:07 PM, Dimiter_Popoff wrote: >>>>> On 08.11.2015 &#1075;. 00:46, rickman wrote: >>>>>> On 11/7/2015 5:18 PM, Dimiter_Popoff wrote: >>>>>>> On 07.11.2015 &#1075;. 18:45, rickman wrote: >>>>>>>> On 11/7/2015 11:35 AM, Dimiter_Popoff wrote: >>>>>>>>> On 07.11.2015 &#1075;. 17:47, Randy Yates wrote: >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> I'm trying to find information on the Silicon Labs/Energy Micro >>>>>>>>>> EFM32 >>>>>>>>>> Cortex M3 processor instruction execution times, namely, the >>>>>>>>>> MLA/Multiply-Accumulate instruction, but others as well. I've >>>>>>>>>> found >>>>>>>>>> the >>>>>>>>>> instruction in the reference manual but nowhere are cycle times >>>>>>>>>> mentioned. >>>>>>>>>> >>>>>>>>>> This is surreal. Every assembly language reference manual I've >>>>>>>>>> ever >>>>>>>>>> used >>>>>>>>>> includes cycle counts for each instruction. Here they're nowhere >>>>>>>>>> to be >>>>>>>>>> found. >>>>>>>>>> >>>>>>>>> >>>>>>>>> Be prepared for surprises with the MAC instruction on a >>>>>>>>> non-specialized >>>>>>>>> DSP like that. >>>>>>>>> Even if they specify 1 cycle throughput this can be unrealistic >>>>>>>>> given >>>>>>>>> the few registers they have. >>>>>>>>> If they have a 6 stage pipeline it takes at least 18 registers for >>>>>>>>> MAC intermediate results only to bypass the data dependencies. >>>>>>>>> IOW, if you just write a loop with a counter you will need at >>>>>>>>> least 6 >>>>>>>>> cycles (plus perhaps some additional time for the mul) simply >>>>>>>>> because >>>>>>>>> every multiply-add needs the result of the previous one to be able >>>>>>>>> to add to. >>>>>>>> >>>>>>>> You might want to rethink that. The accumulate operation (add) is >>>>>>>> typically one clock cycle while the multiply is sometimes multiple >>>>>>>> cycles. I don't know what the multiply time is in the CM3, I >>>>>>>> thought it >>>>>>>> was one cycle as well, but perhaps that is a pipelined time. >>>>>>>> Regardless, >>>>>>>> the multiply spits out a result on every clock which is then >>>>>>>> added to >>>>>>>> the accumulator on each clock producing a MAC result on each clock. >>>>>>>> >>>>>>>> I remember that in the CM4 they claimed to be able to get close >>>>>>>> to 1 >>>>>>>> MAC >>>>>>>> per clock with various optimizations. >>>>>>>> >>>>>>> >>>>>>> Hah, it appears I am the only one - not only in this group - to have >>>>>>> really gone through this. >>>>>>> >>>>>>> The multiply does spit say a result every cycle, OK. But this is at >>>>>>> the >>>>>>> end of the pipeline; so each multiply has started 6 (to stay with >>>>>>> my 6 >>>>>>> stages example) cycles earlier than the result it spits. Now since >>>>>>> we accumulate the result in one register - and it is also at the >>>>>>> input >>>>>>> of the pipeline for the multiply-add opcode - a new instruction >>>>>>> cannot >>>>>>> begin going through the pipeline before one is finished, not without >>>>>>> some additional, DSP-ish trickery - which "normal" processor do not >>>>>>> have or if they do they talk about some "DSP engine" or sort of. >>>>>> >>>>>> I don't quite understand what you are saying. You seem to be saying >>>>>> the >>>>>> pipeline is 6 clock cycles long while that does not seem to be >>>>>> supported >>>>>> by the facts. >>>>> >>>>> I just stick to the same example from the beginning for clarity. >>>>> >>>>>> Then you propose the inputs to the instruction have to be >>>>>> available at the *start* of the instruction (not sure what that even >>>>>> means really as instructions are fetched, decoded and executed, which >>>>>> one is the "start") which is not necessarily true. I don't know that >>>>>> pipelining the MAC instruction requires anything special from the CPU >>>>>> other than the various controls required for pipelining. >>>>> >>>>> Well I know on the surface this is easy to overlook, as I had not >>>>> thought about it until I had to deal with it. But it is a general >>>>> issue. >>>>> Operands enter the pipeline at its input; if one of these operands >>>>> needs to be the output of the pipeline guess what, you will have >>>>> to wait for the entire pipeline length to be walked through before >>>>> you have all operands to do the next operation. >>>>> Let us try the MAC example: at the pipeline input you >>>>> need a sample, a coefficient and the accumulated value, >>>>> S, C and A. >>>>> Assume that to calculate S*C+A takes as many steps as the pipeline >>>>> is deep, say 6 cycles. >>>>> Now we start with S0*C0+A0=A1; next cycle we do S0*C0+A1. >>>>> But A1 will not be available for another 6 cycles, not before >>>>> s0*c0+a0 make it to the end of the pipeline. >>>>> It is called a data dependency. >>>>> >>>>>> I'm not in a position to debate this since I am not so familiar with >>>>>> the >>>>>> ARM instruction set, but I don't see any reason to use more registers >>>>>> for a simple instruction like the MAC than are actually required. I >>>>>> have never seen a problem with overlapping register usage in >>>>>> pipelined >>>>>> instruction sets. As long as the register is updated by the time >>>>>> it is >>>>>> used, it all works. Otherwise, what is the point of pipelining? >>>>>> >>>>> >>>>> Well I hope I did explain it well enough this time :-). Pipelining is >>>>> powerful but like anything else it has its limitations, the above >>>>> example summarizes it quite well. >>>> >>>> Your explanation has always been clear, but I am not certain that your >>>> facts are straight. I have worked with DSP chips and designed >>>> pipelined >>>> processors for FPGAs. Your supposition that the data from a register >>>> must "enter the pipeline at its input" is an assumption from what I can >>>> see. I would have to consult the ARM CM3 architecture reference manual >>>> to see for sure. I don't see where you have done this. That is my >>>> point. Just as you assumed an invalid value for the pipeline length >>>> you >>>> may well be making a wrong assumption about how the pipeline works. >>>> >>> >>> So where else can data enter the pipeline except at its input? >>> How can you have the result of a 6 cycle operation in less than >>> 6 cycles? >>> That on a general ALU, I made the exception for DSP trickeries >>> in my first post. >> >> I don't know what "DSP trickeries" means. A pipeline does multiple >> steps, you don't need an input until the step that uses it. The adder >> for the accumulation only needs the result of the accumulation on the >> next clock when it starts the next add. Why would it need the result of >> the add at the same time as the inputs to the multiply? >> >> Rather than making assumptions about how the ARM instruction set works, >> why not look it up? >> > > It is not assumption, not more than the result of adding 1 to 1 being 2 > is an assumption. A 6 cycle operation takes 6 cycles, it boils down > to that. I made it clear enough, please consult my previous posts. > > Under DSP trickeries I mean doing extra hardware to hide the effect of > the pipeline length for MAC instructions which I explained above > from the end user. > > As for looking up everyone is free to look up whatever one wants, there > is no need for me to look up things for other people.
Lol, ok, if you want to believe stuff you acknowledge you made up, then so be it. I was talking about the ARM processors. You seem to be talking about an imaginary processor that none of the rest of us know anything about. Enjoy. -- Rick
rickman <gnuarm@gmail.com> writes:

> On 11/7/2015 4:49 PM, Randy Yates wrote: >> rickman <gnuarm@gmail.com> writes: >> >>> On 11/7/2015 3:57 PM, Randy Yates wrote: >>>> David Brown <david.brown@hesbynett.no> writes: >>>> >>>>> On 07/11/15 17:45, rickman wrote: >>>>>> On 11/7/2015 11:35 AM, Dimiter_Popoff wrote: >>>>>>> On 07.11.2015 &#1075;. 17:47, Randy Yates wrote: >>>>>>>> Hi, >>>>>>>> >>>>>>>> I'm trying to find information on the Silicon Labs/Energy Micro EFM32 >>>>>>>> Cortex M3 processor instruction execution times, namely, the >>>>>>>> MLA/Multiply-Accumulate instruction, but others as well. I've found the >>>>>>>> instruction in the reference manual but nowhere are cycle times >>>>>>>> mentioned. >>>>>>>> >>>>>>>> This is surreal. Every assembly language reference manual I've ever used >>>>>>>> includes cycle counts for each instruction. Here they're nowhere to be >>>>>>>> found. >>>>>>>> >>>>>>> >>>>>>> Be prepared for surprises with the MAC instruction on a non-specialized >>>>>>> DSP like that. >>>>>>> Even if they specify 1 cycle throughput this can be unrealistic given >>>>>>> the few registers they have. >>>>>>> If they have a 6 stage pipeline it takes at least 18 registers for >>>>>>> MAC intermediate results only to bypass the data dependencies. >>>>>>> IOW, if you just write a loop with a counter you will need at least 6 >>>>>>> cycles (plus perhaps some additional time for the mul) simply because >>>>>>> every multiply-add needs the result of the previous one to be able >>>>>>> to add to. >>>>>> >>>>>> You might want to rethink that. The accumulate operation (add) is >>>>>> typically one clock cycle while the multiply is sometimes multiple >>>>>> cycles. I don't know what the multiply time is in the CM3, I thought it >>>>>> was one cycle as well, but perhaps that is a pipelined time. Regardless, >>>>>> the multiply spits out a result on every clock which is then added to >>>>>> the accumulator on each clock producing a MAC result on each clock. >>>>>> >>>>>> I remember that in the CM4 they claimed to be able to get close to 1 MAC >>>>>> per clock with various optimizations. >>>>>> >>>>> >>>>> The Cortex M4 has a range of additional instructions aimed precisely >>>>> at DSP instructions such as MAC. That is the main difference between >>>>> the M3 and the M4. >>>>> >>>>> The M3 and M4 have a 3 stage pipeline. >>>>> >>>>> A very quick google >>>> >>>> Hi David, >>>> >>>> I don't want to sound ungrateful, but why in the hell must I resort >>>> to Google to get this deeply domain-specific information? It belongs >>>> in a reference manual. >>>> >>>> Turns out Rick was right - it's in the ARM Cortex M3 TRM: >>>> >>>> http://infocenter.arm.com/help/topic/com.arm.doc.ddi0337e/DDI0337E_cortex_m3_r1p1_trm.pdf >>>> >>>>> shows that MLA (32x32 -> 32) on the M3 is 2 cycles. It is 1 cycle on >>>>> the M4. If you want 64-bit results and accumulates, it is 4 to 7 >>>>> cycles on the M3 and 1 on the M4. The M4 also has a variety of other >>>>> DSP-style instructions, including SIMD codes for 16-bit or 8-bit MACs >>>>> in parallel. >>>>> >>>>> With its very short pipelines, the M4 has enough registers to keep up >>>>> a good throughput at MAC operations - significantly better than on an >>>>> M3 in many circumstances. >>>>> >>>>> Even an M4 is not going to compete with a dedicated DSP on MAC >>>>> throughput per clock cycle - but it is /vastly/ easier to work with. >>>> >>>>> The real question is what the OP actually wants to do, and if his M3 >>>>> (or a replacement M4) is good enough - there is no point in going for >>>>> a hideous architecture that can do 1 GMAC/s if 1 MMAC/s is more than >>>>> enough for the application. >>>> >>>> The goal is to implement a high performance filter in few enough cycles >>>> to get back to low-power mode and meet a specific battery life goal. Is >>>> the CM3 "good enough?" TBD. There are a lot of choices (processing >>>> architecture, filter specifications, etc.) that will decide. >>> >>> Should I assume that I can't talk you into an FPGA design in a low >>> power device? >> >> That would require a board respin. Not good! > > Yes, of course. I didn't quite grasp what you were saying. You want > to duty cycle running the signal processing at full power with idling > at low power. Exactly how to do it with most CPUs.
Yes. -- Randy Yates Digital Signal Labs http://www.digitalsignallabs.com
Dimiter_Popoff <dp@tgi-sci.com> wrote:
> > Well I cannot have your certainty about the motivation ARM have, > but I strongly suspect they _do_ know about the data dependencies > and they do take them into account when designing. > > What I am pointing out is the architectural limitation; the MAC > loop is only one good example how it takes pipeline depth > times 3 registers plus address pointers and counters etc. to be > able to keep it productive. > > Of course like you say most applications do not need all the > resources, then there are architectures much worse than ARM > doing commercially fine etc., I am not interested in such a > discussion at all. > > My point is about the number of registers a load/store machine needs > in order to make use of a given pipeline depth. ARM is fundamentally > limited in that by having too few registers and being a load/store > machine at the same time, there is nothing one can do against these > figures.
Those issues are well-undersood, but you badly mixed up things. First, pipeline depth matters for jumps and interrupt latency, but is irrelevant for most other operations. What matters is latency of given operation. In partucular most modern machines manage to have 1 cycle latency for simple integer operations regardless of pipeline length. For example both Cortex M3 and PC class processors have 1 cycle integer add latency, despite 3 cycle pipeline in M3 and longer than 10 cycles in PC-s. Now, difference between troughput and latency comes from pipelined execution units, but speaking about "pipeline depth" without any qualification is misleading. Second, your PPC example may be valid, but it is quite unusual to need to keep inputs valid during execution of instruction. For example, when computing dot product on a PC I had to take into accout floating point add and multiply latencies (IIRC both were 4 cycles on my machine). Since at that time there my machine had no MAC instruction I had to use separate multiply and add. I had to keep 4 acumulators, to hide add latency. But in case of multiply I could immediately reuse input registers for another multiply. Of course, I took advantage of out of order execution to reuse logical output registers from multiply. But even on in-order machine I would just have to keep outputs of multiplies in separate register and still reuse input registers. On machine with MAC instruction I would just keep enough accumulators and reuse inputs. Third, most of the above is irrelevant to Cortex M3. Namely, M3 executes at most 1 instruction per cycle. To have MAC doing useful work one needs to feed enough data to it. On M3 fetching an argument is a separate instruction, so assuming both arguments came from memory (as will be the case for long filter) we will have max of 1 MAC per 3 cycles (or rather 4 cycles assuming 2 cycles for MAC). Also, for most M3 instruction latency is the same as troughput. MAC and multiply may be special, but time to fetch arguments is likely to hide any extra latency. -- Waldek Hebisch
On 08.11.2015 &#1075;. 06:47, Waldek Hebisch wrote:
> Dimiter_Popoff <dp@tgi-sci.com> wrote: >> >> Well I cannot have your certainty about the motivation ARM have, >> but I strongly suspect they _do_ know about the data dependencies >> and they do take them into account when designing. >> >> What I am pointing out is the architectural limitation; the MAC >> loop is only one good example how it takes pipeline depth >> times 3 registers plus address pointers and counters etc. to be >> able to keep it productive. >> >> Of course like you say most applications do not need all the >> resources, then there are architectures much worse than ARM >> doing commercially fine etc., I am not interested in such a >> discussion at all. >> >> My point is about the number of registers a load/store machine needs >> in order to make use of a given pipeline depth. ARM is fundamentally >> limited in that by having too few registers and being a load/store >> machine at the same time, there is nothing one can do against these >> figures. > > Those issues are well-undersood, but you badly mixed up things. > First, pipeline depth matters for jumps and interrupt latency, > but is irrelevant for most other operations. What matters > is latency of given operation. In partucular most modern > machines manage to have 1 cycle latency for simple integer > operations regardless of pipeline length. For example both > Cortex M3 and PC class processors have 1 cycle integer add > latency, despite 3 cycle pipeline in M3 and longer than 10 > cycles in PC-s. Now, difference between troughput and > latency comes from pipelined execution units, but speaking > about "pipeline depth" without any qualification is > misleading. > > Second, your PPC example may be valid, but it is quite > unusual to need to keep inputs valid during execution > of instruction. For example, when computing dot product > on a PC I had to take into accout floating point add and > multiply latencies (IIRC both were 4 cycles on my machine). > Since at that time there my machine had no MAC instruction > I had to use separate multiply and add. I had to keep > 4 acumulators, to hide add latency. But in case of > multiply I could immediately reuse input registers for > another multiply. Of course, I took advantage of out > of order execution to reuse logical output registers > from multiply. But even on in-order machine I would > just have to keep outputs of multiplies in separate > register and still reuse input registers. On machine > with MAC instruction I would just keep enough accumulators > and reuse inputs. > > Third, most of the above is irrelevant to Cortex M3. > Namely, M3 executes at most 1 instruction per cycle. > To have MAC doing useful work one needs to feed enough > data to it. On M3 fetching an argument is a separate > instruction, so assuming both arguments came from > memory (as will be the case for long filter) we will > have max of 1 MAC per 3 cycles (or rather 4 cycles assuming > 2 cycles for MAC). Also, for most M3 instruction latency > is the same as troughput. MAC and multiply may be special, > but time to fetch arguments is likely to hide any extra > latency. > >
Well, I'll try to explain it one more time for you. These things may be well understood but you are clearly not among those who have understood them so here we go again. To repeat the example I already gave in another post, to do a MAC we need to multiply a sample (S) by a coefficient (C) and add the product to the accumulated result (A). So we have S0*C0+A0=A1, S1*C1+A1=A2 and so on. Let us say we have a 6 stage pipeline (just to persist with the 6 figure I started my examples with). At the first line we have all 3 inputs - S0, C0 and A0 readily available and they enter the pipeline. Next clock cycle we need S1, C1 - which let us say we have or can fetch fast enough and... oops, A1. But A1 will be at the OUTPUT of the pipeline 6 cycles later so we will have to wait until then. I don't think this can be explained any clearer. Obviously the MAC operation can be substituted by any other one which needs the output of the pipeline as an input. This is plain arithmetic and is valid for any pipelined processor. Those which manage 1 or more MACs per cycle have special hardware to do so - doing what my source example does this or that way, the registers they use would simply be hidden from the programming model. Dimiter ------------------------------------------------------ Dimiter Popoff, TGI http://www.tgi-sci.com ------------------------------------------------------ http://www.flickr.com/photos/didi_tgi/

Memfault State of IoT Report