EmbeddedRelated.com
Forums

EFM32 Instruction Execution Times

Started by Randy Yates November 7, 2015
rickman <gnuarm@gmail.com> writes:

> On 11/7/2015 3:57 PM, Randy Yates wrote: >> David Brown <david.brown@hesbynett.no> writes: >> >>> On 07/11/15 17:45, rickman wrote: >>>> On 11/7/2015 11:35 AM, Dimiter_Popoff wrote: >>>>> On 07.11.2015 &#1075;. 17:47, Randy Yates wrote: >>>>>> Hi, >>>>>> >>>>>> I'm trying to find information on the Silicon Labs/Energy Micro EFM32 >>>>>> Cortex M3 processor instruction execution times, namely, the >>>>>> MLA/Multiply-Accumulate instruction, but others as well. I've found the >>>>>> instruction in the reference manual but nowhere are cycle times >>>>>> mentioned. >>>>>> >>>>>> This is surreal. Every assembly language reference manual I've ever used >>>>>> includes cycle counts for each instruction. Here they're nowhere to be >>>>>> found. >>>>>> >>>>> >>>>> Be prepared for surprises with the MAC instruction on a non-specialized >>>>> DSP like that. >>>>> Even if they specify 1 cycle throughput this can be unrealistic given >>>>> the few registers they have. >>>>> If they have a 6 stage pipeline it takes at least 18 registers for >>>>> MAC intermediate results only to bypass the data dependencies. >>>>> IOW, if you just write a loop with a counter you will need at least 6 >>>>> cycles (plus perhaps some additional time for the mul) simply because >>>>> every multiply-add needs the result of the previous one to be able >>>>> to add to. >>>> >>>> You might want to rethink that. The accumulate operation (add) is >>>> typically one clock cycle while the multiply is sometimes multiple >>>> cycles. I don't know what the multiply time is in the CM3, I thought it >>>> was one cycle as well, but perhaps that is a pipelined time. Regardless, >>>> the multiply spits out a result on every clock which is then added to >>>> the accumulator on each clock producing a MAC result on each clock. >>>> >>>> I remember that in the CM4 they claimed to be able to get close to 1 MAC >>>> per clock with various optimizations. >>>> >>> >>> The Cortex M4 has a range of additional instructions aimed precisely >>> at DSP instructions such as MAC. That is the main difference between >>> the M3 and the M4. >>> >>> The M3 and M4 have a 3 stage pipeline. >>> >>> A very quick google >> >> Hi David, >> >> I don't want to sound ungrateful, but why in the hell must I resort >> to Google to get this deeply domain-specific information? It belongs >> in a reference manual. >> >> Turns out Rick was right - it's in the ARM Cortex M3 TRM: >> >> http://infocenter.arm.com/help/topic/com.arm.doc.ddi0337e/DDI0337E_cortex_m3_r1p1_trm.pdf >> >>> shows that MLA (32x32 -> 32) on the M3 is 2 cycles. It is 1 cycle on >>> the M4. If you want 64-bit results and accumulates, it is 4 to 7 >>> cycles on the M3 and 1 on the M4. The M4 also has a variety of other >>> DSP-style instructions, including SIMD codes for 16-bit or 8-bit MACs >>> in parallel. >>> >>> With its very short pipelines, the M4 has enough registers to keep up >>> a good throughput at MAC operations - significantly better than on an >>> M3 in many circumstances. >>> >>> Even an M4 is not going to compete with a dedicated DSP on MAC >>> throughput per clock cycle - but it is /vastly/ easier to work with. >> >>> The real question is what the OP actually wants to do, and if his M3 >>> (or a replacement M4) is good enough - there is no point in going for >>> a hideous architecture that can do 1 GMAC/s if 1 MMAC/s is more than >>> enough for the application. >> >> The goal is to implement a high performance filter in few enough cycles >> to get back to low-power mode and meet a specific battery life goal. Is >> the CM3 "good enough?" TBD. There are a lot of choices (processing >> architecture, filter specifications, etc.) that will decide. > > Should I assume that I can't talk you into an FPGA design in a low > power device?
That would require a board respin. Not good! -- Randy Yates Digital Signal Labs http://www.digitalsignallabs.com
On 11/7/2015 4:49 PM, Randy Yates wrote:
> rickman <gnuarm@gmail.com> writes: > >> On 11/7/2015 3:57 PM, Randy Yates wrote: >>> David Brown <david.brown@hesbynett.no> writes: >>> >>>> On 07/11/15 17:45, rickman wrote: >>>>> On 11/7/2015 11:35 AM, Dimiter_Popoff wrote: >>>>>> On 07.11.2015 &#1075;. 17:47, Randy Yates wrote: >>>>>>> Hi, >>>>>>> >>>>>>> I'm trying to find information on the Silicon Labs/Energy Micro EFM32 >>>>>>> Cortex M3 processor instruction execution times, namely, the >>>>>>> MLA/Multiply-Accumulate instruction, but others as well. I've found the >>>>>>> instruction in the reference manual but nowhere are cycle times >>>>>>> mentioned. >>>>>>> >>>>>>> This is surreal. Every assembly language reference manual I've ever used >>>>>>> includes cycle counts for each instruction. Here they're nowhere to be >>>>>>> found. >>>>>>> >>>>>> >>>>>> Be prepared for surprises with the MAC instruction on a non-specialized >>>>>> DSP like that. >>>>>> Even if they specify 1 cycle throughput this can be unrealistic given >>>>>> the few registers they have. >>>>>> If they have a 6 stage pipeline it takes at least 18 registers for >>>>>> MAC intermediate results only to bypass the data dependencies. >>>>>> IOW, if you just write a loop with a counter you will need at least 6 >>>>>> cycles (plus perhaps some additional time for the mul) simply because >>>>>> every multiply-add needs the result of the previous one to be able >>>>>> to add to. >>>>> >>>>> You might want to rethink that. The accumulate operation (add) is >>>>> typically one clock cycle while the multiply is sometimes multiple >>>>> cycles. I don't know what the multiply time is in the CM3, I thought it >>>>> was one cycle as well, but perhaps that is a pipelined time. Regardless, >>>>> the multiply spits out a result on every clock which is then added to >>>>> the accumulator on each clock producing a MAC result on each clock. >>>>> >>>>> I remember that in the CM4 they claimed to be able to get close to 1 MAC >>>>> per clock with various optimizations. >>>>> >>>> >>>> The Cortex M4 has a range of additional instructions aimed precisely >>>> at DSP instructions such as MAC. That is the main difference between >>>> the M3 and the M4. >>>> >>>> The M3 and M4 have a 3 stage pipeline. >>>> >>>> A very quick google >>> >>> Hi David, >>> >>> I don't want to sound ungrateful, but why in the hell must I resort >>> to Google to get this deeply domain-specific information? It belongs >>> in a reference manual. >>> >>> Turns out Rick was right - it's in the ARM Cortex M3 TRM: >>> >>> http://infocenter.arm.com/help/topic/com.arm.doc.ddi0337e/DDI0337E_cortex_m3_r1p1_trm.pdf >>> >>>> shows that MLA (32x32 -> 32) on the M3 is 2 cycles. It is 1 cycle on >>>> the M4. If you want 64-bit results and accumulates, it is 4 to 7 >>>> cycles on the M3 and 1 on the M4. The M4 also has a variety of other >>>> DSP-style instructions, including SIMD codes for 16-bit or 8-bit MACs >>>> in parallel. >>>> >>>> With its very short pipelines, the M4 has enough registers to keep up >>>> a good throughput at MAC operations - significantly better than on an >>>> M3 in many circumstances. >>>> >>>> Even an M4 is not going to compete with a dedicated DSP on MAC >>>> throughput per clock cycle - but it is /vastly/ easier to work with. >>> >>>> The real question is what the OP actually wants to do, and if his M3 >>>> (or a replacement M4) is good enough - there is no point in going for >>>> a hideous architecture that can do 1 GMAC/s if 1 MMAC/s is more than >>>> enough for the application. >>> >>> The goal is to implement a high performance filter in few enough cycles >>> to get back to low-power mode and meet a specific battery life goal. Is >>> the CM3 "good enough?" TBD. There are a lot of choices (processing >>> architecture, filter specifications, etc.) that will decide. >> >> Should I assume that I can't talk you into an FPGA design in a low >> power device? > > That would require a board respin. Not good!
Yes, of course. I didn't quite grasp what you were saying. You want to duty cycle running the signal processing at full power with idling at low power. Exactly how to do it with most CPUs. -- Rick
On 07.11.2015 &#1075;. 18:45, rickman wrote:
> On 11/7/2015 11:35 AM, Dimiter_Popoff wrote: >> On 07.11.2015 &#1075;. 17:47, Randy Yates wrote: >>> Hi, >>> >>> I'm trying to find information on the Silicon Labs/Energy Micro EFM32 >>> Cortex M3 processor instruction execution times, namely, the >>> MLA/Multiply-Accumulate instruction, but others as well. I've found the >>> instruction in the reference manual but nowhere are cycle times >>> mentioned. >>> >>> This is surreal. Every assembly language reference manual I've ever used >>> includes cycle counts for each instruction. Here they're nowhere to be >>> found. >>> >> >> Be prepared for surprises with the MAC instruction on a non-specialized >> DSP like that. >> Even if they specify 1 cycle throughput this can be unrealistic given >> the few registers they have. >> If they have a 6 stage pipeline it takes at least 18 registers for >> MAC intermediate results only to bypass the data dependencies. >> IOW, if you just write a loop with a counter you will need at least 6 >> cycles (plus perhaps some additional time for the mul) simply because >> every multiply-add needs the result of the previous one to be able >> to add to. > > You might want to rethink that. The accumulate operation (add) is > typically one clock cycle while the multiply is sometimes multiple > cycles. I don't know what the multiply time is in the CM3, I thought it > was one cycle as well, but perhaps that is a pipelined time. Regardless, > the multiply spits out a result on every clock which is then added to > the accumulator on each clock producing a MAC result on each clock. > > I remember that in the CM4 they claimed to be able to get close to 1 MAC > per clock with various optimizations. >
Hah, it appears I am the only one - not only in this group - to have really gone through this. The multiply does spit say a result every cycle, OK. But this is at the end of the pipeline; so each multiply has started 6 (to stay with my 6 stages example) cycles earlier than the result it spits. Now since we accumulate the result in one register - and it is also at the input of the pipeline for the multiply-add opcode - a new instruction cannot begin going through the pipeline before one is finished, not without some additional, DSP-ish trickery - which "normal" processor do not have or if they do they talk about some "DSP engine" or sort of. I had to do this the hard way on the e300 power core; in a simple loop, the FMADD (FP multiply-add 64 bit) would take something like 20-30nS in a straight forward loop (at 2.5nS clock period). The latency specified for the FMADD is just 2 cycles though; I had to bypass the data dependencies by using at least 6 (I did 6, 7 and 8) sets of 3 registers so the loop would go through all sets which all had different destination registers and thus would have enough time for the pipeline every cycle. At the end of the loop all 6 (or 7 or 8) destination registers are simply added to get the final result. I have posted it before, hopefully this explanation is better than my previous ones. Here is the source of how this works: http://tgi-sci.com/misc/mac8.sa Notice that it also saves load/store by a factor of 6 (or 8 in the example, I think it is the 8 sets case); the measured performance with this was 5.5 nS per FMADD (theoretical best, no load/store involved would have been 5 nS). Now David said the ARM in question has only 3 stages in its pipeline, it would take 9 registers to bypass its data dependencies; might even be doable with the few registers they have. [In fact the above is a good example of why ARM try to keep their pipelines short; their architecture just does not have the registers it takes to maintain a longer pipeline full/productive, it is a major architecture limitation for a load/store machine). Dimiter ------------------------------------------------------ Dimiter Popoff, TGI http://www.tgi-sci.com ------------------------------------------------------ http://www.flickr.com/photos/didi_tgi/
On 07/11/15 23:18, Dimiter_Popoff wrote:
> On 07.11.2015 &#1075;. 18:45, rickman wrote: >> On 11/7/2015 11:35 AM, Dimiter_Popoff wrote: >>> On 07.11.2015 &#1075;. 17:47, Randy Yates wrote: >>>> Hi, >>>> >>>> I'm trying to find information on the Silicon Labs/Energy Micro EFM32 >>>> Cortex M3 processor instruction execution times, namely, the >>>> MLA/Multiply-Accumulate instruction, but others as well. I've found the >>>> instruction in the reference manual but nowhere are cycle times >>>> mentioned. >>>> >>>> This is surreal. Every assembly language reference manual I've ever >>>> used >>>> includes cycle counts for each instruction. Here they're nowhere to be >>>> found. >>>> >>> >>> Be prepared for surprises with the MAC instruction on a non-specialized >>> DSP like that. >>> Even if they specify 1 cycle throughput this can be unrealistic given >>> the few registers they have. >>> If they have a 6 stage pipeline it takes at least 18 registers for >>> MAC intermediate results only to bypass the data dependencies. >>> IOW, if you just write a loop with a counter you will need at least 6 >>> cycles (plus perhaps some additional time for the mul) simply because >>> every multiply-add needs the result of the previous one to be able >>> to add to. >> >> You might want to rethink that. The accumulate operation (add) is >> typically one clock cycle while the multiply is sometimes multiple >> cycles. I don't know what the multiply time is in the CM3, I thought it >> was one cycle as well, but perhaps that is a pipelined time. Regardless, >> the multiply spits out a result on every clock which is then added to >> the accumulator on each clock producing a MAC result on each clock. >> >> I remember that in the CM4 they claimed to be able to get close to 1 MAC >> per clock with various optimizations. >> > > Hah, it appears I am the only one - not only in this group - to have > really gone through this. > > The multiply does spit say a result every cycle, OK. But this is at the > end of the pipeline; so each multiply has started 6 (to stay with my 6 > stages example) cycles earlier than the result it spits. Now since > we accumulate the result in one register - and it is also at the input > of the pipeline for the multiply-add opcode - a new instruction cannot > begin going through the pipeline before one is finished, not without > some additional, DSP-ish trickery - which "normal" processor do not > have or if they do they talk about some "DSP engine" or sort of. > > I had to do this the hard way on the e300 power core; in a simple loop, > the FMADD (FP multiply-add 64 bit) would take something like 20-30nS > in a straight forward loop (at 2.5nS clock period). The latency > specified for the FMADD is just 2 cycles though; I had to bypass the > data dependencies by using at least 6 (I did 6, 7 and 8) sets of > 3 registers so the loop would go through all sets which all > had different destination registers and thus would have enough > time for the pipeline every cycle. At the end of the loop all > 6 (or 7 or 8) destination registers are simply added to get the > final result. > I have posted it before, hopefully this explanation is better > than my previous ones. Here is the source of how this works: > > http://tgi-sci.com/misc/mac8.sa > > Notice that it also saves load/store by a factor of 6 (or 8 > in the example, I think it is the 8 sets case); the measured > performance with this was 5.5 nS per FMADD (theoretical best, > no load/store involved would have been 5 nS). > > Now David said the ARM in question has only 3 stages in its pipeline, > it would take 9 registers to bypass its data dependencies; > might even be doable with the few registers they have. > [In fact the above is a good example of why ARM try to keep their > pipelines short; their architecture just does not have the registers > it takes to maintain a longer pipeline full/productive, it is a major > architecture limitation for a load/store machine). >
No, that is most certainly /not/ why ARM wants to keep these pipelines short. I am not disagreeing with your calculations regarding throughput, latency, and registers on hardware that is not DSP-dedicated (and I don't know what DSP features an M4 really has, as I haven't needed them myself). You've pointed out these issues before, and I think they are often misunderstood - people see the "MAC instruction timing 1 cycle" and think they can get 72 MMACs from a 72 MHz M4. So it is good that you raise awareness here. But these are primarily control-oriented microcontroller cores - short pipelines means low latencies, consistent timings, short branch delays, minimal interrupt latency jitter, small core die area, and low power. Being able to improve throughput of long MAC chains is merely a bonus. Remember, the M4 core is not in the same class as the e300 - you would be better to compare the e300 to a Cortex A device with NEON SIMD instructions and see how that compares in DSP throughput. (Alternatively, you could compare MACs/s per $, or per mW, to get a fairer match.)
On 07/11/15 21:57, Randy Yates wrote:
> David Brown <david.brown@hesbynett.no> writes: > >> On 07/11/15 17:45, rickman wrote: >>> On 11/7/2015 11:35 AM, Dimiter_Popoff wrote: >>>> On 07.11.2015 &#1075;. 17:47, Randy Yates wrote: >>>>> Hi, >>>>> >>>>> I'm trying to find information on the Silicon Labs/Energy Micro EFM32 >>>>> Cortex M3 processor instruction execution times, namely, the >>>>> MLA/Multiply-Accumulate instruction, but others as well. I've found the >>>>> instruction in the reference manual but nowhere are cycle times >>>>> mentioned. >>>>> >>>>> This is surreal. Every assembly language reference manual I've ever used >>>>> includes cycle counts for each instruction. Here they're nowhere to be >>>>> found. >>>>> >>>> >>>> Be prepared for surprises with the MAC instruction on a non-specialized >>>> DSP like that. >>>> Even if they specify 1 cycle throughput this can be unrealistic given >>>> the few registers they have. >>>> If they have a 6 stage pipeline it takes at least 18 registers for >>>> MAC intermediate results only to bypass the data dependencies. >>>> IOW, if you just write a loop with a counter you will need at least 6 >>>> cycles (plus perhaps some additional time for the mul) simply because >>>> every multiply-add needs the result of the previous one to be able >>>> to add to. >>> >>> You might want to rethink that. The accumulate operation (add) is >>> typically one clock cycle while the multiply is sometimes multiple >>> cycles. I don't know what the multiply time is in the CM3, I thought it >>> was one cycle as well, but perhaps that is a pipelined time. Regardless, >>> the multiply spits out a result on every clock which is then added to >>> the accumulator on each clock producing a MAC result on each clock. >>> >>> I remember that in the CM4 they claimed to be able to get close to 1 MAC >>> per clock with various optimizations. >>> >> >> The Cortex M4 has a range of additional instructions aimed precisely >> at DSP instructions such as MAC. That is the main difference between >> the M3 and the M4. >> >> The M3 and M4 have a 3 stage pipeline. >> >> A very quick google > > Hi David, > > I don't want to sound ungrateful, but why in the hell must I resort > to Google to get this deeply domain-specific information? It belongs > in a reference manual. > > Turns out Rick was right - it's in the ARM Cortex M3 TRM: > > http://infocenter.arm.com/help/topic/com.arm.doc.ddi0337e/DDI0337E_cortex_m3_r1p1_trm.pdf >
And guess which link turns up at the top of a google search for "Cortex M3 MLA timing"? You can argue that Silicon Labs should have information in their own datasheets, or at least pointers to the ARM documents. They probably /do/ have that information there somewhere, if you dig deep enough. But sometimes googling is a lot faster, easier and less stressful than looking in the "right" places.
>> shows that MLA (32x32 -> 32) on the M3 is 2 cycles. It is 1 cycle on >> the M4. If you want 64-bit results and accumulates, it is 4 to 7 >> cycles on the M3 and 1 on the M4. The M4 also has a variety of other >> DSP-style instructions, including SIMD codes for 16-bit or 8-bit MACs >> in parallel. >> >> With its very short pipelines, the M4 has enough registers to keep up >> a good throughput at MAC operations - significantly better than on an >> M3 in many circumstances. >> >> Even an M4 is not going to compete with a dedicated DSP on MAC >> throughput per clock cycle - but it is /vastly/ easier to work with. > >> The real question is what the OP actually wants to do, and if his M3 >> (or a replacement M4) is good enough - there is no point in going for >> a hideous architecture that can do 1 GMAC/s if 1 MMAC/s is more than >> enough for the application. > > The goal is to implement a high performance filter in few enough cycles > to get back to low-power mode and meet a specific battery life goal. Is > the CM3 "good enough?" TBD. There are a lot of choices (processing > architecture, filter specifications, etc.) that will decide. >
Without knowing a good deal more, it is impossible to guess. But certainly the "run as fast as possible for a short time, then sleep" is the right way to minimise power. Have plenty of capacitors on the board to reduce power spikes to the battery.
On 11/7/2015 5:18 PM, Dimiter_Popoff wrote:
> On 07.11.2015 &#1075;. 18:45, rickman wrote: >> On 11/7/2015 11:35 AM, Dimiter_Popoff wrote: >>> On 07.11.2015 &#1075;. 17:47, Randy Yates wrote: >>>> Hi, >>>> >>>> I'm trying to find information on the Silicon Labs/Energy Micro EFM32 >>>> Cortex M3 processor instruction execution times, namely, the >>>> MLA/Multiply-Accumulate instruction, but others as well. I've found the >>>> instruction in the reference manual but nowhere are cycle times >>>> mentioned. >>>> >>>> This is surreal. Every assembly language reference manual I've ever >>>> used >>>> includes cycle counts for each instruction. Here they're nowhere to be >>>> found. >>>> >>> >>> Be prepared for surprises with the MAC instruction on a non-specialized >>> DSP like that. >>> Even if they specify 1 cycle throughput this can be unrealistic given >>> the few registers they have. >>> If they have a 6 stage pipeline it takes at least 18 registers for >>> MAC intermediate results only to bypass the data dependencies. >>> IOW, if you just write a loop with a counter you will need at least 6 >>> cycles (plus perhaps some additional time for the mul) simply because >>> every multiply-add needs the result of the previous one to be able >>> to add to. >> >> You might want to rethink that. The accumulate operation (add) is >> typically one clock cycle while the multiply is sometimes multiple >> cycles. I don't know what the multiply time is in the CM3, I thought it >> was one cycle as well, but perhaps that is a pipelined time. Regardless, >> the multiply spits out a result on every clock which is then added to >> the accumulator on each clock producing a MAC result on each clock. >> >> I remember that in the CM4 they claimed to be able to get close to 1 MAC >> per clock with various optimizations. >> > > Hah, it appears I am the only one - not only in this group - to have > really gone through this. > > The multiply does spit say a result every cycle, OK. But this is at the > end of the pipeline; so each multiply has started 6 (to stay with my 6 > stages example) cycles earlier than the result it spits. Now since > we accumulate the result in one register - and it is also at the input > of the pipeline for the multiply-add opcode - a new instruction cannot > begin going through the pipeline before one is finished, not without > some additional, DSP-ish trickery - which "normal" processor do not > have or if they do they talk about some "DSP engine" or sort of.
I don't quite understand what you are saying. You seem to be saying the pipeline is 6 clock cycles long while that does not seem to be supported by the facts. Then you propose the inputs to the instruction have to be available at the *start* of the instruction (not sure what that even means really as instructions are fetched, decoded and executed, which one is the "start") which is not necessarily true. I don't know that pipelining the MAC instruction requires anything special from the CPU other than the various controls required for pipelining. I don't know how many clocks it takes to do pipelined MAC instructions on the CM3. I do know they specifically added all the required logic to do fully pipelined MACs on the CM4, the real limitation seems to be memory accesses. Perhaps it was a 16 bit mode where two coefficients would be fetched in one memory operation and two data values were fetched in one memory operation, but they were able to reach 1 MAC per clock as long as nothing got in the way.
> I had to do this the hard way on the e300 power core; in a simple loop, > the FMADD (FP multiply-add 64 bit) would take something like 20-30nS > in a straight forward loop (at 2.5nS clock period). The latency > specified for the FMADD is just 2 cycles though; I had to bypass the > data dependencies by using at least 6 (I did 6, 7 and 8) sets of > 3 registers so the loop would go through all sets which all > had different destination registers and thus would have enough > time for the pipeline every cycle. At the end of the loop all > 6 (or 7 or 8) destination registers are simply added to get the > final result. > I have posted it before, hopefully this explanation is better > than my previous ones. Here is the source of how this works: > > http://tgi-sci.com/misc/mac8.sa
There is no reason why one processor would be the same as another in this regard. This link seems to be something other than ARM CM3 code. I'm guessing this is your e300 power core.
> Notice that it also saves load/store by a factor of 6 (or 8 > in the example, I think it is the 8 sets case); the measured > performance with this was 5.5 nS per FMADD (theoretical best, > no load/store involved would have been 5 nS). > > Now David said the ARM in question has only 3 stages in its pipeline, > it would take 9 registers to bypass its data dependencies; > might even be doable with the few registers they have. > [In fact the above is a good example of why ARM try to keep their > pipelines short; their architecture just does not have the registers > it takes to maintain a longer pipeline full/productive, it is a major > architecture limitation for a load/store machine).
I'm not in a position to debate this since I am not so familiar with the ARM instruction set, but I don't see any reason to use more registers for a simple instruction like the MAC than are actually required. I have never seen a problem with overlapping register usage in pipelined instruction sets. As long as the register is updated by the time it is used, it all works. Otherwise, what is the point of pipelining? -- Rick
On Sat, 07 Nov 2015 10:47:27 -0500, Randy Yates wrote:

> Hi, > > I'm trying to find information on the Silicon Labs/Energy Micro EFM32 > Cortex M3 processor instruction execution times, namely, the > MLA/Multiply-Accumulate instruction, but others as well. I've found the > instruction in the reference manual but nowhere are cycle times > mentioned. > > This is surreal. Every assembly language reference manual I've ever used > includes cycle counts for each instruction. Here they're nowhere to be > found.
In addition to everything else that's mentioned, with today's processors you're highly constrained by pipelining & whatnot. Most of the parts that I've worked with need lots of wait states to run out of flash -- I wouldn't be surprised if the processor spends most of it's time twiddling it's thumbs waiting on memory. -- Tim Wescott Wescott Design Services http://www.wescottdesign.com
On 08.11.2015 &#1075;. 00:33, David Brown wrote:
> On 07/11/15 23:18, Dimiter_Popoff wrote: >> On 07.11.2015 &#1075;. 18:45, rickman wrote: >>> On 11/7/2015 11:35 AM, Dimiter_Popoff wrote: >>>> On 07.11.2015 &#1075;. 17:47, Randy Yates wrote: >>>>> Hi, >>>>> >>>>> I'm trying to find information on the Silicon Labs/Energy Micro EFM32 >>>>> Cortex M3 processor instruction execution times, namely, the >>>>> MLA/Multiply-Accumulate instruction, but others as well. I've found >>>>> the >>>>> instruction in the reference manual but nowhere are cycle times >>>>> mentioned. >>>>> >>>>> This is surreal. Every assembly language reference manual I've ever >>>>> used >>>>> includes cycle counts for each instruction. Here they're nowhere to be >>>>> found. >>>>> >>>> >>>> Be prepared for surprises with the MAC instruction on a non-specialized >>>> DSP like that. >>>> Even if they specify 1 cycle throughput this can be unrealistic given >>>> the few registers they have. >>>> If they have a 6 stage pipeline it takes at least 18 registers for >>>> MAC intermediate results only to bypass the data dependencies. >>>> IOW, if you just write a loop with a counter you will need at least 6 >>>> cycles (plus perhaps some additional time for the mul) simply because >>>> every multiply-add needs the result of the previous one to be able >>>> to add to. >>> >>> You might want to rethink that. The accumulate operation (add) is >>> typically one clock cycle while the multiply is sometimes multiple >>> cycles. I don't know what the multiply time is in the CM3, I thought it >>> was one cycle as well, but perhaps that is a pipelined time. Regardless, >>> the multiply spits out a result on every clock which is then added to >>> the accumulator on each clock producing a MAC result on each clock. >>> >>> I remember that in the CM4 they claimed to be able to get close to 1 MAC >>> per clock with various optimizations. >>> >> >> Hah, it appears I am the only one - not only in this group - to have >> really gone through this. >> >> The multiply does spit say a result every cycle, OK. But this is at the >> end of the pipeline; so each multiply has started 6 (to stay with my 6 >> stages example) cycles earlier than the result it spits. Now since >> we accumulate the result in one register - and it is also at the input >> of the pipeline for the multiply-add opcode - a new instruction cannot >> begin going through the pipeline before one is finished, not without >> some additional, DSP-ish trickery - which "normal" processor do not >> have or if they do they talk about some "DSP engine" or sort of. >> >> I had to do this the hard way on the e300 power core; in a simple loop, >> the FMADD (FP multiply-add 64 bit) would take something like 20-30nS >> in a straight forward loop (at 2.5nS clock period). The latency >> specified for the FMADD is just 2 cycles though; I had to bypass the >> data dependencies by using at least 6 (I did 6, 7 and 8) sets of >> 3 registers so the loop would go through all sets which all >> had different destination registers and thus would have enough >> time for the pipeline every cycle. At the end of the loop all >> 6 (or 7 or 8) destination registers are simply added to get the >> final result. >> I have posted it before, hopefully this explanation is better >> than my previous ones. Here is the source of how this works: >> >> http://tgi-sci.com/misc/mac8.sa >> >> Notice that it also saves load/store by a factor of 6 (or 8 >> in the example, I think it is the 8 sets case); the measured >> performance with this was 5.5 nS per FMADD (theoretical best, >> no load/store involved would have been 5 nS). >> >> Now David said the ARM in question has only 3 stages in its pipeline, >> it would take 9 registers to bypass its data dependencies; >> might even be doable with the few registers they have. >> [In fact the above is a good example of why ARM try to keep their >> pipelines short; their architecture just does not have the registers >> it takes to maintain a longer pipeline full/productive, it is a major >> architecture limitation for a load/store machine). >> > > No, that is most certainly /not/ why ARM wants to keep these pipelines > short. I am not disagreeing with your calculations regarding > throughput, latency, and registers on hardware that is not DSP-dedicated > (and I don't know what DSP features an M4 really has, as I haven't > needed them myself). You've pointed out these issues before, and I > think they are often misunderstood - people see the "MAC instruction > timing 1 cycle" and think they can get 72 MMACs from a 72 MHz M4. So it > is good that you raise awareness here. > > But these are primarily control-oriented microcontroller cores - short > pipelines means low latencies, consistent timings, short branch delays, > minimal interrupt latency jitter, small core die area, and low power. > Being able to improve throughput of long MAC chains is merely a bonus. > > Remember, the M4 core is not in the same class as the e300 - you would > be better to compare the e300 to a Cortex A device with NEON SIMD > instructions and see how that compares in DSP throughput. > (Alternatively, you could compare MACs/s per $, or per mW, to get a > fairer match.) > >
Well I cannot have your certainty about the motivation ARM have, but I strongly suspect they _do_ know about the data dependencies and they do take them into account when designing. What I am pointing out is the architectural limitation; the MAC loop is only one good example how it takes pipeline depth times 3 registers plus address pointers and counters etc. to be able to keep it productive. Of course like you say most applications do not need all the resources, then there are architectures much worse than ARM doing commercially fine etc., I am not interested in such a discussion at all. My point is about the number of registers a load/store machine needs in order to make use of a given pipeline depth. ARM is fundamentally limited in that by having too few registers and being a load/store machine at the same time, there is nothing one can do against these figures. Dimiter ------------------------------------------------------ Dimiter Popoff, TGI http://www.tgi-sci.com ------------------------------------------------------ http://www.flickr.com/photos/didi_tgi/
On 08.11.2015 &#1075;. 00:46, rickman wrote:
> On 11/7/2015 5:18 PM, Dimiter_Popoff wrote: >> On 07.11.2015 &#1075;. 18:45, rickman wrote: >>> On 11/7/2015 11:35 AM, Dimiter_Popoff wrote: >>>> On 07.11.2015 &#1075;. 17:47, Randy Yates wrote: >>>>> Hi, >>>>> >>>>> I'm trying to find information on the Silicon Labs/Energy Micro EFM32 >>>>> Cortex M3 processor instruction execution times, namely, the >>>>> MLA/Multiply-Accumulate instruction, but others as well. I've found >>>>> the >>>>> instruction in the reference manual but nowhere are cycle times >>>>> mentioned. >>>>> >>>>> This is surreal. Every assembly language reference manual I've ever >>>>> used >>>>> includes cycle counts for each instruction. Here they're nowhere to be >>>>> found. >>>>> >>>> >>>> Be prepared for surprises with the MAC instruction on a non-specialized >>>> DSP like that. >>>> Even if they specify 1 cycle throughput this can be unrealistic given >>>> the few registers they have. >>>> If they have a 6 stage pipeline it takes at least 18 registers for >>>> MAC intermediate results only to bypass the data dependencies. >>>> IOW, if you just write a loop with a counter you will need at least 6 >>>> cycles (plus perhaps some additional time for the mul) simply because >>>> every multiply-add needs the result of the previous one to be able >>>> to add to. >>> >>> You might want to rethink that. The accumulate operation (add) is >>> typically one clock cycle while the multiply is sometimes multiple >>> cycles. I don't know what the multiply time is in the CM3, I thought it >>> was one cycle as well, but perhaps that is a pipelined time. Regardless, >>> the multiply spits out a result on every clock which is then added to >>> the accumulator on each clock producing a MAC result on each clock. >>> >>> I remember that in the CM4 they claimed to be able to get close to 1 MAC >>> per clock with various optimizations. >>> >> >> Hah, it appears I am the only one - not only in this group - to have >> really gone through this. >> >> The multiply does spit say a result every cycle, OK. But this is at the >> end of the pipeline; so each multiply has started 6 (to stay with my 6 >> stages example) cycles earlier than the result it spits. Now since >> we accumulate the result in one register - and it is also at the input >> of the pipeline for the multiply-add opcode - a new instruction cannot >> begin going through the pipeline before one is finished, not without >> some additional, DSP-ish trickery - which "normal" processor do not >> have or if they do they talk about some "DSP engine" or sort of. > > I don't quite understand what you are saying. You seem to be saying the > pipeline is 6 clock cycles long while that does not seem to be supported > by the facts.
I just stick to the same example from the beginning for clarity.
> Then you propose the inputs to the instruction have to be > available at the *start* of the instruction (not sure what that even > means really as instructions are fetched, decoded and executed, which > one is the "start") which is not necessarily true. I don't know that > pipelining the MAC instruction requires anything special from the CPU > other than the various controls required for pipelining.
Well I know on the surface this is easy to overlook, as I had not thought about it until I had to deal with it. But it is a general issue. Operands enter the pipeline at its input; if one of these operands needs to be the output of the pipeline guess what, you will have to wait for the entire pipeline length to be walked through before you have all operands to do the next operation. Let us try the MAC example: at the pipeline input you need a sample, a coefficient and the accumulated value, S, C and A. Assume that to calculate S*C+A takes as many steps as the pipeline is deep, say 6 cycles. Now we start with S0*C0+A0=A1; next cycle we do S0*C0+A1. But A1 will not be available for another 6 cycles, not before s0*c0+a0 make it to the end of the pipeline. It is called a data dependency.
> I'm not in a position to debate this since I am not so familiar with the > ARM instruction set, but I don't see any reason to use more registers > for a simple instruction like the MAC than are actually required. I > have never seen a problem with overlapping register usage in pipelined > instruction sets. As long as the register is updated by the time it is > used, it all works. Otherwise, what is the point of pipelining? >
Well I hope I did explain it well enough this time :-). Pipelining is powerful but like anything else it has its limitations, the above example summarizes it quite well. Dimiter ------------------------------------------------------ Dimiter Popoff, TGI http://www.tgi-sci.com ------------------------------------------------------ http://www.flickr.com/photos/didi_tgi/
On 08.11.2015 &#1075;. 01:07, Dimiter_Popoff wrote:
> On 08.11.2015 &#1075;. 00:46, rickman wrote: >> On 11/7/2015 5:18 PM, Dimiter_Popoff wrote: >>> On 07.11.2015 &#1075;. 18:45, rickman wrote: >>>> On 11/7/2015 11:35 AM, Dimiter_Popoff wrote: >>>>> On 07.11.2015 &#1075;. 17:47, Randy Yates wrote: >>>>>> Hi, >>>>>> >>>>>> I'm trying to find information on the Silicon Labs/Energy Micro EFM32 >>>>>> Cortex M3 processor instruction execution times, namely, the >>>>>> MLA/Multiply-Accumulate instruction, but others as well. I've found >>>>>> the >>>>>> instruction in the reference manual but nowhere are cycle times >>>>>> mentioned. >>>>>> >>>>>> This is surreal. Every assembly language reference manual I've ever >>>>>> used >>>>>> includes cycle counts for each instruction. Here they're nowhere >>>>>> to be >>>>>> found. >>>>>> >>>>> >>>>> Be prepared for surprises with the MAC instruction on a >>>>> non-specialized >>>>> DSP like that. >>>>> Even if they specify 1 cycle throughput this can be unrealistic given >>>>> the few registers they have. >>>>> If they have a 6 stage pipeline it takes at least 18 registers for >>>>> MAC intermediate results only to bypass the data dependencies. >>>>> IOW, if you just write a loop with a counter you will need at least 6 >>>>> cycles (plus perhaps some additional time for the mul) simply because >>>>> every multiply-add needs the result of the previous one to be able >>>>> to add to. >>>> >>>> You might want to rethink that. The accumulate operation (add) is >>>> typically one clock cycle while the multiply is sometimes multiple >>>> cycles. I don't know what the multiply time is in the CM3, I >>>> thought it >>>> was one cycle as well, but perhaps that is a pipelined time. >>>> Regardless, >>>> the multiply spits out a result on every clock which is then added to >>>> the accumulator on each clock producing a MAC result on each clock. >>>> >>>> I remember that in the CM4 they claimed to be able to get close to 1 >>>> MAC >>>> per clock with various optimizations. >>>> >>> >>> Hah, it appears I am the only one - not only in this group - to have >>> really gone through this. >>> >>> The multiply does spit say a result every cycle, OK. But this is at the >>> end of the pipeline; so each multiply has started 6 (to stay with my 6 >>> stages example) cycles earlier than the result it spits. Now since >>> we accumulate the result in one register - and it is also at the input >>> of the pipeline for the multiply-add opcode - a new instruction cannot >>> begin going through the pipeline before one is finished, not without >>> some additional, DSP-ish trickery - which "normal" processor do not >>> have or if they do they talk about some "DSP engine" or sort of. >> >> I don't quite understand what you are saying. You seem to be saying the >> pipeline is 6 clock cycles long while that does not seem to be supported >> by the facts. > > I just stick to the same example from the beginning for clarity. > >> Then you propose the inputs to the instruction have to be >> available at the *start* of the instruction (not sure what that even >> means really as instructions are fetched, decoded and executed, which >> one is the "start") which is not necessarily true. I don't know that >> pipelining the MAC instruction requires anything special from the CPU >> other than the various controls required for pipelining. > > Well I know on the surface this is easy to overlook, as I had not > thought about it until I had to deal with it. But it is a general > issue. > Operands enter the pipeline at its input; if one of these operands > needs to be the output of the pipeline guess what, you will have > to wait for the entire pipeline length to be walked through before > you have all operands to do the next operation. > Let us try the MAC example: at the pipeline input you > need a sample, a coefficient and the accumulated value, > S, C and A. > Assume that to calculate S*C+A takes as many steps as the pipeline > is deep, say 6 cycles. > Now we start with S0*C0+A0=A1; next cycle we do S0*C0+A1. > But A1 will not be available for another 6 cycles, not before > s0*c0+a0 make it to the end of the pipeline. > It is called a data dependency. > >> I'm not in a position to debate this since I am not so familiar with the >> ARM instruction set, but I don't see any reason to use more registers >> for a simple instruction like the MAC than are actually required. I >> have never seen a problem with overlapping register usage in pipelined >> instruction sets. As long as the register is updated by the time it is >> used, it all works. Otherwise, what is the point of pipelining? >> > > Well I hope I did explain it well enough this time :-). Pipelining is > powerful but like anything else it has its limitations, the above > example summarizes it quite well. > > Dimiter > > ------------------------------------------------------ > Dimiter Popoff, TGI http://www.tgi-sci.com > ------------------------------------------------------ > http://www.flickr.com/photos/didi_tgi/ > >
> Now we start with S0*C0+A0=A1; next cycle we do S0*C0+A1. my mistake - obviously this shoud read "Now we start with S0*C0+A0=A1; next cycle we do S1*C1+A1." Dimiter