EmbeddedRelated.com
Forums
Memfault State of IoT Report

Accurate delay routine in assembly for ARM7, CortexM3 and M0

Started by Alexan_e June 7, 2012
SW Timing loops should generally not be used for any critical timing, there
are too many events that can affect them, so if the loop time is critical
you should find another way.

There are other bus masters in the system which may affect CPU timing if
they cause a stall in an instruction fetch, e.g. the DMA or Ethernet
controllers.

For short timing loops you also have to consider the effect of any buffered
writes which may need to be completed at the start of the loop.

Regards

Phil.

An Engineer's Guide to the LPC2100 Series

--- In l..., "Kevin" wrote:
>
> As far as I know the M3 does support ARM-mode, however M0 does not.

You are not that far wrong in practice - it is just that you have not used the correct terminology. The Cortex-M3 does support a mixed 32-bit / 16-bit instruction set but it is now unified and called Thumb-2 instead of the earlier separate mode 32-bit ARM / 16-bit Thumb combination. The Cortex-M0 supports 16-bit Thumb only.

The major difference between the functionality of the 32-bit instructions in Thumb-2 compared to ARM is that most instructions no longer have the conditional execution feature. This has been replaced by the new IT instruction. Apart from that you can do just about the same (and in some cases more) with the 32-bit Thumb-2 instruction set as you could with the earlier 32-bit ARM instruction set.

Regards,
Chris Burrows

CFB Software
Astrobe v4.2: Cortex-M3 Oberon Development System
http://www.astrobe.com
>NOP does nothing. NOP is not necessarily a time-consuming NOP. The processor might remove it from the pipeline before it reaches the execution stage.

It shill has to fetch them, so only some of no-time-consuming NOPs can happen after other instructions. Once the execution end of pipeline is empty, a stream of NOPs provides the expected 1-cycle-per-NOP delay.

But you are right - this is just another unknown in the equation. It is not a trivial task indeed, especially with the "documentation" we have at hand.
JW
>SW Timing loops should generally not be used for any critical timing, there
>are too many events that can affect them, so if the loop time is critical
>you should find another way.

Oh, there is an implicit assumption in loop delays that they provide a *minimal* delay.

We all are grown ups, right? :-)

JW
I do not aproove the use of processor instructions to perform delays, what
I do is always to use a timer/counter. Since it will not be affected if the
routine is interrupted during a wait.

The Cortex-M chips usually have 4 timers + RIT + TICK, this is a lot.

Anyways, I'd like to show to all of you a great way of doing timout and
delays routines, which supports overflow, so the timers do not have to be
reset when used, and therefore, can be used in multi tasking/interrption
enviroment.

Thank you for your replies.

I have read about the __NOP() function defined in the CMSIS library before starting this threads and I have tried to use it in a C
loop to get some delays but the results were kind of strange , at least for the level of knowledge I have about Cortex.

For anyone unfamiliar with the __NOP() define it is the following

--------------------- these are defined for the keil uvision compiler ------------------------------

----------------------------

And this is the function:



I have made the test in the software simulator of uvision (v4.53) using LPC1313-01 (CortexM3) with the following loop:



Strangely the cpu cycles per loop change depending on the condition of the loop.

When I use i<1000 then the loop takes a total of 6006 cpu cycles which gives 6 cycles per loop (it is the same with i<100 )
When I use i<10000 then the loop takes a total of 70007 cpu cycles which gives 7 cycles per loop
When I use i<100000 then the loop takes a total of 800008 cpu cycles which gives 8 cycles per loop
When I use i<100000 then the loop takes a total of 8000008 cpu cycles which gives 8 cycles per loop and any higher number gives
the same result

I assume that the software simulation is accurate.
Can anyone explain why I get these different cpu clocks per loop depending on the number of loops?

I have also done the same test in LPC2103 (ARM7TDMI) using the definition as given in the top of this post and the same loop

When I use i<1000 then the loop takes a total of 7995 cpu cycles which gives 8 cycles per loop
When I use i<10000 then the loop takes a total of 109998 cpu cycles which gives 11 cycles per loop
When I use i<100000 then the loop takes a total of 1099998 cpu cycles which gives 11 cycles per loop and any higher number gives
the same result.

Alex
There are a few things to consider when using the Keil environment which may
affect the results.

Firstly the software simulator is NOT cycle accurate, it's about 90%
accurate but doesn't simulate all possible delays correctly, and doesn't
simulate effects of memory accelerators etc used to get high speed flash
access.

Secondly with the Keil system the debugger is intrusive by default, i.e. if
you have periodic window update enabled then that causes bus accesses which
steal cycles from the CPU for the debugger interface to access the memory,
so turn it off for any time measurements.

I'm not sure if they model this effect in the simulator as there is need,
but it depends on what CPU model is used I guess.

You also need to ensure that you compile with semi-hosting disabled since
there may be background tasks using the CPU bandwidth to handle the
semi-hosting libraries. For short simulations they would usually not get
called, but a periodic interrupt to handle IO would affect longer
simulations more.

Regards

Phil.
On 7 Jun 2012, at 23:40, Alexan_e wrote:
> Thank you for your replies.
>
> I have read about the __NOP() function defined in the CMSIS library before starting this threads and I have tried to use it in a C
> loop to get some delays but the results were kind of strange , at least for the level of knowledge I have about Cortex.
>
> For anyone unfamiliar with the __NOP() define it is the following
>
> --------------------- these are defined for the keil uvision compiler ------------
> #define __ASM __asm /*!< asm keyword for ARM Compiler */
> #define __INLINE __inline /*!< inline keyword for ARM Compiler */
> #define __STATIC_INLINE static __inline
> ----------------------------------
>
> And this is the function:
>
> __attribute__( ( always_inline ) ) __STATIC_INLINE void __NOP(void)
> {
> __ASM volatile ("nop");
> }
>
> I have made the test in the software simulator of uvision (v4.53) using LPC1313-01 (CortexM3) with the following loop:
>
> volatile unsigned int i;
>
> for(i=0;i<1000;i++)
> {
> __NOP();

You don't need the NOP. What is it used for? Absolutely nothing.

>
> }
>
> Strangely the cpu cycles per loop change depending on the condition of the loop.

It is not strange at all. You have not examined what you are executing.

>
> When I use i<1000 then the loop takes a total of 6006 cpu cycles which gives 6 cycles per loop (it is the same with i<100 )
> When I use i<10000 then the loop takes a total of 70007 cpu cycles which gives 7 cycles per loop
> When I use i<100000 then the loop takes a total of 800008 cpu cycles which gives 8 cycles per loop
> When I use i<100000 then the loop takes a total of 8000008 cpu cycles which gives 8 cycles per loop and any higher number gives
> the same result

This indicates three different strategies of compilation, if your results are correct (I suspect the last one is missing a zero)

You may achieve (as in, I expect but have no intention of verifying) the same number of cycles per loop iteration for your terminal values if you happen to make the terminal value a volatile variable:



My assumption would be that the compiler will be forced into using a single form of loop for this; other compiler options, of course, may well scupper your finely tuned absolutely-no-use-in-practice delay function. As discussed, either write it 100% in assembly language and, even then, it is highly unlikely to be accurate, or use an accurate time source.

Alternatively, just don't bother :-)

-- Paul
You are right , I missed a 0 in the last line.
It was supposed to be When I use i<100000 then the loop takes a total of 8000008 cpu cycles which gives 8 cycles per loop.
I have checked the disassembly window (I know , I should have done it before) and it shows the difference between the loops

for i<1000 (6 clocks)

0x00000716 BF00 NOP
0x00000718 1C40 ADDS r0,r0,#1
0x0000071A F5B07F7A CMP r0,#0x3E8
0x0000071E D3FA BCC 0x00000716

for i<10000 (7 clocks)

0x00000716 BF00 NOP
0x00000718 1C40 ADDS r0,r0,#1
0x0000071A F2427110 MOVW r1,#0x2710
0x0000071E 4288 CMP r0,r1
0x00000720 D3F9 BCC 0x00000716

for i<100000 (8 clocks)

0x00000716 BF00 NOP
0x00000718 1C40 ADDS r0,r0,#1
0x0000071A 4913 LDR r1,[pc,#76] ; @0x00000768
0x0000071C 4288 CMP r0,r1
0x0000071E D3FA BCC 0x00000716

I have tried the empty loop,
for(i=0;i<100000;i++); results in the following code and executes in 700008 clocks , 7 instead of 8 clocks per loop because of the missing NOP

0x00000716 1C40 ADDS r0,r0,#1
0x00000718 4913 LDR r1,[pc,#76] ; @0x00000768
0x0000071A 4288 CMP r0,r1
0x0000071C D3FB BCC 0x00000716

I'm just trying to figure out a small delay routine (preferably accurate in a us level) so that I can use it in a few libraries as a delay between commands.
I was using such a delay in my Atmel AVR libraries and it would be convenient for me to have a similar delay for Cortex (and ARM7TDMI).
A function like that will work without any change for any LPC cortex device and there will be no need to ensure that the timer used in the delay isn't used elsewhere in the project.
In addition there are about four different sets of timer register names in the headers of LPC 17xx, 177x/8x, 13xx, 13Uxx etc so I would have to either change the register names each time or use a set of defines.

An 100% in assembly language delay is kind of what I was looking for when I started the thread but I couldn't fix the given one to work properly

Alex
Alexan,

What happen to the accuracy of your us delay function if it gets
interrupted in the loop?

Memfault State of IoT Report