Reply by David Brown March 24, 20172017-03-24
On 24/03/17 20:26, Tauno Voipio wrote:
> On 24.3.17 16:21, David Brown wrote:
>> >> Finally, they might like to note that: >> >> void portable_delay_cycles(unsigned long n) { >> while (n--) { >> asm volatile (""); >> } >> } >> >> gives a faster loop, as well as actually being portable! >> >> The best code for accurate delays is usually made by having static >> inline functions, so that the compiler can calculate and compensate for >> loop entry/exit directly, and use whatever registers make most sense for >> efficient code. >> >> >> > > I would delete the cmp instruction and change the sub instruction > to a subs instruction.
That is what gcc generates for the C loop above (I said it gave a faster loop, but did not post the generated code).
> > One more vote to the C version of David. >
Yes. The " asm volatile(""); " trick might be new to some people - it tells the compiler "pretend something important is happening here, even though there is no code". It is cheaper than the traditional idea of making the loop variable volatile to force the compiler to keep the loop, or the alternative of using an assembly "nop" instruction.
Reply by Tauno Voipio March 24, 20172017-03-24
On 24.3.17 16:21, David Brown wrote:
> On 24/03/17 13:54, pozz wrote: >> Il 24/03/2017 12:57, David Brown ha scritto: >>> On 24/03/17 12:23, pozz wrote: >>>> Il 23/02/2017 17:15, pozz ha scritto: >>>>> During startup, I need a short and not precise delay, before >>>>> configuring >>>>> clocks, timers and other peripherals (at startup the CPU runs with >>>>> internal clock). >>>>> >>>>> What do you suggest? >>>>> >>>>> I think there's a simpler method than configuring a hardware timer. >>>>> >>>>> I need to check the status of an input pin, *after* enabling internal >>>>> pull-up. I'd like to introduce a short delay after enabling internal >>>>> pull-up, otherwise there's a risk I will read a transient level >>>>> (maybe 0 >>>>> or 1). >>>> >>>> I found some interesting code in Atmel Software Framework. It is based >>>> on the following function defined in delay.c file: >>>> >>>> __attribute__((optimize("-Os"))) >>>> __attribute__ ((section(".ramfunc"))) >>>> void portable_delay_cycles(unsigned long n) >>>> { >>>> (void)n; >>>> >>>> __asm ( >>>> "loop: DMB \n" >>>> "SUB r0, r0, #1 \n" >>>> "CMP r0, #0 \n" >>>> "BNE loop " >>>> ); >>>> } >>>> >>> >>> Someone should teach the Atmel folk about gcc inline assembly... >> >> What is wrong with that code? >> > > That is written in the old "basic" format for inline assembly, which for > most purposes has been replaced by the "extended" format. As it says in > the manual: > > "Using extended asm (see Extended Asm) typically produces smaller, > safer, and more efficient code, and in most cases it is a better > solution than basic asm." > > This would have been better written: > > __attribute__((section(".ramfunc")) > void portable_delay_cycles(unsigned long n) > { > asm volatile ( > " dmb\n" > " 1:\n" > " sub %[n], %[n], #1\n" > " cmp %[n], #0\n" > " bne 1b" > : [n] "+r" (n) :: "cc"); > } > > I would actually prefer such functions to be declared "static inline", > and not as a separately compiled function in a different segment, to > avoid function call overhead in the delay calculation. But I appreciate > the thought of putting it into ram like this. > > There is very little point in specifying an optimisation attribute for a > function containing nothing but assembly! > > When written as basic assembly, the compiler has fewer opportunities to > do much with the code. Extended assembly gives the compiler details of > exactly what you need, and exactly what you are doing. If the compiler > can move the code around (such as for a "static inline" function, or if > you have link-time optimisation enabled), then the compiler is able to > generate better code. With the basic assembly, the compiler must assume > that the assembly code will corrupt the volatile registers r0..r3, lr > and the condition codes. With extended assembly, the compiler knows > exactly which registers are in use - and it does not have to pick r0 for > the loop counter. It also means the compiler knows exactly how "n" is > going to be used. > > Also, there is no need to put a "data memory barrier" /inside/ the loop! > I don't quite see how Atmel count 7 clocks per cycle - I see 6 myself. > But putting the DMB outside the loop reduces it by two clocks per > cycle, thus increasing the resolution. > > Finally, they might like to note that: > > void portable_delay_cycles(unsigned long n) { > while (n--) { > asm volatile (""); > } > } > > gives a faster loop, as well as actually being portable! > > The best code for accurate delays is usually made by having static > inline functions, so that the compiler can calculate and compensate for > loop entry/exit directly, and use whatever registers make most sense for > efficient code. > > >
I would delete the cmp instruction and change the sub instruction to a subs instruction. One more vote to the C version of David. -- -TV
Reply by David Brown March 24, 20172017-03-24
On 24/03/17 13:54, pozz wrote:
> Il 24/03/2017 12:57, David Brown ha scritto: >> On 24/03/17 12:23, pozz wrote: >>> Il 23/02/2017 17:15, pozz ha scritto: >>>> During startup, I need a short and not precise delay, before >>>> configuring >>>> clocks, timers and other peripherals (at startup the CPU runs with >>>> internal clock). >>>> >>>> What do you suggest? >>>> >>>> I think there's a simpler method than configuring a hardware timer. >>>> >>>> I need to check the status of an input pin, *after* enabling internal >>>> pull-up. I'd like to introduce a short delay after enabling internal >>>> pull-up, otherwise there's a risk I will read a transient level >>>> (maybe 0 >>>> or 1). >>> >>> I found some interesting code in Atmel Software Framework. It is based >>> on the following function defined in delay.c file: >>> >>> __attribute__((optimize("-Os"))) >>> __attribute__ ((section(".ramfunc"))) >>> void portable_delay_cycles(unsigned long n) >>> { >>> (void)n; >>> >>> __asm ( >>> "loop: DMB \n" >>> "SUB r0, r0, #1 \n" >>> "CMP r0, #0 \n" >>> "BNE loop " >>> ); >>> } >>> >> >> Someone should teach the Atmel folk about gcc inline assembly... > > What is wrong with that code? >
That is written in the old "basic" format for inline assembly, which for most purposes has been replaced by the "extended" format. As it says in the manual: "Using extended asm (see Extended Asm) typically produces smaller, safer, and more efficient code, and in most cases it is a better solution than basic asm." This would have been better written: __attribute__((section(".ramfunc")) void portable_delay_cycles(unsigned long n) { asm volatile ( " dmb\n" " 1:\n" " sub %[n], %[n], #1\n" " cmp %[n], #0\n" " bne 1b" : [n] "+r" (n) :: "cc"); } I would actually prefer such functions to be declared "static inline", and not as a separately compiled function in a different segment, to avoid function call overhead in the delay calculation. But I appreciate the thought of putting it into ram like this. There is very little point in specifying an optimisation attribute for a function containing nothing but assembly! When written as basic assembly, the compiler has fewer opportunities to do much with the code. Extended assembly gives the compiler details of exactly what you need, and exactly what you are doing. If the compiler can move the code around (such as for a "static inline" function, or if you have link-time optimisation enabled), then the compiler is able to generate better code. With the basic assembly, the compiler must assume that the assembly code will corrupt the volatile registers r0..r3, lr and the condition codes. With extended assembly, the compiler knows exactly which registers are in use - and it does not have to pick r0 for the loop counter. It also means the compiler knows exactly how "n" is going to be used. Also, there is no need to put a "data memory barrier" /inside/ the loop! I don't quite see how Atmel count 7 clocks per cycle - I see 6 myself. But putting the DMB outside the loop reduces it by two clocks per cycle, thus increasing the resolution. Finally, they might like to note that: void portable_delay_cycles(unsigned long n) { while (n--) { asm volatile (""); } } gives a faster loop, as well as actually being portable! The best code for accurate delays is usually made by having static inline functions, so that the compiler can calculate and compensate for loop entry/exit directly, and use whatever registers make most sense for efficient code.
Reply by pozz March 24, 20172017-03-24
Il 24/03/2017 12:57, David Brown ha scritto:
> On 24/03/17 12:23, pozz wrote: >> Il 23/02/2017 17:15, pozz ha scritto: >>> During startup, I need a short and not precise delay, before configuring >>> clocks, timers and other peripherals (at startup the CPU runs with >>> internal clock). >>> >>> What do you suggest? >>> >>> I think there's a simpler method than configuring a hardware timer. >>> >>> I need to check the status of an input pin, *after* enabling internal >>> pull-up. I'd like to introduce a short delay after enabling internal >>> pull-up, otherwise there's a risk I will read a transient level (maybe 0 >>> or 1). >> >> I found some interesting code in Atmel Software Framework. It is based >> on the following function defined in delay.c file: >> >> __attribute__((optimize("-Os"))) >> __attribute__ ((section(".ramfunc"))) >> void portable_delay_cycles(unsigned long n) >> { >> (void)n; >> >> __asm ( >> "loop: DMB \n" >> "SUB r0, r0, #1 \n" >> "CMP r0, #0 \n" >> "BNE loop " >> ); >> } >> > > Someone should teach the Atmel folk about gcc inline assembly...
What is wrong with that code?
Reply by David Brown March 24, 20172017-03-24
On 24/03/17 12:23, pozz wrote:
> Il 23/02/2017 17:15, pozz ha scritto: >> During startup, I need a short and not precise delay, before configuring >> clocks, timers and other peripherals (at startup the CPU runs with >> internal clock). >> >> What do you suggest? >> >> I think there's a simpler method than configuring a hardware timer. >> >> I need to check the status of an input pin, *after* enabling internal >> pull-up. I'd like to introduce a short delay after enabling internal >> pull-up, otherwise there's a risk I will read a transient level (maybe 0 >> or 1). > > I found some interesting code in Atmel Software Framework. It is based > on the following function defined in delay.c file: > > __attribute__((optimize("-Os"))) > __attribute__ ((section(".ramfunc"))) > void portable_delay_cycles(unsigned long n) > { > (void)n; > > __asm ( > "loop: DMB \n" > "SUB r0, r0, #1 \n" > "CMP r0, #0 \n" > "BNE loop " > ); > } >
Someone should teach the Atmel folk about gcc inline assembly... They might also realise that "written in AVR32 assembly" and "portable" don't really go together!
> > In delay.h there are some macros: > > void portable_delay_cycles(unsigned long n); > > #define cpu_ms_2_cy(ms, f_cpu) \ > (((uint64_t)(ms) * (f_cpu) + (uint64_t)(7e3-1ul)) / (uint64_t)7e3) > #define cpu_us_2_cy(us, f_cpu) \ > (((uint64_t)(us) * (f_cpu) + (uint64_t)(7e6-1ul)) / (uint64_t)7e6) > > #define delay_cycles portable_delay_cycles > > #define cpu_delay_s(delay) delay_cycles(cpu_ms_2_cy(1000 * delay, F_CPU)) > #define cpu_delay_ms(delay) delay_cycles(cpu_ms_2_cy(delay, F_CPU)) > #define cpu_delay_us(delay) delay_cycles(cpu_us_2_cy(delay, F_CPU)) > > > However I didn't measure how precise are the delays generated by those > functions.
For a multi-MHz cpu clock, they should certainly be better than �s precision. The point of the function is to put the code in ram, so that its timing is not dependent on things like flash access delays. Of course, if interrupts are enabled, it's a different matter.
Reply by pozz March 24, 20172017-03-24
Il 23/02/2017 17:15, pozz ha scritto:
> During startup, I need a short and not precise delay, before configuring > clocks, timers and other peripherals (at startup the CPU runs with > internal clock). > > What do you suggest? > > I think there's a simpler method than configuring a hardware timer. > > I need to check the status of an input pin, *after* enabling internal > pull-up. I'd like to introduce a short delay after enabling internal > pull-up, otherwise there's a risk I will read a transient level (maybe 0 > or 1).
I found some interesting code in Atmel Software Framework. It is based on the following function defined in delay.c file: __attribute__((optimize("-Os"))) __attribute__ ((section(".ramfunc"))) void portable_delay_cycles(unsigned long n) { (void)n; __asm ( "loop: DMB \n" "SUB r0, r0, #1 \n" "CMP r0, #0 \n" "BNE loop " ); } In delay.h there are some macros: void portable_delay_cycles(unsigned long n); #define cpu_ms_2_cy(ms, f_cpu) \ (((uint64_t)(ms) * (f_cpu) + (uint64_t)(7e3-1ul)) / (uint64_t)7e3) #define cpu_us_2_cy(us, f_cpu) \ (((uint64_t)(us) * (f_cpu) + (uint64_t)(7e6-1ul)) / (uint64_t)7e6) #define delay_cycles portable_delay_cycles #define cpu_delay_s(delay) delay_cycles(cpu_ms_2_cy(1000 * delay, F_CPU)) #define cpu_delay_ms(delay) delay_cycles(cpu_ms_2_cy(delay, F_CPU)) #define cpu_delay_us(delay) delay_cycles(cpu_us_2_cy(delay, F_CPU)) However I didn't measure how precise are the delays generated by those functions.
Reply by mrfirmware March 23, 20172017-03-23
On Thursday, March 23, 2017 at 1:54:41 PM UTC-4, Tim Wescott wrote:
> On Thu, 23 Mar 2017 10:05:51 +0100, David Brown wrote: > > > Turning on -O2 can never make correct code (in the C sense) into > > incorrect code - it just asks the compiler to work harder to > > generate more efficient code that does exactly the same thing as > > before. > > Just wanted to double down on that -- if bumping up the optimization > level screws the pooch, then your code is at fault. > > -- > > Tim Wescott > Wescott Design Services > http://www.wescottdesign.com > > I'm looking for work -- see my website!
Yeah, in retrospect what was I thinking? I guess this is why I just read or write the port pin multiple times to sample reliably/create a pulse. Or snapshot a free running hw timer and wait for it to exceed some offset from the snapshot time.
Reply by Tim Wescott March 23, 20172017-03-23
On Thu, 23 Mar 2017 10:05:51 +0100, David Brown wrote:

> Turning on -O2 can never make correct code (in the C sense) into > incorrect code - it just asks the compiler to work harder to > generate more efficient code that does exactly the same thing as > before.
Just wanted to double down on that -- if bumping up the optimization level screws the pooch, then your code is at fault. -- Tim Wescott Wescott Design Services http://www.wescottdesign.com I'm looking for work -- see my website!
Reply by David Brown March 23, 20172017-03-23
On 22/03/17 17:48, mrfirmware wrote:
> On Wednesday, March 22, 2017 at 12:29:29 PM UTC-4, mrfirmware wrote: >> On Thursday, February 23, 2017 at 12:00:23 PM UTC-5, John Speth wrote: >>> On 2/23/2017 8:15 AM, pozz wrote: >>>> During startup, I need a short and not precise delay, before configuring >>>> clocks, timers and other peripherals (at startup the CPU runs with >>>> internal clock). >>>> >>>> What do you suggest? >>>> >>>> I think there's a simpler method than configuring a hardware timer. >>>> >>>> I need to check the status of an input pin, *after* enabling internal >>>> pull-up. I'd like to introduce a short delay after enabling internal >>>> pull-up, otherwise there's a risk I will read a transient level (maybe 0 >>>> or 1). >>> >>> First thing after main() starts, configure your pin and run a spin loop >>> based delay, then read the pin. There's probably no need for a timer at >>> that stage of start up. >>> >>> void delay(void) >>> { >>> // Use volatile so the optimizer will not nullify this code >>> volatile int i; >>> for(i = 0; i < YOUR_DELAY; i++); >>> } >>> >>> JJS >> >> How is it legal for the compiler to optimize out the iterator? If it >> does, your compiler is broken. Don't confuse <optimized out> from gdb >> with "I removed your code." That for loop will execute because you told >> the compiler to loop for YOUR_DELAY counts. It must emit that code even >> if 'i' is not visible in the debugger. Volatile is only needed when the >> compiler cannot see any possible path for the variable to change with in >> the current scope of execution, e.g. ISR, another thread, or a hardware >> mapped address. >> >> As for a delay in this case I'd read that GPIO port N times to create >> the delay. The port read will have a longer execution duration and you >> may be able to empirically determine how many reads it takes before it >> stabilizes. Often times port reads can be rather CPU clock independent >> and depend more on the port peripheral clock so you may even be able to >> get reliable timing regardless of the CPU clock. > > Gah! I forgot about optimizers. Yes the -O2 flag will remove the for > loop. It's not correct C though to do that. The port read is immune to > optimizer magic. >
It is /entirely/ correct for the compiler to remove a delay loop like this if there is no "volatile" involved. For C, there are certain "observable behaviours" in a program. These are: 1. Accesses to volatile objects (reads or writes). Ordering is important, but timing is not. 2. Program start and exit. 3. Input and output of "interactive devices" through <stdio.h> functions like printf(), fread() and fwrite(). 4. The data written to files, at program termination. 5. Any function calls where the functions might do one of the above four things. The compiler has to generate code that produces the same results with respect to these "observable behaviours" as the "C abstract machine" does. And that is /it/. The compiler is free to do anything it wants, as long as these rules are followed. For embedded systems, this usually boils down to just "program start" and "volatile accesses". Even when you use "printf", the key point is that this will eventually use volatile accesses to send data to UART hardware. And function calls are considered "observable behaviour" if the compiler does not know for sure that they involve no volatile accesses. A loop that simple counts up does not access anything volatile - the compiler can freely remove it. The compiler can shuffle things around any way it wants, as long as it does not break the ordering of volatile accesses. It can remove or simplify any code it wants, as long as the results are that the same data is written out to volatile objects and the same reads are made of volatile objects. It can do this regardless of any optimisation settings. Using -O0 or no optimisation flags does /not/ turn off optimisations - it merely tells the compiler not to work too hard on optimising. Turning on -O2 can never make correct code (in the C sense) into incorrect code - it just asks the compiler to work harder to generate more efficient code that does exactly the same thing as before. And note that C has no concept of time - there is no way in C to express a delay, or to suggest that some code has to be faster or slower than other code. There is only ordering of volatiles and the expectation that a decent compiler won't insert extra slow code.
Reply by Hans-Bernhard Bröker March 22, 20172017-03-22
Am 22.03.2017 um 17:48 schrieb mrfirmware:

> Gah! I forgot about optimizers. Yes the -O2 flag will remove the for > loop. It's not correct C though to do that.
Actually in the case at hand, it can still be considered correct, depending how you interpret some rather involved wording in the standard.
> The port read is immune to optimizer magic.
If there were one in the code under consideration, it might be. But there isn't, so the entire loop is fair game.