EmbeddedRelated.com
Forums

Code size reduction migrating from PIC18 to Cortex M0

Started by Kvik May 24, 2012
On Jun 20, 1:23=A0pm, FreeRTOS info <noem...@given.com> wrote:
> On 20/06/2012 18:07, peter_gotka...@supergreatmail.com wrote: > > > I've ported a fairly large app from a PIC18 to a Cortex M3 (which I > > believe is just a superset of the M0) and the code size actually > > INCREASED from about 62K to 129K. > > Did you look at the map file to see why? =A0If using GCC, did you set the > compile options to remove dead code (most linkers will do it > automatically). =A0If using GCC, did you avoid using libraries that were > written for a much larger class of processor?
I actually spent a lot of time looking at the map file. The total "overhead" including the vector table, C startup, and the two library functions that I actually use (printf and memcpy) is around 2K, the rest is all my code.
> "The Cortex-M3 processor has a feature known as "bit-banding". This > allows an individual bit in a memory-mapped mailbox or peripheral > register to be set/cleared by a single store/load instruction to an > bit-band aliased memory address, rather than using a conventional > read/modify/write instruction sequence."
I've used bit banding in some spots, and although it's great for RAM variables it's not very elegant for the peripheral registers since the header files define bits by value, not position whereas any bitbanding C macro that I could come up with would require the bit number, not position. So while TIM1->CCER |=3D TIM_CCER_CC4E clearly enables the timer's CC4 output, BITBANDSET(TIM1->CCER, 12) is less intuitive. Instead of a macro I thought about making an ASM inline function that would use the CLZ instruction to do this efficiently but for some reason gcc didn't want to inline any of my functions (C or ASM) in debug mode so I just gave up at that point.
In article <aec90865-670b-4bc5-a27d-
04d6ff608e4d@d17g2000vbv.googlegroups.com>, 
peter_gotkatov@supergreatmail.com says...
> > On May 24, 5:32&#4294967295;pm, Kvik <klaus.kragel...@gmail.com> wrote: > > Hi > > > > We are digging deeper into the Cortex M0 processor versus a PIC18. > > > > Seemingly objective material (Coremark data) at page 32 of: > > > > http://ics.nxp.com/literature/presentations/microcontrollers/pdf/cort... > > > > List a reduction in code size from PIC18 to M0 by a factor 2. > > > > But, anyone with a real-life experience of the possible code size > > reduction? > > > > Thanks > > > > Klaus > > I've ported a fairly large app from a PIC18 to a Cortex M3 (which I > believe is just a superset of the M0) and the code size actually > INCREASED from about 62K to 129K. This was just plain C code without > any processor-specific optimizations or tricks that was just cut & > pasted from one compiler to the other. While the Cortex does get > better density on things like 32 X 32 multiplies or divides it suffers > horribly on simple control structures. > > For example, clearing a timer interrupt flag: > > On the PIC18 this takes 2 bytes: > PIR1 &= ~TMR1IF; > 2108: BCF F9E.0 > > On the Cortex M3 it takes 40 bytes: > TIM1->SR &= ~TIM_SR_UIF; > F6424200 movw r2, #0x2C00 > F2C40201 movt r2, #0x4001 > F6424300 movw r3, #0x2C00 > F2C40301 movt r3, #0x4001 > 8A1B ldrh r3, [r3, #16] > B29B uxth r3, r3 > 4619 mov r1, r3 > F64F73FE movw r3, #0xFFFE > F2C00300 movt r3, #0 > EA010303 and.w r3, r1, r3 > 4619 mov r1, r3 > 460B mov r3, r1 > 8213 strh r3, [r2, #16] > > A simple countdown: > > On the PIC18 it takes 6 bytes: > if (--timeout) return; > 210A: DECF x3B,F > 210C: BZ 2110 > 210E: BRA 2114 > > On the Cortex M3 it takes 40 bytes: > if (--timeout) return; > F2400360 movw r3, #0x60 > F2C20300 movt r3, #0x2000 > 7B5B ldrb r3, [r3, #13] > F10333FF add.w r3, r3, #0xFFFFFFFF > B2DA uxtb r2, r3 > F2400360 movw r3, #0x60 > F2C20300 movt r3, #0x2000 > 735A strb r2, [r3, #13] > F2400360 movw r3, #0x60 > F2C20300 movt r3, #0x2000 > 7B5B ldrb r3, [r3, #13] > 2B00 cmp r3, #0 > D128 bne 0x08000F92 > > This may not be a very fair comparison since both compilers (CCS for > the PIC and gcc for the Cortex) are set to non-optimized mode but even > when gcc is set to optimize it only drops from 129K down to 104K which > is not much of a savings and still worse than the PIC18. When I first > started this exercise I was quite disappointed by the poor density so > I tried a simple exercise: I took one single C function that had more > than doubled in size and re-wrote it so as to take advantage of the > Cortex strengths. I made heavy use of 32-bit variables, careful use of > the "register" keyword, always accessing global variables through a > pointer, combining bit shifts with other arithmetic operations, using > bit-banding for IO registers wherever possible, etc. In the end I > managed to get it down to almost half its size, but still couldn't > match the PIC18. > > Perhaps the final answer depends on what kind of application you're > writing. In my case it's very IO intensive with a lot of peripherals > being used and a simple touchscreen UI with very little math involved. > Perhaps the Cortex was not the best choice here.
I think there are two things going on here: 1. The GCC compiler isn't very good at producing compact code. I tried one of your examples on IAR EW-ARM with optimization set to low (my usual default): 119 TIM1->SR &= TIM_SR_UIF; \ 00000010 0x.... LDR.N R0,??DataTable6_6 ;; 0x40010010 \ 00000012 0x8800 LDRH R0,[R0, #+0] \ 00000014 0xF010 0x0001 ANDS R0,R0,#0x1 \ 00000018 0x.... LDR.N R1,??DataTable6_6 ;; 0x40010010 \ 0000001A 0x8008 STRH R0,[R1, #+0] That's just 10 bytes----4 times better than the GCC result. 2. Cortex IO registers may be 16 or 32 bits and there are enough of them that you need 32-bit pointers to get at them. Loading those pointers is going to take more code. I suspect that the IAR compiler would reduce the code expansion to about a factor of 1.5. Since a lot Cortex MCUs have up to 1MB of flash while the PIC18 maxes out at 128KB, the ratio of program size to available flash may be better on the Cortex than on the PIC18. Mark Borgerson
On 20/06/2012 19:07, peter_gotkatov@supergreatmail.com wrote:
> On May 24, 5:32 pm, Kvik <klaus.kragel...@gmail.com> wrote: >> Hi >> >> We are digging deeper into the Cortex M0 processor versus a PIC18. >> >> Seemingly objective material (Coremark data) at page 32 of: >> >> http://ics.nxp.com/literature/presentations/microcontrollers/pdf/cort... >> >> List a reduction in code size from PIC18 to M0 by a factor 2. >> >> But, anyone with a real-life experience of the possible code size >> reduction? >> >> Thanks >> >> Klaus > > I've ported a fairly large app from a PIC18 to a Cortex M3 (which I > believe is just a superset of the M0) and the code size actually > INCREASED from about 62K to 129K. This was just plain C code without > any processor-specific optimizations or tricks that was just cut & > pasted from one compiler to the other. While the Cortex does get > better density on things like 32 X 32 multiplies or divides it suffers > horribly on simple control structures. > > For example, clearing a timer interrupt flag: > > On the PIC18 this takes 2 bytes: > PIR1 &= ~TMR1IF; > 2108: BCF F9E.0 > > On the Cortex M3 it takes 40 bytes: > TIM1->SR &= ~TIM_SR_UIF; > F6424200 movw r2, #0x2C00 > F2C40201 movt r2, #0x4001 > F6424300 movw r3, #0x2C00 > F2C40301 movt r3, #0x4001 > 8A1B ldrh r3, [r3, #16] > B29B uxth r3, r3 > 4619 mov r1, r3 > F64F73FE movw r3, #0xFFFE > F2C00300 movt r3, #0 > EA010303 and.w r3, r1, r3 > 4619 mov r1, r3 > 460B mov r3, r1 > 8213 strh r3, [r2, #16] > > A simple countdown: > > On the PIC18 it takes 6 bytes: > if (--timeout) return; > 210A: DECF x3B,F > 210C: BZ 2110 > 210E: BRA 2114 > > On the Cortex M3 it takes 40 bytes: > if (--timeout) return; > F2400360 movw r3, #0x60 > F2C20300 movt r3, #0x2000 > 7B5B ldrb r3, [r3, #13] > F10333FF add.w r3, r3, #0xFFFFFFFF > B2DA uxtb r2, r3 > F2400360 movw r3, #0x60 > F2C20300 movt r3, #0x2000 > 735A strb r2, [r3, #13] > F2400360 movw r3, #0x60 > F2C20300 movt r3, #0x2000 > 7B5B ldrb r3, [r3, #13] > 2B00 cmp r3, #0 > D128 bne 0x08000F92 > > This may not be a very fair comparison since both compilers (CCS for > the PIC and gcc for the Cortex) are set to non-optimized mode but even > when gcc is set to optimize it only drops from 129K down to 104K which > is not much of a savings and still worse than the PIC18. When I first > started this exercise I was quite disappointed by the poor density so > I tried a simple exercise: I took one single C function that had more > than doubled in size and re-wrote it so as to take advantage of the > Cortex strengths. I made heavy use of 32-bit variables, careful use of > the "register" keyword, always accessing global variables through a > pointer, combining bit shifts with other arithmetic operations, using > bit-banding for IO registers wherever possible, etc. In the end I > managed to get it down to almost half its size, but still couldn't > match the PIC18. > > Perhaps the final answer depends on what kind of application you're > writing. In my case it's very IO intensive with a lot of peripherals > being used and a simple touchscreen UI with very little math involved. > Perhaps the Cortex was not the best choice here. >
Saying you use a compiler but don't enable optimisation, then complaining about the code generated, is like saying you drive a car but never bother changing out of first gear and then complaining about the lack of speed. When you say you tried using the "register" keyword, I have to assume you learned C from a 30 year old book. One thing that is worth learning about modern toolchains (for the PIC, the Cortex, or whatever) is that they generate better code from well-written C using a clear, modern style, and using appropriate command-line switches. Don't try and second-guess your tools by adding irrelevant keywords (like "register") or "hand-optimising" by using extra pointers. Learn to use the tools properly, then let them do their job. When you say you use "gcc", which version? There are still some people using ancient versions of gcc which were very poor for ARM code (which has lead to a long-lasting myth that gcc is bad for ARM). To test your issues, I compiled this test code: #include <stdint.h> typedef struct { uint16_t padding[8]; volatile uint16_t SR; } TIM_t; #define TIM1 (((TIM_t*)(0x40012c00))) #define TIM_SR_UIF 0x0002 #define timeout (*((uint8_t*)(0x20000060))) void test2(void) { if (--timeout) return; TIM1->SR &= ~TIM_SR_UIF; } I used gcc 4.6.1 (from CodeSourcery Lite version 2011.09-69), with flags "-mcpu=cortex-m3 -mthumb -S". Even with no optimisation, I am failing to generate code quite as bad as you have. With -Os (which is the norm for embedded systems), I get: test2: @ args = 0, pretend = 0, frame = 0 @ frame_needed = 0, uses_anonymous_args = 0 @ link register save eliminated. ldr r2, .L3 ldrb r1, [r2, #0] @ zero_extendqisi2 subs r0, r1, #1 uxtb r3, r0 strb r3, [r2, #0] cbnz r3, .L1 ldr r2, .L3+4 ldrh ip, [r2, #16] bic r1, ip, #2 lsls r0, r1, #16 lsrs r3, r0, #16 strh r3, [r2, #16] @ movhi .L1: bx lr .L4: .align 2 .L3: .word 536871008 .word 1073818624 Real-world code will be even better, as the compiler can re-use base pointers and otherwise optimise larger code sections.
On 20/06/2012 21:21, peter_gotkatov@supergreatmail.com wrote:
> On Jun 20, 1:23 pm, FreeRTOS info <noem...@given.com> wrote: >> On 20/06/2012 18:07, peter_gotka...@supergreatmail.com wrote: >> >>> I've ported a fairly large app from a PIC18 to a Cortex M3 (which I >>> believe is just a superset of the M0) and the code size actually >>> INCREASED from about 62K to 129K. >> >> Did you look at the map file to see why? If using GCC, did you set the >> compile options to remove dead code (most linkers will do it >> automatically). If using GCC, did you avoid using libraries that were >> written for a much larger class of processor? > > I actually spent a lot of time looking at the map file. The total > "overhead" including the vector table, C startup, and the two library > functions that I actually use (printf and memcpy) is around 2K, the > rest is all my code.
That is unlikely to be true, but without knowing your code or the map file, there is no way to be sure. It is not a surprise that the code size has increased in moving from the PIC18 - differences here will vary wildly according to the type of code. But it /is/ a surprise that you only have 2K of startup, vector tables, and library code.
> >> "The Cortex-M3 processor has a feature known as "bit-banding". This >> allows an individual bit in a memory-mapped mailbox or peripheral >> register to be set/cleared by a single store/load instruction to an >> bit-band aliased memory address, rather than using a conventional >> read/modify/write instruction sequence." > > I've used bit banding in some spots, and although it's great for RAM > variables it's not very elegant for the peripheral registers since the > header files define bits by value, not position whereas any bitbanding > C macro that I could come up with would require the bit number, not > position. So while TIM1->CCER |= TIM_CCER_CC4E clearly enables the > timer's CC4 output, BITBANDSET(TIM1->CCER, 12) is less intuitive.
Code clarity is more important than code efficiency. But if code efficiency is important, then put such code in little "static inline" functions with appropriate comments.
> Instead of a macro I thought about making an ASM inline function that > would use the CLZ instruction to do this efficiently but for some > reason gcc didn't want to inline any of my functions (C or ASM) in > debug mode so I just gave up at that point. >
First off, you should not need to resort to assembly to get basic instructions working - the compiler should produce near-optimal code as long as you let it (by enabling optimisations and writing appropriate C code). Secondly, don't use "ASM functions" - they are normally only needed by more limited compilers. If you need to use assembly with gcc, use gcc's extended "asm" syntax. Finally, if you are not getting inlining when debugging it is because you have got incorrect compiler switches. You should not have different "debug" and "release" (or "optimised") builds - do a single build with the proper optimisation settings (typically -Os unless you know what you are doing) and "-g" to enable debugging. You never want to be releasing code that is built differently from the code you debugged.
On Jun 21, 4:09=A0am, David Brown <da...@westcontrol.removethisbit.com>
wrote:
> On 20/06/2012 21:21, peter_gotka...@supergreatmail.com wrote: > It is not a surprise that the code > size has increased in moving from the PIC18 - differences here will vary > wildly according to the type of code. =A0But it /is/ a surprise that you > only have 2K of startup, vector tables, and library code.
Not all that surprising, here are the sizes in bytes: .vectors 304 .init 508 __putchar 40 __vprintf 1498 memcpy 56
> > Instead of a macro I thought about making an ASM inline function that > > would use the CLZ instruction to do this efficiently but for some > > reason gcc didn't want to inline any of my functions (C or ASM) in > > debug mode so I just gave up at that point. > > First off, you should not need to resort to assembly to get basic > instructions working - the compiler should produce near-optimal code as > long as you let it (by enabling optimisations and writing appropriate C > code).
I've tried several ways of writing a counleadingzeroes() function that would use the Cortex CLZ instruction but even with optimization turned on it still wouldn't do it.
> Secondly, don't use "ASM functions" - they are normally only needed by > more limited compilers. =A0If you need to use assembly with gcc, use gcc'=
s
> extended "asm" syntax.
There are some things like the bootloader that need to be ASM functions in their own separate .S file anyway since they need to copy portions of themselves to RAM in order to execute. But a bootloader is a special case and I do agree that normal code shouldn't need to rely on ASM functions. I must say I'm not familiar with gcc's extended asm syntax and although I did look at it briefly it seemed like it was more complicated than a plain old .S file and it was mostly geared towards mixing C and ASM together in the same function and accessing variables by name etc. Not something I needed for a simple bootloader.
> Finally, if you are not getting inlining when debugging it is because > you have got incorrect compiler switches. =A0You should not have differen=
t
> "debug" and "release" (or "optimised") builds - do a single build with > the proper optimisation settings (typically -Os unless you know what you > are doing) and "-g" to enable debugging. =A0You never want to be releasin=
g
> code that is built differently from the code you debugged.
I was fighting with this for a while when I was first handed this toolchain, and it seems that in debug mode, there is no -O switch at all and in release mode it defaults to -O1. When I change this -Os it does produce the same code as the sample that you posted above from gcc 4.6.1 (mine is 4.4.4 by the way). However even with manually adding the -g switch I still don't get source annotation in the ELF file unless I use debug mode. This effectively limits any development/ debugging to unoptimized code, which still has to fit into the 256K somehow. As for using register keywords and accessing globals through pointers, I normally don't do this (haven't used the register keyword in years) and I certainly wouldn't be doing it at all if it didn't have such a significant effect on the code size: unsigned long a,b; void test(void) { B4B0 push {r4-r5, r7} AF00 add r7, sp, #0 ---------------------------------------- register unsigned long x, y; a=3Db+5; F2402314 movw r3, #0x214 F2C20300 movt r3, #0x2000 681B ldr r3, [r3, #0] F1030205 add.w r2, r3, #5 F240231C movw r3, #0x21C F2C20300 movt r3, #0x2000 601A str r2, [r3, #0] ---------------------------------------- x=3Dy+5; F1040505 add.w r5, r4, #5 ---------------------------------------- } 46BD mov sp, r7 BCB0 pop {r4-r5, r7} 4770 bx lr BF00 nop
On 06/21/2012 05:32 PM, peter_gotkatov@supergreatmail.com wrote:

> As for using register keywords and accessing globals through pointers, > I normally don't do this (haven't used the register keyword in years) > and I certainly wouldn't be doing it at all if it didn't have such a > significant effect on the code size: > > unsigned long a,b; > > void test(void) { > B4B0 push {r4-r5, r7} > AF00 add r7, sp, #0 > ---------------------------------------- > register unsigned long x, y; > a=b+5; > F2402314 movw r3, #0x214 > F2C20300 movt r3, #0x2000 > 681B ldr r3, [r3, #0] > F1030205 add.w r2, r3, #5 > F240231C movw r3, #0x21C > F2C20300 movt r3, #0x2000 > 601A str r2, [r3, #0] > ---------------------------------------- > x=y+5; > F1040505 add.w r5, r4, #5 > ---------------------------------------- > } > 46BD mov sp, r7 > BCB0 pop {r4-r5, r7} > 4770 bx lr > BF00 nop
The difference is due to the fact that a, b are global and x, y are local. Try removing the 'register' keyword, but leaving everything else the same. You should get the same code (assuming optimization enabled)
On 21/06/12 17:32, peter_gotkatov@supergreatmail.com wrote:
> On Jun 21, 4:09 am, David Brown<da...@westcontrol.removethisbit.com> > wrote: >> On 20/06/2012 21:21, peter_gotka...@supergreatmail.com wrote: >> It is not a surprise that the code >> size has increased in moving from the PIC18 - differences here will vary >> wildly according to the type of code. But it /is/ a surprise that you >> only have 2K of startup, vector tables, and library code. > > Not all that surprising, here are the sizes in bytes: > .vectors 304 > .init 508 > __putchar 40 > __vprintf 1498 > memcpy 56 >
I still think it is surprising, because these library functions often pull in other library code (such as for floating point support), and quite often there is library code for small "helper" functions. But it depends on the configuration, and what is in the rest of your source code.
>>> Instead of a macro I thought about making an ASM inline function that >>> would use the CLZ instruction to do this efficiently but for some >>> reason gcc didn't want to inline any of my functions (C or ASM) in >>> debug mode so I just gave up at that point. >> >> First off, you should not need to resort to assembly to get basic >> instructions working - the compiler should produce near-optimal code as >> long as you let it (by enabling optimisations and writing appropriate C >> code). > > I've tried several ways of writing a counleadingzeroes() function that > would use the Cortex CLZ instruction but even with optimization turned > on it still wouldn't do it. >
Did you try using the "__builtin_clz()" function described in the gcc manual? <http://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html>
>> Secondly, don't use "ASM functions" - they are normally only needed by >> more limited compilers. If you need to use assembly with gcc, use gcc's >> extended "asm" syntax. > > There are some things like the bootloader that need to be ASM > functions in their own separate .S file anyway since they need to copy > portions of themselves to RAM in order to execute. But a bootloader is > a special case and I do agree that normal code shouldn't need to rely > on ASM functions.
I have written bootloaders for several microcontrollers (though not for a Cortex). I write them in C. I have written startup code for several microcontrollers, handling the setup of the stack, memory, the C environment, clearing bss, copying constants, etc. I write such code in C. You can't avoid writing two or three of the instructions in assembly, but usually it's not more than that. It's certainly not enough to bother with separate .S files - normally not even individual assembly functions (though sometimes I've used "naked" C functions as wrappers for a few lines of pure assembly). Usually I write startup code when I am unhappy with the code supplied by the toolchain vendor - which is invariably written in assembly. Re-writing it in C gives code that is far clearer, and smaller and faster (sometimes many times faster).
> I must say I'm not familiar with gcc's extended asm > syntax and although I did look at it briefly it seemed like it was > more complicated than a plain old .S file and it was mostly geared > towards mixing C and ASM together in the same function and accessing > variables by name etc. Not something I needed for a simple bootloader.
It is aimed at mixing C and assembly, yes. It lets you do the minimal work in assembly, while letting the compiler handle as much as possible, including optimising around your assembly code. Let the compiler do the things it is good at.
> >> Finally, if you are not getting inlining when debugging it is because >> you have got incorrect compiler switches. You should not have different >> "debug" and "release" (or "optimised") builds - do a single build with >> the proper optimisation settings (typically -Os unless you know what you >> are doing) and "-g" to enable debugging. You never want to be releasing >> code that is built differently from the code you debugged. > > I was fighting with this for a while when I was first handed this > toolchain, and it seems that in debug mode, there is no -O switch at > all and in release mode it defaults to -O1. When I change this -Os it > does produce the same code as the sample that you posted above from > gcc 4.6.1 (mine is 4.4.4 by the way). However even with manually > adding the -g switch I still don't get source annotation in the ELF > file unless I use debug mode. This effectively limits any development/ > debugging to unoptimized code, which still has to fit into the 256K > somehow.
This is some limitation or misunderstanding of your IDE or other tools, not gcc. Most likely it is a misunderstanding rather than a limitation, but without knowing your particular toolchain it is hard to give specific help. Most serious developers use something like -Os (which is -O2 with an emphasis on size). The only reason to use -O1 is if you have a very slow computer and a very large code base, as it is faster than -Os/-O2, or very occasionally in testing or debugging. The only reason to use no optimisation is because you don't understand your tools. And sometimes it is useful to use higher optimisations, or enable specific optimisations, because of particular effects. They don't tend to have much effect on most code, but can make a big difference to particular parts (perhaps unrolling a loop, or re-arranging nested loops to fit cache line sizes, etc.). I tend to use "optimize" function attributes or pragmas for such special cases.
> > As for using register keywords and accessing globals through pointers, > I normally don't do this (haven't used the register keyword in years) > and I certainly wouldn't be doing it at all if it didn't have such a > significant effect on the code size: > > unsigned long a,b; > > void test(void) { > B4B0 push {r4-r5, r7} > AF00 add r7, sp, #0 > ---------------------------------------- > register unsigned long x, y; > a=b+5; > F2402314 movw r3, #0x214 > F2C20300 movt r3, #0x2000 > 681B ldr r3, [r3, #0] > F1030205 add.w r2, r3, #5 > F240231C movw r3, #0x21C > F2C20300 movt r3, #0x2000 > 601A str r2, [r3, #0] > ---------------------------------------- > x=y+5; > F1040505 add.w r5, r4, #5 > ---------------------------------------- > } > 46BD mov sp, r7 > BCB0 pop {r4-r5, r7} > 4770 bx lr > BF00 nop >
The "register" keyword has always been ignored in gcc except in -O0 mode, unless of course you are using the extended syntax to specific a particular register. <http://gcc.gnu.org/ml/gcc/2010-05/msg00113.html> You are seeing a difference in the code because one set of variables is global, and must be accessed externally, while the other set is local and uses registers. And if you had enabled optimisations, the "x = y + 5;" would have been eliminated entirely because it has no effect. And if you had enabled warnings, as you always should, the compiler would have complained about the code using variables before they are initialised, and about setting a variable that has no effect. mvh., David
In article <9b55cce9-96db-46f4-909a-1f6500deb237
@j9g2000vbk.googlegroups.com>, peter_gotkatov@supergreatmail.com says...
> > On Jun 21, 4:09&#4294967295;am, David Brown <da...@westcontrol.removethisbit.com> > wrote: > > On 20/06/2012 21:21, peter_gotka...@supergreatmail.com wrote: > > It is not a surprise that the code > > size has increased in moving from the PIC18 - differences here will vary > > wildly according to the type of code. &#4294967295;But it /is/ a surprise that you > > only have 2K of startup, vector tables, and library code. > > Not all that surprising, here are the sizes in bytes: > .vectors 304 > .init 508 > __putchar 40 > __vprintf 1498 > memcpy 56 > > > > Instead of a macro I thought about making an ASM inline function that > > > would use the CLZ instruction to do this efficiently but for some > > > reason gcc didn't want to inline any of my functions (C or ASM) in > > > debug mode so I just gave up at that point. > > > > First off, you should not need to resort to assembly to get basic > > instructions working - the compiler should produce near-optimal code as > > long as you let it (by enabling optimisations and writing appropriate C > > code). > > I've tried several ways of writing a counleadingzeroes() function that > would use the Cortex CLZ instruction but even with optimization turned > on it still wouldn't do it. > > > Secondly, don't use "ASM functions" - they are normally only needed by > > more limited compilers. &#4294967295;If you need to use assembly with gcc, use gcc's > > extended "asm" syntax. > > There are some things like the bootloader that need to be ASM > functions in their own separate .S file anyway since they need to copy > portions of themselves to RAM in order to execute. But a bootloader is > a special case and I do agree that normal code shouldn't need to rely > on ASM functions. I must say I'm not familiar with gcc's extended asm > syntax and although I did look at it briefly it seemed like it was > more complicated than a plain old .S file and it was mostly geared > towards mixing C and ASM together in the same function and accessing > variables by name etc. Not something I needed for a simple bootloader. > > > Finally, if you are not getting inlining when debugging it is because > > you have got incorrect compiler switches. &#4294967295;You should not have different > > "debug" and "release" (or "optimised") builds - do a single build with > > the proper optimisation settings (typically -Os unless you know what you > > are doing) and "-g" to enable debugging. &#4294967295;You never want to be releasing > > code that is built differently from the code you debugged. > > I was fighting with this for a while when I was first handed this > toolchain, and it seems that in debug mode, there is no -O switch at > all and in release mode it defaults to -O1. When I change this -Os it > does produce the same code as the sample that you posted above from > gcc 4.6.1 (mine is 4.4.4 by the way). However even with manually > adding the -g switch I still don't get source annotation in the ELF > file unless I use debug mode. This effectively limits any development/ > debugging to unoptimized code, which still has to fit into the 256K > somehow. > > As for using register keywords and accessing globals through pointers, > I normally don't do this (haven't used the register keyword in years) > and I certainly wouldn't be doing it at all if it didn't have such a > significant effect on the code size: > > unsigned long a,b; > > void test(void) { > B4B0 push {r4-r5, r7} > AF00 add r7, sp, #0 > ---------------------------------------- > register unsigned long x, y; > a=b+5; > F2402314 movw r3, #0x214 > F2C20300 movt r3, #0x2000 > 681B ldr r3, [r3, #0] > F1030205 add.w r2, r3, #5 > F240231C movw r3, #0x21C > F2C20300 movt r3, #0x2000 > 601A str r2, [r3, #0] > ---------------------------------------- > x=y+5; > F1040505 add.w r5, r4, #5 > ---------------------------------------- > } > 46BD mov sp, r7 > BCB0 pop {r4-r5, r7} > 4770 bx lr > BF00 nop
(36 bytes of code) I'd be surprised that your compiler did not complain about x and y not being initialized before the addition. EW ARM warned me that y was used before its value was set and that x was set but never used. Here is the code it gave me----with some extra comments added afterwards 111 void test(void){ 112 register unsigned long x,y; 113 a = b+5; \ test: \ 00000000 0x.... LDR.N R1,??DataTable8_6 \ 00000002 0x6809 LDR R1,[R1, #+0] \ 00000004 0x1D49 ADDS R1,R1,#+5 \ 00000006 0x.... LDR.N R2,??DataTable8_7 \ 00000008 0x6011 STR R1,[R2, #+0] 114 x = y+5; \ 0000000A 0x1D40 ADDS R0,R0,#+5 // R0 is y // sum not saved 115 } \ 0000000C 0x4770 BX LR ;; return 116 (14 bytes of code) Apparently, R0, R1, R2 are scratch registers for IAR and don't need to be saved and restored. Adding actual initialization to x and y and saving the result in b produced the following: In section .text, align 2, keep-with-next 110 void test(void){ 111 register unsigned long x=3,y=4; \ test: \ 00000000 0x2003 MOVS R0,#+3 \ 00000002 0x2104 MOVS R1,#+4 112 a = b+5; \ 00000004 0x.... LDR.N R2,??DataTable8_6 \ 00000006 0x6812 LDR R2,[R2, #+0] \ 00000008 0x1D52 ADDS R2,R2,#+5 \ 0000000A 0x.... LDR.N R3,??DataTable8_7 \ 0000000C 0x601A STR R2,[R3, #+0] 113 x = y+5; \ 0000000E 0x1D49 ADDS R1,R1,#+5 \ 00000010 0x0008 MOVS R0,R1 // x = sum 114 b = x; // this time save the result \ 00000012 0x.... LDR.N R1,??DataTable8_6 \ 00000014 0x6008 STR R0,[R1, #+0] 115 } \ 00000016 0x4770 BX LR ;; return 116 Still accomplished with scratch registers----no need to save any on the stack. I changed from my default optimization of 'low' to 'none' and got exactly the same code. Finally, I took out the 'register' key word before x and y----and got exactly the same result as above. It seems that GCC just doesn't match up to IAR at producing compact code at low optimization levels. OTOH, given that EW_ARM costs several KBucks, it SHOULD do better! Mark Borgerson
On 22/06/2012 02:42, Mark Borgerson wrote:
> In article <9b55cce9-96db-46f4-909a-1f6500deb237 > @j9g2000vbk.googlegroups.com>, peter_gotkatov@supergreatmail.com says... >> >> On Jun 21, 4:09 am, David Brown <da...@westcontrol.removethisbit.com> >> wrote: >>> On 20/06/2012 21:21, peter_gotka...@supergreatmail.com wrote: >>> It is not a surprise that the code >>> size has increased in moving from the PIC18 - differences here will vary >>> wildly according to the type of code. But it /is/ a surprise that you >>> only have 2K of startup, vector tables, and library code. >> >> Not all that surprising, here are the sizes in bytes: >> .vectors 304 >> .init 508 >> __putchar 40 >> __vprintf 1498 >> memcpy 56 >> >>>> Instead of a macro I thought about making an ASM inline function that >>>> would use the CLZ instruction to do this efficiently but for some >>>> reason gcc didn't want to inline any of my functions (C or ASM) in >>>> debug mode so I just gave up at that point. >>> >>> First off, you should not need to resort to assembly to get basic >>> instructions working - the compiler should produce near-optimal code as >>> long as you let it (by enabling optimisations and writing appropriate C >>> code). >> >> I've tried several ways of writing a counleadingzeroes() function that >> would use the Cortex CLZ instruction but even with optimization turned >> on it still wouldn't do it. >> >>> Secondly, don't use "ASM functions" - they are normally only needed by >>> more limited compilers. If you need to use assembly with gcc, use gcc's >>> extended "asm" syntax. >> >> There are some things like the bootloader that need to be ASM >> functions in their own separate .S file anyway since they need to copy >> portions of themselves to RAM in order to execute. But a bootloader is >> a special case and I do agree that normal code shouldn't need to rely >> on ASM functions. I must say I'm not familiar with gcc's extended asm >> syntax and although I did look at it briefly it seemed like it was >> more complicated than a plain old .S file and it was mostly geared >> towards mixing C and ASM together in the same function and accessing >> variables by name etc. Not something I needed for a simple bootloader. >> >>> Finally, if you are not getting inlining when debugging it is because >>> you have got incorrect compiler switches. You should not have different >>> "debug" and "release" (or "optimised") builds - do a single build with >>> the proper optimisation settings (typically -Os unless you know what you >>> are doing) and "-g" to enable debugging. You never want to be releasing >>> code that is built differently from the code you debugged. >> >> I was fighting with this for a while when I was first handed this >> toolchain, and it seems that in debug mode, there is no -O switch at >> all and in release mode it defaults to -O1. When I change this -Os it >> does produce the same code as the sample that you posted above from >> gcc 4.6.1 (mine is 4.4.4 by the way). However even with manually >> adding the -g switch I still don't get source annotation in the ELF >> file unless I use debug mode. This effectively limits any development/ >> debugging to unoptimized code, which still has to fit into the 256K >> somehow. >> >> As for using register keywords and accessing globals through pointers, >> I normally don't do this (haven't used the register keyword in years) >> and I certainly wouldn't be doing it at all if it didn't have such a >> significant effect on the code size: >> >> unsigned long a,b; >> >> void test(void) { >> B4B0 push {r4-r5, r7} >> AF00 add r7, sp, #0 >> ---------------------------------------- >> register unsigned long x, y; >> a=b+5; >> F2402314 movw r3, #0x214 >> F2C20300 movt r3, #0x2000 >> 681B ldr r3, [r3, #0] >> F1030205 add.w r2, r3, #5 >> F240231C movw r3, #0x21C >> F2C20300 movt r3, #0x2000 >> 601A str r2, [r3, #0] >> ---------------------------------------- >> x=y+5; >> F1040505 add.w r5, r4, #5 >> ---------------------------------------- >> } >> 46BD mov sp, r7 >> BCB0 pop {r4-r5, r7} >> 4770 bx lr >> BF00 nop > > (36 bytes of code) > > I'd be surprised that your compiler did not complain about x and y not > being initialized before the addition. EW ARM warned me that y was used > before its value was set and that x was set but never used. > > Here is the code it gave me----with some extra comments added afterwards > > > 111 void test(void){ > 112 register unsigned long x,y; > 113 a = b+5; > \ test: > \ 00000000 0x.... LDR.N R1,??DataTable8_6 > \ 00000002 0x6809 LDR R1,[R1, #+0] > \ 00000004 0x1D49 ADDS R1,R1,#+5 > \ 00000006 0x.... LDR.N R2,??DataTable8_7 > \ 00000008 0x6011 STR R1,[R2, #+0] > 114 x = y+5; > \ 0000000A 0x1D40 ADDS R0,R0,#+5 // R0 is y > // sum not saved > 115 } > \ 0000000C 0x4770 BX LR ;; return > 116 > (14 bytes of code) > > Apparently, R0, R1, R2 are scratch registers for IAR and don't need to > be saved and restored. > > Adding actual initialization to x and y and saving the result in b > produced the following: > > In section .text, align 2, keep-with-next > 110 void test(void){ > 111 register unsigned long x=3,y=4; > \ test: > \ 00000000 0x2003 MOVS R0,#+3 > \ 00000002 0x2104 MOVS R1,#+4 > 112 a = b+5; > \ 00000004 0x.... LDR.N R2,??DataTable8_6 > \ 00000006 0x6812 LDR R2,[R2, #+0] > \ 00000008 0x1D52 ADDS R2,R2,#+5 > \ 0000000A 0x.... LDR.N R3,??DataTable8_7 > \ 0000000C 0x601A STR R2,[R3, #+0] > 113 x = y+5; > \ 0000000E 0x1D49 ADDS R1,R1,#+5 > \ 00000010 0x0008 MOVS R0,R1 // x = sum > 114 b = x; // this time save the result > \ 00000012 0x.... LDR.N R1,??DataTable8_6 > \ 00000014 0x6008 STR R0,[R1, #+0] > 115 } > \ 00000016 0x4770 BX LR ;; return > 116 > > Still accomplished with scratch registers----no need to save any on the > stack. I changed from my default optimization of 'low' to 'none' > and got exactly the same code. > > Finally, I took out the 'register' key word before x and y----and > got exactly the same result as above. > > It seems that GCC just doesn't match up to IAR at producing compact > code at low optimization levels. OTOH, given that EW_ARM costs > several KBucks, it SHOULD do better! > >
The problems here don't lie with the compiler - they lie with the user. I'm sure that EW_ARM produces better code than gcc (correctly used) in some cases - but I am also sure that gcc can do better than EW_ARM in other cases. I really don't think there is going to be a big difference in code generation quality - if that's why you paid K$ for EW, you've probably wasted your money. There are many reasons for choosing different toolchains, but generally speaking I don't see a large difference in code generation quality between the major toolchains (including gcc) for 32-bit processors. Occasionally you'll see major differences in particular kinds of code, but for the most part it is the user that makes the biggest difference. On place where EW_ARM might score over the gcc setup this user has (he hasn't yet said anything about the rest - is it home-made, CodeSourcery, Code Red, etc.?) is that EW_ARM might make it easier to get the compiler switches correct, and avoid this "I don't know how to enable debugging and optimisation" or "what's a warning?" nonsense. It hardly needs saying, but when run properly, my brief test with gcc produces the same code here as you get with EW_ARM, and the same warnings about x and y. I'm sure that EW_ARM has a similar option, but gcc has a "-fno-common" switch to disable "common" sections. With this disabled, definitions like "unsigned long a, b;" can only appear once in the program for each global identifier, and the space is allocated directly in the .bss inside the module that made the definition. gcc can use this extra information to take advantage of relative placement between variables, and generate addressing via section anchors: Command line: arm-none-eabi-gcc -mcpu=cortex-m3 -mthumb -S testcode.c -Wall -Os -fno-common test: @ args = 0, pretend = 0, frame = 0 @ frame_needed = 0, uses_anonymous_args = 0 @ link register save eliminated. ldr r3, .L6 ldr r0, [r3, #4] adds r2, r0, #5 str r2, [r3, #0] bx lr .L7: .align 2 .L6: .word .LANCHOR0 .size test, .-test .global b .global a .bss .align 2 .set .LANCHOR0,. + 0 .type a, %object .size a, 4 a: .space 4 .type b, %object .size b, 4 b: .space 4 .ident "GCC: (Sourcery CodeBench Lite 2011.09-69) 4.6.1" It's all about learning to use the tools you have, rather than buying more expensive tools. mvh., David
In article <svCdnfOCcv-WvHnSnZ2dnUVZ8rednZ2d@lyse.net>, 
david@westcontrol.removethisbit.com says...
> > On 22/06/2012 02:42, Mark Borgerson wrote: > > In article <9b55cce9-96db-46f4-909a-1f6500deb237 > > @j9g2000vbk.googlegroups.com>, peter_gotkatov@supergreatmail.com says... > >> > >> On Jun 21, 4:09 am, David Brown <da...@westcontrol.removethisbit.com> > >> wrote: > >>> On 20/06/2012 21:21, peter_gotka...@supergreatmail.com wrote: > >>> It is not a surprise that the code > >>> size has increased in moving from the PIC18 - differences here will vary > >>> wildly according to the type of code. But it /is/ a surprise that you > >>> only have 2K of startup, vector tables, and library code. > >> > >> Not all that surprising, here are the sizes in bytes: > >> .vectors 304 > >> .init 508 > >> __putchar 40 > >> __vprintf 1498 > >> memcpy 56 > >> > >>>> Instead of a macro I thought about making an ASM inline function that > >>>> would use the CLZ instruction to do this efficiently but for some > >>>> reason gcc didn't want to inline any of my functions (C or ASM) in > >>>> debug mode so I just gave up at that point. > >>> > >>> First off, you should not need to resort to assembly to get basic > >>> instructions working - the compiler should produce near-optimal code as > >>> long as you let it (by enabling optimisations and writing appropriate C > >>> code). > >> > >> I've tried several ways of writing a counleadingzeroes() function that > >> would use the Cortex CLZ instruction but even with optimization turned > >> on it still wouldn't do it. > >> > >>> Secondly, don't use "ASM functions" - they are normally only needed by > >>> more limited compilers. If you need to use assembly with gcc, use gcc's > >>> extended "asm" syntax. > >> > >> There are some things like the bootloader that need to be ASM > >> functions in their own separate .S file anyway since they need to copy > >> portions of themselves to RAM in order to execute. But a bootloader is > >> a special case and I do agree that normal code shouldn't need to rely > >> on ASM functions. I must say I'm not familiar with gcc's extended asm > >> syntax and although I did look at it briefly it seemed like it was > >> more complicated than a plain old .S file and it was mostly geared > >> towards mixing C and ASM together in the same function and accessing > >> variables by name etc. Not something I needed for a simple bootloader. > >> > >>> Finally, if you are not getting inlining when debugging it is because > >>> you have got incorrect compiler switches. You should not have different > >>> "debug" and "release" (or "optimised") builds - do a single build with > >>> the proper optimisation settings (typically -Os unless you know what you > >>> are doing) and "-g" to enable debugging. You never want to be releasing > >>> code that is built differently from the code you debugged. > >> > >> I was fighting with this for a while when I was first handed this > >> toolchain, and it seems that in debug mode, there is no -O switch at > >> all and in release mode it defaults to -O1. When I change this -Os it > >> does produce the same code as the sample that you posted above from > >> gcc 4.6.1 (mine is 4.4.4 by the way). However even with manually > >> adding the -g switch I still don't get source annotation in the ELF > >> file unless I use debug mode. This effectively limits any development/ > >> debugging to unoptimized code, which still has to fit into the 256K > >> somehow. > >> > >> As for using register keywords and accessing globals through pointers, > >> I normally don't do this (haven't used the register keyword in years) > >> and I certainly wouldn't be doing it at all if it didn't have such a > >> significant effect on the code size: > >> > >> unsigned long a,b; > >> > >> void test(void) { > >> B4B0 push {r4-r5, r7} > >> AF00 add r7, sp, #0 > >> ---------------------------------------- > >> register unsigned long x, y; > >> a=b+5; > >> F2402314 movw r3, #0x214 > >> F2C20300 movt r3, #0x2000 > >> 681B ldr r3, [r3, #0] > >> F1030205 add.w r2, r3, #5 > >> F240231C movw r3, #0x21C > >> F2C20300 movt r3, #0x2000 > >> 601A str r2, [r3, #0] > >> ---------------------------------------- > >> x=y+5; > >> F1040505 add.w r5, r4, #5 > >> ---------------------------------------- > >> } > >> 46BD mov sp, r7 > >> BCB0 pop {r4-r5, r7} > >> 4770 bx lr > >> BF00 nop > > > > (36 bytes of code) > > > > I'd be surprised that your compiler did not complain about x and y not > > being initialized before the addition. EW ARM warned me that y was used > > before its value was set and that x was set but never used. > > > > Here is the code it gave me----with some extra comments added afterwards > > > > > > 111 void test(void){ > > 112 register unsigned long x,y; > > 113 a = b+5; > > \ test: > > \ 00000000 0x.... LDR.N R1,??DataTable8_6 > > \ 00000002 0x6809 LDR R1,[R1, #+0] > > \ 00000004 0x1D49 ADDS R1,R1,#+5 > > \ 00000006 0x.... LDR.N R2,??DataTable8_7 > > \ 00000008 0x6011 STR R1,[R2, #+0] > > 114 x = y+5; > > \ 0000000A 0x1D40 ADDS R0,R0,#+5 // R0 is y > > // sum not saved > > 115 } > > \ 0000000C 0x4770 BX LR ;; return > > 116 > > (14 bytes of code) > > > > Apparently, R0, R1, R2 are scratch registers for IAR and don't need to > > be saved and restored. > > > > Adding actual initialization to x and y and saving the result in b > > produced the following: > > > > In section .text, align 2, keep-with-next > > 110 void test(void){ > > 111 register unsigned long x=3,y=4; > > \ test: > > \ 00000000 0x2003 MOVS R0,#+3 > > \ 00000002 0x2104 MOVS R1,#+4 > > 112 a = b+5; > > \ 00000004 0x.... LDR.N R2,??DataTable8_6 > > \ 00000006 0x6812 LDR R2,[R2, #+0] > > \ 00000008 0x1D52 ADDS R2,R2,#+5 > > \ 0000000A 0x.... LDR.N R3,??DataTable8_7 > > \ 0000000C 0x601A STR R2,[R3, #+0] > > 113 x = y+5; > > \ 0000000E 0x1D49 ADDS R1,R1,#+5 > > \ 00000010 0x0008 MOVS R0,R1 // x = sum > > 114 b = x; // this time save the result > > \ 00000012 0x.... LDR.N R1,??DataTable8_6 > > \ 00000014 0x6008 STR R0,[R1, #+0] > > 115 } > > \ 00000016 0x4770 BX LR ;; return > > 116 > > > > Still accomplished with scratch registers----no need to save any on the > > stack. I changed from my default optimization of 'low' to 'none' > > and got exactly the same code. > > > > Finally, I took out the 'register' key word before x and y----and > > got exactly the same result as above. > > > > It seems that GCC just doesn't match up to IAR at producing compact > > code at low optimization levels. OTOH, given that EW_ARM costs > > several KBucks, it SHOULD do better! > > > > > > The problems here don't lie with the compiler - they lie with the user. > I'm sure that EW_ARM produces better code than gcc (correctly used) in > some cases - but I am also sure that gcc can do better than EW_ARM in > other cases. I really don't think there is going to be a big difference > in code generation quality - if that's why you paid K$ for EW, you've > probably wasted your money. There are many reasons for choosing > different toolchains, but generally speaking I don't see a large > difference in code generation quality between the major toolchains > (including gcc) for 32-bit processors. Occasionally you'll see major > differences in particular kinds of code, but for the most part it is the > user that makes the biggest difference. > > On place where EW_ARM might score over the gcc setup this user has (he > hasn't yet said anything about the rest - is it home-made, CodeSourcery, > Code Red, etc.?) is that EW_ARM might make it easier to get the compiler > switches correct, and avoid this "I don't know how to enable debugging > and optimisation" or "what's a warning?" nonsense.
One of the reasons I like the EW_ARM system is that the IDE handles all the compiler and linker flags with a pretty good GUI. You can override the GUI options with #pragma statements in the code----which I haven't found reason to do for the most part.
> > > It hardly needs saying, but when run properly, my brief test with gcc > produces the same code here as you get with EW_ARM, and the same > warnings about x and y. >
That's comforting in a way. While I now use EW_ARM for most of my current projects, I spent about 5 years using GCC_ARM on a project based on Linux. I would hate to think that I was producing crap code all that time! I had some experienced Linux users to set up my dev system and show me how to generate good make files, so I probably got pretty good results there. I'm using EW_ARM for projects that don't have the resources of a Linux OS, and I prefer it for these projects.
> > I'm sure that EW_ARM has a similar option, but gcc has a "-fno-common" > switch to disable "common" sections. With this disabled, definitions > like "unsigned long a, b;" can only appear once in the program for each > global identifier, and the space is allocated directly in the .bss > inside the module that made the definition. gcc can use this extra > information to take advantage of relative placement between variables, > and generate addressing via section anchors: > > > Command line: > arm-none-eabi-gcc -mcpu=cortex-m3 -mthumb -S testcode.c -Wall -Os > -fno-common > > > test: > @ args = 0, pretend = 0, frame = 0 > @ frame_needed = 0, uses_anonymous_args = 0 > @ link register save eliminated. > ldr r3, .L6 > ldr r0, [r3, #4] > adds r2, r0, #5 > str r2, [r3, #0] > bx lr > .L7: > .align 2 > .L6: > .word .LANCHOR0 > .size test, .-test > .global b > .global a > .bss > .align 2 > .set .LANCHOR0,. + 0 > .type a, %object > .size a, 4 > a: > .space 4 > .type b, %object > .size b, 4 > b: > .space 4 > .ident "GCC: (Sourcery CodeBench Lite 2011.09-69) 4.6.1" > > > It's all about learning to use the tools you have, rather than buying > more expensive tools.
Which reminds me----when counting bytes in code like this, it's easy to forget the bytes used in the constant tables that provide the addresses of variables. A 16-bit variable may require a 32-bit table entry. I started with EW_ARM about three years before I started on the Linux project. The original compiler was purchased by the customer---who had no preferences, but was developing a project with fairly limited hardware resources. They asked what compiler I'd like and I picked EW- ARM. At that time, I'd been using CodeWarrior for the M68K for many years and EW_ARM had the same 'feel'. When it came time to do the Linux project, the transition to GCC took MUCH longer than the transition from CodeWarrior to EW_ARM. Of course, much of that was in setting up a virtual machine on the PC and learning Linux so that I could use GCC. One thing that I missed on the Linux project is that I didn't have a debugger equivalent to C-Spy that is integrated into EW_ARM. Debugging on the Linux system was mostly "Save everything and analyze later". Of course, the original poster is discussing the type of code that few Linux programmers write----direct interfacing to peripherals. My recent experience with Linux and digital cameras was pretty frustrating. I was dependent on others to provide the drivers--and they often didn't work quite right with the particular camera I was using. That's a story for another time, though.
> > mvh., > > David
Mark Borgerson