Code size reduction migrating from PIC18 to Cortex M0| page 2

Reply by ●June 20, 20122012-06-20

On Jun 20, 1:23=A0pm, FreeRTOS info <noem...@given.com> wrote:
> On 20/06/2012 18:07, peter_gotka...@supergreatmail.com wrote:
>
> > I've ported a fairly large app from a PIC18 to a Cortex M3 (which I
> > believe is just a superset of the M0) and the code size actually
> > INCREASED from about 62K to 129K.
>
> Did you look at the map file to see why? =A0If using GCC, did you set the
> compile options to remove dead code (most linkers will do it
> automatically). =A0If using GCC, did you avoid using libraries that were
> written for a much larger class of processor?

I actually spent a lot of time looking at the map file. The total
"overhead" including the vector table, C startup, and the two library
functions that I actually use (printf and memcpy) is around 2K, the
rest is all my code.

> "The Cortex-M3 processor has a feature known as "bit-banding". This
> allows an individual bit in a memory-mapped mailbox or peripheral
> register to be set/cleared by a single store/load instruction to an
> bit-band aliased memory address, rather than using a conventional
> read/modify/write instruction sequence."

I've used bit banding in some spots, and although it's great for RAM
variables it's not very elegant for the peripheral registers since the
header files define bits by value, not position whereas any bitbanding
C macro that I could come up with would require the bit number, not
position. So while TIM1->CCER |=3D TIM_CCER_CC4E clearly enables the
timer's CC4 output, BITBANDSET(TIM1->CCER, 12) is less intuitive.
Instead of a macro I thought about making an ASM inline function that
would use the CLZ instruction to do this efficiently but for some
reason gcc didn't want to inline any of my functions (C or ASM) in
debug mode so I just gave up at that point.

Reply by Mark Borgerson ●June 20, 20122012-06-20

In article <aec90865-670b-4bc5-a27d-
04d6ff608e4d@d17g2000vbv.googlegroups.com>, 
peter_gotkatov@supergreatmail.com says...
> 
> On May 24, 5:32&#4294967295;pm, Kvik <klaus.kragel...@gmail.com> wrote:
> > Hi
> >
> > We are digging deeper into the Cortex M0 processor versus a PIC18.
> >
> > Seemingly objective material (Coremark data) at page 32 of:
> >
> > http://ics.nxp.com/literature/presentations/microcontrollers/pdf/cort...
> >
> > List a reduction in code size from PIC18 to M0 by a factor 2.
> >
> > But, anyone with a real-life experience of the possible code size
> > reduction?
> >
> > Thanks
> >
> > Klaus
> 
> I've ported a fairly large app from a PIC18 to a Cortex M3 (which I
> believe is just a superset of the M0) and the code size actually
> INCREASED from about 62K to 129K. This was just plain C code without
> any processor-specific optimizations or tricks that was just cut &
> pasted from one compiler to the other. While the Cortex does get
> better density on things like 32 X 32 multiplies or divides it suffers
> horribly on simple control structures.
> 
> For example, clearing a timer interrupt flag:
> 
> On the PIC18 this takes 2 bytes:
> PIR1 &= ~TMR1IF;
>     2108:  BCF    F9E.0
> 
> On the Cortex M3 it takes 40 bytes:
> TIM1->SR &= ~TIM_SR_UIF;
>     F6424200    movw r2, #0x2C00
>     F2C40201    movt r2, #0x4001
>     F6424300    movw r3, #0x2C00
>     F2C40301    movt r3, #0x4001
>     8A1B        ldrh r3, [r3, #16]
>     B29B        uxth r3, r3
>     4619        mov r1, r3
>     F64F73FE    movw r3, #0xFFFE
>     F2C00300    movt r3, #0
>     EA010303    and.w r3, r1, r3
>     4619        mov r1, r3
>     460B        mov r3, r1
>     8213        strh r3, [r2, #16]
> 
> A simple countdown:
> 
> On the PIC18 it takes 6 bytes:
> if (--timeout) return;
>     210A:  DECF   x3B,F
>     210C:  BZ    2110
>     210E:  BRA    2114
> 
> On the Cortex M3 it takes 40 bytes:
> if (--timeout) return;
>     F2400360    movw r3, #0x60
>     F2C20300    movt r3, #0x2000
>     7B5B        ldrb r3, [r3, #13]
>     F10333FF    add.w r3, r3, #0xFFFFFFFF
>     B2DA        uxtb r2, r3
>     F2400360    movw r3, #0x60
>     F2C20300    movt r3, #0x2000
>     735A        strb r2, [r3, #13]
>     F2400360    movw r3, #0x60
>     F2C20300    movt r3, #0x2000
>     7B5B        ldrb r3, [r3, #13]
>     2B00        cmp r3, #0
>     D128        bne 0x08000F92
> 
> This may not be a very fair comparison since both compilers (CCS for
> the PIC and gcc for the Cortex) are set to non-optimized mode but even
> when gcc is set to optimize it only drops from 129K down to 104K which
> is not much of a savings and still worse than the PIC18. When I first
> started this exercise I was quite disappointed by the poor density so
> I tried a simple exercise: I took one single C function that had more
> than doubled in size and re-wrote it so as to take advantage of the
> Cortex strengths. I made heavy use of 32-bit variables, careful use of
> the "register" keyword, always accessing global variables through a
> pointer, combining bit shifts with other arithmetic operations, using
> bit-banding for IO registers wherever possible, etc. In the end I
> managed to get it down to almost half its size, but still couldn't
> match the PIC18.
> 
> Perhaps the final answer depends on what kind of application you're
> writing. In my case it's very IO intensive with a lot of peripherals
> being used and a simple touchscreen UI with very little math involved.
> Perhaps the Cortex was not the best choice here.

I think there are two things going on here:  

1.  The GCC compiler isn't very good at producing compact code.

I tried one of your examples on IAR EW-ARM with optimization set to
low (my usual default):

  119          	TIM1->SR &= TIM_SR_UIF;
   \   00000010   0x....             LDR.N    R0,??DataTable6_6  ;; 
0x40010010
   \   00000012   0x8800             LDRH     R0,[R0, #+0]
   \   00000014   0xF010 0x0001      ANDS     R0,R0,#0x1
   \   00000018   0x....             LDR.N    R1,??DataTable6_6  ;; 
0x40010010
   \   0000001A   0x8008             STRH     R0,[R1, #+0]

That's just 10 bytes----4 times better than the GCC result.


2.   Cortex IO registers may be 16  or 32 bits and there are enough of 
them that you need 32-bit pointers to get at them.  Loading those 
pointers is going to take more code.



I suspect that the IAR compiler would reduce the code expansion to about
a factor of 1.5.   Since a lot Cortex MCUs have up to 1MB of flash while 
the PIC18 maxes out at 128KB,  the ratio of program size to available 
flash may be better on the Cortex than on the PIC18.


Mark Borgerson

Reply by David Brown ●June 21, 20122012-06-21

On 20/06/2012 19:07, peter_gotkatov@supergreatmail.com wrote:
> On May 24, 5:32 pm, Kvik <klaus.kragel...@gmail.com> wrote:
>> Hi
>>
>> We are digging deeper into the Cortex M0 processor versus a PIC18.
>>
>> Seemingly objective material (Coremark data) at page 32 of:
>>
>> http://ics.nxp.com/literature/presentations/microcontrollers/pdf/cort...
>>
>> List a reduction in code size from PIC18 to M0 by a factor 2.
>>
>> But, anyone with a real-life experience of the possible code size
>> reduction?
>>
>> Thanks
>>
>> Klaus
>
> I've ported a fairly large app from a PIC18 to a Cortex M3 (which I
> believe is just a superset of the M0) and the code size actually
> INCREASED from about 62K to 129K. This was just plain C code without
> any processor-specific optimizations or tricks that was just cut &
> pasted from one compiler to the other. While the Cortex does get
> better density on things like 32 X 32 multiplies or divides it suffers
> horribly on simple control structures.
>
> For example, clearing a timer interrupt flag:
>
> On the PIC18 this takes 2 bytes:
> PIR1 &= ~TMR1IF;
>      2108:  BCF    F9E.0
>
> On the Cortex M3 it takes 40 bytes:
> TIM1->SR &= ~TIM_SR_UIF;
>      F6424200    movw r2, #0x2C00
>      F2C40201    movt r2, #0x4001
>      F6424300    movw r3, #0x2C00
>      F2C40301    movt r3, #0x4001
>      8A1B        ldrh r3, [r3, #16]
>      B29B        uxth r3, r3
>      4619        mov r1, r3
>      F64F73FE    movw r3, #0xFFFE
>      F2C00300    movt r3, #0
>      EA010303    and.w r3, r1, r3
>      4619        mov r1, r3
>      460B        mov r3, r1
>      8213        strh r3, [r2, #16]
>
> A simple countdown:
>
> On the PIC18 it takes 6 bytes:
> if (--timeout) return;
>      210A:  DECF   x3B,F
>      210C:  BZ    2110
>      210E:  BRA    2114
>
> On the Cortex M3 it takes 40 bytes:
> if (--timeout) return;
>      F2400360    movw r3, #0x60
>      F2C20300    movt r3, #0x2000
>      7B5B        ldrb r3, [r3, #13]
>      F10333FF    add.w r3, r3, #0xFFFFFFFF
>      B2DA        uxtb r2, r3
>      F2400360    movw r3, #0x60
>      F2C20300    movt r3, #0x2000
>      735A        strb r2, [r3, #13]
>      F2400360    movw r3, #0x60
>      F2C20300    movt r3, #0x2000
>      7B5B        ldrb r3, [r3, #13]
>      2B00        cmp r3, #0
>      D128        bne 0x08000F92
>
> This may not be a very fair comparison since both compilers (CCS for
> the PIC and gcc for the Cortex) are set to non-optimized mode but even
> when gcc is set to optimize it only drops from 129K down to 104K which
> is not much of a savings and still worse than the PIC18. When I first
> started this exercise I was quite disappointed by the poor density so
> I tried a simple exercise: I took one single C function that had more
> than doubled in size and re-wrote it so as to take advantage of the
> Cortex strengths. I made heavy use of 32-bit variables, careful use of
> the "register" keyword, always accessing global variables through a
> pointer, combining bit shifts with other arithmetic operations, using
> bit-banding for IO registers wherever possible, etc. In the end I
> managed to get it down to almost half its size, but still couldn't
> match the PIC18.
>
> Perhaps the final answer depends on what kind of application you're
> writing. In my case it's very IO intensive with a lot of peripherals
> being used and a simple touchscreen UI with very little math involved.
> Perhaps the Cortex was not the best choice here.
>

Saying you use a compiler but don't enable optimisation, then 
complaining about the code generated, is like saying you drive a car but 
never bother changing out of first gear and then complaining about the 
lack of speed.

When you say you tried using the "register" keyword, I have to assume 
you learned C from a 30 year old book.  One thing that is worth learning 
about modern toolchains (for the PIC, the Cortex, or whatever) is that 
they generate better code from well-written C using a clear, modern 
style, and using appropriate command-line switches.  Don't try and 
second-guess your tools by adding irrelevant keywords (like "register") 
or "hand-optimising" by using extra pointers.  Learn to use the tools 
properly, then let them do their job.

When you say you use "gcc", which version?  There are still some people 
using ancient versions of gcc which were very poor for ARM code (which 
has lead to a long-lasting myth that gcc is bad for ARM).

To test your issues, I compiled this test code:


#include <stdint.h>

typedef struct {
     uint16_t padding[8];
     volatile uint16_t SR;
} TIM_t;

#define TIM1 (((TIM_t*)(0x40012c00)))
#define TIM_SR_UIF 0x0002

#define timeout (*((uint8_t*)(0x20000060)))

void test2(void) {
     if (--timeout) return;
     TIM1->SR &= ~TIM_SR_UIF;
}

I used gcc 4.6.1 (from CodeSourcery Lite version 2011.09-69), with flags 
"-mcpu=cortex-m3 -mthumb -S".

Even with no optimisation, I am failing to generate code quite as bad as 
you have.

With -Os (which is the norm for embedded systems), I get:

test2:
         @ args = 0, pretend = 0, frame = 0
         @ frame_needed = 0, uses_anonymous_args = 0
         @ link register save eliminated.
         ldr     r2, .L3
         ldrb    r1, [r2, #0]    @ zero_extendqisi2
         subs    r0, r1, #1
         uxtb    r3, r0
         strb    r3, [r2, #0]
         cbnz    r3, .L1
         ldr     r2, .L3+4
         ldrh    ip, [r2, #16]
         bic     r1, ip, #2
         lsls    r0, r1, #16
         lsrs    r3, r0, #16
         strh    r3, [r2, #16]   @ movhi
.L1:
         bx      lr
.L4:
         .align  2
.L3:
         .word   536871008
         .word   1073818624


Real-world code will be even better, as the compiler can re-use base 
pointers and otherwise optimise larger code sections.

Reply by David Brown ●June 21, 20122012-06-21

On 20/06/2012 21:21, peter_gotkatov@supergreatmail.com wrote:
> On Jun 20, 1:23 pm, FreeRTOS info <noem...@given.com> wrote:
>> On 20/06/2012 18:07, peter_gotka...@supergreatmail.com wrote:
>>
>>> I've ported a fairly large app from a PIC18 to a Cortex M3 (which I
>>> believe is just a superset of the M0) and the code size actually
>>> INCREASED from about 62K to 129K.
>>
>> Did you look at the map file to see why?  If using GCC, did you set the
>> compile options to remove dead code (most linkers will do it
>> automatically).  If using GCC, did you avoid using libraries that were
>> written for a much larger class of processor?
>
> I actually spent a lot of time looking at the map file. The total
> "overhead" including the vector table, C startup, and the two library
> functions that I actually use (printf and memcpy) is around 2K, the
> rest is all my code.

That is unlikely to be true, but without knowing your code or the map 
file, there is no way to be sure.  It is not a surprise that the code 
size has increased in moving from the PIC18 - differences here will vary 
wildly according to the type of code.  But it /is/ a surprise that you 
only have 2K of startup, vector tables, and library code.

>
>> "The Cortex-M3 processor has a feature known as "bit-banding". This
>> allows an individual bit in a memory-mapped mailbox or peripheral
>> register to be set/cleared by a single store/load instruction to an
>> bit-band aliased memory address, rather than using a conventional
>> read/modify/write instruction sequence."
>
> I've used bit banding in some spots, and although it's great for RAM
> variables it's not very elegant for the peripheral registers since the
> header files define bits by value, not position whereas any bitbanding
> C macro that I could come up with would require the bit number, not
> position. So while TIM1->CCER |= TIM_CCER_CC4E clearly enables the
> timer's CC4 output, BITBANDSET(TIM1->CCER, 12) is less intuitive.

Code clarity is more important than code efficiency.  But if code 
efficiency is important, then put such code in little "static inline" 
functions with appropriate comments.

> Instead of a macro I thought about making an ASM inline function that
> would use the CLZ instruction to do this efficiently but for some
> reason gcc didn't want to inline any of my functions (C or ASM) in
> debug mode so I just gave up at that point.
>

First off, you should not need to resort to assembly to get basic 
instructions working - the compiler should produce near-optimal code as 
long as you let it (by enabling optimisations and writing appropriate C 
code).

Secondly, don't use "ASM functions" - they are normally only needed by 
more limited compilers.  If you need to use assembly with gcc, use gcc's 
extended "asm" syntax.

Finally, if you are not getting inlining when debugging it is because 
you have got incorrect compiler switches.  You should not have different 
"debug" and "release" (or "optimised") builds - do a single build with 
the proper optimisation settings (typically -Os unless you know what you 
are doing) and "-g" to enable debugging.  You never want to be releasing 
code that is built differently from the code you debugged.

Reply by ●June 21, 20122012-06-21

On Jun 21, 4:09=A0am, David Brown <da...@westcontrol.removethisbit.com>
wrote:
> On 20/06/2012 21:21, peter_gotka...@supergreatmail.com wrote:
> It is not a surprise that the code
> size has increased in moving from the PIC18 - differences here will vary
> wildly according to the type of code. =A0But it /is/ a surprise that you
> only have 2K of startup, vector tables, and library code.

Not all that surprising, here are the sizes in bytes:
.vectors 304
.init 508
__putchar 40
__vprintf 1498
memcpy 56

> > Instead of a macro I thought about making an ASM inline function that
> > would use the CLZ instruction to do this efficiently but for some
> > reason gcc didn't want to inline any of my functions (C or ASM) in
> > debug mode so I just gave up at that point.
>
> First off, you should not need to resort to assembly to get basic
> instructions working - the compiler should produce near-optimal code as
> long as you let it (by enabling optimisations and writing appropriate C
> code).

I've tried several ways of writing a counleadingzeroes() function that
would use the Cortex CLZ instruction but even with optimization turned
on it still wouldn't do it.

> Secondly, don't use "ASM functions" - they are normally only needed by
> more limited compilers. =A0If you need to use assembly with gcc, use gcc'=
s
> extended "asm" syntax.

There are some things like the bootloader that need to be ASM
functions in their own separate .S file anyway since they need to copy
portions of themselves to RAM in order to execute. But a bootloader is
a special case and I do agree that normal code shouldn't need to rely
on ASM functions. I must say I'm not familiar with gcc's extended asm
syntax and although I did look at it briefly it seemed like it was
more complicated than a plain old .S file and it was mostly geared
towards mixing C and ASM together in the same function and accessing
variables by name etc. Not something I needed for a simple bootloader.

> Finally, if you are not getting inlining when debugging it is because
> you have got incorrect compiler switches. =A0You should not have differen=
t
> "debug" and "release" (or "optimised") builds - do a single build with
> the proper optimisation settings (typically -Os unless you know what you
> are doing) and "-g" to enable debugging. =A0You never want to be releasin=
g
> code that is built differently from the code you debugged.

I was fighting with this for a while when I was first handed this
toolchain, and it seems that in debug mode, there is no -O switch at
all and in release mode it defaults to -O1. When I change this -Os it
does produce the same code as the sample that you posted above from
gcc 4.6.1 (mine is 4.4.4 by the way). However even with manually
adding the -g switch I still don't get source annotation in the ELF
file unless I use debug mode. This effectively limits any development/
debugging to unoptimized code, which still has to fit into the 256K
somehow.

As for using register keywords and accessing globals through pointers,
I normally don't do this (haven't used the register keyword in years)
and I certainly wouldn't be doing it at all if it didn't have such a
significant effect on the code size:

unsigned long a,b;

void test(void) {
    B4B0        push {r4-r5, r7}
    AF00        add r7, sp, #0
----------------------------------------
register unsigned long x, y;
a=3Db+5;
    F2402314    movw r3, #0x214
    F2C20300    movt r3, #0x2000
    681B        ldr r3, [r3, #0]
    F1030205    add.w r2, r3, #5
    F240231C    movw r3, #0x21C
    F2C20300    movt r3, #0x2000
    601A        str r2, [r3, #0]
----------------------------------------
x=3Dy+5;
    F1040505    add.w r5, r4, #5
----------------------------------------
}
    46BD        mov sp, r7
    BCB0        pop {r4-r5, r7}
    4770        bx lr
    BF00        nop

Reply by Arlet Ottens ●June 21, 20122012-06-21

On 06/21/2012 05:32 PM, peter_gotkatov@supergreatmail.com wrote:

> As for using register keywords and accessing globals through pointers,
> I normally don't do this (haven't used the register keyword in years)
> and I certainly wouldn't be doing it at all if it didn't have such a
> significant effect on the code size:
>
> unsigned long a,b;
>
> void test(void) {
>      B4B0        push {r4-r5, r7}
>      AF00        add r7, sp, #0
> ----------------------------------------
> register unsigned long x, y;
> a=b+5;
>      F2402314    movw r3, #0x214
>      F2C20300    movt r3, #0x2000
>      681B        ldr r3, [r3, #0]
>      F1030205    add.w r2, r3, #5
>      F240231C    movw r3, #0x21C
>      F2C20300    movt r3, #0x2000
>      601A        str r2, [r3, #0]
> ----------------------------------------
> x=y+5;
>      F1040505    add.w r5, r4, #5
> ----------------------------------------
> }
>      46BD        mov sp, r7
>      BCB0        pop {r4-r5, r7}
>      4770        bx lr
>      BF00        nop

The difference is due to the fact that a, b are global and x, y are local.

Try removing the 'register' keyword, but leaving everything else the 
same. You should get the same code (assuming optimization enabled)

Reply by David Brown ●June 21, 20122012-06-21

On 21/06/12 17:32, peter_gotkatov@supergreatmail.com wrote:
> On Jun 21, 4:09 am, David Brown<da...@westcontrol.removethisbit.com>
> wrote:
>> On 20/06/2012 21:21, peter_gotka...@supergreatmail.com wrote:
>> It is not a surprise that the code
>> size has increased in moving from the PIC18 - differences here will vary
>> wildly according to the type of code.  But it /is/ a surprise that you
>> only have 2K of startup, vector tables, and library code.
>
> Not all that surprising, here are the sizes in bytes:
> .vectors 304
> .init 508
> __putchar 40
> __vprintf 1498
> memcpy 56
>

I still think it is surprising, because these library functions often 
pull in other library code (such as for floating point support), and 
quite often there is library code for small "helper" functions.  But it 
depends on the configuration, and what is in the rest of your source code.

>>> Instead of a macro I thought about making an ASM inline function that
>>> would use the CLZ instruction to do this efficiently but for some
>>> reason gcc didn't want to inline any of my functions (C or ASM) in
>>> debug mode so I just gave up at that point.
>>
>> First off, you should not need to resort to assembly to get basic
>> instructions working - the compiler should produce near-optimal code as
>> long as you let it (by enabling optimisations and writing appropriate C
>> code).
>
> I've tried several ways of writing a counleadingzeroes() function that
> would use the Cortex CLZ instruction but even with optimization turned
> on it still wouldn't do it.
>

Did you try using the "__builtin_clz()" function described in the gcc 
manual?

<http://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html>

>> Secondly, don't use "ASM functions" - they are normally only needed by
>> more limited compilers.  If you need to use assembly with gcc, use gcc's
>> extended "asm" syntax.
>
> There are some things like the bootloader that need to be ASM
> functions in their own separate .S file anyway since they need to copy
> portions of themselves to RAM in order to execute. But a bootloader is
> a special case and I do agree that normal code shouldn't need to rely
> on ASM functions.

I have written bootloaders for several microcontrollers (though not for 
a Cortex).  I write them in C.

I have written startup code for several microcontrollers, handling the 
setup of the stack, memory, the C environment, clearing bss, copying 
constants, etc.  I write such code in C.

You can't avoid writing two or three of the instructions in assembly, 
but usually it's not more than that.  It's certainly not enough to 
bother with separate .S files - normally not even individual assembly 
functions (though sometimes I've used "naked" C functions as wrappers 
for a few lines of pure assembly).

Usually I write startup code when I am unhappy with the code supplied by 
the toolchain vendor - which is invariably written in assembly. 
Re-writing it in C gives code that is far clearer, and smaller and 
faster (sometimes many times faster).

> I must say I'm not familiar with gcc's extended asm
> syntax and although I did look at it briefly it seemed like it was
> more complicated than a plain old .S file and it was mostly geared
> towards mixing C and ASM together in the same function and accessing
> variables by name etc. Not something I needed for a simple bootloader.

It is aimed at mixing C and assembly, yes.  It lets you do the minimal 
work in assembly, while letting the compiler handle as much as possible, 
including optimising around your assembly code.  Let the compiler do the 
things it is good at.

>
>> Finally, if you are not getting inlining when debugging it is because
>> you have got incorrect compiler switches.  You should not have different
>> "debug" and "release" (or "optimised") builds - do a single build with
>> the proper optimisation settings (typically -Os unless you know what you
>> are doing) and "-g" to enable debugging.  You never want to be releasing
>> code that is built differently from the code you debugged.
>
> I was fighting with this for a while when I was first handed this
> toolchain, and it seems that in debug mode, there is no -O switch at
> all and in release mode it defaults to -O1. When I change this -Os it
> does produce the same code as the sample that you posted above from
> gcc 4.6.1 (mine is 4.4.4 by the way). However even with manually
> adding the -g switch I still don't get source annotation in the ELF
> file unless I use debug mode. This effectively limits any development/
> debugging to unoptimized code, which still has to fit into the 256K
> somehow.

This is some limitation or misunderstanding of your IDE or other tools, 
not gcc.  Most likely it is a misunderstanding rather than a limitation, 
but without knowing your particular toolchain it is hard to give 
specific help.

Most serious developers use something like -Os (which is -O2 with an 
emphasis on size).  The only reason to use -O1 is if you have a very 
slow computer and a very large code base, as it is faster than -Os/-O2, 
or very occasionally in testing or debugging.  The only reason to use no 
optimisation is because you don't understand your tools.

And sometimes it is useful to use higher optimisations, or enable 
specific optimisations, because of particular effects.  They don't tend 
to have much effect on most code, but can make a big difference to 
particular parts (perhaps unrolling a loop, or re-arranging nested loops 
to fit cache line sizes, etc.).  I tend to use "optimize" function 
attributes or pragmas for such special cases.

>
> As for using register keywords and accessing globals through pointers,
> I normally don't do this (haven't used the register keyword in years)
> and I certainly wouldn't be doing it at all if it didn't have such a
> significant effect on the code size:
>
> unsigned long a,b;
>
> void test(void) {
>      B4B0        push {r4-r5, r7}
>      AF00        add r7, sp, #0
> ----------------------------------------
> register unsigned long x, y;
> a=b+5;
>      F2402314    movw r3, #0x214
>      F2C20300    movt r3, #0x2000
>      681B        ldr r3, [r3, #0]
>      F1030205    add.w r2, r3, #5
>      F240231C    movw r3, #0x21C
>      F2C20300    movt r3, #0x2000
>      601A        str r2, [r3, #0]
> ----------------------------------------
> x=y+5;
>      F1040505    add.w r5, r4, #5
> ----------------------------------------
> }
>      46BD        mov sp, r7
>      BCB0        pop {r4-r5, r7}
>      4770        bx lr
>      BF00        nop
>

The "register" keyword has always been ignored in gcc except in -O0 
mode, unless of course you are using the extended syntax to specific a 
particular register.

<http://gcc.gnu.org/ml/gcc/2010-05/msg00113.html>

You are seeing a difference in the code because one set of variables is 
global, and must be accessed externally, while the other set is local 
and uses registers.

And if you had enabled optimisations, the "x = y + 5;" would have been 
eliminated entirely because it has no effect.

And if you had enabled warnings, as you always should, the compiler 
would have complained about the code using variables before they are 
initialised, and about setting a variable that has no effect.

mvh.,

David

Reply by Mark Borgerson ●June 21, 20122012-06-21

In article <9b55cce9-96db-46f4-909a-1f6500deb237
@j9g2000vbk.googlegroups.com>, peter_gotkatov@supergreatmail.com says...
> 
> On Jun 21, 4:09&#4294967295;am, David Brown <da...@westcontrol.removethisbit.com>
> wrote:
> > On 20/06/2012 21:21, peter_gotka...@supergreatmail.com wrote:
> > It is not a surprise that the code
> > size has increased in moving from the PIC18 - differences here will vary
> > wildly according to the type of code. &#4294967295;But it /is/ a surprise that you
> > only have 2K of startup, vector tables, and library code.
> 
> Not all that surprising, here are the sizes in bytes:
> .vectors 304
> .init 508
> __putchar 40
> __vprintf 1498
> memcpy 56
> 
> > > Instead of a macro I thought about making an ASM inline function that
> > > would use the CLZ instruction to do this efficiently but for some
> > > reason gcc didn't want to inline any of my functions (C or ASM) in
> > > debug mode so I just gave up at that point.
> >
> > First off, you should not need to resort to assembly to get basic
> > instructions working - the compiler should produce near-optimal code as
> > long as you let it (by enabling optimisations and writing appropriate C
> > code).
> 
> I've tried several ways of writing a counleadingzeroes() function that
> would use the Cortex CLZ instruction but even with optimization turned
> on it still wouldn't do it.
> 
> > Secondly, don't use "ASM functions" - they are normally only needed by
> > more limited compilers. &#4294967295;If you need to use assembly with gcc, use gcc's
> > extended "asm" syntax.
> 
> There are some things like the bootloader that need to be ASM
> functions in their own separate .S file anyway since they need to copy
> portions of themselves to RAM in order to execute. But a bootloader is
> a special case and I do agree that normal code shouldn't need to rely
> on ASM functions. I must say I'm not familiar with gcc's extended asm
> syntax and although I did look at it briefly it seemed like it was
> more complicated than a plain old .S file and it was mostly geared
> towards mixing C and ASM together in the same function and accessing
> variables by name etc. Not something I needed for a simple bootloader.
> 
> > Finally, if you are not getting inlining when debugging it is because
> > you have got incorrect compiler switches. &#4294967295;You should not have different
> > "debug" and "release" (or "optimised") builds - do a single build with
> > the proper optimisation settings (typically -Os unless you know what you
> > are doing) and "-g" to enable debugging. &#4294967295;You never want to be releasing
> > code that is built differently from the code you debugged.
> 
> I was fighting with this for a while when I was first handed this
> toolchain, and it seems that in debug mode, there is no -O switch at
> all and in release mode it defaults to -O1. When I change this -Os it
> does produce the same code as the sample that you posted above from
> gcc 4.6.1 (mine is 4.4.4 by the way). However even with manually
> adding the -g switch I still don't get source annotation in the ELF
> file unless I use debug mode. This effectively limits any development/
> debugging to unoptimized code, which still has to fit into the 256K
> somehow.
> 
> As for using register keywords and accessing globals through pointers,
> I normally don't do this (haven't used the register keyword in years)
> and I certainly wouldn't be doing it at all if it didn't have such a
> significant effect on the code size:
> 
> unsigned long a,b;
> 
> void test(void) {
>     B4B0        push {r4-r5, r7}
>     AF00        add r7, sp, #0
> ----------------------------------------
> register unsigned long x, y;
> a=b+5;
>     F2402314    movw r3, #0x214
>     F2C20300    movt r3, #0x2000
>     681B        ldr r3, [r3, #0]
>     F1030205    add.w r2, r3, #5
>     F240231C    movw r3, #0x21C
>     F2C20300    movt r3, #0x2000
>     601A        str r2, [r3, #0]
> ----------------------------------------
> x=y+5;
>     F1040505    add.w r5, r4, #5
> ----------------------------------------
> }
>     46BD        mov sp, r7
>     BCB0        pop {r4-r5, r7}
>     4770        bx lr
>     BF00        nop

(36 bytes of code)

I'd be surprised that your compiler did not complain about  x and y not 
being initialized before the addition.  EW ARM warned me that y was used 
before its value was set and that x was set but never used.

Here is the code it gave me----with some extra comments added afterwards


    111          void test(void){
    112          	register unsigned long x,y;
    113          	a = b+5;
   \                     test:
   \   00000000   0x....             LDR.N    R1,??DataTable8_6
   \   00000002   0x6809             LDR      R1,[R1, #+0]
   \   00000004   0x1D49             ADDS     R1,R1,#+5
   \   00000006   0x....             LDR.N    R2,??DataTable8_7
   \   00000008   0x6011             STR      R1,[R2, #+0]
    114          	x = y+5;
   \   0000000A   0x1D40             ADDS     R0,R0,#+5  // R0 is y
								 // sum not saved
    115          }
   \   0000000C   0x4770             BX       LR               ;; return
    116   
(14 bytes of code)

Apparently,  R0, R1, R2 are scratch registers for IAR and don't need to 
be saved and restored.

Adding actual initialization  to  x and y and saving the result in b 
produced the following:

                            In section .text, align 2, keep-with-next
    110          void test(void){
    111          	register unsigned long x=3,y=4;
   \                     test:
   \   00000000   0x2003             MOVS     R0,#+3
   \   00000002   0x2104             MOVS     R1,#+4   
    112          	a = b+5;
   \   00000004   0x....             LDR.N    R2,??DataTable8_6
   \   00000006   0x6812             LDR      R2,[R2, #+0]
   \   00000008   0x1D52             ADDS     R2,R2,#+5
   \   0000000A   0x....             LDR.N    R3,??DataTable8_7
   \   0000000C   0x601A             STR      R2,[R3, #+0]
    113          	x = y+5;
   \   0000000E   0x1D49             ADDS     R1,R1,#+5
   \   00000010   0x0008             MOVS     R0,R1     //  x = sum
    114          	b = x;   // this time save the result
   \   00000012   0x....             LDR.N    R1,??DataTable8_6
   \   00000014   0x6008             STR      R0,[R1, #+0]
    115          }
   \   00000016   0x4770             BX       LR               ;; return
    116 

Still accomplished with scratch registers----no need to save any on the 
stack.   I changed from my default optimization of 'low' to 'none'
and got exactly the same code.

Finally,  I took out the 'register' key word before x and y----and
got exactly the same result as above.

It seems that GCC just doesn't match up to IAR at producing compact
code at low optimization levels.    OTOH, given that EW_ARM costs 
several KBucks, it SHOULD do better!


Mark Borgerson

Reply by David Brown ●June 22, 20122012-06-22

On 22/06/2012 02:42, Mark Borgerson wrote:
> In article <9b55cce9-96db-46f4-909a-1f6500deb237
> @j9g2000vbk.googlegroups.com>, peter_gotkatov@supergreatmail.com says...
>>
>> On Jun 21, 4:09 am, David Brown <da...@westcontrol.removethisbit.com>
>> wrote:
>>> On 20/06/2012 21:21, peter_gotka...@supergreatmail.com wrote:
>>> It is not a surprise that the code
>>> size has increased in moving from the PIC18 - differences here will vary
>>> wildly according to the type of code.  But it /is/ a surprise that you
>>> only have 2K of startup, vector tables, and library code.
>>
>> Not all that surprising, here are the sizes in bytes:
>> .vectors 304
>> .init 508
>> __putchar 40
>> __vprintf 1498
>> memcpy 56
>>
>>>> Instead of a macro I thought about making an ASM inline function that
>>>> would use the CLZ instruction to do this efficiently but for some
>>>> reason gcc didn't want to inline any of my functions (C or ASM) in
>>>> debug mode so I just gave up at that point.
>>>
>>> First off, you should not need to resort to assembly to get basic
>>> instructions working - the compiler should produce near-optimal code as
>>> long as you let it (by enabling optimisations and writing appropriate C
>>> code).
>>
>> I've tried several ways of writing a counleadingzeroes() function that
>> would use the Cortex CLZ instruction but even with optimization turned
>> on it still wouldn't do it.
>>
>>> Secondly, don't use "ASM functions" - they are normally only needed by
>>> more limited compilers.  If you need to use assembly with gcc, use gcc's
>>> extended "asm" syntax.
>>
>> There are some things like the bootloader that need to be ASM
>> functions in their own separate .S file anyway since they need to copy
>> portions of themselves to RAM in order to execute. But a bootloader is
>> a special case and I do agree that normal code shouldn't need to rely
>> on ASM functions. I must say I'm not familiar with gcc's extended asm
>> syntax and although I did look at it briefly it seemed like it was
>> more complicated than a plain old .S file and it was mostly geared
>> towards mixing C and ASM together in the same function and accessing
>> variables by name etc. Not something I needed for a simple bootloader.
>>
>>> Finally, if you are not getting inlining when debugging it is because
>>> you have got incorrect compiler switches.  You should not have different
>>> "debug" and "release" (or "optimised") builds - do a single build with
>>> the proper optimisation settings (typically -Os unless you know what you
>>> are doing) and "-g" to enable debugging.  You never want to be releasing
>>> code that is built differently from the code you debugged.
>>
>> I was fighting with this for a while when I was first handed this
>> toolchain, and it seems that in debug mode, there is no -O switch at
>> all and in release mode it defaults to -O1. When I change this -Os it
>> does produce the same code as the sample that you posted above from
>> gcc 4.6.1 (mine is 4.4.4 by the way). However even with manually
>> adding the -g switch I still don't get source annotation in the ELF
>> file unless I use debug mode. This effectively limits any development/
>> debugging to unoptimized code, which still has to fit into the 256K
>> somehow.
>>
>> As for using register keywords and accessing globals through pointers,
>> I normally don't do this (haven't used the register keyword in years)
>> and I certainly wouldn't be doing it at all if it didn't have such a
>> significant effect on the code size:
>>
>> unsigned long a,b;
>>
>> void test(void) {
>>      B4B0        push {r4-r5, r7}
>>      AF00        add r7, sp, #0
>> ----------------------------------------
>> register unsigned long x, y;
>> a=b+5;
>>      F2402314    movw r3, #0x214
>>      F2C20300    movt r3, #0x2000
>>      681B        ldr r3, [r3, #0]
>>      F1030205    add.w r2, r3, #5
>>      F240231C    movw r3, #0x21C
>>      F2C20300    movt r3, #0x2000
>>      601A        str r2, [r3, #0]
>> ----------------------------------------
>> x=y+5;
>>      F1040505    add.w r5, r4, #5
>> ----------------------------------------
>> }
>>      46BD        mov sp, r7
>>      BCB0        pop {r4-r5, r7}
>>      4770        bx lr
>>      BF00        nop
>
> (36 bytes of code)
>
> I'd be surprised that your compiler did not complain about  x and y not
> being initialized before the addition.  EW ARM warned me that y was used
> before its value was set and that x was set but never used.
>
> Here is the code it gave me----with some extra comments added afterwards
>
>
>      111          void test(void){
>      112          	register unsigned long x,y;
>      113          	a = b+5;
>     \                     test:
>     \   00000000   0x....             LDR.N    R1,??DataTable8_6
>     \   00000002   0x6809             LDR      R1,[R1, #+0]
>     \   00000004   0x1D49             ADDS     R1,R1,#+5
>     \   00000006   0x....             LDR.N    R2,??DataTable8_7
>     \   00000008   0x6011             STR      R1,[R2, #+0]
>      114          	x = y+5;
>     \   0000000A   0x1D40             ADDS     R0,R0,#+5  // R0 is y
> 								 // sum not saved
>      115          }
>     \   0000000C   0x4770             BX       LR               ;; return
>      116
> (14 bytes of code)
>
> Apparently,  R0, R1, R2 are scratch registers for IAR and don't need to
> be saved and restored.
>
> Adding actual initialization  to  x and y and saving the result in b
> produced the following:
>
>                              In section .text, align 2, keep-with-next
>      110          void test(void){
>      111          	register unsigned long x=3,y=4;
>     \                     test:
>     \   00000000   0x2003             MOVS     R0,#+3
>     \   00000002   0x2104             MOVS     R1,#+4
>      112          	a = b+5;
>     \   00000004   0x....             LDR.N    R2,??DataTable8_6
>     \   00000006   0x6812             LDR      R2,[R2, #+0]
>     \   00000008   0x1D52             ADDS     R2,R2,#+5
>     \   0000000A   0x....             LDR.N    R3,??DataTable8_7
>     \   0000000C   0x601A             STR      R2,[R3, #+0]
>      113          	x = y+5;
>     \   0000000E   0x1D49             ADDS     R1,R1,#+5
>     \   00000010   0x0008             MOVS     R0,R1     //  x = sum
>      114          	b = x;   // this time save the result
>     \   00000012   0x....             LDR.N    R1,??DataTable8_6
>     \   00000014   0x6008             STR      R0,[R1, #+0]
>      115          }
>     \   00000016   0x4770             BX       LR               ;; return
>      116
>
> Still accomplished with scratch registers----no need to save any on the
> stack.   I changed from my default optimization of 'low' to 'none'
> and got exactly the same code.
>
> Finally,  I took out the 'register' key word before x and y----and
> got exactly the same result as above.
>
> It seems that GCC just doesn't match up to IAR at producing compact
> code at low optimization levels.    OTOH, given that EW_ARM costs
> several KBucks, it SHOULD do better!
>
>

The problems here don't lie with the compiler - they lie with the user. 
  I'm sure that EW_ARM produces better code than gcc (correctly used) in 
some cases - but I am also sure that gcc can do better than EW_ARM in 
other cases.  I really don't think there is going to be a big difference 
in code generation quality - if that's why you paid K$ for EW, you've 
probably wasted your money.  There are many reasons for choosing 
different toolchains, but generally speaking I don't see a large 
difference in code generation quality between the major toolchains 
(including gcc) for 32-bit processors.  Occasionally you'll see major 
differences in particular kinds of code, but for the most part it is the 
user that makes the biggest difference.

On place where EW_ARM might score over the gcc setup this user has (he 
hasn't yet said anything about the rest - is it home-made, CodeSourcery, 
Code Red, etc.?) is that EW_ARM might make it easier to get the compiler 
switches correct, and avoid this "I don't know how to enable debugging 
and optimisation" or "what's a warning?" nonsense.


It hardly needs saying, but when run properly, my brief test with gcc 
produces the same code here as you get with EW_ARM, and the same 
warnings about x and y.


I'm sure that EW_ARM has a similar option, but gcc has a "-fno-common" 
switch to disable "common" sections.  With this disabled, definitions 
like "unsigned long a, b;" can only appear once in the program for each 
global identifier, and the space is allocated directly in the .bss 
inside the module that made the definition.  gcc can use this extra 
information to take advantage of relative placement between variables, 
and generate addressing via section anchors:


Command line:
arm-none-eabi-gcc -mcpu=cortex-m3 -mthumb -S testcode.c -Wall -Os 
-fno-common


test:
         @ args = 0, pretend = 0, frame = 0
         @ frame_needed = 0, uses_anonymous_args = 0
         @ link register save eliminated.
         ldr     r3, .L6
         ldr     r0, [r3, #4]
         adds    r2, r0, #5
         str     r2, [r3, #0]
         bx      lr
.L7:
         .align  2
.L6:
         .word   .LANCHOR0
         .size   test, .-test
         .global b
         .global a
         .bss
         .align  2
         .set    .LANCHOR0,. + 0
         .type   a, %object
         .size   a, 4
a:
         .space  4
         .type   b, %object
         .size   b, 4
b:
         .space  4
         .ident  "GCC: (Sourcery CodeBench Lite 2011.09-69) 4.6.1"


It's all about learning to use the tools you have, rather than buying 
more expensive tools.

mvh.,

David

Reply by Mark Borgerson ●June 22, 20122012-06-22

In article <svCdnfOCcv-WvHnSnZ2dnUVZ8rednZ2d@lyse.net>, 
david@westcontrol.removethisbit.com says...
> 
> On 22/06/2012 02:42, Mark Borgerson wrote:
> > In article <9b55cce9-96db-46f4-909a-1f6500deb237
> > @j9g2000vbk.googlegroups.com>, peter_gotkatov@supergreatmail.com says...
> >>
> >> On Jun 21, 4:09 am, David Brown <da...@westcontrol.removethisbit.com>
> >> wrote:
> >>> On 20/06/2012 21:21, peter_gotka...@supergreatmail.com wrote:
> >>> It is not a surprise that the code
> >>> size has increased in moving from the PIC18 - differences here will vary
> >>> wildly according to the type of code.  But it /is/ a surprise that you
> >>> only have 2K of startup, vector tables, and library code.
> >>
> >> Not all that surprising, here are the sizes in bytes:
> >> .vectors 304
> >> .init 508
> >> __putchar 40
> >> __vprintf 1498
> >> memcpy 56
> >>
> >>>> Instead of a macro I thought about making an ASM inline function that
> >>>> would use the CLZ instruction to do this efficiently but for some
> >>>> reason gcc didn't want to inline any of my functions (C or ASM) in
> >>>> debug mode so I just gave up at that point.
> >>>
> >>> First off, you should not need to resort to assembly to get basic
> >>> instructions working - the compiler should produce near-optimal code as
> >>> long as you let it (by enabling optimisations and writing appropriate C
> >>> code).
> >>
> >> I've tried several ways of writing a counleadingzeroes() function that
> >> would use the Cortex CLZ instruction but even with optimization turned
> >> on it still wouldn't do it.
> >>
> >>> Secondly, don't use "ASM functions" - they are normally only needed by
> >>> more limited compilers.  If you need to use assembly with gcc, use gcc's
> >>> extended "asm" syntax.
> >>
> >> There are some things like the bootloader that need to be ASM
> >> functions in their own separate .S file anyway since they need to copy
> >> portions of themselves to RAM in order to execute. But a bootloader is
> >> a special case and I do agree that normal code shouldn't need to rely
> >> on ASM functions. I must say I'm not familiar with gcc's extended asm
> >> syntax and although I did look at it briefly it seemed like it was
> >> more complicated than a plain old .S file and it was mostly geared
> >> towards mixing C and ASM together in the same function and accessing
> >> variables by name etc. Not something I needed for a simple bootloader.
> >>
> >>> Finally, if you are not getting inlining when debugging it is because
> >>> you have got incorrect compiler switches.  You should not have different
> >>> "debug" and "release" (or "optimised") builds - do a single build with
> >>> the proper optimisation settings (typically -Os unless you know what you
> >>> are doing) and "-g" to enable debugging.  You never want to be releasing
> >>> code that is built differently from the code you debugged.
> >>
> >> I was fighting with this for a while when I was first handed this
> >> toolchain, and it seems that in debug mode, there is no -O switch at
> >> all and in release mode it defaults to -O1. When I change this -Os it
> >> does produce the same code as the sample that you posted above from
> >> gcc 4.6.1 (mine is 4.4.4 by the way). However even with manually
> >> adding the -g switch I still don't get source annotation in the ELF
> >> file unless I use debug mode. This effectively limits any development/
> >> debugging to unoptimized code, which still has to fit into the 256K
> >> somehow.
> >>
> >> As for using register keywords and accessing globals through pointers,
> >> I normally don't do this (haven't used the register keyword in years)
> >> and I certainly wouldn't be doing it at all if it didn't have such a
> >> significant effect on the code size:
> >>
> >> unsigned long a,b;
> >>
> >> void test(void) {
> >>      B4B0        push {r4-r5, r7}
> >>      AF00        add r7, sp, #0
> >> ----------------------------------------
> >> register unsigned long x, y;
> >> a=b+5;
> >>      F2402314    movw r3, #0x214
> >>      F2C20300    movt r3, #0x2000
> >>      681B        ldr r3, [r3, #0]
> >>      F1030205    add.w r2, r3, #5
> >>      F240231C    movw r3, #0x21C
> >>      F2C20300    movt r3, #0x2000
> >>      601A        str r2, [r3, #0]
> >> ----------------------------------------
> >> x=y+5;
> >>      F1040505    add.w r5, r4, #5
> >> ----------------------------------------
> >> }
> >>      46BD        mov sp, r7
> >>      BCB0        pop {r4-r5, r7}
> >>      4770        bx lr
> >>      BF00        nop
> >
> > (36 bytes of code)
> >
> > I'd be surprised that your compiler did not complain about  x and y not
> > being initialized before the addition.  EW ARM warned me that y was used
> > before its value was set and that x was set but never used.
> >
> > Here is the code it gave me----with some extra comments added afterwards
> >
> >
> >      111          void test(void){
> >      112          	register unsigned long x,y;
> >      113          	a = b+5;
> >     \                     test:
> >     \   00000000   0x....             LDR.N    R1,??DataTable8_6
> >     \   00000002   0x6809             LDR      R1,[R1, #+0]
> >     \   00000004   0x1D49             ADDS     R1,R1,#+5
> >     \   00000006   0x....             LDR.N    R2,??DataTable8_7
> >     \   00000008   0x6011             STR      R1,[R2, #+0]
> >      114          	x = y+5;
> >     \   0000000A   0x1D40             ADDS     R0,R0,#+5  // R0 is y
> > 								 // sum not saved
> >      115          }
> >     \   0000000C   0x4770             BX       LR               ;; return
> >      116
> > (14 bytes of code)
> >
> > Apparently,  R0, R1, R2 are scratch registers for IAR and don't need to
> > be saved and restored.
> >
> > Adding actual initialization  to  x and y and saving the result in b
> > produced the following:
> >
> >                              In section .text, align 2, keep-with-next
> >      110          void test(void){
> >      111          	register unsigned long x=3,y=4;
> >     \                     test:
> >     \   00000000   0x2003             MOVS     R0,#+3
> >     \   00000002   0x2104             MOVS     R1,#+4
> >      112          	a = b+5;
> >     \   00000004   0x....             LDR.N    R2,??DataTable8_6
> >     \   00000006   0x6812             LDR      R2,[R2, #+0]
> >     \   00000008   0x1D52             ADDS     R2,R2,#+5
> >     \   0000000A   0x....             LDR.N    R3,??DataTable8_7
> >     \   0000000C   0x601A             STR      R2,[R3, #+0]
> >      113          	x = y+5;
> >     \   0000000E   0x1D49             ADDS     R1,R1,#+5
> >     \   00000010   0x0008             MOVS     R0,R1     //  x = sum
> >      114          	b = x;   // this time save the result
> >     \   00000012   0x....             LDR.N    R1,??DataTable8_6
> >     \   00000014   0x6008             STR      R0,[R1, #+0]
> >      115          }
> >     \   00000016   0x4770             BX       LR               ;; return
> >      116
> >
> > Still accomplished with scratch registers----no need to save any on the
> > stack.   I changed from my default optimization of 'low' to 'none'
> > and got exactly the same code.
> >
> > Finally,  I took out the 'register' key word before x and y----and
> > got exactly the same result as above.
> >
> > It seems that GCC just doesn't match up to IAR at producing compact
> > code at low optimization levels.    OTOH, given that EW_ARM costs
> > several KBucks, it SHOULD do better!
> >
> >
> 
> The problems here don't lie with the compiler - they lie with the user. 
>   I'm sure that EW_ARM produces better code than gcc (correctly used) in 
> some cases - but I am also sure that gcc can do better than EW_ARM in 
> other cases.  I really don't think there is going to be a big difference 
> in code generation quality - if that's why you paid K$ for EW, you've 
> probably wasted your money.  There are many reasons for choosing 
> different toolchains, but generally speaking I don't see a large 
> difference in code generation quality between the major toolchains 
> (including gcc) for 32-bit processors.  Occasionally you'll see major 
> differences in particular kinds of code, but for the most part it is the 
> user that makes the biggest difference.
> 
> On place where EW_ARM might score over the gcc setup this user has (he 
> hasn't yet said anything about the rest - is it home-made, CodeSourcery, 
> Code Red, etc.?) is that EW_ARM might make it easier to get the compiler 
> switches correct, and avoid this "I don't know how to enable debugging 
> and optimisation" or "what's a warning?" nonsense.

One of the reasons I like the EW_ARM system is that the IDE handles all
the compiler and linker flags with a pretty good GUI.   You can override
the GUI options with #pragma statements in the code----which I haven't 
found reason to do for the most part.
> 
> 
> It hardly needs saying, but when run properly, my brief test with gcc 
> produces the same code here as you get with EW_ARM, and the same 
> warnings about x and y.
> 
That's comforting in a way.  While I now use EW_ARM for most of my 
current projects,  I spent about 5 years using GCC_ARM on a project 
based on Linux.   I would hate to think that I was producing crap code 
all that time!   I had some experienced Linux users to set up my dev
system and show me how to generate good make files, so I probably
got pretty good results there.

I'm using EW_ARM for projects that don't have the resources of a
Linux OS, and I prefer it for these projects.
> 
> I'm sure that EW_ARM has a similar option, but gcc has a "-fno-common" 
> switch to disable "common" sections.  With this disabled, definitions 
> like "unsigned long a, b;" can only appear once in the program for each 
> global identifier, and the space is allocated directly in the .bss 
> inside the module that made the definition.  gcc can use this extra 
> information to take advantage of relative placement between variables, 
> and generate addressing via section anchors:
> 
> 
> Command line:
> arm-none-eabi-gcc -mcpu=cortex-m3 -mthumb -S testcode.c -Wall -Os 
> -fno-common
> 
> 
> test:
>          @ args = 0, pretend = 0, frame = 0
>          @ frame_needed = 0, uses_anonymous_args = 0
>          @ link register save eliminated.
>          ldr     r3, .L6
>          ldr     r0, [r3, #4]
>          adds    r2, r0, #5
>          str     r2, [r3, #0]
>          bx      lr
> .L7:
>          .align  2
> .L6:
>          .word   .LANCHOR0
>          .size   test, .-test
>          .global b
>          .global a
>          .bss
>          .align  2
>          .set    .LANCHOR0,. + 0
>          .type   a, %object
>          .size   a, 4
> a:
>          .space  4
>          .type   b, %object
>          .size   b, 4
> b:
>          .space  4
>          .ident  "GCC: (Sourcery CodeBench Lite 2011.09-69) 4.6.1"
> 
> 
> It's all about learning to use the tools you have, rather than buying 
> more expensive tools.

Which reminds me----when counting bytes in code like this, it's easy to 
forget the bytes used in the constant tables that provide the addresses 
of variables.   A 16-bit variable may require a 32-bit table entry.
 

I started with EW_ARM about three years before I started on the Linux
project.  The original compiler was purchased by the customer---who had 
no preferences, but was developing a project with fairly limited 
hardware resources.  They asked what compiler I'd like and I picked EW-
ARM.   At that time, I'd been using CodeWarrior for the M68K for many 
years and EW_ARM  had the same 'feel'.  When it came time to do the 
Linux project, the transition to GCC took MUCH longer than the 
transition from CodeWarrior to EW_ARM.  Of course, much of that was in 
setting up a virtual machine on the PC  and learning Linux so that I 
could use GCC.

One thing that I missed on the Linux project is that I didn't have a
debugger equivalent to C-Spy that is integrated into EW_ARM.  Debugging 
on the Linux system was mostly  "Save everything and analyze later".




Of course, the original poster is discussing the type of code that few 
Linux programmers write----direct interfacing to peripherals.  My recent 
experience with Linux and digital cameras was pretty frustrating.  I was 
dependent on others to provide the drivers--and they often didn't work 
quite right with the particular camera I was using.  That's a story for 
another time, though.
> 
> mvh.,
> 
> David

Mark Borgerson