Reply by Mark Borgerson June 24, 20122012-06-24
In article <398b2c50-5de0-4eb0-8775-d4a200ad7a30
@a16g2000vby.googlegroups.com>, dp@tgi-sci.com says...
> > On Jun 24, 1:37&#4294967295;am, David Brown <david.br...@removethis.hesbynett.no> > wrote: > >.... > > ARM code will be more compact (and /much/ faster) for multiplying > > floats. &#4294967295;Code comparisons on wildly different architectures are heavily > > dependent on the sort of code used. > > Hi David, > have you tried how fast ARM does at say 64 (or 32) bit MAC in a filter > loop? > Not so long ago I had to reach the limit for a power CPU (MPC5200B) > which is specified at 2 cycles per MAC. Was not trivial at all, doing > it in a loop DSP-like took about 10 cycles mainly because of data > dependencies. Had to spread things over many registers until I got > there, 2.1 cycles (with load/store included). That for a filer with > hundreds of taps. > How many FP registers do ARM have? It took using 24 out of the 32 > to get to 2.1 cycles (although using the same technique over 18 > registers yielded about 2.3 cycles, 15 was dramatically worse, > data dependencies began to kick in - likely a 6 stage pipeline). > > Dimiter > > ------------------------------------------------------ > Dimiter Popoff Transgalactic Instruments > > http://www.tgi-sci.com > ------------------------------------------------------ > http://www.flickr.com/photos/didi_tgi/sets/72157600228621276/
I don't think the ARM-CortexM4 chips have 64-bit FPUs and their clock speeds top out at about 160MHz. You're not going to get anything near the MPC5200B performance. The closest you might get with an ARM-based chip might be one of the TI OMAP chips which has a fixed/floating DSP coprocessor. Mark Borgerson
Reply by David Brown June 24, 20122012-06-24
On 24/06/12 12:19, dp wrote:
> On Jun 24, 1:37 am, David Brown<david.br...@removethis.hesbynett.no> > wrote: >> .... >> ARM code will be more compact (and /much/ faster) for multiplying >> floats. Code comparisons on wildly different architectures are heavily >> dependent on the sort of code used. > > Hi David, > have you tried how fast ARM does at say 64 (or 32) bit MAC in a filter > loop? > Not so long ago I had to reach the limit for a power CPU (MPC5200B) > which is specified at 2 cycles per MAC. Was not trivial at all, doing > it in a loop DSP-like took about 10 cycles mainly because of data > dependencies. Had to spread things over many registers until I got > there, 2.1 cycles (with load/store included). That for a filer with > hundreds of taps. > How many FP registers do ARM have? It took using 24 out of the 32 > to get to 2.1 cycles (although using the same technique over 18 > registers yielded about 2.3 cycles, 15 was dramatically worse, > data dependencies began to kick in - likely a 6 stage pipeline). > > Dimiter >
I haven't tried anything exactly like that (I love doing that sort of thing, but seldom have the need). On the PPC cores I have used (e200z7 recently), it can make a big difference to the speed of the code when things are spread out over many registers. Most PPC cores have quite long pipelines, and some have super-scaler execution, speculative execution or loads, etc., which make it a big challenge getting everything right here. You also need to take into account the cache - getting the computation flow to match the cache flow is vital. In comparison, the Cortex-M0 is very simple. It has pipelining, but not nearly as deep, and it does not have a cache to consider. Some Cortex-M devices have a bit of cache, and may also have tightly-coupled memory, so there you have to consider the flow of data into and out of the cpu core. But I expect it is easier to get close to peak performance from an M0 than from a typical PPC core.
Reply by dp June 24, 20122012-06-24
On Jun 24, 1:37=A0am, David Brown <david.br...@removethis.hesbynett.no>
wrote:
>.... > ARM code will be more compact (and /much/ faster) for multiplying > floats. =A0Code comparisons on wildly different architectures are heavily > dependent on the sort of code used.
Hi David, have you tried how fast ARM does at say 64 (or 32) bit MAC in a filter loop? Not so long ago I had to reach the limit for a power CPU (MPC5200B) which is specified at 2 cycles per MAC. Was not trivial at all, doing it in a loop DSP-like took about 10 cycles mainly because of data dependencies. Had to spread things over many registers until I got there, 2.1 cycles (with load/store included). That for a filer with hundreds of taps. How many FP registers do ARM have? It took using 24 out of the 32 to get to 2.1 cycles (although using the same technique over 18 registers yielded about 2.3 cycles, 15 was dramatically worse, data dependencies began to kick in - likely a 6 stage pipeline). Dimiter ------------------------------------------------------ Dimiter Popoff Transgalactic Instruments http://www.tgi-sci.com ------------------------------------------------------ http://www.flickr.com/photos/didi_tgi/sets/72157600228621276/
Reply by David Brown June 23, 20122012-06-23
On 23/06/12 17:23, hamilton wrote:
> > This discussion is kind of silly. > > Has anyone tried a "simple" program with multiplying two or four > flouting point numbers and displaying the result to an LCD ?? > > As has been mentioned, you don't buy a V8 just to watch the cylinders go > up and down. > > You buy a V8 to use it. > > How much code space would the PIC18 take to multiply two/four floats ?? > > hamilton >
It is certainly true that you can only get a real-world test using real-world code. The trouble is, real-world code is not very suitable for discussing in a newsgroup. So the best we can do is look at some simple sample functions, and work from there. And when a poster is having such trouble generating good code from a simple "a = b + 5" function, it makes sense to start with that and work up. You bring up another point here, of course - while the PIC18 may have compact code for setting a bit or adding a couple of 8-bit numbers, the ARM code will be more compact (and /much/ faster) for multiplying floats. Code comparisons on wildly different architectures are heavily dependent on the sort of code used.
Reply by Mark Borgerson June 23, 20122012-06-23
In article <cOidnQo3hoKmCnjSnZ2dnUVZ7tOdnZ2d@lyse.net>, 
david.brown@removethis.hesbynett.no says...
> > On 23/06/12 01:03, Mark Borgerson wrote: > > In article<svCdnfOCcv-WvHnSnZ2dnUVZ8rednZ2d@lyse.net>, > > david@westcontrol.removethisbit.com says... > >> > >> On 22/06/2012 02:42, Mark Borgerson wrote: > >>> In article<9b55cce9-96db-46f4-909a-1f6500deb237 > > >>> It seems that GCC just doesn't match up to IAR at producing compact > >>> code at low optimization levels. OTOH, given that EW_ARM costs > >>> several KBucks, it SHOULD do better! > >>> > >>> > >> > >> The problems here don't lie with the compiler - they lie with the user. > >> I'm sure that EW_ARM produces better code than gcc (correctly used) in > >> some cases - but I am also sure that gcc can do better than EW_ARM in > >> other cases. I really don't think there is going to be a big difference > >> in code generation quality - if that's why you paid K$ for EW, you've > >> probably wasted your money. There are many reasons for choosing > >> different toolchains, but generally speaking I don't see a large > >> difference in code generation quality between the major toolchains > >> (including gcc) for 32-bit processors. Occasionally you'll see major > >> differences in particular kinds of code, but for the most part it is the > >> user that makes the biggest difference. > >> > >> On place where EW_ARM might score over the gcc setup this user has (he > >> hasn't yet said anything about the rest - is it home-made, CodeSourcery, > >> Code Red, etc.?) is that EW_ARM might make it easier to get the compiler > >> switches correct, and avoid this "I don't know how to enable debugging > >> and optimisation" or "what's a warning?" nonsense. > > > > One of the reasons I like the EW_ARM system is that the IDE handles all > > the compiler and linker flags with a pretty good GUI. You can override > > the GUI options with #pragma statements in the code----which I haven't > > found reason to do for the most part. > > As I say, we don't know what toolchain package the poster here was > using, but there certainly are gcc-based toolchain packages available > with gcc that handle this fine. We use Code Sourcery for a couple of > different processors - they package gcc along with libraries, debugger > support, and Eclipse to give similar ease-of-use. Although Code > Sourcery is that package I am most familiar with, I know that others > such as Code Red are similar. I don't know what the poster uses that > makes it apparently so hard to get it right. > > Personally, I prefer to use makefiles and explicit compiler flags (or > pragmas / function attributes as needed). I think that gives better > control, more replicable results, and is more suitable for re-use on > different projects, different development hosts, and different tool > versions. But that's a matter of taste - and I don't recommend it as > the first step for someone unfamiliar with the tools.
I agree. I was glad I had other programmers and sample files to help me through the initial setup of GCC-ARM.
> > >> > >> > >> It hardly needs saying, but when run properly, my brief test with gcc > >> produces the same code here as you get with EW_ARM, and the same > >> warnings about x and y. > >> > > That's comforting in a way. While I now use EW_ARM for most of my > > current projects, I spent about 5 years using GCC_ARM on a project > > based on Linux. I would hate to think that I was producing crap code > > all that time! I had some experienced Linux users to set up my dev > > system and show me how to generate good make files, so I probably > > got pretty good results there. > > > > gcc has very extensive static error checking and warning mechanisms, and > they've been getting better with each version. It doesn't have MISRA > rule checking, which I believe EW has, but otherwise it is top-class. > Of course, you have to enable the warnings!
EW does have MISRA rule checking---but I haven't started to use it yet.
> > > I'm using EW_ARM for projects that don't have the resources of a > > Linux OS, and I prefer it for these projects. > > Here is the key point that makes EW worth the money for /you/ - you > prefer it. When choosing tools and judging value for money, questions > of code generation quality are normally secondary to what the developer > finds most productive - developer time outweighs tool costs.
I found that to be very true when I switched from a very low-end PCB layout program to PADS PCB. The $4K cost of that system has been paid back many times over in time saved in the design and layout of PC boards for customers. Heck, I've even learned to trust the autorouter when it is properly set up. ("Trust, but verify" still applies, though). The autorouter really helps with those fine-pitch QFP STM32 chips! I was able to pull an MSP430 from an existing design and plug in an STM32F205 in just a few days.
>
<< Snip Example Code>>
> >> > >> It's all about learning to use the tools you have, rather than buying > >> more expensive tools. > > > > Which reminds me----when counting bytes in code like this, it's easy to > > forget the bytes used in the constant tables that provide the addresses > > of variables. A 16-bit variable may require a 32-bit table entry. > > > > Indeed - people trying to "hand-optimise" their code often miss out > details like that. ("I'll use a 16-bit variable instead of a 32-bit > variable to save memory space...".) > > If you can use section anchors (like above), or a "small data section" > (as used by the PPC ABI, though not the ARM, for some reason), then you > can avoid most of the individual storage and loads of addresses.
It took me a while when I first started looking at disassembled ARM code to realize that constants were being loaded using PC-relative offsets. When I finally figured that out it took me back to the mid-80's when I was writing Macintosh code using position-independent modules. What a rush of nostalgia that was!
> > > > > I started with EW_ARM about three years before I started on the Linux > > project. The original compiler was purchased by the customer---who had > > no preferences, but was developing a project with fairly limited > > hardware resources. They asked what compiler I'd like and I picked EW- > > ARM. At that time, I'd been using CodeWarrior for the M68K for many > > years and EW_ARM had the same 'feel'. When it came time to do the > > Linux project, the transition to GCC took MUCH longer than the > > transition from CodeWarrior to EW_ARM. Of course, much of that was in > > setting up a virtual machine on the PC and learning Linux so that I > > could use GCC. > > Learning to develop Linux programs is a lot more than just learning gcc, > as you've found out. > > One gets used to one's tools. I've been using gcc for embedded > development for some 15 years, and have used it on perhaps 8 different > processor architectures. So for me, gcc is always the obvious choice > for new devices, since I am most familiar with it. > > I actually think there is a fair similarity between modern CodeWarrior > and gcc - CW supports many gcc extensions such as the inline assembly > syntax and several attributes. On the IDE side, of course, gcc has no > IDE - it's a compiler. But gcc is often used with Eclipse, which is > what CW now uses for most targets. (The "classic" CW IDE was horrible - > if EW has a similar feel, then I'll remember not to buy it!)
I think I mis-stated part of that. It is really the editor and project file window that are similar between EW and Codewarrior. The other parts of EW are much better with fairly straightforward menus and dialog boxes for setting compiler,linker, and debugger options. I was using Codewarrior to develop code for the Persistor micro data loggers. That was a pretty tightly constrained development environment and Persistor provided most of the setup files and initial project files. If you have to start from scratch, I agree that the old Codewarrior was a true PITA to get set up. The Persistor logger didn't have a true debug capability, so I can't comment on the capabilities of Codewarrior in that regard. I really like the integrated C-Spy debugger in EW, though.
> > > > > One thing that I missed on the Linux project is that I didn't have a > > debugger equivalent to C-Spy that is integrated into EW_ARM. Debugging > > on the Linux system was mostly "Save everything and analyze later". > > > > There are /lots/ of debugging options for Linux development, that can be > much more powerful than C-Spy (depending on the type of programming you > are doing, of course). However, it all involves a lot more learning and > experimenting than the ease-of-use of an integrated debugger in an IDE.
Of of the problems with debugging the Linux-based system was that the code was controlling an autonomous parafoil supply delivery system. When the system was operating, it started 20,000 feet above and several miles away from the programmer! Thus, the "record everything and analyze later paradigm" I did develop a simulated version that grabbed the GPS and other sensor values via hooks and substituted simulated values based on a simple flight model. That sim ran on the target hardware while it sat on the bench. I got really tired of listening to the whining control servos! Unfortunately, the flight model wasn't up to simulating GPS loss, sticky servos, and all the things that can happen to parafoils stressed beyond their flight limits. Flight algorithms were tested in sims running on Borland CPP Builder on a PC. That allowed good debugging, lots of intermediate variable recording and graphic displays. The CPP Builder sims were written to use the same C control code that ran on the target hardware.
> > > > > > > > > Of course, the original poster is discussing the type of code that few > > Linux programmers write----direct interfacing to peripherals. My recent > > experience with Linux and digital cameras was pretty frustrating. I was > > dependent on others to provide the drivers--and they often didn't work > > quite right with the particular camera I was using. That's a story for > > another time, though. > > Indeed. >
Mark Borgerson
Reply by hamilton June 23, 20122012-06-23
This discussion is kind of silly.

Has anyone tried a "simple" program with multiplying two or four 
flouting point numbers and displaying the result to an LCD ??

As has been mentioned, you don't buy a V8 just to watch the cylinders go 
up and down.

You buy a V8 to use it.

How much code space would the PIC18 take to multiply two/four floats ??

hamilton

Reply by David Brown June 23, 20122012-06-23
On 23/06/12 01:03, Mark Borgerson wrote:
> In article<svCdnfOCcv-WvHnSnZ2dnUVZ8rednZ2d@lyse.net>, > david@westcontrol.removethisbit.com says... >> >> On 22/06/2012 02:42, Mark Borgerson wrote: >>> In article<9b55cce9-96db-46f4-909a-1f6500deb237
>>> It seems that GCC just doesn't match up to IAR at producing compact >>> code at low optimization levels. OTOH, given that EW_ARM costs >>> several KBucks, it SHOULD do better! >>> >>> >> >> The problems here don't lie with the compiler - they lie with the user. >> I'm sure that EW_ARM produces better code than gcc (correctly used) in >> some cases - but I am also sure that gcc can do better than EW_ARM in >> other cases. I really don't think there is going to be a big difference >> in code generation quality - if that's why you paid K$ for EW, you've >> probably wasted your money. There are many reasons for choosing >> different toolchains, but generally speaking I don't see a large >> difference in code generation quality between the major toolchains >> (including gcc) for 32-bit processors. Occasionally you'll see major >> differences in particular kinds of code, but for the most part it is the >> user that makes the biggest difference. >> >> On place where EW_ARM might score over the gcc setup this user has (he >> hasn't yet said anything about the rest - is it home-made, CodeSourcery, >> Code Red, etc.?) is that EW_ARM might make it easier to get the compiler >> switches correct, and avoid this "I don't know how to enable debugging >> and optimisation" or "what's a warning?" nonsense. > > One of the reasons I like the EW_ARM system is that the IDE handles all > the compiler and linker flags with a pretty good GUI. You can override > the GUI options with #pragma statements in the code----which I haven't > found reason to do for the most part.
As I say, we don't know what toolchain package the poster here was using, but there certainly are gcc-based toolchain packages available with gcc that handle this fine. We use Code Sourcery for a couple of different processors - they package gcc along with libraries, debugger support, and Eclipse to give similar ease-of-use. Although Code Sourcery is that package I am most familiar with, I know that others such as Code Red are similar. I don't know what the poster uses that makes it apparently so hard to get it right. Personally, I prefer to use makefiles and explicit compiler flags (or pragmas / function attributes as needed). I think that gives better control, more replicable results, and is more suitable for re-use on different projects, different development hosts, and different tool versions. But that's a matter of taste - and I don't recommend it as the first step for someone unfamiliar with the tools.
>> >> >> It hardly needs saying, but when run properly, my brief test with gcc >> produces the same code here as you get with EW_ARM, and the same >> warnings about x and y. >> > That's comforting in a way. While I now use EW_ARM for most of my > current projects, I spent about 5 years using GCC_ARM on a project > based on Linux. I would hate to think that I was producing crap code > all that time! I had some experienced Linux users to set up my dev > system and show me how to generate good make files, so I probably > got pretty good results there. >
gcc has very extensive static error checking and warning mechanisms, and they've been getting better with each version. It doesn't have MISRA rule checking, which I believe EW has, but otherwise it is top-class. Of course, you have to enable the warnings!
> I'm using EW_ARM for projects that don't have the resources of a > Linux OS, and I prefer it for these projects.
Here is the key point that makes EW worth the money for /you/ - you prefer it. When choosing tools and judging value for money, questions of code generation quality are normally secondary to what the developer finds most productive - developer time outweighs tool costs.
>> >> I'm sure that EW_ARM has a similar option, but gcc has a "-fno-common" >> switch to disable "common" sections. With this disabled, definitions >> like "unsigned long a, b;" can only appear once in the program for each >> global identifier, and the space is allocated directly in the .bss >> inside the module that made the definition. gcc can use this extra >> information to take advantage of relative placement between variables, >> and generate addressing via section anchors: >> >> >> Command line: >> arm-none-eabi-gcc -mcpu=cortex-m3 -mthumb -S testcode.c -Wall -Os >> -fno-common >> >> >> test: >> @ args = 0, pretend = 0, frame = 0 >> @ frame_needed = 0, uses_anonymous_args = 0 >> @ link register save eliminated. >> ldr r3, .L6 >> ldr r0, [r3, #4] >> adds r2, r0, #5 >> str r2, [r3, #0] >> bx lr >> .L7: >> .align 2 >> .L6: >> .word .LANCHOR0 >> .size test, .-test >> .global b >> .global a >> .bss >> .align 2 >> .set .LANCHOR0,. + 0 >> .type a, %object >> .size a, 4 >> a: >> .space 4 >> .type b, %object >> .size b, 4 >> b: >> .space 4 >> .ident "GCC: (Sourcery CodeBench Lite 2011.09-69) 4.6.1" >> >> >> It's all about learning to use the tools you have, rather than buying >> more expensive tools. > > Which reminds me----when counting bytes in code like this, it's easy to > forget the bytes used in the constant tables that provide the addresses > of variables. A 16-bit variable may require a 32-bit table entry. >
Indeed - people trying to "hand-optimise" their code often miss out details like that. ("I'll use a 16-bit variable instead of a 32-bit variable to save memory space...".) If you can use section anchors (like above), or a "small data section" (as used by the PPC ABI, though not the ARM, for some reason), then you can avoid most of the individual storage and loads of addresses.
> > I started with EW_ARM about three years before I started on the Linux > project. The original compiler was purchased by the customer---who had > no preferences, but was developing a project with fairly limited > hardware resources. They asked what compiler I'd like and I picked EW- > ARM. At that time, I'd been using CodeWarrior for the M68K for many > years and EW_ARM had the same 'feel'. When it came time to do the > Linux project, the transition to GCC took MUCH longer than the > transition from CodeWarrior to EW_ARM. Of course, much of that was in > setting up a virtual machine on the PC and learning Linux so that I > could use GCC.
Learning to develop Linux programs is a lot more than just learning gcc, as you've found out. One gets used to one's tools. I've been using gcc for embedded development for some 15 years, and have used it on perhaps 8 different processor architectures. So for me, gcc is always the obvious choice for new devices, since I am most familiar with it. I actually think there is a fair similarity between modern CodeWarrior and gcc - CW supports many gcc extensions such as the inline assembly syntax and several attributes. On the IDE side, of course, gcc has no IDE - it's a compiler. But gcc is often used with Eclipse, which is what CW now uses for most targets. (The "classic" CW IDE was horrible - if EW has a similar feel, then I'll remember not to buy it!)
> > One thing that I missed on the Linux project is that I didn't have a > debugger equivalent to C-Spy that is integrated into EW_ARM. Debugging > on the Linux system was mostly "Save everything and analyze later". >
There are /lots/ of debugging options for Linux development, that can be much more powerful than C-Spy (depending on the type of programming you are doing, of course). However, it all involves a lot more learning and experimenting than the ease-of-use of an integrated debugger in an IDE.
> > > > Of course, the original poster is discussing the type of code that few > Linux programmers write----direct interfacing to peripherals. My recent > experience with Linux and digital cameras was pretty frustrating. I was > dependent on others to provide the drivers--and they often didn't work > quite right with the particular camera I was using. That's a story for > another time, though.
Indeed. mvh., David
>> >> mvh., >> >> David > > Mark Borgerson >
Reply by Mark Borgerson June 22, 20122012-06-22
In article <svCdnfOCcv-WvHnSnZ2dnUVZ8rednZ2d@lyse.net>, 
david@westcontrol.removethisbit.com says...
> > On 22/06/2012 02:42, Mark Borgerson wrote: > > In article <9b55cce9-96db-46f4-909a-1f6500deb237 > > @j9g2000vbk.googlegroups.com>, peter_gotkatov@supergreatmail.com says... > >> > >> On Jun 21, 4:09 am, David Brown <da...@westcontrol.removethisbit.com> > >> wrote: > >>> On 20/06/2012 21:21, peter_gotka...@supergreatmail.com wrote: > >>> It is not a surprise that the code > >>> size has increased in moving from the PIC18 - differences here will vary > >>> wildly according to the type of code. But it /is/ a surprise that you > >>> only have 2K of startup, vector tables, and library code. > >> > >> Not all that surprising, here are the sizes in bytes: > >> .vectors 304 > >> .init 508 > >> __putchar 40 > >> __vprintf 1498 > >> memcpy 56 > >> > >>>> Instead of a macro I thought about making an ASM inline function that > >>>> would use the CLZ instruction to do this efficiently but for some > >>>> reason gcc didn't want to inline any of my functions (C or ASM) in > >>>> debug mode so I just gave up at that point. > >>> > >>> First off, you should not need to resort to assembly to get basic > >>> instructions working - the compiler should produce near-optimal code as > >>> long as you let it (by enabling optimisations and writing appropriate C > >>> code). > >> > >> I've tried several ways of writing a counleadingzeroes() function that > >> would use the Cortex CLZ instruction but even with optimization turned > >> on it still wouldn't do it. > >> > >>> Secondly, don't use "ASM functions" - they are normally only needed by > >>> more limited compilers. If you need to use assembly with gcc, use gcc's > >>> extended "asm" syntax. > >> > >> There are some things like the bootloader that need to be ASM > >> functions in their own separate .S file anyway since they need to copy > >> portions of themselves to RAM in order to execute. But a bootloader is > >> a special case and I do agree that normal code shouldn't need to rely > >> on ASM functions. I must say I'm not familiar with gcc's extended asm > >> syntax and although I did look at it briefly it seemed like it was > >> more complicated than a plain old .S file and it was mostly geared > >> towards mixing C and ASM together in the same function and accessing > >> variables by name etc. Not something I needed for a simple bootloader. > >> > >>> Finally, if you are not getting inlining when debugging it is because > >>> you have got incorrect compiler switches. You should not have different > >>> "debug" and "release" (or "optimised") builds - do a single build with > >>> the proper optimisation settings (typically -Os unless you know what you > >>> are doing) and "-g" to enable debugging. You never want to be releasing > >>> code that is built differently from the code you debugged. > >> > >> I was fighting with this for a while when I was first handed this > >> toolchain, and it seems that in debug mode, there is no -O switch at > >> all and in release mode it defaults to -O1. When I change this -Os it > >> does produce the same code as the sample that you posted above from > >> gcc 4.6.1 (mine is 4.4.4 by the way). However even with manually > >> adding the -g switch I still don't get source annotation in the ELF > >> file unless I use debug mode. This effectively limits any development/ > >> debugging to unoptimized code, which still has to fit into the 256K > >> somehow. > >> > >> As for using register keywords and accessing globals through pointers, > >> I normally don't do this (haven't used the register keyword in years) > >> and I certainly wouldn't be doing it at all if it didn't have such a > >> significant effect on the code size: > >> > >> unsigned long a,b; > >> > >> void test(void) { > >> B4B0 push {r4-r5, r7} > >> AF00 add r7, sp, #0 > >> ---------------------------------------- > >> register unsigned long x, y; > >> a=b+5; > >> F2402314 movw r3, #0x214 > >> F2C20300 movt r3, #0x2000 > >> 681B ldr r3, [r3, #0] > >> F1030205 add.w r2, r3, #5 > >> F240231C movw r3, #0x21C > >> F2C20300 movt r3, #0x2000 > >> 601A str r2, [r3, #0] > >> ---------------------------------------- > >> x=y+5; > >> F1040505 add.w r5, r4, #5 > >> ---------------------------------------- > >> } > >> 46BD mov sp, r7 > >> BCB0 pop {r4-r5, r7} > >> 4770 bx lr > >> BF00 nop > > > > (36 bytes of code) > > > > I'd be surprised that your compiler did not complain about x and y not > > being initialized before the addition. EW ARM warned me that y was used > > before its value was set and that x was set but never used. > > > > Here is the code it gave me----with some extra comments added afterwards > > > > > > 111 void test(void){ > > 112 register unsigned long x,y; > > 113 a = b+5; > > \ test: > > \ 00000000 0x.... LDR.N R1,??DataTable8_6 > > \ 00000002 0x6809 LDR R1,[R1, #+0] > > \ 00000004 0x1D49 ADDS R1,R1,#+5 > > \ 00000006 0x.... LDR.N R2,??DataTable8_7 > > \ 00000008 0x6011 STR R1,[R2, #+0] > > 114 x = y+5; > > \ 0000000A 0x1D40 ADDS R0,R0,#+5 // R0 is y > > // sum not saved > > 115 } > > \ 0000000C 0x4770 BX LR ;; return > > 116 > > (14 bytes of code) > > > > Apparently, R0, R1, R2 are scratch registers for IAR and don't need to > > be saved and restored. > > > > Adding actual initialization to x and y and saving the result in b > > produced the following: > > > > In section .text, align 2, keep-with-next > > 110 void test(void){ > > 111 register unsigned long x=3,y=4; > > \ test: > > \ 00000000 0x2003 MOVS R0,#+3 > > \ 00000002 0x2104 MOVS R1,#+4 > > 112 a = b+5; > > \ 00000004 0x.... LDR.N R2,??DataTable8_6 > > \ 00000006 0x6812 LDR R2,[R2, #+0] > > \ 00000008 0x1D52 ADDS R2,R2,#+5 > > \ 0000000A 0x.... LDR.N R3,??DataTable8_7 > > \ 0000000C 0x601A STR R2,[R3, #+0] > > 113 x = y+5; > > \ 0000000E 0x1D49 ADDS R1,R1,#+5 > > \ 00000010 0x0008 MOVS R0,R1 // x = sum > > 114 b = x; // this time save the result > > \ 00000012 0x.... LDR.N R1,??DataTable8_6 > > \ 00000014 0x6008 STR R0,[R1, #+0] > > 115 } > > \ 00000016 0x4770 BX LR ;; return > > 116 > > > > Still accomplished with scratch registers----no need to save any on the > > stack. I changed from my default optimization of 'low' to 'none' > > and got exactly the same code. > > > > Finally, I took out the 'register' key word before x and y----and > > got exactly the same result as above. > > > > It seems that GCC just doesn't match up to IAR at producing compact > > code at low optimization levels. OTOH, given that EW_ARM costs > > several KBucks, it SHOULD do better! > > > > > > The problems here don't lie with the compiler - they lie with the user. > I'm sure that EW_ARM produces better code than gcc (correctly used) in > some cases - but I am also sure that gcc can do better than EW_ARM in > other cases. I really don't think there is going to be a big difference > in code generation quality - if that's why you paid K$ for EW, you've > probably wasted your money. There are many reasons for choosing > different toolchains, but generally speaking I don't see a large > difference in code generation quality between the major toolchains > (including gcc) for 32-bit processors. Occasionally you'll see major > differences in particular kinds of code, but for the most part it is the > user that makes the biggest difference. > > On place where EW_ARM might score over the gcc setup this user has (he > hasn't yet said anything about the rest - is it home-made, CodeSourcery, > Code Red, etc.?) is that EW_ARM might make it easier to get the compiler > switches correct, and avoid this "I don't know how to enable debugging > and optimisation" or "what's a warning?" nonsense.
One of the reasons I like the EW_ARM system is that the IDE handles all the compiler and linker flags with a pretty good GUI. You can override the GUI options with #pragma statements in the code----which I haven't found reason to do for the most part.
> > > It hardly needs saying, but when run properly, my brief test with gcc > produces the same code here as you get with EW_ARM, and the same > warnings about x and y. >
That's comforting in a way. While I now use EW_ARM for most of my current projects, I spent about 5 years using GCC_ARM on a project based on Linux. I would hate to think that I was producing crap code all that time! I had some experienced Linux users to set up my dev system and show me how to generate good make files, so I probably got pretty good results there. I'm using EW_ARM for projects that don't have the resources of a Linux OS, and I prefer it for these projects.
> > I'm sure that EW_ARM has a similar option, but gcc has a "-fno-common" > switch to disable "common" sections. With this disabled, definitions > like "unsigned long a, b;" can only appear once in the program for each > global identifier, and the space is allocated directly in the .bss > inside the module that made the definition. gcc can use this extra > information to take advantage of relative placement between variables, > and generate addressing via section anchors: > > > Command line: > arm-none-eabi-gcc -mcpu=cortex-m3 -mthumb -S testcode.c -Wall -Os > -fno-common > > > test: > @ args = 0, pretend = 0, frame = 0 > @ frame_needed = 0, uses_anonymous_args = 0 > @ link register save eliminated. > ldr r3, .L6 > ldr r0, [r3, #4] > adds r2, r0, #5 > str r2, [r3, #0] > bx lr > .L7: > .align 2 > .L6: > .word .LANCHOR0 > .size test, .-test > .global b > .global a > .bss > .align 2 > .set .LANCHOR0,. + 0 > .type a, %object > .size a, 4 > a: > .space 4 > .type b, %object > .size b, 4 > b: > .space 4 > .ident "GCC: (Sourcery CodeBench Lite 2011.09-69) 4.6.1" > > > It's all about learning to use the tools you have, rather than buying > more expensive tools.
Which reminds me----when counting bytes in code like this, it's easy to forget the bytes used in the constant tables that provide the addresses of variables. A 16-bit variable may require a 32-bit table entry. I started with EW_ARM about three years before I started on the Linux project. The original compiler was purchased by the customer---who had no preferences, but was developing a project with fairly limited hardware resources. They asked what compiler I'd like and I picked EW- ARM. At that time, I'd been using CodeWarrior for the M68K for many years and EW_ARM had the same 'feel'. When it came time to do the Linux project, the transition to GCC took MUCH longer than the transition from CodeWarrior to EW_ARM. Of course, much of that was in setting up a virtual machine on the PC and learning Linux so that I could use GCC. One thing that I missed on the Linux project is that I didn't have a debugger equivalent to C-Spy that is integrated into EW_ARM. Debugging on the Linux system was mostly "Save everything and analyze later". Of course, the original poster is discussing the type of code that few Linux programmers write----direct interfacing to peripherals. My recent experience with Linux and digital cameras was pretty frustrating. I was dependent on others to provide the drivers--and they often didn't work quite right with the particular camera I was using. That's a story for another time, though.
> > mvh., > > David
Mark Borgerson
Reply by David Brown June 22, 20122012-06-22
On 22/06/2012 02:42, Mark Borgerson wrote:
> In article <9b55cce9-96db-46f4-909a-1f6500deb237 > @j9g2000vbk.googlegroups.com>, peter_gotkatov@supergreatmail.com says... >> >> On Jun 21, 4:09 am, David Brown <da...@westcontrol.removethisbit.com> >> wrote: >>> On 20/06/2012 21:21, peter_gotka...@supergreatmail.com wrote: >>> It is not a surprise that the code >>> size has increased in moving from the PIC18 - differences here will vary >>> wildly according to the type of code. But it /is/ a surprise that you >>> only have 2K of startup, vector tables, and library code. >> >> Not all that surprising, here are the sizes in bytes: >> .vectors 304 >> .init 508 >> __putchar 40 >> __vprintf 1498 >> memcpy 56 >> >>>> Instead of a macro I thought about making an ASM inline function that >>>> would use the CLZ instruction to do this efficiently but for some >>>> reason gcc didn't want to inline any of my functions (C or ASM) in >>>> debug mode so I just gave up at that point. >>> >>> First off, you should not need to resort to assembly to get basic >>> instructions working - the compiler should produce near-optimal code as >>> long as you let it (by enabling optimisations and writing appropriate C >>> code). >> >> I've tried several ways of writing a counleadingzeroes() function that >> would use the Cortex CLZ instruction but even with optimization turned >> on it still wouldn't do it. >> >>> Secondly, don't use "ASM functions" - they are normally only needed by >>> more limited compilers. If you need to use assembly with gcc, use gcc's >>> extended "asm" syntax. >> >> There are some things like the bootloader that need to be ASM >> functions in their own separate .S file anyway since they need to copy >> portions of themselves to RAM in order to execute. But a bootloader is >> a special case and I do agree that normal code shouldn't need to rely >> on ASM functions. I must say I'm not familiar with gcc's extended asm >> syntax and although I did look at it briefly it seemed like it was >> more complicated than a plain old .S file and it was mostly geared >> towards mixing C and ASM together in the same function and accessing >> variables by name etc. Not something I needed for a simple bootloader. >> >>> Finally, if you are not getting inlining when debugging it is because >>> you have got incorrect compiler switches. You should not have different >>> "debug" and "release" (or "optimised") builds - do a single build with >>> the proper optimisation settings (typically -Os unless you know what you >>> are doing) and "-g" to enable debugging. You never want to be releasing >>> code that is built differently from the code you debugged. >> >> I was fighting with this for a while when I was first handed this >> toolchain, and it seems that in debug mode, there is no -O switch at >> all and in release mode it defaults to -O1. When I change this -Os it >> does produce the same code as the sample that you posted above from >> gcc 4.6.1 (mine is 4.4.4 by the way). However even with manually >> adding the -g switch I still don't get source annotation in the ELF >> file unless I use debug mode. This effectively limits any development/ >> debugging to unoptimized code, which still has to fit into the 256K >> somehow. >> >> As for using register keywords and accessing globals through pointers, >> I normally don't do this (haven't used the register keyword in years) >> and I certainly wouldn't be doing it at all if it didn't have such a >> significant effect on the code size: >> >> unsigned long a,b; >> >> void test(void) { >> B4B0 push {r4-r5, r7} >> AF00 add r7, sp, #0 >> ---------------------------------------- >> register unsigned long x, y; >> a=b+5; >> F2402314 movw r3, #0x214 >> F2C20300 movt r3, #0x2000 >> 681B ldr r3, [r3, #0] >> F1030205 add.w r2, r3, #5 >> F240231C movw r3, #0x21C >> F2C20300 movt r3, #0x2000 >> 601A str r2, [r3, #0] >> ---------------------------------------- >> x=y+5; >> F1040505 add.w r5, r4, #5 >> ---------------------------------------- >> } >> 46BD mov sp, r7 >> BCB0 pop {r4-r5, r7} >> 4770 bx lr >> BF00 nop > > (36 bytes of code) > > I'd be surprised that your compiler did not complain about x and y not > being initialized before the addition. EW ARM warned me that y was used > before its value was set and that x was set but never used. > > Here is the code it gave me----with some extra comments added afterwards > > > 111 void test(void){ > 112 register unsigned long x,y; > 113 a = b+5; > \ test: > \ 00000000 0x.... LDR.N R1,??DataTable8_6 > \ 00000002 0x6809 LDR R1,[R1, #+0] > \ 00000004 0x1D49 ADDS R1,R1,#+5 > \ 00000006 0x.... LDR.N R2,??DataTable8_7 > \ 00000008 0x6011 STR R1,[R2, #+0] > 114 x = y+5; > \ 0000000A 0x1D40 ADDS R0,R0,#+5 // R0 is y > // sum not saved > 115 } > \ 0000000C 0x4770 BX LR ;; return > 116 > (14 bytes of code) > > Apparently, R0, R1, R2 are scratch registers for IAR and don't need to > be saved and restored. > > Adding actual initialization to x and y and saving the result in b > produced the following: > > In section .text, align 2, keep-with-next > 110 void test(void){ > 111 register unsigned long x=3,y=4; > \ test: > \ 00000000 0x2003 MOVS R0,#+3 > \ 00000002 0x2104 MOVS R1,#+4 > 112 a = b+5; > \ 00000004 0x.... LDR.N R2,??DataTable8_6 > \ 00000006 0x6812 LDR R2,[R2, #+0] > \ 00000008 0x1D52 ADDS R2,R2,#+5 > \ 0000000A 0x.... LDR.N R3,??DataTable8_7 > \ 0000000C 0x601A STR R2,[R3, #+0] > 113 x = y+5; > \ 0000000E 0x1D49 ADDS R1,R1,#+5 > \ 00000010 0x0008 MOVS R0,R1 // x = sum > 114 b = x; // this time save the result > \ 00000012 0x.... LDR.N R1,??DataTable8_6 > \ 00000014 0x6008 STR R0,[R1, #+0] > 115 } > \ 00000016 0x4770 BX LR ;; return > 116 > > Still accomplished with scratch registers----no need to save any on the > stack. I changed from my default optimization of 'low' to 'none' > and got exactly the same code. > > Finally, I took out the 'register' key word before x and y----and > got exactly the same result as above. > > It seems that GCC just doesn't match up to IAR at producing compact > code at low optimization levels. OTOH, given that EW_ARM costs > several KBucks, it SHOULD do better! > >
The problems here don't lie with the compiler - they lie with the user. I'm sure that EW_ARM produces better code than gcc (correctly used) in some cases - but I am also sure that gcc can do better than EW_ARM in other cases. I really don't think there is going to be a big difference in code generation quality - if that's why you paid K$ for EW, you've probably wasted your money. There are many reasons for choosing different toolchains, but generally speaking I don't see a large difference in code generation quality between the major toolchains (including gcc) for 32-bit processors. Occasionally you'll see major differences in particular kinds of code, but for the most part it is the user that makes the biggest difference. On place where EW_ARM might score over the gcc setup this user has (he hasn't yet said anything about the rest - is it home-made, CodeSourcery, Code Red, etc.?) is that EW_ARM might make it easier to get the compiler switches correct, and avoid this "I don't know how to enable debugging and optimisation" or "what's a warning?" nonsense. It hardly needs saying, but when run properly, my brief test with gcc produces the same code here as you get with EW_ARM, and the same warnings about x and y. I'm sure that EW_ARM has a similar option, but gcc has a "-fno-common" switch to disable "common" sections. With this disabled, definitions like "unsigned long a, b;" can only appear once in the program for each global identifier, and the space is allocated directly in the .bss inside the module that made the definition. gcc can use this extra information to take advantage of relative placement between variables, and generate addressing via section anchors: Command line: arm-none-eabi-gcc -mcpu=cortex-m3 -mthumb -S testcode.c -Wall -Os -fno-common test: @ args = 0, pretend = 0, frame = 0 @ frame_needed = 0, uses_anonymous_args = 0 @ link register save eliminated. ldr r3, .L6 ldr r0, [r3, #4] adds r2, r0, #5 str r2, [r3, #0] bx lr .L7: .align 2 .L6: .word .LANCHOR0 .size test, .-test .global b .global a .bss .align 2 .set .LANCHOR0,. + 0 .type a, %object .size a, 4 a: .space 4 .type b, %object .size b, 4 b: .space 4 .ident "GCC: (Sourcery CodeBench Lite 2011.09-69) 4.6.1" It's all about learning to use the tools you have, rather than buying more expensive tools. mvh., David
Reply by Mark Borgerson June 21, 20122012-06-21
In article <9b55cce9-96db-46f4-909a-1f6500deb237
@j9g2000vbk.googlegroups.com>, peter_gotkatov@supergreatmail.com says...
> > On Jun 21, 4:09&#4294967295;am, David Brown <da...@westcontrol.removethisbit.com> > wrote: > > On 20/06/2012 21:21, peter_gotka...@supergreatmail.com wrote: > > It is not a surprise that the code > > size has increased in moving from the PIC18 - differences here will vary > > wildly according to the type of code. &#4294967295;But it /is/ a surprise that you > > only have 2K of startup, vector tables, and library code. > > Not all that surprising, here are the sizes in bytes: > .vectors 304 > .init 508 > __putchar 40 > __vprintf 1498 > memcpy 56 > > > > Instead of a macro I thought about making an ASM inline function that > > > would use the CLZ instruction to do this efficiently but for some > > > reason gcc didn't want to inline any of my functions (C or ASM) in > > > debug mode so I just gave up at that point. > > > > First off, you should not need to resort to assembly to get basic > > instructions working - the compiler should produce near-optimal code as > > long as you let it (by enabling optimisations and writing appropriate C > > code). > > I've tried several ways of writing a counleadingzeroes() function that > would use the Cortex CLZ instruction but even with optimization turned > on it still wouldn't do it. > > > Secondly, don't use "ASM functions" - they are normally only needed by > > more limited compilers. &#4294967295;If you need to use assembly with gcc, use gcc's > > extended "asm" syntax. > > There are some things like the bootloader that need to be ASM > functions in their own separate .S file anyway since they need to copy > portions of themselves to RAM in order to execute. But a bootloader is > a special case and I do agree that normal code shouldn't need to rely > on ASM functions. I must say I'm not familiar with gcc's extended asm > syntax and although I did look at it briefly it seemed like it was > more complicated than a plain old .S file and it was mostly geared > towards mixing C and ASM together in the same function and accessing > variables by name etc. Not something I needed for a simple bootloader. > > > Finally, if you are not getting inlining when debugging it is because > > you have got incorrect compiler switches. &#4294967295;You should not have different > > "debug" and "release" (or "optimised") builds - do a single build with > > the proper optimisation settings (typically -Os unless you know what you > > are doing) and "-g" to enable debugging. &#4294967295;You never want to be releasing > > code that is built differently from the code you debugged. > > I was fighting with this for a while when I was first handed this > toolchain, and it seems that in debug mode, there is no -O switch at > all and in release mode it defaults to -O1. When I change this -Os it > does produce the same code as the sample that you posted above from > gcc 4.6.1 (mine is 4.4.4 by the way). However even with manually > adding the -g switch I still don't get source annotation in the ELF > file unless I use debug mode. This effectively limits any development/ > debugging to unoptimized code, which still has to fit into the 256K > somehow. > > As for using register keywords and accessing globals through pointers, > I normally don't do this (haven't used the register keyword in years) > and I certainly wouldn't be doing it at all if it didn't have such a > significant effect on the code size: > > unsigned long a,b; > > void test(void) { > B4B0 push {r4-r5, r7} > AF00 add r7, sp, #0 > ---------------------------------------- > register unsigned long x, y; > a=b+5; > F2402314 movw r3, #0x214 > F2C20300 movt r3, #0x2000 > 681B ldr r3, [r3, #0] > F1030205 add.w r2, r3, #5 > F240231C movw r3, #0x21C > F2C20300 movt r3, #0x2000 > 601A str r2, [r3, #0] > ---------------------------------------- > x=y+5; > F1040505 add.w r5, r4, #5 > ---------------------------------------- > } > 46BD mov sp, r7 > BCB0 pop {r4-r5, r7} > 4770 bx lr > BF00 nop
(36 bytes of code) I'd be surprised that your compiler did not complain about x and y not being initialized before the addition. EW ARM warned me that y was used before its value was set and that x was set but never used. Here is the code it gave me----with some extra comments added afterwards 111 void test(void){ 112 register unsigned long x,y; 113 a = b+5; \ test: \ 00000000 0x.... LDR.N R1,??DataTable8_6 \ 00000002 0x6809 LDR R1,[R1, #+0] \ 00000004 0x1D49 ADDS R1,R1,#+5 \ 00000006 0x.... LDR.N R2,??DataTable8_7 \ 00000008 0x6011 STR R1,[R2, #+0] 114 x = y+5; \ 0000000A 0x1D40 ADDS R0,R0,#+5 // R0 is y // sum not saved 115 } \ 0000000C 0x4770 BX LR ;; return 116 (14 bytes of code) Apparently, R0, R1, R2 are scratch registers for IAR and don't need to be saved and restored. Adding actual initialization to x and y and saving the result in b produced the following: In section .text, align 2, keep-with-next 110 void test(void){ 111 register unsigned long x=3,y=4; \ test: \ 00000000 0x2003 MOVS R0,#+3 \ 00000002 0x2104 MOVS R1,#+4 112 a = b+5; \ 00000004 0x.... LDR.N R2,??DataTable8_6 \ 00000006 0x6812 LDR R2,[R2, #+0] \ 00000008 0x1D52 ADDS R2,R2,#+5 \ 0000000A 0x.... LDR.N R3,??DataTable8_7 \ 0000000C 0x601A STR R2,[R3, #+0] 113 x = y+5; \ 0000000E 0x1D49 ADDS R1,R1,#+5 \ 00000010 0x0008 MOVS R0,R1 // x = sum 114 b = x; // this time save the result \ 00000012 0x.... LDR.N R1,??DataTable8_6 \ 00000014 0x6008 STR R0,[R1, #+0] 115 } \ 00000016 0x4770 BX LR ;; return 116 Still accomplished with scratch registers----no need to save any on the stack. I changed from my default optimization of 'low' to 'none' and got exactly the same code. Finally, I took out the 'register' key word before x and y----and got exactly the same result as above. It seems that GCC just doesn't match up to IAR at producing compact code at low optimization levels. OTOH, given that EW_ARM costs several KBucks, it SHOULD do better! Mark Borgerson