On 05/08/15 12:25, Simon Clubley wrote:> On 2015-08-04, Tom Gardner <spamjunk@blueyonder.co.uk> wrote: >> On 04/08/15 18:59, Tim Wescott wrote: >>> Which do you think is quicker on a PC, with the latest gnu compiler: >>> >>> double a = something; >>> double b = something else; >>> >>> if (a >= 0.0 && b < 0.0) >>> >>>>> or << >>> >>> if (a * b <= 0) >>> >>> Thank you for your time. >> >> With many of these questions, the answer may depend on the >> surrounding context and the compiler optimisation level, >> and what is/isn't in the cache. >> >> And may change with the next compiler release. Or with >> a "trivial" change to the context. >> > > You have no idea just how true that is. :-( :-( > > Still annoyed about this one because it's just happened to me (last > night) and I sunk quite some time into it because I wrongly thought > initially I had done something stupid to my code while changing it. > > MIPS gcc cross-compiler, targeting a M4K core, -Os in effect. > > Made a change to a routine (called from lots of places) which _reduced_ > the amount of code in the routine. End result: final binary _grew_ from > 3520 bytes to 4844 bytes and hence smashed through the 4K SRAM limit > available for the code to execute in. > > Even though -Os was in use, it appears gcc decided to inline my new > smaller routine; I guess the final binary size wasn't considered by gcc > for some reason. > > -fno-inline fixed the problem. (As did -O1 :-)) >Or use __attribute__((noinline)) on the function in question. gcc uses a number of heuristics and hints to decide whether or not a function should be inlined. These include optimisation options (such as -Os, -O2 or -O3), knowledge about the code usage (a single-use static function is almost always inlined with modern gcc), and estimations about how the total code size will change by inline the code. It is often not clear whether inlining will cause code to grow or shrink, since inlining eliminates function call overhead (which can be large in some cases), and can allow for many other optimisations (keeping data in registers before and after the call, constant propagation, etc.). gcc usually does a reasonable job, but certainly not always - it sometimes gets things wrong due to bugs (suboptimal code generation is still correct code generation, and so such issues are harder to spot and lower priority for fixing), limited functionality (such as overly simplistic models of code size), or simply because the balancing heuristics about when a function is "too big to inline" or "small enough to always inline, even with -Os" don't match your particular requirements. More recent versions of gcc generally have better tuning (and fewer bugs) than older ones, but it can never be right for everyone. You can fiddle with many of the heuristic parameters manually, but usually the best method is a few "noinline" or "always_inline" (or even "flatten") attributes in critical functions. Look at it this way - it is one of the many little quirks that keep our jobs from getting boring!> This was with the MIPS sourced version of the compiler which is quite > old by now (gcc 4.4.6 IIRC) so I don't know if it's a problem on current > gcc versions. > > The message for Tim is that you need to see the generated code in the > final binary at the optimisation level you are _actually_ using before > you can decide which one is best. (And don't assume higher optimisation > level equals "better". :-)) > > Simon. >
Slightly OT: speed of operation on a PC
Started by ●August 4, 2015
Reply by ●August 5, 20152015-08-05
Reply by ●August 5, 20152015-08-05
Tom Gardner <spamjunk@blueyonder.co.uk> wrote:> HP's Dynamo C compiler from the 1990s is a remarkable demonstration > of that kind of thing:<snip> Many modern C/C++ compilers (eg. GCC, Visual Studio, Clang, ICC) have similar features, usually called "profile guided optimization." Having to generate and export the execution profile however means it's not usually an option for smaller embedded systems. -a
Reply by ●August 5, 20152015-08-05
On 05/08/15 13:23, Anders.Montonen@kapsi.spam.stop.fi.invalid wrote:> Tom Gardner <spamjunk@blueyonder.co.uk> wrote: >> HP's Dynamo C compiler from the 1990s is a remarkable demonstration >> of that kind of thing: > <snip> > Many modern C/C++ compilers (eg. GCC, Visual Studio, Clang, ICC) have > similar features, usually called "profile guided optimization." Having > to generate and export the execution profile however means it's not > usually an option for smaller embedded systems. >There is an additional problem with profile-guided optimisation. It is based on the idea that you measure how often functions (and to a lesser extent, loops) are executed - and thus where the optimisation effort should be concentrated. This plays badly with modern compiler optimisation techniques such as inlining, function cloning, partial inlining, hot/cold partitioning, link-time optimisation, which all blur the concept of separate functions and the match between source-code functions and object-code functions.
Reply by ●August 5, 20152015-08-05
On 8/5/2015 2:14 AM, Clifford Heath wrote:> On 05/08/15 17:44, Don Y wrote: >> On 8/4/2015 10:43 PM, glen herrmannsfeldt wrote: >>> Don Y <this@is.not.me.com> wrote: >>> >>> (snip, I wrote) >>>>> The 7447 has nice high current 15 volt OC drivers. >>>>> Usual OC gates have 5V transistors. >>> >>>> The HV output wasn't required. The silliness was the effort >>>> to save a package of *gates* (or an AOI, etc.) -- at the expense >>>> of puzzling anyone who had to look at the circuit thereafter >>>> (Can you rattle off the decoding functions of any particular >>>> 7seg decoder, "off the top of your head"? e.g., does 6 have a >>>> tail? 9? >>> >>> This one I still remember from years ago. The 7447 does not have >>> tails, the 74247 does. >> >> Your memory is far better than mine! That was the *one* instance where >> I used a TTL 7segment decoder in my career. > > The 7447 was the first IC I ever bought, with a single digit display. > It absorbed my pocket money saved for a month, so you can imagine my > dismay when I found I'd bought a common-cathode display that wouldn't > work with the 7447 - and the supplier wouldn't take either of them back. > That was after fretting for a while about how to get 5V, when batteries only > came in 1.5v increments. I really needed a mentor or a book.Mine would probably have been a `138 (ubiquitous "address decoder" for small systems; or, `74 and `157's (DRAM controller)> I never did end up building the digital dice (die?) circuit I had designed, but > I did design and build a lot of other stuff.Most of my "discrete" logic designs were specialty CPU's. So, any "junk logic" was typically implemented in bipolar ROMs (e.g., microcode store).
Reply by ●August 5, 20152015-08-05
On Tuesday, August 4, 2015 at 10:59:08 AM UTC-7, Tim Wescott wrote:> Which do you think is quicker on a PC, with the latest gnu compiler: > > double a = something; > double b = something else; > > if (a >= 0.0 && b < 0.0) > > >> or << > > if (a * b <= 0) > > Thank you for your time.I just wrote a test program. The first statement is 10% faster than the second. You can argue all you want, but nothing beat benchmarking.
Reply by ●August 5, 20152015-08-05
Am 05.08.2015 um 19:31 schrieb edward.ming.lee@gmail.com:> I just wrote a test program. The first statement is 10% faster than > the second. You can argue all you want, but nothing beat > benchmarking.That's easy to believe, but still wrong. Benchmarking a micro-optimization like that, particularly if done in any other context than the actual project, is clearly pointless. _Any_ difference between the benchmark platform and the actual one will render the benchmark result useless. A 10% difference like that can easily be inverted by just about any unrelated change to the surrounding source code, or any change to the tools and their settings. The old truth still holds: premature optimization _is_ the root of all evil.
Reply by ●August 5, 20152015-08-05
On 8/5/2015 1:31 PM, edward.ming.lee@gmail.com wrote:> On Tuesday, August 4, 2015 at 10:59:08 AM UTC-7, Tim Wescott wrote: >> Which do you think is quicker on a PC, with the latest gnu compiler: >> >> double a = something; >> double b = something else; >> >> if (a >= 0.0 && b < 0.0) >> >>>> or << >> >> if (a * b <= 0) >> >> Thank you for your time. > > I just wrote a test program. The first statement is 10% faster than the second. You can argue all you want, but nothing beat benchmarking.What code is generated? Seems like 10% is close enough that "faster" may change with different hardware and tools. -- Rick
Reply by ●August 5, 20152015-08-05
On Wednesday, August 5, 2015 at 5:19:31 PM UTC-7, rickman wrote:> On 8/5/2015 1:31 PM, edward.ming.lee@gmail.com wrote: > > On Tuesday, August 4, 2015 at 10:59:08 AM UTC-7, Tim Wescott wrote: > >> Which do you think is quicker on a PC, with the latest gnu compiler: > >> > >> double a = something; > >> double b = something else; > >> > >> if (a >= 0.0 && b < 0.0) > >> > >>>> or << > >> > >> if (a * b <= 0) > >> > >> Thank you for your time. > > > > I just wrote a test program. The first statement is 10% faster than the second. You can argue all you want, but nothing beat benchmarking. > > What code is generated? Seems like 10% is close enough that "faster" > may change with different hardware and tools.I timed two loops, looking at the user time. t1.c: for(i=0; i<10000000; i++) { if (a >= 0.0 && b < 0.0) j++; } t2.c: for(i=0; i<10000000; i++) { if (a * b <= 0) j++; } Actually, t1 is 10% faster if a & b are integers and 20% faster if a & b are doubles. ---------------------------------- .file "t1.c" .text .globl main .type main, @function main: .LFB0: .cfi_startproc pushl %ebp .cfi_def_cfa_offset 8 .cfi_offset 5, -8 movl %esp, %ebp .cfi_def_cfa_register 5 andl $-8, %esp subl $32, %esp movl $0, 8(%esp) jmp .L2 .L6: fldl 16(%esp) fldz fxch %st(1) fucomip %st(1), %st fstp %st(0) jb .L3 .L7: fldz fldl 24(%esp) fxch %st(1) fucomip %st(1), %st fstp %st(0) jbe .L3 .L8: addl $1, 12(%esp) .L3: addl $1, 8(%esp) .L2: cmpl $9999999, 8(%esp) jle .L6 leave .cfi_restore 5 .cfi_def_cfa 4, 4 ret .cfi_endproc .LFE0: .size main, .-main .ident "GCC: (Ubuntu/Linaro 4.7.3-1ubuntu1) 4.7.3" .section .note.GNU-stack,"",@progbits ---------------------------------- .file "t2.c" .text .globl main .type main, @function main: .LFB0: .cfi_startproc pushl %ebp .cfi_def_cfa_offset 8 .cfi_offset 5, -8 movl %esp, %ebp .cfi_def_cfa_register 5 andl $-8, %esp subl $32, %esp movl $0, 8(%esp) jmp .L2 .L5: fldl 16(%esp) fmull 24(%esp) fldz fucomip %st(1), %st fstp %st(0) jb .L3 .L6: addl $1, 12(%esp) .L3: addl $1, 8(%esp) .L2: cmpl $9999999, 8(%esp) jle .L5 leave .cfi_restore 5 .cfi_def_cfa 4, 4 ret .cfi_endproc .LFE0: .size main, .-main .ident "GCC: (Ubuntu/Linaro 4.7.3-1ubuntu1) 4.7.3" .section .note.GNU-stack,"",@progbits
Reply by ●August 5, 20152015-08-05
On 06/08/15 12:45, edward.ming.lee@gmail.com wrote:> On Wednesday, August 5, 2015 at 5:19:31 PM UTC-7, rickman wrote: >> On 8/5/2015 1:31 PM, edward.ming.lee@gmail.com wrote: >>> On Tuesday, August 4, 2015 at 10:59:08 AM UTC-7, Tim Wescott wrote: >>>> Which do you think is quicker on a PC, with the latest gnu compiler: >>>> double a = something; >>>> double b = something else; >>>> if (a >= 0.0 && b < 0.0) >>>>>> or << >>>> if (a * b <= 0) >>>> Thank you for your time >>> I just wrote a test program. The first statement is 10% faster than the second. You can argue all you want, but nothing beat benchmarking >> What code is generated? Seems like 10% is close enough that "faster" >> may change with different hardware and tools. > I timed two loops, looking at the user time.If you enable the optimiser, neither of these two executes the loop. Since a and b don't change inside the loop, j will always be either unchanged or incremented by 10000000, and the compiler figures that out and removes the loop. Clifford Heath.> t1.c: > for(i=0; i<10000000; i++) > { > if (a >= 0.0 && b < 0.0) > j++; > } > > t2.c: > for(i=0; i<10000000; i++) > { > if (a * b <= 0) > j++; > } > > Actually, t1 is 10% faster if a & b are integers and 20% faster if a & b are doubles. > > ---------------------------------- > > .file "t1.c" > .text > .globl main > .type main, @function > main: > .LFB0: > .cfi_startproc > pushl %ebp > .cfi_def_cfa_offset 8 > .cfi_offset 5, -8 > movl %esp, %ebp > .cfi_def_cfa_register 5 > andl $-8, %esp > subl $32, %esp > movl $0, 8(%esp) > jmp .L2 > .L6: > fldl 16(%esp) > fldz > fxch %st(1) > fucomip %st(1), %st > fstp %st(0) > jb .L3 > .L7: > fldz > fldl 24(%esp) > fxch %st(1) > fucomip %st(1), %st > fstp %st(0) > jbe .L3 > .L8: > addl $1, 12(%esp) > .L3: > addl $1, 8(%esp) > .L2: > cmpl $9999999, 8(%esp) > jle .L6 > leave > .cfi_restore 5 > .cfi_def_cfa 4, 4 > ret > .cfi_endproc > .LFE0: > .size main, .-main > .ident "GCC: (Ubuntu/Linaro 4.7.3-1ubuntu1) 4.7.3" > .section .note.GNU-stack,"",@progbits > > ---------------------------------- > > .file "t2.c" > .text > .globl main > .type main, @function > main: > .LFB0: > .cfi_startproc > pushl %ebp > .cfi_def_cfa_offset 8 > .cfi_offset 5, -8 > movl %esp, %ebp > .cfi_def_cfa_register 5 > andl $-8, %esp > subl $32, %esp > movl $0, 8(%esp) > jmp .L2 > .L5: > fldl 16(%esp) > fmull 24(%esp) > fldz > fucomip %st(1), %st > fstp %st(0) > jb .L3 > .L6: > addl $1, 12(%esp) > .L3: > addl $1, 8(%esp) > .L2: > cmpl $9999999, 8(%esp) > jle .L5 > leave > .cfi_restore 5 > .cfi_def_cfa 4, 4 > ret > .cfi_endproc > .LFE0: > .size main, .-main > .ident "GCC: (Ubuntu/Linaro 4.7.3-1ubuntu1) 4.7.3" > .section .note.GNU-stack,"",@progbits >
Reply by ●August 6, 20152015-08-06
On Wed, 5 Aug 2015 11:23:06 +0000 (UTC), Anders.Montonen@kapsi.spam.stop.fi.invalid wrote:>Tom Gardner <spamjunk@blueyonder.co.uk> wrote: >> HP's Dynamo C compiler from the 1990s is a remarkable demonstration >> of that kind of thing: ><snip> >Many modern C/C++ compilers (eg. GCC, Visual Studio, Clang, ICC) have >similar features, usually called "profile guided optimization." Having >to generate and export the execution profile however means it's not >usually an option for smaller embedded systems.A fundamental difference is that PGO requires a training run, and after that the code is static. A more dynamic system, like Dynamo, will rebuild the code as usage changes. Somewhat similar to rebuilding with PGO with a new training dataset, but on the fly.







