EmbeddedRelated.com
Forums
The 2026 Embedded Online Conference

Same code, same data, different results

Started by Tim Wescott October 6, 2015
On 10/6/2015 8:35 PM, Tim Wescott wrote:
> > So now I'm getting four different results from three different software > loads and two different circumstances. I can't go into detail, but I'm > going to give a general story 'cause I'm looking for general things to > look for: > > Under Linux 32-bit I get behavior A (correct operation) > > Under Linux 64-bit I get behavior B (correct operation, just different) > > Under Wine running a 32-bit Windows program I get behavior B
> So -- other than picking through the code line by line, can you guys > suggest anything that I can do or look for in specific?
Looks like Linux 32 is your gold standard. Rather than try to debug this by looking hard at one use case by standard test methods, is it feasible for your program to print to a file info from internal points of the program so you can comparing the output across the different platforms? This may help you narrow down the section of code that is failing. If the differences pop up in different area randomly, that will tell you perhaps that it is not a coding error per-se, but rather a problem between the program and its environment. -- Rick
On Tue, 06 Oct 2015 22:19:53 -0500, Tim Wescott
<seemywebsite@myfooter.really> wrote:

>On Wed, 07 Oct 2015 01:36:33 +0000, Rob Gaddi wrote: > >> On Tue, 06 Oct 2015 19:35:13 -0500, Tim Wescott wrote: >> >>> This is about code that clings to "embedded" by it's fingernails -- >>> it's running on a fast PC-compatible single-board computer, under >>> Windows, as a DLL. So it's not exactly some little thing shoehorned >>> into 4kB of flash.
Exactly what kind of processor is the target using and from which manufacturer ? There might be some minor differences e.g. in IEEE floating point such as handling of non-normalized values.
>>> >>> At any rate: >>> >>> I have a rather complicated algorithm that I've coded up, to do >>> marvelous stuff for my customer. It recently grew quite a bit, and in >>> the process I've introduced some subtle bugs. I'm looking for ideas on >>> things to look for to see if I can figure out what's going on. >>> >>> Here's the deal: >>> >>> First, some time this spring I got a shiny new machine, and went ahead >>> and loaded 64-bit Linux onto it, with all its 64-bit appurtenances. >>> This did not, at the time, cause problems. >>> >>> I coded up a bunch of changes, tested it on my 64-bit machine, and >>> happily shipped it off to my customer -- who reported that it broke, >>> horribly. >>> >>> Oh drat. On top of this, at some point the MinGW stream library broke, >>> so my test code no longer worked under Wine -- I could only test with >>> the Linux version. >>> >>> After much trial and tribulation, I managed to get Linux 32 and 64-bit >>> versions, and Windows 32-bit versions all working. I tracked down my >>> problems (size_t and unsigned int are not the same size in gcc 64 bit >>> for Linux), fixed them, and shipped. >>> >>> So now I'm getting four different results from three different software >>> loads and two different circumstances. I can't go into detail, but I'm >>> going to give a general story 'cause I'm looking for general things to >>> look for: >>> >>> Under Linux 32-bit I get behavior A (correct operation) >>> >>> Under Linux 64-bit I get behavior B (correct operation, just different) >>> >>> Under Wine running a 32-bit Windows program I get behavior B >>> >>> My customer calls my DLL from Labview. Nine times out of ten he gets >>> some correct behavior -- he's not sophisticated enough that I can know >>> whether it's A, B or something else. The tenth time the thing fails to >>> work correctly. >>> >>> So, I suspect that I've got some uninitialized memory someplace. But, >>> I'm running the Linux versions under Valgrind and it's not finding any >>> problems (Valgrind is great, by the way -- great enough that for my >>> embedded ARM stuff I do unit testing under Linux and Valgrind). >>> >>> I'm going through the code with a fine-toothed comb, and so far I've >>> only found a few very minor problems that border on the stylistic, >>> although one of the changes that I made did improve things a bit. >>> >>> So -- other than picking through the code line by line, can you guys >>> suggest anything that I can do or look for in specific? >>> >>> Also, does anyone know of a Linux tool that'll randomly populate the >>> heap with junk then call a program? I suspect that I'm not seeing the >>> "sometimes it is, sometimes not" behavior that my customer is because >>> of the different environment, not because Linux is magically fixing my >>> bugs. Suggestions on how to make the bugs apparent would be helpful. >>> >>> Thanks for reading, suggestions welcome -- I'm becoming a candidate for >>> a rubber room over this one. >> >> Without getting into the A/B specifics, is the difference something that >> could be chalked up to floating point error? > >Between A and B, yes. In fact, it was tweaks to some floating point >calculations to make them more kosher that caused the change in the >Windows version. > >However, the customer's one out of ten problem is, I'm pretty sure, >different -- first, because it's a failure and not just a little >difference, and second, he's running the same file through all the time, >and occasionally it's spitting up. I don't know what could cause that in >my code other than using an uninitialized variable.
Sometimes an interrupt occurs during your code and sometimes not ? Any bugs in the interrupt processing (either HW or SW) would cause such problems.
> >It may possibly be a bug on his side, but I don't want to start pointing >at his side of things unless I'm pretty certain of mine.
Tim Wescott wrote:
> This is about code that clings to "embedded" by it's fingernails -- it's > running on a fast PC-compatible single-board computer, under Windows, as > a DLL. So it's not exactly some little thing shoehorned into 4kB of > flash. > > At any rate: > > I have a rather complicated algorithm that I've coded up, to do marvelous > stuff for my customer. It recently grew quite a bit, and in the process > I've introduced some subtle bugs. I'm looking for ideas on things to > look for to see if I can figure out what's going on. > > Here's the deal: > > First, some time this spring I got a shiny new machine, and went ahead > and loaded 64-bit Linux onto it, with all its 64-bit appurtenances. This > did not, at the time, cause problems. > > I coded up a bunch of changes, tested it on my 64-bit machine, and > happily shipped it off to my customer -- who reported that it broke, > horribly. > > Oh drat. On top of this, at some point the MinGW stream library broke, > so my test code no longer worked under Wine -- I could only test with the > Linux version. >
That's a big red flag, unless you know what causes this. Now, it easily could be some interaction between WINE and MinGW but the stream library ( I presume you mean the stuff in file.h? stdio.h? ) should *always* work. If you mean C++ >> stuff, then that should pretty much always work, too. Can you mock out your .dll and see if it still breaks in WINE?
> After much trial and tribulation, I managed to get Linux 32 and 64-bit > versions, and Windows 32-bit versions all working. I tracked down my > problems (size_t and unsigned int are not the same size in gcc 64 bit for > Linux), fixed them, and shipped. > > So now I'm getting four different results from three different software > loads and two different circumstances. I can't go into detail, but I'm > going to give a general story 'cause I'm looking for general things to > look for: > > Under Linux 32-bit I get behavior A (correct operation) > > Under Linux 64-bit I get behavior B (correct operation, just different) > > Under Wine running a 32-bit Windows program I get behavior B >
Really fishy. How hard would it be to change *every* int in the whole shebang to an int32_t or uint32_t from stdint.h? This should not matter because ints are all 4 bytes in gcc unless they aren't. But the story about size_t makes me think....
> My customer calls my DLL from Labview. Nine times out of ten he gets > some correct behavior -- he's not sophisticated enough that I can know > whether it's A, B or something else. The tenth time the thing fails to > work correctly. > > So, I suspect that I've got some uninitialized memory someplace. But, > I'm running the Linux versions under Valgrind and it's not finding any > problems (Valgrind is great, by the way -- great enough that for my > embedded ARM stuff I do unit testing under Linux and Valgrind). > > I'm going through the code with a fine-toothed comb, and so far I've only > found a few very minor problems that border on the stylistic, although > one of the changes that I made did improve things a bit. >
Verifying that everything is initialized by hand shouldn't be that horrible.
> So -- other than picking through the code line by line, can you guys > suggest anything that I can do or look for in specific? >
This ain't a line by line - this is a Heisenbug. I'd seriously consider mocking the whole thing out, routine by routine - but you can't tell if behavior B is the precursor to failure or not. And to really get it to fail, your customer is in the loop. Can you get a test vector from the customer?
> Also, does anyone know of a Linux tool that'll randomly populate the heap > with junk then call a program? I suspect that I'm not seeing the > "sometimes it is, sometimes not" behavior that my customer is because of > the different environment, not because Linux is magically fixing my > bugs. Suggestions on how to make the bugs apparent would be helpful. >
??? First line of main() is "char *p = malloc()"?
> Thanks for reading, suggestions welcome -- I'm becoming a candidate for a > rubber room over this one. >
-- Les Cargill
On 2015-10-07, Tim Wescott <seemywebsite@myfooter.really> wrote:
> So now I'm getting four different results from three different software > loads and two different circumstances. I can't go into detail, but I'm > going to give a general story 'cause I'm looking for general things to > look for: > > Under Linux 32-bit I get behavior A (correct operation) > > Under Linux 64-bit I get behavior B (correct operation, just different) > > Under Wine running a 32-bit Windows program I get behavior B > > My customer calls my DLL from Labview.
If I'm understanding correctly, your customer is using your code with LabVIEW with Windows, yet you're not testing your code under Windows or with LabVIEW? I can understand there might be difficulty testing with LabVIEW (e.g., expensive hardware involved), but do you have reason to be believe that testing with Wine is the same as testing with Windows? Is there some reason you can't test it under Windows using the same version as the customer?
On 07/10/15 05:19, Tim Wescott wrote:
> On Wed, 07 Oct 2015 01:36:33 +0000, Rob Gaddi wrote:
>> Without getting into the A/B specifics, is the difference something that >> could be chalked up to floating point error? > > Between A and B, yes. In fact, it was tweaks to some floating point > calculations to make them more kosher that caused the change in the > Windows version. >
Floating point can be difficult to get /exactly/ the same between different systems. In particular, even on the same x86 cpu, you can get slightly different results (but still within IEEE specs) if the calculations are done in the MMX registers, with SIMD instructions, or with x87 floating point. Your two options are to make sure you use the same compiler, the same flags, the same code, and avoid passing floating point data directly between functions, or to accept that floating point results are not bit-for-bit repeatable. (The issue with passing floats and doubles into and out of functions is that Windows and Linux use different calling conventions, as do 32-bit and 64-bit.)
On 07/10/15 02:35, Tim Wescott wrote:

> After much trial and tribulation, I managed to get Linux 32 and 64-bit > versions, and Windows 32-bit versions all working. I tracked down my > problems (size_t and unsigned int are not the same size in gcc 64 bit for > Linux), fixed them, and shipped. >
There are bigger differences than that. In particular, "long" is 64-bit in 64-bit Linux, but only 32-bit in 64-bit Windows. And size_t is going to be 64-bit in 64-bit Linux and Windows, but 32-bit in 32-bit Linux and Windows. And while size_t may happen to be the same size as unsigned int on some combinations, don't forget that it is not necessarily the same type. Your best way forward here is to treat your programming as carefully as you would for embedded programming. Never make any assumptions about the relationships between types, other than as given by the law (the C or C++ standards, as appropriate). When you want something of a particular size, use the <stdint.h> types. The other big question here is what language(s) you are using, and what compiler(s) you are using. That would be helpful to know. If you are using up-to-date gcc or clang, you have a variety of "sanitize" options that can help. AFAIK some of them only run on 64-bit Linux, since they use memory management tricks that require the wider address space and greater flexibility, but they should still be helpful.
David Brown <david.brown@hesbynett.no> wrote:
> On 07/10/15 05:19, Tim Wescott wrote: >> On Wed, 07 Oct 2015 01:36:33 +0000, Rob Gaddi wrote:
(snip)
>>> Without getting into the A/B specifics, is the difference >>> something that could be chalked up to floating point error?
>> Between A and B, yes. In fact, it was tweaks to some floating point >> calculations to make them more kosher that caused the change in the >> Windows version.
> Floating point can be difficult to get /exactly/ the same between > different systems. In particular, even on the same x86 cpu, you can get > slightly different results (but still within IEEE specs) if the > calculations are done in the MMX registers, with SIMD instructions,
Another one I remember some time ago, I believe on Windows, is not intializing the x87 control register, such that rounding modes and precision are different between runs. (That is, what the previous program left.) If you use a test system that detects all attempts to use memory that hasn't be given a value, it likely won't notice x87 registers. -- glen
On Wed, 07 Oct 2015 10:25:42 +0200, David Brown wrote:

> On 07/10/15 02:35, Tim Wescott wrote: > >> After much trial and tribulation, I managed to get Linux 32 and 64-bit >> versions, and Windows 32-bit versions all working. I tracked down my >> problems (size_t and unsigned int are not the same size in gcc 64 bit >> for Linux), fixed them, and shipped. >> >> > There are bigger differences than that. In particular, "long" is 64-bit > in 64-bit Linux, but only 32-bit in 64-bit Windows. And size_t is going > to be 64-bit in 64-bit Linux and Windows, but 32-bit in 32-bit Linux and > Windows. And while size_t may happen to be the same size as unsigned > int on some combinations, don't forget that it is not necessarily the > same type. > > Your best way forward here is to treat your programming as carefully as > you would for embedded programming. Never make any assumptions about > the relationships between types, other than as given by the law (the C > or C++ standards, as appropriate). When you want something of a > particular size, use the <stdint.h> types. > > > The other big question here is what language(s) you are using, and what > compiler(s) you are using. That would be helpful to know. > > If you are using up-to-date gcc or clang, you have a variety of > "sanitize" options that can help. AFAIK some of them only run on 64-bit > Linux, since they use memory management tricks that require the wider > address space and greater flexibility, but they should still be helpful.
gcc 4.8.4 (it's what came with Ubuntu). "-fsanitize=address" and "- fsanitize=thread" seem to be the only sanitize options -- and it passes those. -- Tim Wescott Wescott Design Services http://www.wescottdesign.com
On Tue, 06 Oct 2015 20:27:28 -0700, Paul Rubin wrote:

> Tim Wescott <seemywebsite@myfooter.really> writes: >>> Have you run the code with undefined behaviour and address sanitizers >>> turned on? >> The only such sanitizer I know of is valgrind. Do you have other tools >> to suggest? > > I mean compiler flags like -fsanitize-address. Clang has some other > ones like bounds checking, but I've only used GCC. > > You could look at Frama-C (frama-c.com) which is sort of a lint on > steroids. I've never tried it myself but have been wanting to look into > it.
I did not know about this feature -- thanks. -- Tim Wescott Wescott Design Services http://www.wescottdesign.com
On 07/10/15 05:27, Paul Rubin wrote:
> Tim Wescott <seemywebsite@myfooter.really> writes: >>> Have you run the code with undefined behaviour and address sanitizers >>> turned on? >> The only such sanitizer I know of is valgrind. Do you have other tools >> to suggest? > > I mean compiler flags like -fsanitize-address. Clang has some other > ones like bounds checking, but I've only used GCC. > > You could look at Frama-C (frama-c.com) which is sort of a lint on > steroids. I've never tried it myself but have been wanting to look into > it. >
I haven't heard of Frama-C before, but I've found the website, and it is now high up on my list of things to read as soon as I get the time. Thanks for the pointer.
The 2026 Embedded Online Conference