This is about code that clings to "embedded" by it's fingernails -- it's running on a fast PC-compatible single-board computer, under Windows, as a DLL. So it's not exactly some little thing shoehorned into 4kB of flash. At any rate: I have a rather complicated algorithm that I've coded up, to do marvelous stuff for my customer. It recently grew quite a bit, and in the process I've introduced some subtle bugs. I'm looking for ideas on things to look for to see if I can figure out what's going on. Here's the deal: First, some time this spring I got a shiny new machine, and went ahead and loaded 64-bit Linux onto it, with all its 64-bit appurtenances. This did not, at the time, cause problems. I coded up a bunch of changes, tested it on my 64-bit machine, and happily shipped it off to my customer -- who reported that it broke, horribly. Oh drat. On top of this, at some point the MinGW stream library broke, so my test code no longer worked under Wine -- I could only test with the Linux version. After much trial and tribulation, I managed to get Linux 32 and 64-bit versions, and Windows 32-bit versions all working. I tracked down my problems (size_t and unsigned int are not the same size in gcc 64 bit for Linux), fixed them, and shipped. So now I'm getting four different results from three different software loads and two different circumstances. I can't go into detail, but I'm going to give a general story 'cause I'm looking for general things to look for: Under Linux 32-bit I get behavior A (correct operation) Under Linux 64-bit I get behavior B (correct operation, just different) Under Wine running a 32-bit Windows program I get behavior B My customer calls my DLL from Labview. Nine times out of ten he gets some correct behavior -- he's not sophisticated enough that I can know whether it's A, B or something else. The tenth time the thing fails to work correctly. So, I suspect that I've got some uninitialized memory someplace. But, I'm running the Linux versions under Valgrind and it's not finding any problems (Valgrind is great, by the way -- great enough that for my embedded ARM stuff I do unit testing under Linux and Valgrind). I'm going through the code with a fine-toothed comb, and so far I've only found a few very minor problems that border on the stylistic, although one of the changes that I made did improve things a bit. So -- other than picking through the code line by line, can you guys suggest anything that I can do or look for in specific? Also, does anyone know of a Linux tool that'll randomly populate the heap with junk then call a program? I suspect that I'm not seeing the "sometimes it is, sometimes not" behavior that my customer is because of the different environment, not because Linux is magically fixing my bugs. Suggestions on how to make the bugs apparent would be helpful. Thanks for reading, suggestions welcome -- I'm becoming a candidate for a rubber room over this one. -- Tim Wescott Wescott Design Services http://www.wescottdesign.com
Same code, same data, different results
Started by ●October 6, 2015
Reply by ●October 6, 20152015-10-06
On Tue, 06 Oct 2015 19:35:13 -0500, Tim Wescott wrote:> This is about code that clings to "embedded" by it's fingernails -- it's > running on a fast PC-compatible single-board computer, under Windows, as > a DLL. So it's not exactly some little thing shoehorned into 4kB of > flash. > > At any rate: > > I have a rather complicated algorithm that I've coded up, to do > marvelous stuff for my customer. It recently grew quite a bit, and in > the process I've introduced some subtle bugs. I'm looking for ideas on > things to look for to see if I can figure out what's going on. > > Here's the deal: > > First, some time this spring I got a shiny new machine, and went ahead > and loaded 64-bit Linux onto it, with all its 64-bit appurtenances. > This did not, at the time, cause problems. > > I coded up a bunch of changes, tested it on my 64-bit machine, and > happily shipped it off to my customer -- who reported that it broke, > horribly. > > Oh drat. On top of this, at some point the MinGW stream library broke, > so my test code no longer worked under Wine -- I could only test with > the Linux version. > > After much trial and tribulation, I managed to get Linux 32 and 64-bit > versions, and Windows 32-bit versions all working. I tracked down my > problems (size_t and unsigned int are not the same size in gcc 64 bit > for Linux), fixed them, and shipped. > > So now I'm getting four different results from three different software > loads and two different circumstances. I can't go into detail, but I'm > going to give a general story 'cause I'm looking for general things to > look for: > > Under Linux 32-bit I get behavior A (correct operation) > > Under Linux 64-bit I get behavior B (correct operation, just different) > > Under Wine running a 32-bit Windows program I get behavior B > > My customer calls my DLL from Labview. Nine times out of ten he gets > some correct behavior -- he's not sophisticated enough that I can know > whether it's A, B or something else. The tenth time the thing fails to > work correctly. > > So, I suspect that I've got some uninitialized memory someplace. But, > I'm running the Linux versions under Valgrind and it's not finding any > problems (Valgrind is great, by the way -- great enough that for my > embedded ARM stuff I do unit testing under Linux and Valgrind). > > I'm going through the code with a fine-toothed comb, and so far I've > only found a few very minor problems that border on the stylistic, > although one of the changes that I made did improve things a bit. > > So -- other than picking through the code line by line, can you guys > suggest anything that I can do or look for in specific? > > Also, does anyone know of a Linux tool that'll randomly populate the > heap with junk then call a program? I suspect that I'm not seeing the > "sometimes it is, sometimes not" behavior that my customer is because of > the different environment, not because Linux is magically fixing my > bugs. Suggestions on how to make the bugs apparent would be helpful. > > Thanks for reading, suggestions welcome -- I'm becoming a candidate for > a rubber room over this one.Without getting into the A/B specifics, is the difference something that could be chalked up to floating point error? -- Rob Gaddi, Highland Technology -- www.highlandtechnology.com Email address domain is currently out of order. See above to fix.
Reply by ●October 6, 20152015-10-06
On 07.10.2015 г. 03:35, Tim Wescott wrote:> This is about code that clings to "embedded" by it's fingernails -- it's > running on a fast PC-compatible single-board computer, under Windows, as > a DLL. So it's not exactly some little thing shoehorned into 4kB of > flash. > > At any rate: > > I have a rather complicated algorithm that I've coded up, to do marvelous > stuff for my customer. It recently grew quite a bit, and in the process > I've introduced some subtle bugs. I'm looking for ideas on things to > look for to see if I can figure out what's going on. > > Here's the deal: > > First, some time this spring I got a shiny new machine, and went ahead > and loaded 64-bit Linux onto it, with all its 64-bit appurtenances. This > did not, at the time, cause problems. > > I coded up a bunch of changes, tested it on my 64-bit machine, and > happily shipped it off to my customer -- who reported that it broke, > horribly. > > Oh drat. On top of this, at some point the MinGW stream library broke, > so my test code no longer worked under Wine -- I could only test with the > Linux version. > > After much trial and tribulation, I managed to get Linux 32 and 64-bit > versions, and Windows 32-bit versions all working. I tracked down my > problems (size_t and unsigned int are not the same size in gcc 64 bit for > Linux), fixed them, and shipped. > > So now I'm getting four different results from three different software > loads and two different circumstances. I can't go into detail, but I'm > going to give a general story 'cause I'm looking for general things to > look for: > > Under Linux 32-bit I get behavior A (correct operation) > > Under Linux 64-bit I get behavior B (correct operation, just different) > > Under Wine running a 32-bit Windows program I get behavior B > > My customer calls my DLL from Labview. Nine times out of ten he gets > some correct behavior -- he's not sophisticated enough that I can know > whether it's A, B or something else. The tenth time the thing fails to > work correctly. > > So, I suspect that I've got some uninitialized memory someplace. But, > I'm running the Linux versions under Valgrind and it's not finding any > problems (Valgrind is great, by the way -- great enough that for my > embedded ARM stuff I do unit testing under Linux and Valgrind). > > I'm going through the code with a fine-toothed comb, and so far I've only > found a few very minor problems that border on the stylistic, although > one of the changes that I made did improve things a bit. > > So -- other than picking through the code line by line, can you guys > suggest anything that I can do or look for in specific? > > Also, does anyone know of a Linux tool that'll randomly populate the heap > with junk then call a program? I suspect that I'm not seeing the > "sometimes it is, sometimes not" behavior that my customer is because of > the different environment, not because Linux is magically fixing my > bugs. Suggestions on how to make the bugs apparent would be helpful. > > Thanks for reading, suggestions welcome -- I'm becoming a candidate for a > rubber room over this one. >Hi Tim, I do not think you can get real help on this by anything but banging your head into the rubber walls for as much time as it takes. I am in a similar state at the moment - have been chasing for hours WHY the same code after some modifications within the loop produces one line too many at times. Way simpler than yours, has only to list a page of parameters but well, yet to work so I can go to sleep. Hopefully knowing you are not alone banging your head in the walls is some consolation as help is not on the horizon... :D. Dimiter
Reply by ●October 6, 20152015-10-06
Reply by ●October 6, 20152015-10-06
On 07/10/15 11:35, Tim Wescott wrote:> This is about code that clings to "embedded" by it's fingernails -- it's > running on a fast PC-compatible single-board computer, under Windows, as > a DLL. So it's not exactly some little thing shoehorned into 4kB of > flash. > > At any rate: > > I have a rather complicated algorithm that I've coded up, to do marvelous > stuff for my customer. It recently grew quite a bit, and in the process > I've introduced some subtle bugs. I'm looking for ideas on things to > look for to see if I can figure out what's going on. > > Here's the deal: > > First, some time this spring I got a shiny new machine, and went ahead > and loaded 64-bit Linux onto it, with all its 64-bit appurtenances. This > did not, at the time, cause problems. > > I coded up a bunch of changes, tested it on my 64-bit machine, and > happily shipped it off to my customer -- who reported that it broke, > horribly. > > Oh drat. On top of this, at some point the MinGW stream library broke, > so my test code no longer worked under Wine -- I could only test with the > Linux version. > > After much trial and tribulation, I managed to get Linux 32 and 64-bit > versions, and Windows 32-bit versions all working. I tracked down my > problems (size_t and unsigned int are not the same size in gcc 64 bit for > Linux), fixed them, and shipped. > > So now I'm getting four different results from three different software > loads and two different circumstances. I can't go into detail, but I'm > going to give a general story 'cause I'm looking for general things to > look for: > > Under Linux 32-bit I get behavior A (correct operation) > > Under Linux 64-bit I get behavior B (correct operation, just different) > > Under Wine running a 32-bit Windows program I get behavior B > > My customer calls my DLL from Labview. Nine times out of ten he gets > some correct behavior -- he's not sophisticated enough that I can know > whether it's A, B or something else. The tenth time the thing fails to > work correctly. > > So, I suspect that I've got some uninitialized memory someplace. But, > I'm running the Linux versions under Valgrind and it's not finding any > problems (Valgrind is great, by the way -- great enough that for my > embedded ARM stuff I do unit testing under Linux and Valgrind). > > I'm going through the code with a fine-toothed comb, and so far I've only > found a few very minor problems that border on the stylistic, although > one of the changes that I made did improve things a bit. > > So -- other than picking through the code line by line, can you guys > suggest anything that I can do or look for in specific? > > Also, does anyone know of a Linux tool that'll randomly populate the heap > with junk then call a program? I suspect that I'm not seeing the > "sometimes it is, sometimes not" behavior that my customer is because of > the different environment, not because Linux is magically fixing my > bugs. Suggestions on how to make the bugs apparent would be helpful. > > Thanks for reading, suggestions welcome -- I'm becoming a candidate for a > rubber room over this one.When chasing a really difficult "Heisenbug" like this that have involved data structures, I've sometimes had joy by adding invariant-checking code - it can be as costly ans slow as you like - to verify that everything which should add up at any point in time actually does. Sort-of like an "fsck" for the data structure. Every such bug has a symptom, and that symptom is associated with some specific precursor. If you can test for that precursor condition before the completion of the algorithm, you have a tool to narrow down the source of the problem. Add a few calls to the invariant-checker in assorted places and run the code until you start to narrow down where it's failing. Clifford Heath.
Reply by ●October 7, 20152015-10-07
Tim Wescott wrote:>... >I'm going through the code with a fine-toothed comb... >... >...other than picking through the code line by line... >Let Gimpel's PC-Lint do that for you. http://www.gimpel.com/html/index.htm R.W.
Reply by ●October 7, 20152015-10-07
On Wed, 07 Oct 2015 01:36:33 +0000, Rob Gaddi wrote:> On Tue, 06 Oct 2015 19:35:13 -0500, Tim Wescott wrote: > >> This is about code that clings to "embedded" by it's fingernails -- >> it's running on a fast PC-compatible single-board computer, under >> Windows, as a DLL. So it's not exactly some little thing shoehorned >> into 4kB of flash. >> >> At any rate: >> >> I have a rather complicated algorithm that I've coded up, to do >> marvelous stuff for my customer. It recently grew quite a bit, and in >> the process I've introduced some subtle bugs. I'm looking for ideas on >> things to look for to see if I can figure out what's going on. >> >> Here's the deal: >> >> First, some time this spring I got a shiny new machine, and went ahead >> and loaded 64-bit Linux onto it, with all its 64-bit appurtenances. >> This did not, at the time, cause problems. >> >> I coded up a bunch of changes, tested it on my 64-bit machine, and >> happily shipped it off to my customer -- who reported that it broke, >> horribly. >> >> Oh drat. On top of this, at some point the MinGW stream library broke, >> so my test code no longer worked under Wine -- I could only test with >> the Linux version. >> >> After much trial and tribulation, I managed to get Linux 32 and 64-bit >> versions, and Windows 32-bit versions all working. I tracked down my >> problems (size_t and unsigned int are not the same size in gcc 64 bit >> for Linux), fixed them, and shipped. >> >> So now I'm getting four different results from three different software >> loads and two different circumstances. I can't go into detail, but I'm >> going to give a general story 'cause I'm looking for general things to >> look for: >> >> Under Linux 32-bit I get behavior A (correct operation) >> >> Under Linux 64-bit I get behavior B (correct operation, just different) >> >> Under Wine running a 32-bit Windows program I get behavior B >> >> My customer calls my DLL from Labview. Nine times out of ten he gets >> some correct behavior -- he's not sophisticated enough that I can know >> whether it's A, B or something else. The tenth time the thing fails to >> work correctly. >> >> So, I suspect that I've got some uninitialized memory someplace. But, >> I'm running the Linux versions under Valgrind and it's not finding any >> problems (Valgrind is great, by the way -- great enough that for my >> embedded ARM stuff I do unit testing under Linux and Valgrind). >> >> I'm going through the code with a fine-toothed comb, and so far I've >> only found a few very minor problems that border on the stylistic, >> although one of the changes that I made did improve things a bit. >> >> So -- other than picking through the code line by line, can you guys >> suggest anything that I can do or look for in specific? >> >> Also, does anyone know of a Linux tool that'll randomly populate the >> heap with junk then call a program? I suspect that I'm not seeing the >> "sometimes it is, sometimes not" behavior that my customer is because >> of the different environment, not because Linux is magically fixing my >> bugs. Suggestions on how to make the bugs apparent would be helpful. >> >> Thanks for reading, suggestions welcome -- I'm becoming a candidate for >> a rubber room over this one. > > Without getting into the A/B specifics, is the difference something that > could be chalked up to floating point error?Between A and B, yes. In fact, it was tweaks to some floating point calculations to make them more kosher that caused the change in the Windows version. However, the customer's one out of ten problem is, I'm pretty sure, different -- first, because it's a failure and not just a little difference, and second, he's running the same file through all the time, and occasionally it's spitting up. I don't know what could cause that in my code other than using an uninitialized variable. It may possibly be a bug on his side, but I don't want to start pointing at his side of things unless I'm pretty certain of mine. -- Tim Wescott Wescott Design Services http://www.wescottdesign.com
Reply by ●October 7, 20152015-10-07
On Tue, 06 Oct 2015 19:35:24 -0700, Paul Rubin wrote:> Have you run the code with undefined behaviour and address sanitizers > turned on?The only such sanitizer I know of is valgrind. Do you have other tools to suggest? -- Tim Wescott Wescott Design Services http://www.wescottdesign.com
Reply by ●October 7, 20152015-10-07
Tim Wescott <seemywebsite@myfooter.really> writes:> So -- other than picking through the code line by line, can you guys > suggest anything that I can do or look for in specific?The most obvious debugging suggestion once you're digging into it is log intermediate results in the algorithm in all environments, run with the same inputs and compare the logs to see where the intermediate results diverge. There are even some automated tools to bisect for the behaviour: see some of the links at https://en.wikipedia.org/wiki/Delta_Debugging The old "Ask Igor" site was amazing and now you can download the code tha ran it (linked from the article above).> Also, does anyone know of a Linux tool that'll randomly populate the > heap with junk then call a program?Interesting question. Nothing immediately comes to mind but I'm not up to date with this stuff. Couldn't you just patch the program to do that, e.g. wrap malloc etc.? Also turn on all the stack guard and other sanitization options in GCC and maybe also try Clang (which has some different sanitization features iirc). Generally this sounds like run of the mill debugging. It's time consuming but don't get discouraged. The basic method is run the program under GDB until you can see that it's going wrong, then reason backwards to figure out what made it go wrong.
Reply by ●October 7, 20152015-10-07
Tim Wescott <seemywebsite@myfooter.really> writes:>> Have you run the code with undefined behaviour and address sanitizers >> turned on? > The only such sanitizer I know of is valgrind. Do you have other tools > to suggest?I mean compiler flags like -fsanitize-address. Clang has some other ones like bounds checking, but I've only used GCC. You could look at Frama-C (frama-c.com) which is sort of a lint on steroids. I've never tried it myself but have been wanting to look into it.







