EmbeddedRelated.com
Forums
The 2026 Embedded Online Conference

Same code, same data, different results

Started by Tim Wescott October 6, 2015
On 07/10/15 22:27, Tim Wescott wrote:
> On Wed, 07 Oct 2015 10:25:42 +0200, David Brown wrote: > >> On 07/10/15 02:35, Tim Wescott wrote: >> >>> After much trial and tribulation, I managed to get Linux 32 and 64-bit >>> versions, and Windows 32-bit versions all working. I tracked down my >>> problems (size_t and unsigned int are not the same size in gcc 64 bit >>> for Linux), fixed them, and shipped. >>> >>> >> There are bigger differences than that. In particular, "long" is 64-bit >> in 64-bit Linux, but only 32-bit in 64-bit Windows. And size_t is going >> to be 64-bit in 64-bit Linux and Windows, but 32-bit in 32-bit Linux and >> Windows. And while size_t may happen to be the same size as unsigned >> int on some combinations, don't forget that it is not necessarily the >> same type. >> >> Your best way forward here is to treat your programming as carefully as >> you would for embedded programming. Never make any assumptions about >> the relationships between types, other than as given by the law (the C >> or C++ standards, as appropriate). When you want something of a >> particular size, use the <stdint.h> types. >> >> >> The other big question here is what language(s) you are using, and what >> compiler(s) you are using. That would be helpful to know. >> >> If you are using up-to-date gcc or clang, you have a variety of >> "sanitize" options that can help. AFAIK some of them only run on 64-bit >> Linux, since they use memory management tricks that require the wider >> address space and greater flexibility, but they should still be helpful. > > gcc 4.8.4 (it's what came with Ubuntu). "-fsanitize=address" and "- > fsanitize=thread" seem to be the only sanitize options -- and it passes > those. >
It should not be hard to install a newer gcc and get access to a lot more sanitize options. I haven't tried them myself, but they might help you out. <https://gcc.gnu.org/onlinedocs/gcc/Debugging-Options.html#index-fsanitize_003dundefined-680>
"David Brown" <david.brown@hesbynett.no> wrote in message 
news:mv562t$am8$1@dont-email.me...
> On 07/10/15 05:27, Paul Rubin wrote: >> Tim Wescott <seemywebsite@myfooter.really> writes: >>>> Have you run the code with undefined behaviour and address sanitizers >>>> turned on? >>> The only such sanitizer I know of is valgrind. Do you have other tools >>> to suggest? >> >> I mean compiler flags like -fsanitize-address. Clang has some other >> ones like bounds checking, but I've only used GCC. >> >> You could look at Frama-C (frama-c.com) which is sort of a lint on >> steroids. I've never tried it myself but have been wanting to look into >> it. >> > > I haven't heard of Frama-C before, but I've found the website, and it is > now high up on my list of things to read as soon as I get the time. > Thanks for the pointer. >
Coming rather late to this, but occasional random FP errors were occurring in a system I wrote. Turned out the standard ISR pre-amble/post-amble did not save the FP state properly which was normally fine as I tended to avoid FP in ISR (for reasons which are now probably considered pre-historic).
On Wed, 7 Oct 2015 08:54:38 +0000 (UTC), glen herrmannsfeldt
<gah@ugcs.caltech.edu> wrote:

>Another one I remember some time ago, I believe on Windows, is >not intializing the x87 control register, such that rounding modes >and precision are different between runs. (That is, what the previous >program left.)
Tim said he was using MinGW (GCC) and I don't recall that being an issue in either MinGW or Cygwin's GCC. It definitely was a problem in Microsoft's compilers, but I recall it only in 16-bit versions. All the 32-bit and later compilers do initialize the x87. However ... ... the x87 is initialized only if it is used. Since Pentium 4 (SSE 2) compilers have defaulted to using SIMD for most floating point - the x87 isn't used unless you specifically enable it - e.g., to get extended precision transcendentals. That can cause problems if the program does not use the x87, but calls libraries which do. This may have nothing whatsoever to do with Tim's problem, but it's a good practice always to initialize the x87 even if you don't plan on using it, because libraries are allowed to assume that the program has done so.
>If you use a test system that detects all attempts to use memory >that hasn't be given a value, it likely won't notice x87 registers.
Or any other registers. <grin> George
On Tue, 06 Oct 2015 19:35:13 -0500, Tim Wescott
<seemywebsite@myfooter.really> wrote:

>This is about code that clings to "embedded" by it's fingernails -- it's >running on a fast PC-compatible single-board computer, under Windows, as >a DLL. So it's not exactly some little thing shoehorned into 4kB of >flash. > >At any rate: > >I have a rather complicated algorithm that I've coded up, to do marvelous >stuff for my customer. It recently grew quite a bit, and in the process >I've introduced some subtle bugs. I'm looking for ideas on things to >look for to see if I can figure out what's going on. > >Here's the deal: > >First, some time this spring I got a shiny new machine, and went ahead >and loaded 64-bit Linux onto it, with all its 64-bit appurtenances. This >did not, at the time, cause problems. > >I coded up a bunch of changes, tested it on my 64-bit machine, and >happily shipped it off to my customer -- who reported that it broke, >horribly. > >Oh drat. On top of this, at some point the MinGW stream library broke, >so my test code no longer worked under Wine -- I could only test with the >Linux version. > >After much trial and tribulation, I managed to get Linux 32 and 64-bit >versions, and Windows 32-bit versions all working. I tracked down my >problems (size_t and unsigned int are not the same size in gcc 64 bit for >Linux), fixed them, and shipped. > >So now I'm getting four different results from three different software >loads and two different circumstances. I can't go into detail, but I'm >going to give a general story 'cause I'm looking for general things to >look for: > >Under Linux 32-bit I get behavior A (correct operation) > >Under Linux 64-bit I get behavior B (correct operation, just different) > >Under Wine running a 32-bit Windows program I get behavior B > >My customer calls my DLL from Labview. Nine times out of ten he gets >some correct behavior -- he's not sophisticated enough that I can know >whether it's A, B or something else. The tenth time the thing fails to >work correctly. > >So, I suspect that I've got some uninitialized memory someplace. But, >I'm running the Linux versions under Valgrind and it's not finding any >problems (Valgrind is great, by the way -- great enough that for my >embedded ARM stuff I do unit testing under Linux and Valgrind). > >I'm going through the code with a fine-toothed comb, and so far I've only >found a few very minor problems that border on the stylistic, although >one of the changes that I made did improve things a bit. > >So -- other than picking through the code line by line, can you guys >suggest anything that I can do or look for in specific? > >Also, does anyone know of a Linux tool that'll randomly populate the heap >with junk then call a program? I suspect that I'm not seeing the >"sometimes it is, sometimes not" behavior that my customer is because of >the different environment, not because Linux is magically fixing my >bugs. Suggestions on how to make the bugs apparent would be helpful. > >Thanks for reading, suggestions welcome -- I'm becoming a candidate for a >rubber room over this one.
Memory Alignment? #pragma Pack in a third party lib? I vaguly recall memory page boundaries in shared dll's and padding structs accordingly, but that may not apply here. Cheers
On 09/10/15 02:51, Martin Riddle wrote:

> > Memory Alignment? #pragma Pack in a third party lib? > I vaguly recall memory page boundaries in shared dll's and padding > structs accordingly, but that may not apply here. >
That reminds me of one possible issue with dll's in Windows. gcc generates code that keeps the stack aligned on 16-byte boundaries, to allow better cache line usage and faster SIMD instructions. But Windows and MS compilers use 4-byte stack alignment on 32-bit Windows. There is no problem within a mingw-compiled program, since the startup code sorts out the stack alignment. But if your functions are called from somewhere else, as exported dll functions or as callbacks from Windows, then the stack alignement may be bad. The way to fix this is the gcc function attribute "force_align_arg_pointer" to functions that could be called from outside. Alternatively, you can use the "-mstackrealign" compiler flag to make all functions properly align the stack if they need it (the 16-byte alignment is only actually necessary for some SIMD instructions, but these could be generated for code that moves a lot of data around). Stack misalignments are more likely to cause a crash than other incorrect behaviour, but perhaps exception handling or other error trapping is hiding the real issue. There are also gcc options for controlling the details of floating point, which may have different default settings on Windows and Linux or in 32-bit and 64-bit modes, leading to marginal differences in some calculation results. <https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html>
On Tue, 06 Oct 2015 19:35:13 -0500, Tim Wescott wrote:

> This is about code that clings to "embedded" by it's fingernails -- it's > running on a fast PC-compatible single-board computer, under Windows, as > a DLL. So it's not exactly some little thing shoehorned into 4kB of > flash. > > At any rate: > > I have a rather complicated algorithm that I've coded up, to do > marvelous stuff for my customer. It recently grew quite a bit, and in > the process I've introduced some subtle bugs. I'm looking for ideas on > things to look for to see if I can figure out what's going on. > > Here's the deal: > > First, some time this spring I got a shiny new machine, and went ahead > and loaded 64-bit Linux onto it, with all its 64-bit appurtenances. > This did not, at the time, cause problems. > > I coded up a bunch of changes, tested it on my 64-bit machine, and > happily shipped it off to my customer -- who reported that it broke, > horribly. > > Oh drat. On top of this, at some point the MinGW stream library broke, > so my test code no longer worked under Wine -- I could only test with > the Linux version. > > After much trial and tribulation, I managed to get Linux 32 and 64-bit > versions, and Windows 32-bit versions all working. I tracked down my > problems (size_t and unsigned int are not the same size in gcc 64 bit > for Linux), fixed them, and shipped. > > So now I'm getting four different results from three different software > loads and two different circumstances. I can't go into detail, but I'm > going to give a general story 'cause I'm looking for general things to > look for: > > Under Linux 32-bit I get behavior A (correct operation) > > Under Linux 64-bit I get behavior B (correct operation, just different) > > Under Wine running a 32-bit Windows program I get behavior B > > My customer calls my DLL from Labview. Nine times out of ten he gets > some correct behavior -- he's not sophisticated enough that I can know > whether it's A, B or something else. The tenth time the thing fails to > work correctly. > > So, I suspect that I've got some uninitialized memory someplace. But, > I'm running the Linux versions under Valgrind and it's not finding any > problems (Valgrind is great, by the way -- great enough that for my > embedded ARM stuff I do unit testing under Linux and Valgrind). > > I'm going through the code with a fine-toothed comb, and so far I've > only found a few very minor problems that border on the stylistic, > although one of the changes that I made did improve things a bit. > > So -- other than picking through the code line by line, can you guys > suggest anything that I can do or look for in specific? > > Also, does anyone know of a Linux tool that'll randomly populate the > heap with junk then call a program? I suspect that I'm not seeing the > "sometimes it is, sometimes not" behavior that my customer is because of > the different environment, not because Linux is magically fixing my > bugs. Suggestions on how to make the bugs apparent would be helpful. > > Thanks for reading, suggestions welcome -- I'm becoming a candidate for > a rubber room over this one.
For anyone following this saga, I have resorted to weirdness: I've overloaded new to pack the allocated space with random data before the constructor gets to it. I haven't found a problem so far in 12000 runs, each with a different random number generator seed. So either it's a Windows thing that Wine does not replicate, or it's in my customer's code. void * operator new (std::size_t size) throw () { void * p = malloc(size); if (p == 0) { throw std::bad_alloc(); } for (size_t n = 0; n < size; ++n) { static_cast<char *>(p)[n] = rand(); } return p; } -- Tim Wescott Wescott Design Services http://www.wescottdesign.com
Tim Wescott <seemywebsite@myfooter.really> writes:
> I haven't found a problem so far in 12000 runs, each with a different > random number generator seed. So either it's a Windows thing that Wine > does not replicate, or it's in my customer's code.
I agree with people who suggested you duplicate the customer's environment as much as possible, so you have better chance of reproducing the problem. Have you figured out what causes the discrepancy between the 32 and 64 bit versions on your own system? If it's significant and you're not doing something numerically unstable, that seems worth chasing down. I think if the problem was uninitialized data, valgrind memcheck is supposed to have found it. How much code are you talking about?
Am 09.10.2015 um 19:25 schrieb Tim Wescott:
> > For anyone following this saga, I have resorted to weirdness: I've > overloaded new to pack the allocated space with random data before the > constructor gets to it.
That does beg one question: just how sure are you that the effect is even happening on the heap, and not the stack?
On Tue, 06 Oct 2015 19:35:13 -0500, Tim Wescott
<seemywebsite@myfooter.really> wrote:

>This is about code that clings to "embedded" by it's fingernails -- it's >running on a fast PC-compatible single-board computer, under Windows, as >a DLL. So it's not exactly some little thing shoehorned into 4kB of >flash. > >At any rate: > >I have a rather complicated algorithm that I've coded up, to do marvelous >stuff for my customer. It recently grew quite a bit, and in the process >I've introduced some subtle bugs. I'm looking for ideas on things to >look for to see if I can figure out what's going on. > >Here's the deal: > >First, some time this spring I got a shiny new machine, and went ahead >and loaded 64-bit Linux onto it, with all its 64-bit appurtenances. This >did not, at the time, cause problems. > >I coded up a bunch of changes, tested it on my 64-bit machine, and >happily shipped it off to my customer -- who reported that it broke, >horribly. > >Oh drat. On top of this, at some point the MinGW stream library broke, >so my test code no longer worked under Wine -- I could only test with the >Linux version. > >After much trial and tribulation, I managed to get Linux 32 and 64-bit >versions, and Windows 32-bit versions all working. I tracked down my >problems (size_t and unsigned int are not the same size in gcc 64 bit for >Linux), fixed them, and shipped. > >So now I'm getting four different results from three different software >loads and two different circumstances. I can't go into detail, but I'm >going to give a general story 'cause I'm looking for general things to >look for: > >Under Linux 32-bit I get behavior A (correct operation) > >Under Linux 64-bit I get behavior B (correct operation, just different) > >Under Wine running a 32-bit Windows program I get behavior B > >My customer calls my DLL from Labview. Nine times out of ten he gets >some correct behavior -- he's not sophisticated enough that I can know >whether it's A, B or something else. The tenth time the thing fails to >work correctly.
Does it crash every 10th time you run it with the same parameters or what ? Anyway, if the fatal problems occurs on some Windows machine (desktop or embedded Windows ?) why do you insist on using Linux or some Windows emulator on Linux to try to figure out what is wrong in a Windows system ? If you can't get exactly the same configuration, at least use some native Windows version on your own test machine. Using different versions of MS compilers and you can end up in problem. If the .EXE and .DLL are compiled with a different version of the compiler, you may encounter problems, such as when allocating dynamic memory in .exe and freeing in it .dll. You should find out at what compiler (and version) the LabView has been compiled with and preferably use the same compiler (with same version and settings) for compiling your DLL. We have had lots of problems due to different compiler versions settings. Look carefully what LabView compiler settings for your DLL are suggested. Make sure you use the same LabView version as your customer. Are you using DllMain to attach to process (and thread, if you are using multithreading) ? If multithreaded, are all the libraries all multithread ? Some standard C functions are not multithreaded and require special caution if used in multithreaded environment. An other kettle of worms is that one system i truly multicore and the other is not, when running multithread applications. With multithread applications, in which different scheduling algorithms could give different results in Windows and Linux, if there are some bugs in the application. An unrelated device driver in the final target system could handle interrupts improperly (such as failing to save and restore some registers in interrupts), which will generate random problems. In the final target system, disable preferably all device drivers to check if it affects the result of your code.
>So, I suspect that I've got some uninitialized memory someplace.
I don't think so. If the program reports different results depending on the time of day (or phase of the moon) with _exactly_ the same parameters and sequences, this should not happen. After all both Windows and Linux virtual memory systems will create zeroed pages for the dynamic memory manager, so if a virtual memory page is delivered to the C dynamic memory manager, it will be zeroed no matter if malloc() or calloc() is called. Only if a block of memory in a process has first been free() and then you call malloc () it may contain some random data, but I always use calloc() to get properly initialized dynamic memory areas in all cases. While you may have identified some of the potential problems, there are dozens of alternative explanations to your problems.
On Fri, 9 Oct 2015 23:58:39 +0200, Hans-Bernhard Br&#4294967295;ker
<HBBroeker@t-online.de> wrote:

>Am 09.10.2015 um 19:25 schrieb Tim Wescott: >> >> For anyone following this saga, I have resorted to weirdness: I've >> overloaded new to pack the allocated space with random data before the >> constructor gets to it. > >That does beg one question: just how sure are you that the effect is >even happening on the heap, and not the stack?
Not to mention about half a dozen other possibilities (see my other post:-).
The 2026 Embedded Online Conference