EmbeddedRelated.com
Forums
The 2026 Embedded Online Conference

Same code, same data, different results

Started by Tim Wescott October 6, 2015
On Sat, 10 Oct 2015 01:08:11 +0300, upsidedown wrote:

> On Tue, 06 Oct 2015 19:35:13 -0500, Tim Wescott > <seemywebsite@myfooter.really> wrote: > >>This is about code that clings to "embedded" by it's fingernails -- it's >>running on a fast PC-compatible single-board computer, under Windows, as >>a DLL. So it's not exactly some little thing shoehorned into 4kB of >>flash. >> >>At any rate: >> >>I have a rather complicated algorithm that I've coded up, to do >>marvelous stuff for my customer. It recently grew quite a bit, and in >>the process I've introduced some subtle bugs. I'm looking for ideas on >>things to look for to see if I can figure out what's going on. >> >>Here's the deal: >> >>First, some time this spring I got a shiny new machine, and went ahead >>and loaded 64-bit Linux onto it, with all its 64-bit appurtenances. >>This did not, at the time, cause problems. >> >>I coded up a bunch of changes, tested it on my 64-bit machine, and >>happily shipped it off to my customer -- who reported that it broke, >>horribly. >> >>Oh drat. On top of this, at some point the MinGW stream library broke, >>so my test code no longer worked under Wine -- I could only test with >>the Linux version. >> >>After much trial and tribulation, I managed to get Linux 32 and 64-bit >>versions, and Windows 32-bit versions all working. I tracked down my >>problems (size_t and unsigned int are not the same size in gcc 64 bit >>for Linux), fixed them, and shipped. >> >>So now I'm getting four different results from three different software >>loads and two different circumstances. I can't go into detail, but I'm >>going to give a general story 'cause I'm looking for general things to >>look for: >> >>Under Linux 32-bit I get behavior A (correct operation) >> >>Under Linux 64-bit I get behavior B (correct operation, just different) >> >>Under Wine running a 32-bit Windows program I get behavior B >> >>My customer calls my DLL from Labview. Nine times out of ten he gets >>some correct behavior -- he's not sophisticated enough that I can know >>whether it's A, B or something else. The tenth time the thing fails to >>work correctly. > > Does it crash every 10th time you run it with the same parameters or > what ?
On average, yes.
> Anyway, if the fatal problems occurs on some Windows machine (desktop or > embedded Windows ?) why do you insist on using Linux or some Windows > emulator on Linux to try to figure out what is wrong in a Windows system > ?
Because I don't have any spare pots of money lying around with which to buy a new machine, Labview, and Windows. It may come to that, but I'd rather avoid the expense.
> If you can't get exactly the same configuration, at least use some > native Windows version on your own test machine. > > Using different versions of MS compilers and you can end up in problem. > If the .EXE and .DLL are compiled with a different version of the > compiler, you may encounter problems, such as when allocating dynamic > memory in .exe and freeing in it .dll.
We are not there, thankfully.
> You should find out at what compiler (and version) the LabView has been > compiled with and preferably use the same compiler (with same version > and settings) for compiling your DLL. We have had lots of problems due > to different compiler versions settings.
Thank you, no, I am not a Microsoft shop. If it comes to that I'll send them code for them to compile.
> Look carefully what LabView compiler settings for your DLL are > suggested. Make sure you use the same LabView version as your customer. > > Are you using DllMain to attach to process (and thread, if you are using > multithreading) ?
I have no clue. I am emailing a dll to my customer.
> If multithreaded, are all the libraries all multithread ? Some standard > C functions are not multithreaded and require special caution if used in > multithreaded environment. > > An other kettle of worms is that one system i truly multicore and the > other is not, when running multithread applications. > > With multithread applications, in which different scheduling algorithms > could give different results in Windows and Linux, if there are some > bugs in the application.
This may be a worthwhile path to pursue. My understanding is that they have my DLL running in its own thread, but what's happening outside of that is unknown.
> An unrelated device driver in the final target system could handle > interrupts improperly (such as failing to save and restore some > registers in interrupts), which will generate random problems. In the > final target system, disable preferably all device drivers to check if > it affects the result of your code. > >>So, I suspect that I've got some uninitialized memory someplace. > > I don't think so. If the program reports different results depending on > the time of day (or phase of the moon) with _exactly_ the same > parameters and sequences, this should not happen. > > After all both Windows and Linux virtual memory systems will create > zeroed pages for the dynamic memory manager, so if a virtual memory page > is delivered to the C dynamic memory manager, it will be zeroed no > matter if malloc() or calloc() is called. Only if a block of memory in a > process has first been free() and then you call malloc () it may contain > some random data, but I always use calloc() to get properly initialized > dynamic memory areas in all cases. > > While you may have identified some of the potential problems, there are > dozens of alternative explanations to your problems.
Alternative explanations are good -- it'll help me figure out what's going on. -- Tim Wescott Wescott Design Services http://www.wescottdesign.com
Tim Wescott <seemywebsite@myfooter.really> writes:
> Because I don't have any spare pots of money lying around with which to > buy a new machine, Labview, and Windows. > It may come to that, but I'd rather avoid the expense.
You can get a Windows remote desktop (not a super powerful one) for 3 months for free here: https://www.runabove.com/deskaas.xml There are also tons of commercial providers who rent stuff like that for a few cents an hour. I guess that doesn't help with Labview, but can't you mock that out?
> This may be a worthwhile path to pursue. My understanding is that they > have my DLL running in its own thread, but what's happening outside of > that is unknown.
How do their code communicate with yours?
> Alternative explanations are good -- it'll help me figure out what's > going on.
Are you anywhere near the customer physically, so you can debug at their site, and is that workable? If yes that may be the simplest.
On Tue, 06 Oct 2015 22:19:53 -0500, Tim Wescott wrote:

> However, the customer's one out of ten problem is, I'm pretty sure, > different -- first, because it's a failure and not just a little > difference, and second, he's running the same file through all the time, > and occasionally it's spitting up. I don't know what could cause that in > my code other than using an uninitialized variable. > > It may possibly be a bug on his side, but I don't want to start pointing > at his side of things unless I'm pretty certain of mine.
If you're mixing MinGW-compiled code and MSVC-compiled code in the same process, one factor to consider are that the ABIs aren't quite the same. Alignment constraints can be different, and some versions of MinGW have 80-bit "long double" whereas MSVC has 64-bit "long double". Also, don't try to transfer ownership of heap blocks between DLLs (or between DLLs and the EXE). Whichever module created a block with malloc/calloc/realloc is the only one which can safely call realloc() or free() on it. The reason is that each module can link to a different version of the MSVCRT DLL, each with a separate heap. Similar issues may exist for other types, e.g. FILE*. If it was an uninitialised data bug in your code, I think valgrind would find it. If you have numerically-unstable floating-point algorithms, try calling fesetround() explicitly (if that's available).
On Sat, 10 Oct 2015 03:01:39 +0100, Nobody <nobody@nowhere.invalid>
wrote:

>If you're mixing MinGW-compiled code and MSVC-compiled code in the same >process, one factor to consider are that the ABIs aren't quite the same. >Alignment constraints can be different, and some versions of MinGW >have 80-bit "long double" whereas MSVC has 64-bit "long double".
That's version dependent. Up to 2005, MSVC supported the 80-bit extended type. Versions from 2005 onward still support using the x87 with appropriate architecture switches, but they do not support loading or storing extended values in memory. George
On Sat, 10 Oct 2015 00:32:36 -0400, George Neuner
<gneuner2@comcast.net> wrote:

>On Sat, 10 Oct 2015 03:01:39 +0100, Nobody <nobody@nowhere.invalid> >wrote: > >>If you're mixing MinGW-compiled code and MSVC-compiled code in the same >>process, one factor to consider are that the ABIs aren't quite the same. >>Alignment constraints can be different, and some versions of MinGW >>have 80-bit "long double" whereas MSVC has 64-bit "long double". > >That's version dependent. Up to 2005, MSVC supported the 80-bit >extended type. Versions from 2005 onward still support using the x87 >with appropriate architecture switches, but they do not support >loading or storing extended values in memory.
I don't believe any of the 32 bit versions of MSVC ever supported 80-bit long doubles. AFAIK, all of the 16-bit versions did, but those are not relevant for linking with Mingw.
On Fri, 09 Oct 2015 17:40:16 -0700, Paul Rubin wrote:

> Tim Wescott <seemywebsite@myfooter.really> writes: >> Because I don't have any spare pots of money lying around with which to >> buy a new machine, Labview, and Windows. >> It may come to that, but I'd rather avoid the expense. > > You can get a Windows remote desktop (not a super powerful one) for 3 > months for free here: https://www.runabove.com/deskaas.xml > > There are also tons of commercial providers who rent stuff like that for > a few cents an hour. I guess that doesn't help with Labview, but can't > you mock that out? > >> This may be a worthwhile path to pursue. My understanding is that they >> have my DLL running in its own thread, but what's happening outside of >> that is unknown. > > How do their code communicate with yours?
I tried to keep that part of the interface fairly simple -- they call an "update" function which returns a status, the status indicates when various bits of information are available, and there's various functions to fetch the information that's available. All of the guts are well encapsulated in the DLL, and the problem is happening in the guts -- not in the interface part.
>> Alternative explanations are good -- it'll help me figure out what's >> going on. > > Are you anywhere near the customer physically, so you can debug at their > site, and is that workable? If yes that may be the simplest.
14 hour flight and an international border. It's not totally out of reason, but buying a cheap Windows machine and Labview would be easier and cheaper. -- Tim Wescott Wescott Design Services http://www.wescottdesign.com
On 07.10.2015 3:35, Tim Wescott wrote:
> Thanks for reading, suggestions welcome -- I'm becoming a candidate for a > rubber room over this one. >
Just to throw in my two cents. 1. Reentrability of your code is certainly not an issue? 2. I presume that endianness at your and customer's machines is taken into account? (Just mentioning that because it's not exactly impossible.) Regards, Evgeny.
On Sun, 11 Oct 2015 11:49:59 +0300, Evgeny Filatov wrote:

> On 07.10.2015 3:35, Tim Wescott wrote: >> Thanks for reading, suggestions welcome -- I'm becoming a candidate for >> a rubber room over this one. >> >> > Just to throw in my two cents. > > 1. Reentrability of your code is certainly not an issue?
In the sense that it should only ever be called from one thread, yes. In the sense that it is 100% reentrant -- I suspect not.
> 2. I presume that endianness at your and customer's machines is taken > into account? (Just mentioning that because it's not exactly > impossible.)
If that was an issue I would expect the problems to be immediate and horrible, not merely false results every once in a while. -- Tim Wescott Wescott Design Services http://www.wescottdesign.com
Tim Wescott <seemywebsite@myfooter.really> writes:
>> Are you anywhere near the customer physically, so you can debug at their >> site, and is that workable? If yes that may be the simplest. > 14 hour flight and an international border.
Yuggh.
> It's not totally out of reason, but buying a cheap Windows machine and > Labview would be easier and cheaper.
One issue there is it's best if you can run the customer's entire application on your Windows computer: would they allow that? If not, maybe you can get remote access to their machine somehow. Labview is pretty expensive too, if you have to buy it. I wonder if there are cloud instances available.
On 12/10/15 07:00, Tim Wescott wrote:
> On Sun, 11 Oct 2015 11:49:59 +0300, Evgeny Filatov wrote: >> On 07.10.2015 3:35, Tim Wescott wrote: >>> Thanks for reading, suggestions welcome -- I'm becoming a candidate for >>> a rubber room over this one. >> Just to throw in my two cents.
...
>> 2. I presume that endianness at your and customer's machines is taken >> into account? (Just mentioning that because it's not exactly >> impossible.) > If that was an issue I would expect the problems to be immediate and > horrible, not merely false results every once in a while.
The first-ever port of Unix to a big-endian architecture was done at the University of Melbourne. The bootstrap loaders of the time would print the name of the file to be loaded - "unix" then load it into memory and jump to the start. The first (and only) thing that the Interdata 8/32 (big-endian) port said? > nuxi (... silence) I think that "nuxi!" should be adopted as the correct expletive for anyone who encounters endian problems. :) Clifford Heath.
The 2026 Embedded Online Conference