The Hardest Bug I Never Solved
I agreed to four hours.
Four hours to help hunt down and kill a bug. A terrible malicious bug that was eating away at this project, wreaking havoc upon the foundations of a critical feature, and draining time randomly from every one of eight firmware engineers on this project. Quite honestly, I can’t remember the last time it took more than an hour or two for me to locate, isolate, and fix a firmware bug. Surely I could help find and solve this issue within four hours?
This is the story of the hardest bug I never solved.
The product afflicted with this nasty bug was an IoT style project with sensors, a user interface, and data connections. When I arrived onto the project most of the foundational code was in place and operating. The unique RTOS selected for this project was humming along and driving the selected MCU, an STM32F427. SPI buses were clocking and moving millions of bytes. The MCU’s DMA controller was offloading CPU time and moving data along to its final destination. The SDIO bus was moving bits one at a time and enabling the device’s primary data connection. A unique set of on-target unit-tests were available and passing. And it was working. Most of the time.
Sometimes, during the startup and initialization of the firmware, it would “crash.” More precisely: the MCU would find itself mysteriously in the reset handler, reboot, and then once again find itself in the reset handler: repeat forever. In the meantime, the firmware team had learned to avoid this bug. This bug could be “fixed” by adjusting various static buffer sizes until this “crash” went away. Sometimes it would take 30 seconds to find the magic numbers. Sometimes it would take many minutes to find the magic numbers. This bug was a time-leach and needed to be solved, not avoided.
Given the above information, I conjectured that a buffer involved with DMA transactions must be misaligned. I spent most of my four hours tracking down all said-buffers and double checking the MCU’s datasheet for DMA buffer alignment requirements. Every buffer I examined checked out and every buffer I examined was aligned to the MCU’s requirements. Additionally, I carefully inspected associated code for buffer overflows, although the symptoms did not exactly point to such a flaw. I did find fragile code needing improvement, but the targeted bug remained alive, gnawing at me and the foundations of this project. My four hours came and went. I reported to the customer I had failed and would be returning to my primary assignment per our agreement.
The customer then asked another experienced firmware engineer on the team to hunt down our nasty time-leach. Another experienced firmware engineer came and went.
And then another.
At this point I didn’t feel too bad. At this point I was in very good company.
Finally the case was assigned to Unnikrishnan Sreekumar (Unni), one of the “junior” engineers on the team. I say “junior” only because Unni was junior relative to most of the engineers on this project. His skill-set is certainly not junior.
A day or two later the case was solved. The bug was squashed, the time-leach was obliterated forever. I was literally singing hallelujah when the news appeared as a pull-request from Unni in our revision control system.
Ultimately this time-leach was, as everyone guessed, caused by misaligned data. In this particular case, it was caused when the firmware relocated the MCU’s ISR vector table into RAM. The code-in-question properly accounted for the alignment requirements found in the ARM datasheet for the vector table offset register (VTOR). However, that same datasheet also contains a critical sentence:
“See your vendor documentation for the alignment details of your device.”
One overlooked sentence in a datasheet - the curse of firmware engineers everywhere.
Following that lead, Unni found that the STM32F427 MCU required 512 bytes alignment for the interrupt vector table. The project’s code was enforcing a minimum alignment of 256 bytes, hence sometimes the linker happened to place this vector into a 512 byte aligned location and sometimes into a 256 byte aligned location. When the ISR vector was incorrectly aligned, one of the other interrupts would hit the reset handler instead of the actual interrupt, and our “crash” would occur.
With that nagging issue solved, the firmware team moved forward with increased confidence, adding features and delivering firmware with improved efficiency.
And that is the story of the hardest bug I never solved.
I have noticed a trend in my scars from creating forehead-shaped holes in the walls when looking for bugs. They fall into a couple of categories. First there are the dumb ones - off-by-one, using a bit number instead of converting it to a bit mask, etc. Usually when I am tired. The next set are the "I better re-think this whole thing" - fortunately, those are rare.
THEN - there are the ones that take 3 days.... Everything looks good. Register bits are correct. Setup is correct, Things seem to work - except when they rarely don't. Or everything was working, but one minor change went out the window. Just need to keep simplifying and poking at it.
These almost always boil down to one or two lines of code, or a GUI check box that is wrong, or reading the same line in a datasheet for the NNNth time before the meaning flips to reality. Or keep reading forums til you come across a breadcrumb trail...
I had a race condition in a 68332 - two SPI devices on the same port. Three days. Turns out the 68332 have "shadow" registers and I had the equivalent of a race condition causing the equivalent of a priority inversion. One line to fix, and 50 lines of comments on what causes the bug and why the one line fixed it.
Another rare but nasty cause of those "3-day" issues are erratas. Not the ones in the data sheet, but the ones you--the customer--find that eventually make it in to the next rev of the data sheet. I tracked a strange glitch in a Motorola 56000 DSP down once to what had to be a bug in its operation. I discussed it with the Motorola field rep, and they assured me it couldn't be a flaw in the chip. Needless to say, I was a little peeved when 6 months later a new rev of the chip came out with the exact bug I found now added (and fixed) on the errata pages of the new data sheet.
Yep. What were the details? 56000 is an interesting chip. One system I worked on used it as a fast bit-banger. Learned the asm in a morning.
Microchip I2C ADC (forget the number) - chased a bug for a full 24 hours (in one day, so logical 3 days). the ADC returned 16 bits. Every once in a while, the LS byte == 00. Turns out you could do repeated reads (which I knew) - but the HW bug only happened on the first read. I ended up throwing the first read away, then averaging the next 3..4 reads. I wrote a report && gave it to them at the next embedded systems show.
One lesson from that: I had seen the bug before, but had a lot still to do, so i ignored it. Then - just before delivery, it popped up again. Lesson: deal with these things when they pop up.
Also found a HW bug in the 68000 - udiv had a race condition. I had a test case that would fail reliably, but could not force the error. I heard manufacturing had found some temperature sensitive chips. The CPU board was on an extender card, so I ran the test in a loop && hit the chip with a heat gun - FAIL!! hit it with cold spray - WORKED!! I gave motorola a full report. Turns out they changed where udiv started executing from the beginning of a clock cycle, to the half cycle. Sometimes the udiv took a bit more than a half-cycle to execute, so it bled over into the next full cycle. Turns out the test maching on the production line was not working properly and did not catch the bug. Date codes 8303 thru 8318.
I had a cartoon from datamation where they would print a medieval portrait and asked for a caption. The one i saved I put it on the front page of the report. The winning caption? "My software is perfect. It must be the hardware". hhehehe - that felt good ...
Do firmware && you will probably find one of these things in your career.
OMG, that was 25 years ago, and the memory is fading fast :-) IIRC it had something to do with DMA transfers between external (host bus) and internal memory hanging when a parallel (but not overlapping) internal-to-internal DMA transfer was taking place on a different DMA channel, because the DMA block was ignoring the DMA-start bit occasionally. IIRC my work-around was to reset the DMA channel before starting it. Slowed down coefficient input from the external host a few cycles, but wasn't a showstopper in the end. As I'm sure you recall Motorola was a huge player in those days (now On Semi), the design team for the 56000 was in Israel, and no one wanted to admit the chip had other problems, until that errata (and a few more) showed up in the next rev as 'fixed'. Still a lovely chip though...Harvard architecture, fixed-point arithmetic, butterfly DMAs, and fast in its day.
Quite agree with you about the freeze spray and heat gun debugging techniques. Two of the most useful tools in the embedded engineer's arsenal :-) Interesting about that udiv bug. The 68000 was the first processor I designed in to a product (and later upgraded to a 68020). The world would be a better place if IBM had chosen that chip and not the Intel 8088 for the PC. Proper network big-endian order, relatively orthogonal instruction set, vectored priority interrupts, flat address space, and user/supervisor mode. Of course Apple used it in the original Mac, but by then it was too late and segmented little-endian stuff was here to stay. Sigh. (Hope I'm not starting a thread war here :-) )
Yeah - I love the 68K/coldfire architecture. Clean, simple, logical. I really like the NetBurner stuff - coldfire + uC/OS - recommend it highly. really nice platform for real-world applications && a good way to learn RTOS.
Because I found the 68k udiv bug, we were a beta site for the 68020. I have a copy of my log page showing the bug and one of the chips framed on the wall.
And yes, the world would be (positively) different if they chose the 68k. And if they chose CP/M ... If I had a time machine, that would one thing I would change. You are not starting a flame war - just a bunch of kvetching old farts ranting ... :^)
One lesson from that: I had seen the bug before, but had a lot still to do, so i ignored it. Then - just before delivery, it popped up again. Lesson: deal with these things when they pop up
Could not agree more! Thanks for commenting!
Yes, erratas have been on my mind lately too:
Yep - but by that time, you are dedicated on that chip - hardware is cut && now it's Cursing Time.
"We will just fix it in software" ... mumble growl kick the chair pull hair gnash howl, and other fancy words. Time to break out "vocabulary". And - the real pisser is when you yourself has chosen the wretched thing because it has the fancy thing you really need, and it has the bad errata on the very thing you really need.
Another class of firmware bug is the mines left for you by the previous developer that only show up when there is a respin or die shrink of the hardware (usually the CPU). I had one case where I was doing support for a small manufacturer. They only produced a few hundred boards a year, so occasionally a part would go EOL and get replaced by an "improved" product or go through a die shrink. A lot of times the cost would go down, but there were always changes, and hidden in the change documentation there was usually something that affected the product, but wasn't obvious at first glance - especially if you didn't write the code and had not poked into it much. In this case it was a venerable 8051 OTP part that talked to another board using a serial interface. At one point they must have been thinking about using one of the external interrupts as an RTS signal, because they had it routed to the I/O connector, but they never used it, so it was unconnected. In the firmware they masked off the interrupt in a control register and must have felt that was "good enough", because they did not set the vector for it in the interrupt table to something safe. It worked, so no problem, right? Yeah, until the die shrink when that interrupt was made non-maskable. Random lockups at the customer sites were the result. My predecessor was dumbfounded. I found it in a day. Added an ISR for that interrupt and put in an engineering change to have that line grounded on boards going out or being serviced in the future. That particular job was full of those kind of things. Glad I'm not there anymore.
To post reply to a comment, click on the 'reply' button attached to each comment. To post a new comment (not a reply to a comment) check out the 'Write a Comment' tab at the top of the comments.
Registering will allow you to participate to the forums on ALL the related sites and give you access to all pdf downloads.