The Hardest Bug I Never Solved
I agreed to four hours.
Four hours to help hunt down and kill a bug. A terrible malicious bug that was eating away at this project, wreaking havoc upon the foundations of a critical feature, and draining time randomly from every one of eight firmware engineers on this project. Quite honestly, I can’t remember the last time it took more than an hour or two for me to locate, isolate, and fix a firmware bug. Surely I could help find and solve this issue within four hours?
This is the story of the hardest bug I never solved.
The product afflicted with this nasty bug was an IoT style project with sensors, a user interface, and data connections. When I arrived onto the project most of the foundational code was in place and operating. The unique RTOS selected for this project was humming along and driving the selected MCU, an STM32F427. SPI buses were clocking and moving millions of bytes. The MCU’s DMA controller was offloading CPU time and moving data along to its final destination. The SDIO bus was moving bits one at a time and enabling the device’s primary data connection. A unique set of on-target unit-tests were available and passing. And it was working. Most of the time.
Sometimes, during the startup and initialization of the firmware, it would “crash.” More precisely: the MCU would find itself mysteriously in the reset handler, reboot, and then once again find itself in the reset handler: repeat forever. In the meantime, the firmware team had learned to avoid this bug. This bug could be “fixed” by adjusting various static buffer sizes until this “crash” went away. Sometimes it would take 30 seconds to find the magic numbers. Sometimes it would take many minutes to find the magic numbers. This bug was a time-leach and needed to be solved, not avoided.
Given the above information, I conjectured that a buffer involved with DMA transactions must be misaligned. I spent most of my four hours tracking down all said-buffers and double checking the MCU’s datasheet for DMA buffer alignment requirements. Every buffer I examined checked out and every buffer I examined was aligned to the MCU’s requirements. Additionally, I carefully inspected associated code for buffer overflows, although the symptoms did not exactly point to such a flaw. I did find fragile code needing improvement, but the targeted bug remained alive, gnawing at me and the foundations of this project. My four hours came and went. I reported to the customer I had failed and would be returning to my primary assignment per our agreement.
The customer then asked another experienced firmware engineer on the team to hunt down our nasty time-leach. Another experienced firmware engineer came and went.
And then another.
At this point I didn’t feel too bad. At this point I was in very good company.
Finally the case was assigned to Unnikrishnan Sreekumar (Unni), one of the “junior” engineers on the team. I say “junior” only because Unni was junior relative to most of the engineers on this project. His skill-set is certainly not junior.
A day or two later the case was solved. The bug was squashed, the time-leach was obliterated forever. I was literally singing hallelujah when the news appeared as a pull-request from Unni in our revision control system.
Ultimately this time-leach was, as everyone guessed, caused by misaligned data. In this particular case, it was caused when the firmware relocated the MCU’s ISR vector table into RAM. The code-in-question properly accounted for the alignment requirements found in the ARM datasheet for the vector table offset register (VTOR). However, that same datasheet also contains a critical sentence:
“See your vendor documentation for the alignment details of your device.”
One overlooked sentence in a datasheet - the curse of firmware engineers everywhere.
Following that lead, Unni found that the STM32F427 MCU required 512 bytes alignment for the interrupt vector table. The project’s code was enforcing a minimum alignment of 256 bytes, hence sometimes the linker happened to place this vector into a 512 byte aligned location and sometimes into a 256 byte aligned location. When the ISR vector was incorrectly aligned, one of the other interrupts would hit the reset handler instead of the actual interrupt, and our “crash” would occur.
With that nagging issue solved, the firmware team moved forward with increased confidence, adding features and delivering firmware with improved efficiency.
And that is the story of the hardest bug I never solved.
Previous post by Matthew Eshleman:
Tenderfoot: How to Write a Great Bug Report
I have noticed a trend in my scars from creating forehead-shaped holes in the walls when looking for bugs. They fall into a couple of categories. First there are the dumb ones - off-by-one, using a bit number instead of converting it to a bit mask, etc. Usually when I am tired. The next set are the "I better re-think this whole thing" - fortunately, those are rare.
THEN - there are the ones that take 3 days.... Everything looks good. Register bits are correct. Setup is correct, Things seem to work - except when they rarely don't. Or everything was working, but one minor change went out the window. Just need to keep simplifying and poking at it.
These almost always boil down to one or two lines of code, or a GUI check box that is wrong, or reading the same line in a datasheet for the NNNth time before the meaning flips to reality. Or keep reading forums til you come across a breadcrumb trail...
I had a race condition in a 68332 - two SPI devices on the same port. Three days. Turns out the 68332 have "shadow" registers and I had the equivalent of a race condition causing the equivalent of a priority inversion. One line to fix, and 50 lines of comments on what causes the bug and why the one line fixed it.
To post reply to a comment, click on the 'reply' button attached to each comment. To post a new comment (not a reply to a comment) check out the 'Write a Comment' tab at the top of the comments.
Registering will allow you to participate to the forums on ALL the related sites and give you access to all pdf downloads.