EmbeddedRelated.com
Forums
The 2024 Embedded Online Conference

How to avoid a task not executed in a real-time OS?

Started by Robert Willy January 21, 2019
On 22/01/2019 22:10, Hans-Bernhard Bröker wrote:
> Am 22.01.2019 um 09:19 schrieb David Brown: >> On 21/01/2019 13:40, Robert Willy wrote: >>> Hi, >>> I was asked the question in the title some time ago. I had some real-time >>> embedded system experience but with RTOS. A watch-dog can avoid a task not >>> called in a real-time system. But in a RTOS, what is the right answer for it? >> >> I have heard the use of watchdogs as being like hitting a dead man on >> the head with a hammer in the hope that it will wake him. > > That's overstating it teeny little bit. ;-) > > It's more like hitting a newly dead heart with a hefty jolt of > electricity to possibly make it restart --- a procedure that is quite > definitely not recommended to be used on a non-dead one.
Perhaps. But like a defibrillator, the watchdog does nothing to deal with the actual cause of the problem.
> > And just like the bite of a watchdog, that sometimes actually does work. > >> If the problem is in software, however, it will just lead to the >> same situation again and again. > That's by no means certain. It all depends on how it happened that the > software got itself stuck in a situation that didn't occur during > testing (or the software would never have been released into the wild, > right?) But somehow, right now it did. > > If a e.g. once-in-a-blue-moon "forbidden" excession of design > limitations on some input was the reason, a watch-dog reset cures the > problem until the next event of that kind --- i.e. possibly forever. >
We can say without doubt that the watchdog does not cure the problem. If software causes a hang that triggers the watchdog, there is a bug in the software. That applies regardless of how it happened, how good or bad testing you had, what the input values were, etc. (Note that if something exceeded /specified/ design limitations, then that is outside the realm of the software.) Since the watchdog does not magically fix the software, the problem remains. Clearly, not all systems need the same level of quality, reliability, and robustness. You don't design and test your "amusing" singing birthday card to the same levels as you do for your submarine control system. And so sometimes, a watchdog reset on software hang is a good enough way to handle the symptoms of some kinds of software bugs. You balance the cost of the unreliability against the cost of fixing it - engineering is about making things good enough, not perfect. But you do need to be aware of the watchdog actually does, and does not do. Some developers use it as a crutch to avoid the effort of writing correct code, or testing appropriately. "If there is an error on the communication line, it will lead to a timeout - the watchdog will reset the system, so that's fine." "The tasks will only have a conflict and a deadlock if the user presses the button at the same time as the screen is updating - that is unlikely to happen, and the watchdog will fix it if it does". Some use it as a crutch to avoid debugging and fixing problems. "The software hung during testing, but the watchdog restarted it fine. We don't think it will happen at the customer's site." Or "A watch-dog can avoid a task not called in a real-time system", as the OP claimed. So that does /not/ mean I don't recommend a watchdog (though frequently I do not enable them - I'd rather the customer reported the problem so we can fix it properly). You just have to know /why/ you have a watchdog, and use it appropriately.
David Brown <david.brown@hesbynett.no> writes:
> You don't design and test your "amusing" singing birthday card to the > same levels as you do for your submarine control system.
Obligatory: https://www.washingtonpost.com/news/morning-mix/wp/2017/09/25/the-navys-adding-a-new-piece-of-a-equipment-to-nuclear-submarines-xbox-controllers
On 23/01/2019 22:13, Paul Rubin wrote:
> David Brown <david.brown@hesbynett.no> writes: >> You don't design and test your "amusing" singing birthday card to the >> same levels as you do for your submarine control system. > > Obligatory: > > https://www.washingtonpost.com/news/morning-mix/wp/2017/09/25/the-navys-adding-a-new-piece-of-a-equipment-to-nuclear-submarines-xbox-controllers >
We used a Playstation controller for a whole submarine, not just a periscope. (It was an ROV - remote operated vehicle. So no people in it.)
A watchdog timer is really a hardware-assisted, time-based assertion in the code. As such, it is just a part of the larger software development strategy known as Design by Contract (DbC).

The value of identifying the watchdog timer as an *assertion* is that it informs you what to expect from it. For example, you can't expect an assertion to "avoid" or "fix" a problem (like in the OP "avoid a task not executed"). This is because assertions neither handle nor prevent errors, in the same way as fuses in electrical circuits don't prevent accidents or abuse. In fact, a fuse is an intentionally introduced weak spot in the circuit that is designed to fail sooner than anything else, so actually the whole circuit with a fuse is less robust than without it.

Now, regarding using watchdog timers in the context of an RTOS: you should service the watchdog from the context of the task. A common mistake is to service a watchdog from a periodic timer service. RTOS timers typically run in the ISR context, so they might be running and being serviced, while the task is starving. Another mistake along these lines is to service a watchdog from various RTOS callbacks, also known as "hooks", which might also run in a different context than your task.

Once you use a watchdog timer, you need to carefully design (and test!) the behavior of the system when the watchdog expires. Here again, identifying the watchdog as an assertion helps, because you can use your general strategy of handling failed assertions. I've written more about this in the blog: ["A nail for a fuse"](https://embeddedgurus.com/state-space/2009/11/a-nail-for-a-fuse/).

I am always amazed by embedded designs, where developers go to great lengths to apply memory protection (MPU or MMU) or watchdogs, while at the same time they don't sprinkle their code with basic code assertions that perform rudimentary sanity checks.

Even more bizarre to me is when developers use assertions, but *disable* them in the production release (while keeping the MPU and the watchdogs.) I'm sure the readers of this forum never do such an illogical thing, and always ship the products with carefully designed assertions, right?
Den 2019-01-28 kl. 16:17, skrev StateMachineCOM:
> A watchdog timer is really a hardware-assisted, time-based assertion in the code. As such, it is just a part of the larger software development strategy known as Design by Contract (DbC). > > The value of identifying the watchdog timer as an *assertion* is that it informs you what to expect from it. For example, you can't expect an assertion to "avoid" or "fix" a problem (like in the OP "avoid a task not executed"). This is because assertions neither handle nor prevent errors, in the same way as fuses in electrical circuits don't prevent accidents or abuse. In fact, a fuse is an intentionally introduced weak spot in the circuit that is designed to fail sooner than anything else, so actually the whole circuit with a fuse is less robust than without it. > > Now, regarding using watchdog timers in the context of an RTOS: you should service the watchdog from the context of the task. A common mistake is to service a watchdog from a periodic timer service. RTOS timers typically run in the ISR context, so they might be running and being serviced, while the task is starving. Another mistake along these lines is to service a watchdog from various RTOS callbacks, also known as "hooks", which might also run in a different context than your task. > > Once you use a watchdog timer, you need to carefully design (and test!) the behavior of the system when the watchdog expires. Here again, identifying the watchdog as an assertion helps, because you can use your general strategy of handling failed assertions. I've written more about this in the blog: ["A nail for a fuse"](https://embeddedgurus.com/state-space/2009/11/a-nail-for-a-fuse/). > > I am always amazed by embedded designs, where developers go to great lengths to apply memory protection (MPU or MMU) or watchdogs, while at the same time they don't sprinkle their code with basic code assertions that perform rudimentary sanity checks. > > Even more bizarre to me is when developers use assertions, but *disable* them in the production release (while keeping the MPU and the watchdogs.) I'm sure the readers of this forum never do such an illogical thing, and always ship the products with carefully designed assertions, right? >
Assertions are there to check that your code is sane. They are designed to be removed in production code. Assertions are not the same thing as checking your input. You definitely need to check your input, but once validated, they do not need revalidation. if the input is not valid, an intelligent handling/recovery of the erronous output is preferred over some rough action generated by an assertion failure.
> Assertions are not the same thing as checking your input.
Absolutely. You need to very carefully distinguish between the erroneous behavior (a.k.a. bug) and exceptional condition, which is rare but can arise legitimately. Assertions are for errors. I've written specifically about it in the Dr.Dobb's article "An Exception or a Bug?" [http://www.drdobbs.com/an-exception-or-a-bug/184401686 ]
> Assertions are there to check that your code is sane. > They are designed to be removed in production code.
I'm exactly challenging this beaten-path point of view, because it suggests to stop checking the sanity of the production code. This would work if *all* errors are completely removed during debugging. Are they really removed in YOUR code? And also, relevant for the OP, are you really suggesting to leave the watchdog in the production code while disabling other assertions. If so, WHY? I'm looking forward to interesting discussion...
On 1/28/19 3:31 PM, StateMachineCOM wrote:
>> Assertions are not the same thing as checking your input. > > Absolutely. You need to very carefully distinguish between the > erroneous behavior (a.k.a. bug) and exceptional condition, which is > rare but can arise legitimately. Assertions are for errors. I've > written specifically about it in the Dr.Dobb's article "An Exception > or a Bug?" [http://www.drdobbs.com/an-exception-or-a-bug/184401686 ] > >> Assertions are there to check that your code is sane. They are >> designed to be removed in production code. > > I'm exactly challenging this beaten-path point of view, because it > suggests to stop checking the sanity of the production code. This > would work if *all* errors are completely removed during debugging. > Are they really removed in YOUR code? > > And also, relevant for the OP, are you really suggesting to leave the > watchdog in the production code while disabling other assertions. If > so, WHY? > > I'm looking forward to interesting discussion... >
A generally very sensible article. I'm all for having error checking in production code, but I don't call those 'assertions'. I don't like the idea of leaving _assertions_ in, though, because (a) abort() or a hard reset is a mighty big hammer to apply that broadly, and (b) it deprives me of a very useful facility for debugging, because I can't use as many of them as I want if they all have to be left in the production builds. I have a few macros like yours that supply a finer-grained set of options. Cheers Phil Hobbs -- Dr Philip C D Hobbs Principal Consultant ElectroOptical Innovations LLC / Hobbs ElectroOptics Optics, Electro-optics, Photonics, Analog Electronics Briarcliff Manor NY 10510 http://electrooptical.net http://hobbs-eo.com
On 29/1/19 8:22 am, Phil Hobbs wrote:
> On 1/28/19 3:31 PM, StateMachineCOM wrote: >>> Assertions are not the same thing as checking your input. >> >> Absolutely. You need to very carefully distinguish between the >> erroneous behavior (a.k.a. bug) and exceptional condition, which is >> rare but can arise legitimately. Assertions are for errors. I've >> written specifically about it in the Dr.Dobb's article "An Exception >> or a Bug?" [http://www.drdobbs.com/an-exception-or-a-bug/184401686 ] >> >>> Assertions are there to check that your code is sane. They are >>> designed to be removed in production code. >> >> I'm exactly challenging this beaten-path point of view, because it >> suggests to stop checking the sanity of the production code. This >> would work if *all* errors are completely removed during debugging. >> Are they really removed in YOUR code? >> >> And also, relevant for the OP, are you really suggesting to leave the >> watchdog in the production code while disabling other assertions. If >> so, WHY? >> >> I'm looking forward to interesting discussion... >> > > A generally very sensible article. > > I'm all for having error checking in production code, but I don't call > those 'assertions'.&nbsp;&nbsp; I don't like the idea of leaving _assertions_ in, > though, because (a) abort() or a hard reset is a mighty big hammer to > apply that broadly, and (b) it deprives me of a very useful facility for > debugging, because I can't use as many of them as I want if they all > have to be left in the production builds.
We had a set of assert macros that would abort in the test environment, but return an error code when run in production so the caller needed to explicitly ignore or handle the error condition. That gives you proper feedback during testing but proper error handling in prod. Clifford Heath.
On 1/28/19 6:17 PM, Clifford Heath wrote:
> On 29/1/19 8:22 am, Phil Hobbs wrote: >> On 1/28/19 3:31 PM, StateMachineCOM wrote: >>>> Assertions are not the same thing as checking your input. >>> >>> Absolutely. You need to very carefully distinguish between the >>> erroneous behavior (a.k.a. bug) and exceptional condition, which is >>> rare but can arise legitimately. Assertions are for errors. I've >>> written specifically about it in the Dr.Dobb's article "An Exception >>> or a Bug?" [http://www.drdobbs.com/an-exception-or-a-bug/184401686 ] >>> >>>> Assertions are there to check that your code is sane. They are >>>> designed to be removed in production code. >>> >>> I'm exactly challenging this beaten-path point of view, because it >>> suggests to stop checking the sanity of the production code. This >>> would work if *all* errors are completely removed during debugging. >>> Are they really removed in YOUR code? >>> >>> And also, relevant for the OP, are you really suggesting to leave the >>> watchdog in the production code while disabling other assertions. If >>> so, WHY? >>> >>> I'm looking forward to interesting discussion... >>> >> >> A generally very sensible article. >> >> I'm all for having error checking in production code, but I don't call >> those 'assertions'.&nbsp;&nbsp; I don't like the idea of leaving _assertions_ >> in, though, because (a) abort() or a hard reset is a mighty big hammer >> to apply that broadly, and (b) it deprives me of a very useful >> facility for debugging, because I can't use as many of them as I want >> if they all have to be left in the production builds. > > We had a set of assert macros that would abort in the test environment, > but return an error code when run in production so the caller needed to > explicitly ignore or handle the error condition. That gives you proper > feedback during testing but proper error handling in prod. > > Clifford Heath.
I'm talking mostly about things like enforcing class invariants and so on. Putting those in inline functions, for instance, can be a big performance and code size hit, and once testing is done, you can be pretty sure they won't fire in production. Memory corruption, null pointers, deadlocks, etc. definitely have to have run time checks. So it's nice to leave assert() for debug and roll your own macro set for runtime. That way you can have the fault tolerance of defensive programming without hiding bugs. (Maguire is still a good read.) Most of my code is embedded or else console-mode simulations, so I don't really do a lot of error recovery. Cheers Phil Hobbs -- Dr Philip C D Hobbs Principal Consultant ElectroOptical Innovations LLC / Hobbs ElectroOptics Optics, Electro-optics, Photonics, Analog Electronics Briarcliff Manor NY 10510 http://electrooptical.net http://hobbs-eo.com
> @Clifford Heath > We had a set of assert macros that would abort in the test > environment, but return an error code when run in production > so the caller needed to explicitly ignore or handle the > error condition. That gives you proper feedback during > testing but proper error handling in prod.
Seriously? Do you really believe that the error codes are checked and proper actions taken in *all* cases? Isn't this just kicking the can down the road and into some other code, which is ill-prepared to "handle" your bugs?
> @Phil Hobbs > So it's nice to leave assert() for debug and roll > your own macro set for runtime.
I'm not sure what you are proposing by "rolling your own" for production code. What those "other versions" of assert macros in production code are supposed to do? For the OP, what is your advice specific to watchdog timers? Would you switch the watchdog off for production code? In that case, is it worth to implement a watchdog only for debugging? On the other hand, if you recommend keeping the watchdog in production code, why you choose watchdog and suppress other assertions? What's so special about watchdog and what should be done when the watchdog expires in production code? The main point remains: Bugs don't miraculously go away just because you stop checking for them. Do they? Miro Samek state-machine.com

The 2024 Embedded Online Conference