How to avoid a task not executed in a real-time OS?| page 2

Reply by David Brown ●January 23, 20192019-01-23

On 22/01/2019 22:10, Hans-Bernhard Br&ouml;ker wrote:
> Am 22.01.2019 um 09:19 schrieb David Brown:
>> On 21/01/2019 13:40, Robert Willy wrote:
>>> Hi,
>>> I was asked the question in the title some time ago. I had some real-time
>>> embedded system experience but with RTOS. A watch-dog can avoid a task not
>>> called in a real-time system. But in a RTOS, what is the right answer for it?
>>
>> I have heard the use of watchdogs as being like hitting a dead man on
>> the head with a hammer in the hope that it will wake him.
> 
> That's overstating it teeny little bit. ;-)
> 
> It's more like hitting a newly dead heart with a hefty jolt of
> electricity to possibly make it restart --- a procedure that is quite
> definitely not recommended to be used on a non-dead one.

Perhaps.  But like a defibrillator, the watchdog does nothing to deal
with the actual cause of the problem.

> 
> And just like the bite of a watchdog, that sometimes actually does work.
> 
>> If the problem is in software, however, it will just lead to the
>> same situation again and again.
> That's by no means certain.  It all depends on how it happened that the
> software got itself stuck in a situation that didn't occur during
> testing (or the software would never have been released into the wild,
> right?)  But somehow, right now it did.
> 
> If a e.g. once-in-a-blue-moon "forbidden" excession of design
> limitations on some input was the reason, a watch-dog reset cures the
> problem until the next event of that kind --- i.e. possibly forever.
> 

We can say without doubt that the watchdog does not cure the problem.
If software causes a hang that triggers the watchdog, there is a bug in
the software.  That applies regardless of how it happened, how good or
bad testing you had, what the input values were, etc.  (Note that if
something exceeded /specified/ design limitations, then that is outside
the realm of the software.)  Since the watchdog does not magically fix
the software, the problem remains.

Clearly, not all systems need the same level of quality, reliability,
and robustness.  You don't design and test your "amusing" singing
birthday card to the same levels as you do for your submarine control
system.  And so sometimes, a watchdog reset on software hang is a good
enough way to handle the symptoms of some kinds of software bugs.  You
balance the cost of the unreliability against the cost of fixing it -
engineering is about making things good enough, not perfect.

But you do need to be aware of the watchdog actually does, and does not
do.  Some developers use it as a crutch to avoid the effort of writing
correct code, or testing appropriately.  "If there is an error on the
communication line, it will lead to a timeout - the watchdog will reset
the system, so that's fine."  "The tasks will only have a conflict and a
deadlock if the user presses the button at the same time as the screen
is updating - that is unlikely to happen, and the watchdog will fix it
if it does".  Some use it as a crutch to avoid debugging and fixing
problems.  "The software hung during testing, but the watchdog restarted
it fine.  We don't think it will happen at the customer's site."

Or "A watch-dog can avoid a task not called in a real-time system", as
the OP claimed.

So that does /not/ mean I don't recommend a watchdog (though frequently
I do not enable them - I'd rather the customer reported the problem so
we can fix it properly).  You just have to know /why/ you have a
watchdog, and use it appropriately.

Reply by Paul Rubin ●January 23, 20192019-01-23

David Brown <david.brown@hesbynett.no> writes:
> You don't design and test your "amusing" singing birthday card to the
> same levels as you do for your submarine control system.

Obligatory:

https://www.washingtonpost.com/news/morning-mix/wp/2017/09/25/the-navys-adding-a-new-piece-of-a-equipment-to-nuclear-submarines-xbox-controllers

Reply by David Brown ●January 24, 20192019-01-24

On 23/01/2019 22:13, Paul Rubin wrote:
> David Brown <david.brown@hesbynett.no> writes:
>> You don't design and test your "amusing" singing birthday card to the
>> same levels as you do for your submarine control system.
> 
> Obligatory:
> 
> https://www.washingtonpost.com/news/morning-mix/wp/2017/09/25/the-navys-adding-a-new-piece-of-a-equipment-to-nuclear-submarines-xbox-controllers
> 

We used a Playstation controller for a whole submarine, not just a
periscope.  (It was an ROV - remote operated vehicle.  So no people in it.)

Reply by StateMachineCOM ●January 28, 20192019-01-28

A watchdog timer is really a hardware-assisted, time-based assertion in the code. As such, it is just a part of the larger software development strategy known as Design by Contract (DbC).

The value of identifying the watchdog timer as an *assertion* is that it informs you what to expect from it. For example, you can't expect an assertion to "avoid" or "fix" a problem (like in the OP "avoid a task not executed"). This is because assertions neither handle nor prevent errors, in the same way as fuses in electrical circuits don't prevent accidents or abuse. In fact, a fuse is an intentionally introduced weak spot in the circuit that is designed to fail sooner than anything else, so actually the whole circuit with a fuse is less robust than without it.

Now, regarding using watchdog timers in the context of an RTOS: you should service the watchdog from the context of the task. A common mistake is to service a watchdog from a periodic timer service. RTOS timers typically run in the ISR context, so they might be running and being serviced, while the task is starving. Another mistake along these lines is to service a watchdog from various RTOS callbacks, also known as "hooks", which might also run in a different context than your task.

Once you use a watchdog timer, you need to carefully design (and test!) the behavior of the system when the watchdog expires. Here again, identifying the watchdog as an assertion helps, because you can use your general strategy of handling failed assertions. I've written more about this in the blog: ["A nail for a fuse"](https://embeddedgurus.com/state-space/2009/11/a-nail-for-a-fuse/).

I am always amazed by embedded designs, where developers go to great lengths to apply memory protection (MPU or MMU) or watchdogs, while at the same time they don't sprinkle their code with basic code assertions that perform rudimentary sanity checks.

Even more bizarre to me is when developers use assertions, but *disable* them in the production release (while keeping the MPU and the watchdogs.) I'm sure the readers of this forum never do such an illogical thing, and always ship the products with carefully designed assertions, right?

Reply by A.P.Richelieu ●January 28, 20192019-01-28

Den 2019-01-28 kl. 16:17, skrev StateMachineCOM:
> A watchdog timer is really a hardware-assisted, time-based assertion in the code. As such, it is just a part of the larger software development strategy known as Design by Contract (DbC).
> 
> The value of identifying the watchdog timer as an *assertion* is that it informs you what to expect from it. For example, you can't expect an assertion to "avoid" or "fix" a problem (like in the OP "avoid a task not executed"). This is because assertions neither handle nor prevent errors, in the same way as fuses in electrical circuits don't prevent accidents or abuse. In fact, a fuse is an intentionally introduced weak spot in the circuit that is designed to fail sooner than anything else, so actually the whole circuit with a fuse is less robust than without it.
> 
> Now, regarding using watchdog timers in the context of an RTOS: you should service the watchdog from the context of the task. A common mistake is to service a watchdog from a periodic timer service. RTOS timers typically run in the ISR context, so they might be running and being serviced, while the task is starving. Another mistake along these lines is to service a watchdog from various RTOS callbacks, also known as "hooks", which might also run in a different context than your task.
> 
> Once you use a watchdog timer, you need to carefully design (and test!) the behavior of the system when the watchdog expires. Here again, identifying the watchdog as an assertion helps, because you can use your general strategy of handling failed assertions. I've written more about this in the blog: ["A nail for a fuse"](https://embeddedgurus.com/state-space/2009/11/a-nail-for-a-fuse/).
> 
> I am always amazed by embedded designs, where developers go to great lengths to apply memory protection (MPU or MMU) or watchdogs, while at the same time they don't sprinkle their code with basic code assertions that perform rudimentary sanity checks.
> 
> Even more bizarre to me is when developers use assertions, but *disable* them in the production release (while keeping the MPU and the watchdogs.) I'm sure the readers of this forum never do such an illogical thing, and always ship the products with carefully designed assertions, right?
> 

Assertions are there to check that your code is sane.
They are designed to be removed in production code.

Assertions are not the same thing as checking your input.
You definitely need to check your input, but once validated,
they do not need revalidation.
if the input is not valid, an intelligent handling/recovery of the 
erronous output is preferred over some rough action generated by an 
assertion failure.

Reply by StateMachineCOM ●January 28, 20192019-01-28

> Assertions are not the same thing as checking your input. 

Absolutely. You need to very carefully distinguish between the erroneous behavior (a.k.a. bug) and exceptional condition, which is rare but can arise legitimately. Assertions are for errors. I've written specifically about it in the Dr.Dobb's article "An Exception or a Bug?" [http://www.drdobbs.com/an-exception-or-a-bug/184401686 ]

> Assertions are there to check that your code is sane.
> They are designed to be removed in production code. 

I'm exactly challenging this beaten-path point of view, because it suggests to stop checking the sanity of the production code. This would work if *all* errors are completely removed during debugging. Are they really removed in YOUR code?

And also, relevant for the OP, are you really suggesting to leave the watchdog in the production code while disabling other assertions. If so, WHY?

I'm looking forward to interesting discussion...

Reply by Phil Hobbs ●January 28, 20192019-01-28

On 1/28/19 3:31 PM, StateMachineCOM wrote:
>> Assertions are not the same thing as checking your input.
> 
> Absolutely. You need to very carefully distinguish between the
> erroneous behavior (a.k.a. bug) and exceptional condition, which is
> rare but can arise legitimately. Assertions are for errors. I've
> written specifically about it in the Dr.Dobb's article "An Exception
> or a Bug?" [http://www.drdobbs.com/an-exception-or-a-bug/184401686 ]
> 
>> Assertions are there to check that your code is sane. They are
>> designed to be removed in production code.
> 
> I'm exactly challenging this beaten-path point of view, because it
> suggests to stop checking the sanity of the production code. This
> would work if *all* errors are completely removed during debugging.
> Are they really removed in YOUR code?
> 
> And also, relevant for the OP, are you really suggesting to leave the
> watchdog in the production code while disabling other assertions. If
> so, WHY?
> 
> I'm looking forward to interesting discussion...
> 

A generally very sensible article.

I'm all for having error checking in production code, but I don't call 
those 'assertions'.   I don't like the idea of leaving _assertions_ in, 
though, because (a) abort() or a hard reset is a mighty big hammer to 
apply that broadly, and (b) it deprives me of a very useful facility for 
debugging, because I can't use as many of them as I want if they all 
have to be left in the production builds.

I have a few macros like yours that supply a finer-grained set of options.

Cheers

Phil Hobbs


-- 
Dr Philip C D Hobbs
Principal Consultant
ElectroOptical Innovations LLC / Hobbs ElectroOptics
Optics, Electro-optics, Photonics, Analog Electronics
Briarcliff Manor NY 10510

http://electrooptical.net
http://hobbs-eo.com

Reply by Clifford Heath ●January 28, 20192019-01-28

On 29/1/19 8:22 am, Phil Hobbs wrote:
> On 1/28/19 3:31 PM, StateMachineCOM wrote:
>>> Assertions are not the same thing as checking your input.
>>
>> Absolutely. You need to very carefully distinguish between the
>> erroneous behavior (a.k.a. bug) and exceptional condition, which is
>> rare but can arise legitimately. Assertions are for errors. I've
>> written specifically about it in the Dr.Dobb's article "An Exception
>> or a Bug?" [http://www.drdobbs.com/an-exception-or-a-bug/184401686 ]
>>
>>> Assertions are there to check that your code is sane. They are
>>> designed to be removed in production code.
>>
>> I'm exactly challenging this beaten-path point of view, because it
>> suggests to stop checking the sanity of the production code. This
>> would work if *all* errors are completely removed during debugging.
>> Are they really removed in YOUR code?
>>
>> And also, relevant for the OP, are you really suggesting to leave the
>> watchdog in the production code while disabling other assertions. If
>> so, WHY?
>>
>> I'm looking forward to interesting discussion...
>>
> 
> A generally very sensible article.
> 
> I'm all for having error checking in production code, but I don't call 
> those 'assertions'.&nbsp;&nbsp; I don't like the idea of leaving _assertions_ in, 
> though, because (a) abort() or a hard reset is a mighty big hammer to 
> apply that broadly, and (b) it deprives me of a very useful facility for 
> debugging, because I can't use as many of them as I want if they all 
> have to be left in the production builds.

We had a set of assert macros that would abort in the test environment, 
but return an error code when run in production so the caller needed to 
explicitly ignore or handle the error condition. That gives you proper 
feedback during testing but proper error handling in prod.

Clifford Heath.

Reply by Phil Hobbs ●January 29, 20192019-01-29

On 1/28/19 6:17 PM, Clifford Heath wrote:
> On 29/1/19 8:22 am, Phil Hobbs wrote:
>> On 1/28/19 3:31 PM, StateMachineCOM wrote:
>>>> Assertions are not the same thing as checking your input.
>>>
>>> Absolutely. You need to very carefully distinguish between the
>>> erroneous behavior (a.k.a. bug) and exceptional condition, which is
>>> rare but can arise legitimately. Assertions are for errors. I've
>>> written specifically about it in the Dr.Dobb's article "An Exception
>>> or a Bug?" [http://www.drdobbs.com/an-exception-or-a-bug/184401686 ]
>>>
>>>> Assertions are there to check that your code is sane. They are
>>>> designed to be removed in production code.
>>>
>>> I'm exactly challenging this beaten-path point of view, because it
>>> suggests to stop checking the sanity of the production code. This
>>> would work if *all* errors are completely removed during debugging.
>>> Are they really removed in YOUR code?
>>>
>>> And also, relevant for the OP, are you really suggesting to leave the
>>> watchdog in the production code while disabling other assertions. If
>>> so, WHY?
>>>
>>> I'm looking forward to interesting discussion...
>>>
>>
>> A generally very sensible article.
>>
>> I'm all for having error checking in production code, but I don't call 
>> those 'assertions'.&nbsp;&nbsp; I don't like the idea of leaving _assertions_ 
>> in, though, because (a) abort() or a hard reset is a mighty big hammer 
>> to apply that broadly, and (b) it deprives me of a very useful 
>> facility for debugging, because I can't use as many of them as I want 
>> if they all have to be left in the production builds.
> 
> We had a set of assert macros that would abort in the test environment, 
> but return an error code when run in production so the caller needed to 
> explicitly ignore or handle the error condition. That gives you proper 
> feedback during testing but proper error handling in prod.
> 
> Clifford Heath.

I'm talking mostly about things like enforcing class invariants and so 
on.  Putting those in inline functions, for instance, can be a big 
performance and code size hit, and once testing is done, you can be 
pretty sure they won't fire in production.

Memory corruption, null pointers, deadlocks, etc. definitely have to 
have run time checks.  So it's nice to leave assert() for debug and roll 
your own macro set for runtime.  That way you can have the fault 
tolerance of defensive programming without hiding bugs.  (Maguire is 
still a good read.)

Most of my code is embedded or else console-mode simulations, so I don't 
really do a lot of error recovery.

Cheers

Phil Hobbs

-- 
Dr Philip C D Hobbs
Principal Consultant
ElectroOptical Innovations LLC / Hobbs ElectroOptics
Optics, Electro-optics, Photonics, Analog Electronics
Briarcliff Manor NY 10510

http://electrooptical.net
http://hobbs-eo.com

Reply by StateMachineCOM ●January 29, 20192019-01-29

> @Clifford Heath
> We had a set of assert macros that would abort in the test
> environment, but return an error code when run in production
> so the caller needed to explicitly ignore or handle the
> error condition. That gives you proper feedback during
> testing but proper error handling in prod. 

Seriously? Do you really believe that the error codes are checked and proper actions taken in *all* cases? Isn't this just kicking the can down the road and into some other code, which is ill-prepared to "handle" your bugs?

> @Phil Hobbs
> So it's nice to leave assert() for debug and roll
> your own macro set for runtime.

I'm not sure what you are proposing by "rolling your own" for production code. What those "other versions" of assert macros in production code are supposed to do?

For the OP, what is your advice specific to watchdog timers? Would you switch the watchdog off for production code? In that case, is it worth to implement a watchdog only for debugging? 

On the other hand, if you recommend keeping the watchdog in production code, why you choose watchdog and suppress other assertions? What's so special about watchdog and what should be done when the watchdog expires in production code?

The main point remains: Bugs don't miraculously go away just because you stop checking for them. Do they?


Miro Samek
state-machine.com