EmbeddedRelated.com
Forums
Memfault Beyond the Launch

what is CAN bus-off state ?

Started by learn 8 years ago6 replieslatest reply 4 years ago27279 views

Our target embedded board is based on Renesas RH850F1L Microcontroller.  Our embedded language is 'C'.  Target board controls Driver Memory Seat in a Minivan.

We use Vector CANoe to fetch diagnostic information from target board.

I need to mature CAN Interior Bus Off Performance Diagnostic Trouble Code(DTC).  One possible cause of this DTC is CAN bus issues.
The criteria for maturing this DTC is CAN bus-off state.  What is CAN bus-off state?

[ - ]
Reply by Ivan Cibrario BertolottiOctober 19, 2020

In order to monitor bus health (and also their own health), CAN controllers must keep two counters, called transmit and receive error counter.  They start at zero and are incremented (upon error) and decremented (whenever the controller performs a successful tx/rx) according to a set of rules specified by the CAN standard.

The value of these counters affect the error handling mode of the controller (error-active versus error-passive) and, ultimately (when the transmit error counter exceeds the value 255), the transition to the bus-off state.  Roughly speaking, while it is in this state the controller switches off from the bus.  Mandatorily, it stops transmitting and acknowledging frames.  Whether or not it keeps receiving frames depends on the implementation.

The relevant sections of the CAN standard (ISO 11898-1, 2015) are:

  • Section 6, definition of bus-off state
  • Section 12, fault confinement and error counters
  • Section 12.1.4.4 specifically states what a CAN controller shall do when it enters the bus-off state

You may want to refer to them for more information.

All the best,

[ - ]
Reply by gillhern321October 19, 2020

Basically, you have a Buffer issue(counter) limit reached 255, interrupt is set to that node for error message no response needed, but error-msg  trys to write to that counter address but it's full, interrupt can no longer complete, because no RCV acknowledged because buff is full and it did not complete, so interrupt stays active. Eventually all (error)interrupts are assigned and CPU cannot process and comes to a halt. You must have something similar(not a CAN man :) so same issue different format).

Similar problem with the PDP 11/70s (UNIX Kernel 1.9 could have been 2.0) stderr was set to console(back in the day), buffer size was 50k, interrupt daemon would send all system errors to buffer file and stdout(which was console). When 50k buffer was full interrupts would sit in proc que utilizing all the available interrupts, system would hang, system would crash.

3 potential actions you may or could take.


1). increase buffer(counter) size  (not a good thing)

2). set counter limit test case, when 255  reached zero out and keep processing(again not a good thing)

3). set counter limit test case, When buffer count reached, parse error msgs, total same msgs, append to separate log file with error msg and total errors 255(1 line item), zero out counter, flag OBD to driver console or if in LAB testing send to stdout(console). Do the same for all your nodes, same potential error exists for each node being processed or initiated. review code that is issuing error states(interrupts) correct it. Document it and move to next issue.

When your ready to compile for official release, you can turn it off or leave it on as a feature (smile).

I hope the concept helps, Ivan sounds like a CAN man pro, so I definitely concur with his statements, I would review his recommendation.


Sincerely, Gill (old C slammer)

[ - ]
Reply by Ivan Cibrario BertolottiOctober 19, 2020

Hi Gill,

the analogy between CAN error handling and interrupt handling escaped me at first, but it is indeed quite interesting.  It highlights a nice analogy between interrupt overload and an excessive number of bus errors.  They seem far away from each other at first sight, but definitely are not.

In the case of CAN we can say that the value of the error counters roughly represents the ratio between the number of errors that a node detects, and the number of operations it performs successfully.  If the ratio is too high, it is an indication that "something is wrong" and a corrective action is needed (like it happens on interrupt overload).

As also stated by Texane, in CAN the bus-off state is meant to report to the upper software layers a potentially serious error with bus communication (if the bus or other nodes on the bus are faulty), and also to disconnect a node from the bus (if the node itself is faulty) when just going error-passive is not enough.

Concerning potential actions, besides the ones already mentioned in your post, I will also have a look at:

- Local configuration of CAN bus timing parameters (bit rate, but also the sampling point position, defined by the TSEG1 and TSEG2 segment lengths) on ALL nodes on the bus (not just the one that goes bus-off).

- I say all nodes because (due to the error globalization mechanism of CAN) any error detected by any node on the bus is immediately broadcast to (and counted by) all other nodes, with only a few exceptions.

- Proper bus termination, especially if the bus is working at high bit rate (> 125kb/s).

- It may be useful to remark that a CAN bus works only if there are at least TWO functioning nodes connected to the bus (due to the frame acknowledge mechanism).  If there is only ONE node on the bus, whenever that node tries to transmit there is an error (no ACK) and the node quickly goes bus-off.


By the way...

I have fond memories of PDP 11/70s, we had DEC RSX-11 on them. :)

Best regards, Ivan


[ - ]
Reply by learnOctober 19, 2020

Thank you for easy to understand explanation.  How do I acquire CAN standard (ISO 11898-1, 2015)?  Do I have to purchase it?  It's quite expensive.

[ - ]
Reply by Ivan Cibrario BertolottiOctober 19, 2020

The short answer is yes, especially if you would like to build/certify a CAN-based product.  I would recommend two things:  1) Check if your company or academic institution has a subscription to ISO literature, in this case you may be able to download the standard for free or at a discount.  2) Depending on what you would like to do, you do not necessarily have to purchase the whole standard.  For instance, if you are mainly interested in the data link layer, part 1 is enough.

There are copies of ISO 11898-1, 2003 floating around the Internet, but I would rather not use an obsolete version of the standard.  Moreover, the 2003 edition does not cover CAN FD.

Regards,
Ivan

[ - ]
Reply by texaneOctober 19, 2020

Hi,

when a CAN node detects errors, it will send error frames which will disturb the bus traffic, eventually disabling completely the other nodes valid frames.

To prevent that, a node sending too much error frames will utlimatly go to a bus-off state, in which it does not participate to the traffic any more.

Generally, bus-off is an error condition pointing a potential issue in the node itself (misconfiguration ...).

Cheers,

Memfault Beyond the Launch