Developing software for a safety-related embedded system for the first time
I spend most of my working life with organisations that develop software for high-reliability, real-time embedded systems. Some of these systems are created in compliance with IEC 61508, ISO 26262, DO-178C or similar international standards.
When working with organisations that are developing software for their first safety-related design, I’m often asked to identify the key issues that distinguish this process from the techniques used to develop “ordinary” embedded software.
This is never an easy question to answer, not least because every organisation faces different challenges. However, in this article I’ve pulled together a list of steps that may provide some “food for thought” for organisations that are considering the development of their first safety-related embedded system.
Here’s my list:
- Document the system concept and scope
- Document the key system requirements
- Document the main hazards / threats / risks
- Consider relevant international standards
- Make an initial choice of software platform
- Make an initial choice of target microcontroller(s)
- Document your plans for system (run-time) monitoring
- Plan the first prototype
To be clear: the goal of these steps is to give the team a feel for the key development requirements in a safety-related design (without getting bogged down in the details at this stage).
In this article, I’ll consider the entries from this list. To keep things manageable, I’ll use the design of a controller for a washing machine as a “running example”.
Let’s start at the beginning …
1. Document the system concept and scope
At the start of the project, we need record some basic information about the system.
For example: what is the system (product) called? What (in summary – just one paragraph) is it required to do? What other systems (if any) does our system need to interact with? Who will use the system? What qualifications / experience will the users have?
In the case of our domestic washing machine, our key requirement is that it can wash the clothes for a family, in a home environment: it must do so safely, without consuming excessive resources (power, water) and without generating too much noise.
Our washing machine won’t have to interact with any other systems (we will assume).
The system will be used in a home environment, by "unqualified individuals".
ASIDE: Looking further ahead
Our goal is to produce a reliable, real-time embedded system that can be [i] fully tested and verified during development; and [ii] monitored for faults when in use.
During the development, we can only conduct an effective test and verification (T&V) process for any system if we have a complete requirements specification to work from (since the requirements specification is the only “benchmark” against which the “correctness” – or otherwise – of the system may be assessed).
During the development process, your team will – therefore - need to produce a Software Requirements Document (SoRD).This is probably in addition to a System Requirements Document (SRD) and a Hardware Requirements Document (HRD) – but these documents may be the responsibility of a different team.
In some cases, your team may be required to produce both a “high-level” SoRD (laying it what the software must do) and a “low-level” SoRD (laying out the algorithms to be used, etc). You will at least need a low-level SoRD in any safety-related project.
[How can you tell the difference, in practice, between high-level and low-level SoRDs? If a developer is presented with a low-level SoRD, he or she should be able to design and implement the code without needing to ask for clarification.]
A recent blog on this site by Stephen Friederichs makes some useful comments about writing requirements documents: I won’t repeat Stephen’s material here.
Note. In my experience, people sometimes spend a lot of time worrying about the tools they should use for recording the requirements. Use of a Word document (or similar) is as good as anything for your first design, with each requirement given a unique number. This is all that you need. The process that takes the time is not entering the requirements - it’s identifying the requirements in the first place.
2. Document the key system requirements
Writing a good SoRD takes time. At this stage in the project, I suggest that you do NOT attempt to create such a document: instead, I suggest that you focus on identifying the key system requirements, and record these very informally.
For example, this is how we might begin to record the requirements for our washing-machine controller:
- The user will select a wash program (e.g. ‘Cotton’) on the selector dial.
- The user will press the ‘Start’ switch.
- The door lock will be engaged.
- The water valve will be opened to allow water into the wash drum.
- If the wash program involves detergent, the detergent hatch will be opened.When the detergent has been released, the detergent hatch will be closed.
- When the ‘full water level’ is sensed, the water valve will be closed.
- If the program involves warm water, the water heater will be switched on.
- When the water reaches the correct temperature, the water heater will be switched off.
- The washer motor will be turned on to rotate the drum.The motor will go through a series of movements (at various speeds) to wash the clothes.(The precise set of movements carried out depends on the wash program that the user has selected).
- At the end of the wash cycle, the motor will be stopped.
- The pump will then be switched on to drain the drum.
- When the drum is empty, the pump will be switched off.
- The door lock will then be released.
- During the operation various LEDs will be used to indicate where the system is in the wash cycle.
This will be enough to get us started.
3. Document the main hazards / threats / risks
Early in the development cycle for any safety-related embedded system, we need to consider potential threats and hazards. This will include an assessment of the risks posed to users of the system or to those in the vicinity. The role of our system design process is then to include mechanisms in our design that will reduce such risks to an acceptable level.
From this perspective, a washing machine (the running example in this article) consists of powerful electric motor enclosed in a metal casing. As a normal part of the device operation, the electric motor is used to rotate a heavy metal drum at high speed. Access to this potentially-dangerous mechanism is controlled by a door with an electronic locking mechanism.
The device is used in a domestic environment. There is a risk of injury if access is obtained to the drum while it is rotating. Such injuries could potentially be severe (including loss of a limb), or even life-threatening, particularly for a small child.
The device is connected to a pressurised water supply. The drum is filled with water as a normal part of its operation. There is a risk of flooding if the door is opened at the wrong time: we will assume that this is a “nuisance issue” (rather than a safety issue). However, a combination of water and an electrical supply must always be treated with caution.
In summary: a key threat to users that can be identified is failure of the door lock while the drum is rotating. A key design challenge would be to ensure that the risk of this event happening is reduced to an acceptable level.
4. Consider relevant international standards
The focus on this article is on the development of safety-related embedded systems.Most such designs will be produced in compliance with IEC 61508, ISO 26262, DO-178C or similar international standards. It is important to begin to consider the impact of any relevant standards at an early stage in the project.
Manufacturers of washing machines (and those supplying components for use in such devices) need to comply with various international safety standards, including in this case IEC 60335-1 and IEC 60730-1.
A key challenge is presented by Clause 19 in IEC 60335-1. This clause requires that electronic circuits must be designed and applied in such a way that a fault condition will not render the appliance unsafe with regard to electric shock, fire hazard, mechanical hazard or dangerous malfunction.
The effort required to demonstrate compliance with this core clause (and the standard as a whole) depends on the class of equipment being developed: the options are Class A, Class B or Class C.
- Class A control functions are not intended to be relied upon for the safety of the application (IEC 60730, H.2.22.1).
- Class B control functions are intended to prevent an appliance from entering an unsafe state; however, failure of the control function will not lead directly to a hazardous situation (IEC 60730, H.2.22.2).
- Class C control functions are intended to prevent special hazards such as explosion; failure of such functions could directly cause a hazard in the appliance (IEC 60730, H.2.22.3).
In this case, our washer controller will fall into Class B, because failure of the door lock (one of the most serious potential failures) will not lead directly to any injury.
5. Make an initial choice of software platform
The designs that I am involved with usually involve use of a “Time-Triggered” (TT) architecture.
In most cases, the starting point for a successful TT design is a “bare metal” software platform: that is, the system will not usually employ a conventional “RTOS”, Linux™ or Windows®. In this software platform, a single interrupt will be used, linked to the periodic overflow of a timer. A ‘polling’ process will then allow interaction with peripherals.
Time-triggered (TT) architectures built on this foundation have been used for many years in industries such as aerospace, because they have been found to provide the basis for safe and reliable systems.
Recently the wider benefits of this approach to software development have been more generally recognised. For example, according to international standard IEC 61508 (2010), the use of a TT architecture greatly reduces the effort required to test and certify a system.
By the time we are going through the process discussed in this article, we will usually have decided that the system will be based on a TT architecture.
There are, however, various different TT platforms that can be used, some of which are listed here.
Turning again to our washing machine.
One of the permitted architectures for a Class B control system is a single-MCU design with periodic self-test (IEC 60730, H.2.16.6).
In the case of our washer, our initial assessment is that Platform TT03 will meet our requirements.
6. Make an initial choice of target microcontroller(s)
When developing a safety-related embedded system, we – clearly – need to select an appropriate MCU. The most appropriate choice of MCU will depend on the type of system we wish to produce. We view the hardware platform that results from the choice of one or more MCUs as a “Processing Unit” (PU).
- A “Class B” PU can be based a single-core MCU supported by an appropriate code library.
An NXP LPC1769 MCU may be suitable for use in such a PU.
- A “SIL 2” PU can be based a single-core MCU supported by a Safety Manual (or equivalent documentation). An STM32 MCU may be suitable for use in such a PU.
- A “SIL 3” PU can be created in different ways.A suitable PU may consist of a dual-core (lockstep) MCU supported by a Safety Manual (or equivalent documentation): a TMS570 family MCU may be suitable for use in such a PU. Alternatively, a SIL3 PU may be based on a combination of two SIL2 MCUs (such as STM32 MCUs).
In the case of our washing-machine controller, we’ll select an LPC1769 MCU as our target microcontroller.
7. Document your plans for system (run-time) monitoring
One of the main reasons for developing safety-related systems using a TT architecture is that such designs are easy to model during the development process. Using such models we can ensure that we are able to meet key system requirements (such as responses times, task jitter and maximum CPU load).
[I’ll not consider the modelling process here – that will be the subject of a future article.]
All TT models are – inevitably – based on various assumptions:
- We have an operational CPU in each MCU
- We are running the correct program on each MCU
- We have an operational scheduler on each MCU
- We can transfer data between tasks on the same MCU without corruption
- We can transfer data between MCUs without corruption
- We have operational peripherals on each MCU
- We know the worst-case execution times (WCETs) of all tasks on each MCU
- We know the execution sequence of the tasks in each operating mode on each MCU
We therefore need to incorporate monitoring mechanism that will allow us to test these assumptions at run time (and we need to decide what we will do if these assumptions are not met).
In the case of our washing machine, some low-level Power-On Self Tests (POSTs) will be required. NXP Application Note AN10918 describes in detail how to perform POST operations that are in compliance with IEC 60335 on an LPC1769 microcontroller. The Application Note is accompanied by a complete code library. We will assume that this library would be employed in our washing-machine controller.
In the washer controller, we also assume that we would monitor both the task execution times and the task sequences: please take a look at Platform TT03 for some examples of the ways in which we could achieve this.
8. Plan the first prototype
Following discussions outlined in the earlier parts of this article, I usually recommend that organisations construct a first – basic – prototype of their design.
This recommendation often surprises people (who expect to be advised that they should embark immediately on a “big bang” Waterfall-type design).
In my experience, development of an early prototype of the software framework helps to focus minds. It also means that key timing information (for example, how long it is likely to take to power-up the system, change modes or execute some key activities) can be obtained early in the project lifecycle.
Wherever possible, I’d avoid designing a custom PCB to support this first prototype: instead, I’d work with one or more low-cost evaluation boards, wired together as required.
Unless your system is trivial, development of such a prototype will probably take 1-2 months. During this time, the team will develop a much better understanding of the system requirements.
You will find a complete set of example code for the washing-machine example on this page (TTRD15a).
In this article I’ve pulled together a list of steps that are intended to provide some “food for thought” for organisations that are considering the development of their first safety-related product.
I typically consider these steps as part of a one-day introductory workshop with companies that are new to safety-related designs. In these workshops, we allow one (large) whiteboard for each step. We then photograph the result and move on to the next step. The goal is that – by the end of the day – the team will be ready to create a first system prototype.
Previous post by Michael J. Pont:
How to test a Tesla?
Next post by Michael J. Pont:
The three laws of safe embedded systems
"There is a risk of flooding if the door is opened at the wrong time: we will assume that this is a nuisance issue (rather than a safety issue). However, a combination of water and an electrical supply must always be treated with caution."
Would you believe a well-recognized haemo-dialysis machine did not take care of a similar situation? They have electronics un-isolated from the hydraulic section. For some reason, if the tubing pops up/breaks liquid can seep into the electronics and damage it.
To post reply to a comment, click on the 'reply' button attached to each comment. To post a new comment (not a reply to a comment) check out the 'Write a Comment' tab at the top of the comments.
Registering will allow you to participate to the forums on ALL the related sites and give you access to all pdf downloads.