An Introduction To Software Failure Modes Effects Analysis - Regarding Functional Safety

Introduction

The Software Failure Modes Effects Analysis (SFMEA) [God00] is a method to analyze the safety characteristics of critical systems that are based on software. The method is based on the Failure Modes Effects Analysis (FMEA), which is widely used and known in industry [Wan11]. The FMEA method has been intensively used to evaluate the safety of critical hardware systems in the automotive, aerospace and military area. However, using the FMEA method to analyze software has shown to be problematic. Software is somewhat different, since almost all errors within software are design errors or logical errors. Thus, the software does not "fail", because it does exactly what it was programmed to do. In the past the SFMEA has not been very popular, only a few papers provide comprehensive examples of how the analysis is done for software. But Goddard et al. [God00] and Bowles et al. [BW01] showed that the SFMEA can be used to efficiently uncover potential hazards within software projects that remained uncovered by other analytical approaches. Furthermore, analyzing the software of a given system also reveals certain hardware errors that cause failures within the software. The SFMEA can be used for embedded systems as well as for large software projects [BW01]. The SFMEA is also mentioned for the evaluation of critical systems in some standards, e.g. the IEC 61508.

The SFMEA procedure

The SFMEA can be used in an top-down manner or an bottom-up manner. In this thesis the SFMEA will be complemented with a FTA analysis. The FTA itself follows a top-down approach. It is preferable to combine two different viewpoints, thus for the SFMEA only the bottom-up approach is regarded. An SFMEA analysis consists of following steps. At first the scope of the planned analysis should be defined. This also includes the viewpoints that should be considered during the analysis. The viewpoints are: functionality, maintainability, usability, serviceability, vulnerability and interfaces. After that the needed resources to carry out the analysis have to be identified. Furthermore, several terms have to be defined and rated, which can have different a scaling and rating depending on the software that is being analyzed. Those terms are defined as follows:

The detectability describes how detectable a given failure mode is. It is a rating that shows how likely the system detects the given failure mode.
The severity rates how dangerous or catastrophic the effect of a failure could be.
The likelihood rates how likely a given failure is to occur.

After defining the ratings for the detectability, severity and likelihood, a table or template for the SFMEA has to be created or chosen.

After all these preparations are finished the actual analysis begins. Potential failure modes are researched and characterized. Failure modes are the ways in which a component or system might fail. For each failure mode the root cause should be identified. To find the failure modes various techniques can be applied. These techniques include the view on architectural considerations, a system preliminary hazard analysis, the view on the requirements and especially the analysis of critical variables. Goddard et al. [God00] described these methods in more detail.

In the next step the effects of the failures / failure modes should be found. All found failures should be rated by their likelihood, detectability and severity. The multiplication of those three values is called the Risk Priority Number (RPN), which reflects the importance of a given entry. Then, preventive measures and counter actions to lower the threat of a given failure should be identified. After the counter measures are identified, the RPN and the three underlying measurements can be revised. Usually leading to a better (lower) RPN, thus also lowering the danger of the risks. If the threat is still to high, then the procedure can be redone to identify more counter measures and protective measures. The main difference between a SFMEA and a FMEA is that different viewpoints and failure modes are used to analyze the underlying system.

See an example of a SFMEA table at the end of this blog post.

Fault Tree Analysis

A FTA is an top-down analysis approach. A FTA for software works in a similar way than an ordinary FTA. The only difference are types of events and the used modes. FTAs are especially useful to find the root causes of specific failures. Especially if the failure is caused by a combination of multiple errors. A FTA starts with the gathering of the necessary documents, such as requirements and design documents. Then the FTA analysis team brainstorms for failure events. The failures are then positioned on the top of the tree.

The team then tries to identify the causes for the failures. The causes that lead to a failure event can be combined by either logical "and" or "or" blocks. This process creates smaller sub-trees for each failure event. The FTA can thus identify the root causes of failures and also the path its the root errors mitigation. By looking at the risks and severities of each sub path (started from bottom-up) on overall risk and severity scoring can be created by following the path of the error mitigation. The team can then revise the applicable requirements and design documents.

FTA and SFMEA comparison

A bottom-up SFMEA can be used to complement an FTA. Together both methods generate a detailed safety analysis. Nicodemos et al. [GLAS12] analyzed multiple ap- proaches of combining the SFMEA and the FTA method. Additionally, the authors applied both methods for space critical software projects. The authors show that the FTA can be used to determine gaps in the fulfilment of the defined requirements and that the SFMEA can be used to analyze failures in the software requirements definition itself. Ann Marie Neufelder the chairperson of the IEEE 1633 Recommended Practices for Reliable Software working group also mentions that SFMEA and FTA analysis can be combined to complement each other [Neu16].

The different analysis approaches of a FTA and a SFMEA:

SFMEA vs FTA

A SFMEA analysis is especially useful to identify failure modes and single points of failures. The analysts require deep knowledge of the software. The FTA is especially useful to identify failures that are created by the combination of multiple events, a SFMEA fails to detect these failures.

If you want to know more, feel free to read my Master Thesis (feel free to contact me if you have not access to the ressources).

Example SFMEA Table

Example: SFMEA Table

References

[God00] P. L. Goddard. “Software FMEA techniques.” In: Annual Reliability and Maintainability Symposium. 2000 Proceedings. International Symposium on Product Quality and Integrity (Cat. No.00CH37055). 2000, pp. 118–123. DOI : 10.1109/RAMS.2000.816294.

[Wan11] M. H. Wang. “A cost-based FMEA decision tool for product quality design and management.” In: Proceedings of 2011 IEEE International Conference on Intelligence and Securit

[BW01] J. B. Bowles, C. Wan. “Software failure modes and effects analysis for a small embedded control system.” In: Annual Reliability and Maintainability Symposium. 2001 Proceedings. International Symposium on Product Quality and Integrity (Cat. No.01CH37179). 2001, pp. 1–6. DOI : 10.1109/RAMS. 2001.902433.

[GLAS12] F. G. Nicodemos, C. Lahoz, M. A. D. Abdala, O. Saotome. “Using Combined SFTA and SFMECA Techniques for Space Critical Software.” In: (Jan. 2012), pp. 12–.

[Neu16] A. M. Neufelder. “How to apply software reliability engineering.” SoftRel. 2016. URL : http://www.softrel.com/softwarereliability.pdf.