The Un-repeatable Fault
Testing Complex and Embedded Systems
What set of conditions could cause this event to occur?
When we have elicited all we can from the customer about fault information, it is time to proceed further in our analysis. This next step requires investigation of the design to understand how the symptom of failure described could happen by breaking down the hardware and software and the interactions within them to understand the improper behavior of the features to the customer. If the investigator is in the automotive, pharmacy, or food industries, they can resort to an immediate perusal of the Design Failure Mode and Effects Analysis (DFMEA) and the Process Failure Mode and Effects Analysis (PFMEA). If our investigator is lucky, they may find pointers to the cause of the issue in these documents.
To be successful, we need to perform a rigorous and systematic critique of the design—with enough follow up to ensure that any correctable issues have been resolved. Usually, this approach means that we trace the symptom–usually an output–until we discern potential causes. Note that this approach is very close to a logical fallacy called affirming the consequent, where we attempt to find a given antecedent (cause) for a specific consequent (effect). The reason this approach causes problems is the effect may derive from more than one cause. However, we are suggesting that we compile a list of candidate causes. These possibilities are prioritized for which is the most likely when we think have enough information to do so. Alternatively, we can use our candidate cause list and induce the observed failures in a controlled environment to test the theory of the root cause. Even with this testing, our conclusions remain vulnerable to error, since a demonstration of the failure is not necessarily a demonstration of the cause. One method to try and deal with this fallacy is used with electronic parts and has the following steps:
1. Reproduce the observed failure if possible (let’s assume success)
2. Hypothesize an electronic or mechanical cause for the failure
3. Open the unit and test the hypothesis
4. If the hypothesis appears to be correct, then repair the part
5. Attempt to reproduce the observed failure
6. If we fail to reproduce the failure, we can have some confidence that we did indeed discover the cause of the failure
Another potential solution is to have failed material sent back for analysis. However, we are limited as to what we can do with the failed material unless the failure is a hardware failure. If the product is part of a larger system; then removing the product from the system may remove the stimuli from the “failing” component. If the failure is a hard failure, then review of the failing part and the nature of the failure provide evidence to the source of the causing element.
If issue not seen the fault, assume you haven’t found the trigger
It is not time to give up or say “there is no problem”. Customers never want to hear their suppliers tell them it is all in their heads. Time with the customer in the application analysis may be helpful. Finding the scenario where the problem seems to be more common place and traveling to investigate the problem where it exists are options in determining the cause. Do not forget that some problems can be related to geography; in other words, we are talking temperature, humidity, rough roads, electromagnetic interference, and other environmental noise. We may even have to resort to a systematic replacement of components to find the guilty part, a task made even more difficult if the `part’ is actually software.
Pries, K. H., & Quigley, J. M. (2011). Testing complex and embedded systems. Boca Raton: CRC Press. page 40-41